Telematics, Internet and ecommerce services are typically running 24 hours a day. It requires a high available production software infrastructure. Infrastructure upgrade, software issues and environmental impacts such as missing connectivity, electrical power outages are sources of downtime that affect service quality.
The good news is that there are best practices for high availability; the bad news is that most of the approaches are focused on the technical perspective only. Availability is defined as percentage of uptime during a year and calculated based on mean time between failures of its components. Reality shows that even the best available systems are affected by outages and that increased investments in high availability infrastructure did not effectively improve the overall quality.
Customer service orientation instead of uptime
With our approach, we move the focus to the right orientation: your customer. The resources are spent towards customer experience and business objectives rather than a higher technical complexity that leads to even more problems. We accept the weakness of an environment as a reality and learn how to deal with bad weather conditions. Its stunning to see that even with a low technical availability, you can achieve a much higher customer satisfaction that can compete with a 99.9999% uptime system.
Imagine a 5-nine (99.999%) available payment transaction processing system that processes 1000 requests per second. A one-minute downtime affects 60’000 customers. Imagine 5% are dialing in to customer care (6$ cost per dial in = 18’000$), 1% switch to another provider. Considering 50$ customer acquisition cost, the 1% leads to a recovery cost of 30’000$. And we did not mention the 5% that tweet about the bad experience and reaches another 50’000 people with a negative brand related message.
How could that happen? Customers invoked a payment and received the configured “Error in transaction” message that is the text to display for error code 445. As the software is high available by definition, no system architect ever started to think about that scenario, because that error 445 must not occur at all. You may note that this 1 minute outage is completely ok to occur within a 5-nine uptime SLA.