Key takeaways:
Availability is the percentage of time a service or product is accessible and performing intended operations under normal conditions.
Five 9s represents a 99.999% uptime, meaning less than 5.26 minutes of downtime per year.
To achieve high availability in system design, use load balancing, redundancy, rate limiting, circuit breakers, and failover mechanisms.
Availability is measured using metrics like mean time to failure (MTTF), mean time to repair (MTTR), and mean time between failure (MTBF).
Availability in system design is the percentage of time a service or product is accessible and performing intended operations under normal conditions. For example, if a service’s availability is 50%, it is accessible and operational for 50% of the time a year; 100% availability means the service is available all the time––which is impossible. In
When deploying a service, clients are often assured that the application will be available 99.999% (five 9s) of the time. This allows for only 0.001% of downtime. Let’s take a look at some calculations to see exactly what that entails:
A downtime of 0.001% equals less than 6 minutes––our service will be down for around 6 minutes a year.
Now, 6 minutes in a year doesn’t sound like much, but there’s something we haven’t considered yet. Assume an application runs on 200 microservices to handle the load; each of these 200 services can fail at different times. The failure of any one of the services means failure of the overall service if they are dependent on each other, then the overall downtime will be as follows:
Even a downtime of merely 0.001% results in the unavailability of service for approximately 18 hours. Such high numbers are unacceptable—planned downtimes are not included in this number.
The following table shows the effect of different time percentages on availability with approximate numbers:
Availability | Downtime per Year | Downtime per Month | Downtime per day |
1 nine –– 90% | 36.5 days | 72 hours | 2.4 hours |
2 nines –– 99.0% | 3.65 days | 7.20 hours | 14.4 minutes |
3 nines –– 99.9% | 8.76 hours | 43.8 minutes | 1.46 minutes |
4 nines –– 99.99% | 52.56 minutes | 4.32 minutes | 8.64 seconds |
5 nines –– 99.999% | 5.26 minutes | 25.9 seconds | 0.86 seconds |
Availability is crucial for the success of any service and for delivering a seamless user experience.
It ensures that users can access the service and perform their intended tasks at any time. When availability is low and downtime is frequent, it can lead to financial losses and increased customer dissatisfaction—both of which can seriously impact the reputation and reliability of the service.
Usually, a system’s availability is defined through different metrics such as percentage of total operational time,
In system design, minimizing the above times and achieving five-nines of availability is the ultimate goal for an optimal and efficient system, but it can also be more challenging. To achieve high availability, we must follow some design principles carefully to reduce the probability of failure, as discussed in subsequent sections.
For a service with millions of users, numerous requests can arrive in a second.
Though there are thousands of services to handle these requests, if redirected to a single server, they can overload and crash the server. A load balancer is a component that helps to avoid overloading a server by fairly dividing incoming requests among the pool of available servers.
This helps improve availability, scalability, and performance.
We can duplicate important or critical components, including servers and data.
For example, balancing load in a multiple-server environment ensures that the load is shifted to other available servers if a server fails. Similarly, data replication ensures the availability of data in case a storage device fails by continuously duplicating data from a primary location to a separate secondary location.
Rate limiting is the practice of restricting the number of requests a user can make to a service within a defined time period.
It helps throttle requests that exceed a predefined threshold. Typically used as a protective mechanism, a rate limiter prevents excessive or abusive usage, whether intentional or accidental, ensuring fair resource allocation and consistent availability for all users.
When a server fails, the circuit breaker temporarily stops sending requests to it, allowing it to recover and preventing cascading failures. This helps to make service available by isolating failed servers. If redundant servers are present, the requests are redirected to those servers.
A failover mechanism refers to a feature that provides fault tolerance by automatically redirecting traffic from a failed component to a standby or redundant component. This ensures continuous service with minimum disruption during a failure. A failover system detects failures, switches to backup resources, and tries to recover automatically.
This is possible by continuously monitoring the health of services and components.
Quiz
You might think circuit breakers have a similar function to rate limiters. Why should we use one over the other?
To conclude, achieving the highest possible availability is challenging but possible with techniques opted for during system design. Balancing the load between services, keeping backups as redundancy, limiting access to resources for fair usage, and implementing a failover mechanism with monitoring help achieve the goal.
Free Resources