Within a time period of units, if a system is down for
units, then its availability
for the time period
is defined as
. It usually is expressed as percentage.
In other words, . We note that
is a linear function of down time
with a slope of
. Therefore, if down time increases, availability drops. However, the rate of drop does not depend on how close down time is to the total time.
Availability is measured in 9’s. For example, three 9’s means . One year has
days. So, within a year, a system with three 9’s is allowed to be down for at most
days or
hours.
A system with more nines is costlier to build.
| 9’s | down /year | products | techniques | cost |
| six | 0.5 min | 1. DNS: google cloud, azure, AWS 2. Identity access mgmt 3. Financial transactions | 1. Global load balancing (Anycast, GeoDNS) 2. Real-time fault detection and rollback 3. Edge computing for ultra-low latency 4. AI/ML-driven anomaly detection | big |
| five | 5 min | 1. DynamoDB 2. Spanner 3. Cosmos DB 4. Amzn Route 53 (DNS service) 5. Twilio | 1. Synchronous replication across regions 2. Quorum-based consensus (Paxos, Raft) 3. Self-healing systems 4. Chaos engineering for resilience | 100x |
| four | 50 min | 1. S3 standard storage 2. Google cloud storage (standard class) 3. Azure VM 4. Cloudflare CDN / Load balancer 5. Netflix streaming | 1. Multi-region replication 2. Active-active architecture 3. Traffic routing with DNS load balancing 4. Zero-downtime deployment (blue-green, canary releases) | 10x |
| three | 9 hr | 1. Slack/Zoom 2. Microsoft Teams 3. Gmail/Drive 4. Amzn RDS Multi-az | 1. Multi-zone deployment 2. Automated backups 3. Basic auto-scaling 4. Active-passive failover | 2x |
| two | 0.5 week | Budget hosting | 1. Load balancing (basic) 2. Scheduled maintenance 3. Hot standbys (manual switchover) | 1x |
Availability is not perfect. Two systems both having three 9’s of availability may be perceived differently by users. Say in a given year, one system had a big SEV that consumed the entire 8.76 hours of downtime in a single day. The other system was down 43.5 minutes every month. Which one is better?
Leave a comment