Availability

Within a time period of t units, if a system is down for d units, then its availability A for the time period t is defined as A = \frac{t - d}{t}. It usually is expressed as percentage.

In other words, A = 1 - \frac{d}{t}. We note that A is a linear function of down time d with a slope of -\frac{1}{t}. Therefore, if down time increases, availability drops. However, the rate of drop does not depend on how close down time is to the total time.

Availability is measured in 9’s. For example, three 9’s means A = 99.9 \%. One year has 365 days. So, within a year, a system with three 9’s is allowed to be down for at most (1-0.999) \times 365 = 0.365 days or 8.76 hours.

A system with more nines is costlier to build.

9’s down
/year
productstechniquescost
six0.5 min1. DNS: google cloud, azure, AWS
2. Identity access mgmt
3. Financial transactions
1. Global load balancing (Anycast, GeoDNS)
2. Real-time fault detection and rollback
3. Edge computing for ultra-low latency
4. AI/ML-driven anomaly detection
big
five5 min1. DynamoDB
2. Spanner
3. Cosmos DB
4. Amzn Route 53 (DNS service)
5. Twilio
1. Synchronous replication across regions
2. Quorum-based consensus (Paxos, Raft)
3. Self-healing systems
4. Chaos engineering for resilience
100x
four50 min1. S3 standard storage
2. Google cloud storage (standard class)
3. Azure VM
4. Cloudflare CDN / Load balancer
5. Netflix streaming
1. Multi-region replication
2. Active-active architecture
3. Traffic routing with DNS load balancing
4. Zero-downtime deployment (blue-green, canary releases)
10x
three9 hr1. Slack/Zoom
2. Microsoft Teams
3. Gmail/Drive
4. Amzn RDS Multi-az
1. Multi-zone deployment
2. Automated backups
3. Basic auto-scaling
4. Active-passive failover
2x
two0.5 weekBudget hosting1. Load balancing (basic)
2. Scheduled maintenance
3. Hot standbys (manual switchover)
1x

Availability is not perfect. Two systems both having three 9’s of availability may be perceived differently by users. Say in a given year, one system had a big SEV that consumed the entire 8.76 hours of downtime in a single day. The other system was down 43.5 minutes every month. Which one is better?

Leave a comment