SLA, SLO & SLI

An SLA (Service Level Agreement) is a formal commitment between a service provider and its customers. It defines what level of service will be delivered, often with financial or contractual penalties if the provider fails to meet it. To design reliable systems, you start by defining SLAs and then architect infrastructure to meet those guarantees.


SLA vs SLO vs SLI

👉 SLI = metric → SLO = target → SLA = binding agreement.


Key SLA Components

  1. Availability
    Defines uptime guarantees.
      Availability = (Total Time – Downtime) ÷ Total Time × 100
    
    • 99.9% → 8.77 hrs downtime/year.
    • 99.99% → 52.6 mins downtime/year.
  2. Performance
    • Response time (e.g., P95 < 100ms).
    • Throughput (e.g., ≥500 req/sec).
  3. Error Rate
    Acceptable % of failed requests.
      Error Rate = (Failed Requests ÷ Total Requests) × 100
    
  4. Recovery
    • RTO (Recovery Time Objective): Max acceptable downtime per incident.
    • RPO (Recovery Point Objective): Max acceptable data loss.
  5. Other Dimensions
    • Durability: e.g., S3 promises 99.999999999% object durability.
    • Consistency: strong vs eventual.
    • Support SLAs: e.g., P1 ticket response within 1 hour.

SLA and System Design

  1. SLA First, Architecture Second
    • A 99.9% SLA means tolerating up to 8.77 hrs downtime/year.
    • Higher targets require multi-AZ or multi-region redundancy.
  2. Availability Targets and Architecture
    • 90% (36.5 days downtime/year): Single AZ, minimal redundancy.
    • 99.9% (8.77 hrs/year): Single AZ with fast recovery, or Multi-AZ for resilience.
    • 99.99% (52.6 mins/year): Multi-AZ failover, automated recovery.
    • 99.999% (~5 mins/year): Multi-region active-active, very high cost.
  3. Key Considerations
    • Planned maintenance often excluded.
    • Unplanned downtime directly impacts SLA.
    • Buffers account for unexpected incidents.
    • Redundancy raises SLA but increases cost and complexity.

Example: E-Commerce Platform

Step 1: Define SLIs (What to Measure)

➡️ These are end-to-end metrics as seen by customers. Internally, engineers also watch component health (EC2, RDS, ALB, etc.), but those are inputs, not customer-facing outcomes.


Step 2: Define SLOs (Targets for Each SLI)

Each SLO directly maps to an SLI in Step 1:

➡️ Step 1 defines variables, Step 2 sets thresholds. If SLOs aren’t realistic, the SLA that follows will be hollow.


Step 3: Define SLA (Customer-Facing Contract)

Only a subset of SLOs become contractual:

➡️ Customers don’t care about component health; they only care about request success. That’s why SLAs are expressed in terms of end-to-end outcomes, not subsystem metrics.


Step 4: Architecture to Meet SLA

The system is engineered to satisfy the SLA (Step 3) by aligning infrastructure to the SLOs (Step 2):

➡️ This isn’t random architecture — each choice directly maps back to the metrics that underpin the SLA.


Customer-Facing SLA (External View)

👉 Customers never see the internal math. They only see whether they can load the homepage and complete checkout.


Internal SLA Baseline (AWS as Our Provider)

As AWS customers, we rely on provider SLAs to know our lower bound:

Service SLA (Published by AWS) Link
CloudFront 99.9% CloudFront SLA
ALB (Elastic Load Balancing) 99.99% ALB SLA
EC2 (per AZ) 99.99% EC2 SLA
RDS Multi-AZ 99.99% RDS SLA
S3 99.99% availability,
11 nines durability
S3 SLA

With Multi-AZ deployments, availability improves because both AZs would need to fail simultaneously for the service to be unavailable. If each AZ has 99.99% availability (0.9999), then the probability of both failing at the same time is:

Joint failure = (1 – 0.9999) × (1 – 0.9999)
              = 0.0001 × 0.0001
              = 0.00000001  (0.000001%)

So effective availability is:

Multi-AZ Availability ≈ 1 – Joint failure
                      ≈ 1 – 0.00000001
                      ≈ 99.999999%  (8 nines)

That’s why AWS often markets Multi-AZ RDS or EC2 setups as “highly available” — the redundancy pushes the number of nines up dramatically.


Engineering roll-up (approx)

When combining multiple services, the system’s effective availability is roughly the product of each component’s availability:

System Availability ≈ CloudFront × ALB × EC2(Multi-AZ) × RDS(Multi-AZ) × S3
≈ 0.999 × 0.9999 × 0.99999999 × 0.99999999 × 0.9999
≈ 99.88%

➡️ This illustrates a gap: even though individual services have strong SLAs, the multiplicative effect drags the total down. To confidently offer an external 99.95% SLA (or higher), you need extra redundancy such as multi-region, edge caching, or graceful degradation (e.g., read-only mode if the database is down).


Key Insight

Contents