Skip to main content

Reliability Overview

Building systems that continue operating correctly even when parts fail.

Key Concepts

ConceptDescription
RedundancyDuplicate critical components
FailoverAutomatic switch to backup
Circuit breakerStop calling failing services
Graceful degradationReduce functionality instead of crashing

Observability

Three pillars for understanding production systems:

  1. Metrics — numeric measurements over time
  2. Logs — discrete event records
  3. Traces — request flow across services

Further Reading