Services Engineering Reading List
A reading list for services engineering, with a focus on cloud infrastructure services.
We welcome suggestions.
Papers
- Fault Injection in Production (Allspaw)
- Making Reliable Distributed Systems in the Presence of Software Errors (Armstrong)
- Highly Available Transactions: Virtues and Limitations (Bailis et al.)
- The Incident Command System (Bigley and Roberts)
- The Chubby Lock Service for Loosely Coupled Distributed Systems (Burrows)
- Bigtable: a Distributed Storage System for Structured Data (Chang et al.)
- Spanner: Google’s Globally-Distributed Database (Corbett et al.)
- Dynamo: Amazon’s Highly Available Key-Value Store (DeCandia et al.)
- MapReduce: Simplified Data Processing on Large Clusters (Dean and Ghemawat)
- The Google File System (Ghemawat et al.)
- On Designing and Deploying Internet Scale Services (Hamilton)
- Kafka: A Distributed Messaging System for Log Processing (Kreps et al.)
- Weathering the Unexpected (Krishnan)
- The Unified Logging Infrastructure for Data Analytics at Twitter (Lee et al.)
- Automatic Management of Partitioned, Replicated Search Services (Leibert et al.)
- Learning to Embrace Failure (Limoncelli et al.)
- Scaling Big Data Mining Infrastructure: The Twitter Experience (Lin and Rayboy)
- Dremel: Interactive Analysis of Web-Scale Datasets (Melnik et al.)
- Out of the Tar Pit (Moseley and Marks)
- The Log-Structured Merge-Tree (O'Neil et al.)
- In Search of an Understandable Consensus Algorithm (Ongaro and Ousterhout)
- Failure Trends in a Large Disk Drive Population (Pinheiro et al.)
- Fallacies of Distributed Computing Explained (Rotem-Gal-Oz)
- F1 - The Fault-Tolerant Distributed RDBMS Supporting Google’s Ad Business (Shute et al.)
- Dapper, A Large Scale Distributed Systems Tracing Infrastructure (Sigelman et al.)
- Resident Distributed Datasets: a Fault-Tolerant Abstraction for In-Memory Cluster Computing (Zahari et al.)
- The Human Side of Postmortems (Zwieback)
- Crew Resource Management: a Positive Change for the Fire Service
Posts
- Resilience Engineering: Part I, Part II (Allspaw)
- Systems Engineering: a Great Definition (Allspaw)
- Chaos Monkey Released Into The Wild (Bennett and Tseitlin)
- Some Rules for Engineering and Operations (Black)
- Service Level Disagreements Part I, Part II (Black)
- Incuriosity Will Kill Your Infrastructure (Crayford)
- My Philosophy on Alerting (Ewaschuk)