Services Engineering Guide - Cloud Infrastructure Services

Services Engineering Reading ListA reading list for services engineering, with a focus on cloud infrastructure services.We welcome suggestions.Papers

The Google File System (Ghemawat et al.)

On Designing and Deploying Internet Scale Services (Hamilton)

Kafka: A Distributed Messaging System for Log Processing (Kreps et al.)

Weathering the Unexpected (Krishnan)

The Unified Logging Infrastructure for Data Analytics at Twitter (Lee et al.)

Automatic Management of Partitioned, Replicated Search Services (Leibert et al.)

Learning to Embrace Failure (Limoncelli et al.)

Scaling Big Data Mining Infrastructure: The Twitter Experience (Lin and Rayboy)

Dremel: Interactive Analysis of Web-Scale Datasets (Melnik et al.)

Out of the Tar Pit (Moseley and Marks)

The Log-Structured Merge-Tree (O'Neil et al.)

In Search of an Understandable Consensus Algorithm (Ongaro and Ousterhout)

Failure Trends in a Large Disk Drive Population (Pinheiro et al.)

Fallacies of Distributed Computing Explained (Rotem-Gal-Oz)

F1 - The Fault-Tolerant Distributed RDBMS Supporting Google’s Ad Business (Shute et al.)

Dapper, A Large Scale Distributed Systems Tracing Infrastructure (Sigelman et al.)

Resident Distributed Datasets: a Fault-Tolerant Abstraction for In-Memory Cluster Computing (Zahari et al.)

The Human Side of Postmortems (Zwieback)

Crew Resource Management: a Positive Change for the Fire Service

Posts

Resilience Engineering: Part I, Part II (Allspaw)

Systems Engineering: a Great Definition (Allspaw)

Chaos Monkey Released Into The Wild (Bennett and Tseitlin)

Some Rules for Engineering and Operations (Black)

Service Level Disagreements Part I, Part II (Black)

Incuriosity Will Kill Your Infrastructure (Crayford)

My Philosophy on Alerting (Ewaschuk)

Services Engineering Reading List

A reading list for services engineering, with a focus on cloud infrastructure services.

We welcome suggestions.

Papers

Fault Injection in Production (Allspaw)
Making Reliable Distributed Systems in the Presence of Software Errors (Armstrong)
Highly Available Transactions: Virtues and Limitations (Bailis et al.)
The Incident Command System (Bigley and Roberts)
The Chubby Lock Service for Loosely Coupled Distributed Systems (Burrows)
Bigtable: a Distributed Storage System for Structured Data (Chang et al.)
Spanner: Google’s Globally-Distributed Database (Corbett et al.)
Dynamo: Amazon’s Highly Available Key-Value Store (DeCandia et al.)
MapReduce: Simplified Data Processing on Large Clusters (Dean and Ghemawat)
The Google File System (Ghemawat et al.)
On Designing and Deploying Internet Scale Services (Hamilton)
Kafka: A Distributed Messaging System for Log Processing (Kreps et al.)
Weathering the Unexpected (Krishnan)
The Unified Logging Infrastructure for Data Analytics at Twitter (Lee et al.)
Automatic Management of Partitioned, Replicated Search Services (Leibert et al.)
Learning to Embrace Failure (Limoncelli et al.)
Scaling Big Data Mining Infrastructure: The Twitter Experience (Lin and Rayboy)
Dremel: Interactive Analysis of Web-Scale Datasets (Melnik et al.)
Out of the Tar Pit (Moseley and Marks)
The Log-Structured Merge-Tree (O'Neil et al.)
In Search of an Understandable Consensus Algorithm (Ongaro and Ousterhout)
Failure Trends in a Large Disk Drive Population (Pinheiro et al.)
Fallacies of Distributed Computing Explained (Rotem-Gal-Oz)
F1 - The Fault-Tolerant Distributed RDBMS Supporting Google’s Ad Business (Shute et al.)
Dapper, A Large Scale Distributed Systems Tracing Infrastructure (Sigelman et al.)
Resident Distributed Datasets: a Fault-Tolerant Abstraction for In-Memory Cluster Computing (Zahari et al.)
The Human Side of Postmortems (Zwieback)
Crew Resource Management: a Positive Change for the Fire Service

Posts

Resilience Engineering: Part I, Part II (Allspaw)
Systems Engineering: a Great Definition (Allspaw)
Chaos Monkey Released Into The Wild (Bennett and Tseitlin)
Some Rules for Engineering and Operations (Black)
Service Level Disagreements Part I, Part II (Black)
Incuriosity Will Kill Your Infrastructure (Crayford)
My Philosophy on Alerting (Ewaschuk)