designing resilient systems

Designing resilient systems: Circuit Breakers or Retries? (Part 2)

This post can also be found on the Grab Engineering blog This post is the second part of the series on Designing Resilient Systems. In Part 1, we looked...
designing resilient systems

Designing resilient systems: Circuit Breakers or Retries? (Part 1)

This post can also be found on the Grab Engineering blog This post is the first of a two-part series on Circuit Breakers and Retries, where we will introduce...

Post-Incident Questionnaire for Engineers

This is my light-hearted attempt to help engineers get the most value out of a downtime incident.

Getting Started with SRE – Step 2 – Dashboards

Introduction In Part 1 of this series, we introduced the goal of understanding how our system performs by adding instrumentation. This article expands on this goal by taking...

Getting Started with SRE – Step 1 – Instrumentation

Introduction After we set ourselves the goal of system reliability, our first goal must be awareness.  Simply put, we must be fully aware of how...