Demo 04 ~3 min

Configuration Incident Post-Mortem

Someone changes a config value at 3am. Throughput drops 95%. How fast can you figure out what happened?

Hours digging through git blame Instant who/what/when/why

The Problem

It's Monday morning. PagerDuty fires: pipeline throughput has dropped 95% since 3am Saturday. The on-call engineer pulls up the dashboard and confirms the numbers are real. The pipeline is processing at a crawl. The config file shows max_batch_size = 50, but nobody on the team remembers changing it. The last known good value was 1000.

Now begins the archaeology. Git blame shows the file was modified, but from a service account — not a personal login. Slack search turns up nothing. The change happened during a maintenance window, but the maintenance runbook doesn't mention batch size. After two hours of cross-referencing timestamps, someone finds a thread in a private channel: an ops engineer reduced batch size to debug an out-of-memory error, fixed the OOM, but forgot to revert. Two hours of investigation for a two-second revert.

What You Will See

Without Makoto
━━━ PART 1: Config Change WITHOUT DBOM ━━━ 📋 Current config: max_batch_size = 1000 retry_limit = 3 timeout_seconds = 30 🔧 Someone changes max_batch_size: 1000 → 50 🚨 Monday morning: pipeline throughput dropped 95%! ❓ Incident questions: • What changed? • Who changed it? • Why was it changed? • What was the previous value? 🔍 Investigation (all we have): max_batch_size = 50 (updated: 2025-01-15T03:22:00Z) ❌ No record of who changed it or why ❌ No record of the previous value ❌ Mean time to resolution: hours of git-blame and Slack archaeology
With Makoto
━━━ PART 2: Config Change WITH DBOM Audit ━━━ 📋 Current config: max_batch_size = 1000 🔧 Changing max_batch_size: 1000 → 50 📝 DBOM audit entry created automatically 🚨 Monday morning: same alert — pipeline throughput dropped 95%! 🔍 Investigation (with DBOM): ✅ Changed at: 2025-01-15T03:22:00Z ✅ Key: max_batch_size ✅ Old value: 1000 ✅ New value: 50 ✅ Reason: Reducing batch size to debug OOM errors in prod ✅ Changed by: github:ops-team Resolution: revert max_batch_size to 1000, debug memory issue separately ✅ Mean time to resolution: 2 minutes

Run It

$ git clone https://github.com/makoto-project/makoto
$ cd makoto/demos/04-config-postmortem
$ uv run demo.py
Key Insight: Configuration is data too — and data without provenance turns every incident into a detective story.

What Else This Handles