Demo 04
~3 min
Configuration Incident Post-Mortem
Someone changes a config value at 3am. Throughput drops 95%. How fast can you figure out what happened?
Hours digging through git blame
→
Instant who/what/when/why
The Problem
It's Monday morning. PagerDuty fires: pipeline throughput has dropped 95% since 3am Saturday. The on-call engineer pulls up the dashboard and confirms the numbers are real. The pipeline is processing at a crawl. The config file shows max_batch_size = 50, but nobody on the team remembers changing it. The last known good value was 1000.
Now begins the archaeology. Git blame shows the file was modified, but from a service account — not a personal login. Slack search turns up nothing. The change happened during a maintenance window, but the maintenance runbook doesn't mention batch size. After two hours of cross-referencing timestamps, someone finds a thread in a private channel: an ops engineer reduced batch size to debug an out-of-memory error, fixed the OOM, but forgot to revert. Two hours of investigation for a two-second revert.
What You Will See
Without Makoto
━━━ PART 1: Config Change WITHOUT DBOM ━━━
📋 Current config:
max_batch_size = 1000
retry_limit = 3
timeout_seconds = 30
🔧 Someone changes max_batch_size: 1000 → 50
🚨 Monday morning: pipeline throughput dropped 95%!
❓ Incident questions:
• What changed?
• Who changed it?
• Why was it changed?
• What was the previous value?
🔍 Investigation (all we have):
max_batch_size = 50 (updated: 2025-01-15T03:22:00Z)
❌ No record of who changed it or why
❌ No record of the previous value
❌ Mean time to resolution: hours of git-blame and Slack archaeology
With Makoto
━━━ PART 2: Config Change WITH DBOM Audit ━━━
📋 Current config:
max_batch_size = 1000
🔧 Changing max_batch_size: 1000 → 50
📝 DBOM audit entry created automatically
🚨 Monday morning: same alert — pipeline throughput dropped 95%!
🔍 Investigation (with DBOM):
✅ Changed at: 2025-01-15T03:22:00Z
✅ Key: max_batch_size
✅ Old value: 1000
✅ New value: 50
✅ Reason: Reducing batch size to debug OOM errors in prod
✅ Changed by: github:ops-team
Resolution: revert max_batch_size to 1000, debug memory issue separately
✅ Mean time to resolution: 2 minutes
Run It
$ git clone https://github.com/makoto-project/makoto
$ cd makoto/demos/04-config-postmortem
$ uv run demo.py
Key Insight: Configuration is data too — and data without provenance turns every incident into a detective story.
What Else This Handles
- Feature flag changes that cause performance regressions in production
- Database connection pool settings modified during on-call firefighting
- Infrastructure-as-code drift where Terraform state diverges from reality
- Compliance audits requiring a complete history of configuration changes