Demo 02
~3 min
The Reproducibility Gap
A reviewer asks four simple questions about your dataset. Without a DBOM, you can't answer any of them.
Origin and processing unknown
→
Full lineage chain verified
The Problem
A postdoc submits a paper to a top-tier journal. The methods section references experiment_v2.csv — ten rows of normalized spectroscopy data used to support the paper's central claim. The reviewer asks four questions: Which version of the raw data is this? Were outliers removed, and by what method? Who prepared this file? Can I get back to the original instrument readings?
The postdoc can't answer a single one. The file was passed through three hands over six months. The person who ran the outlier removal has graduated. The normalization parameters live in a Jupyter notebook that nobody can find. The paper is rejected with a one-line note: "insufficient provenance for reproducibility." Six months of research, blocked by a CSV with no memory of its own history.
What You Will See
Without Makoto
━━━ PART 1: Without DBOM — The Mystery Dataset ━━━
📄 Found: experiment_v2.csv (10 rows)
Columns: sample_id,trial_date,raw_measurement,normalized_value,label
❓ Questions the reviewer is asking:
• Which version of the raw data is this?
• Were outliers removed? By what method?
• Who prepared this file?
• Can I get back to the original instrument readings?
❌ No answers. The paper cites "experiment_v2.csv" — that's it.
Reviewer rejects: 'insufficient provenance for reproducibility'
With Makoto
━━━ PART 2: With DBOM — Full Lineage ━━━
📄 Found: experiment_v2.dbom.json
DBOM ID: dbom-a3f7c2e1-...
Created: 2025-01-10T14:32:00Z
📦 Source:
URI: lab-instruments://spectro-3/run-2025-01-10
Hash: a1b2c3d4e5f6...
Format: csv
🔏 Signature:
Signer: github:dr-chen-lab
Algorithm: sha256
🔗 Lineage (3 steps):
Step 1: Raw spectroscopy reading from Instrument-3
Tool: spectro-reader v4.2
Input: n/a
Output: a1b2c3d4e5f6...
Step 2: Outlier removal (IQR method, threshold 1.5)
Tool: scipy-outlier-filter v1.0
Input: a1b2c3d4e5f6...
Output: b2c3d4e5f6a7...
Step 3: Normalization (z-score, mean=0.483, std=0.127)
Tool: sklearn-normalizer v2.1
Input: b2c3d4e5f6a7...
Output: c3d4e5f6a7b8...
🔍 Chain verification:
✅ Step 1 → Step 2: hashes link correctly
✅ Step 2 → Step 3: hashes link correctly
✅ Final output matches source hash
✅ Full provenance chain verified — every question answered.
Reviewer accepts: 'lineage is complete and verifiable'
Run It
$ git clone https://github.com/makoto-project/makoto
$ cd makoto/demos/02-reproducibility-gap
$ uv run demo.py
Key Insight: Reproducibility isn't about trusting people — it's about data that can answer questions about itself.
What Else This Handles
- Journal submissions requiring data availability statements with verifiable provenance
- Clinical trial datasets where processing steps must be auditable by regulators
- Collaborative research where datasets pass between institutions over months
- Reproducibility challenges in ML research where training data preprocessing matters