Demo 02 ~3 min

The Reproducibility Gap

A reviewer asks four simple questions about your dataset. Without a DBOM, you can't answer any of them.

Origin and processing unknown Full lineage chain verified

The Problem

A postdoc submits a paper to a top-tier journal. The methods section references experiment_v2.csv — ten rows of normalized spectroscopy data used to support the paper's central claim. The reviewer asks four questions: Which version of the raw data is this? Were outliers removed, and by what method? Who prepared this file? Can I get back to the original instrument readings?

The postdoc can't answer a single one. The file was passed through three hands over six months. The person who ran the outlier removal has graduated. The normalization parameters live in a Jupyter notebook that nobody can find. The paper is rejected with a one-line note: "insufficient provenance for reproducibility." Six months of research, blocked by a CSV with no memory of its own history.

What You Will See

Without Makoto
━━━ PART 1: Without DBOM — The Mystery Dataset ━━━ 📄 Found: experiment_v2.csv (10 rows) Columns: sample_id,trial_date,raw_measurement,normalized_value,label ❓ Questions the reviewer is asking: • Which version of the raw data is this? • Were outliers removed? By what method? • Who prepared this file? • Can I get back to the original instrument readings? ❌ No answers. The paper cites "experiment_v2.csv" — that's it. Reviewer rejects: 'insufficient provenance for reproducibility'
With Makoto
━━━ PART 2: With DBOM — Full Lineage ━━━ 📄 Found: experiment_v2.dbom.json DBOM ID: dbom-a3f7c2e1-... Created: 2025-01-10T14:32:00Z 📦 Source: URI: lab-instruments://spectro-3/run-2025-01-10 Hash: a1b2c3d4e5f6... Format: csv 🔏 Signature: Signer: github:dr-chen-lab Algorithm: sha256 🔗 Lineage (3 steps): Step 1: Raw spectroscopy reading from Instrument-3 Tool: spectro-reader v4.2 Input: n/a Output: a1b2c3d4e5f6... Step 2: Outlier removal (IQR method, threshold 1.5) Tool: scipy-outlier-filter v1.0 Input: a1b2c3d4e5f6... Output: b2c3d4e5f6a7... Step 3: Normalization (z-score, mean=0.483, std=0.127) Tool: sklearn-normalizer v2.1 Input: b2c3d4e5f6a7... Output: c3d4e5f6a7b8... 🔍 Chain verification: ✅ Step 1 → Step 2: hashes link correctly ✅ Step 2 → Step 3: hashes link correctly ✅ Final output matches source hash ✅ Full provenance chain verified — every question answered. Reviewer accepts: 'lineage is complete and verifiable'

Run It

$ git clone https://github.com/makoto-project/makoto
$ cd makoto/demos/02-reproducibility-gap
$ uv run demo.py
Key Insight: Reproducibility isn't about trusting people — it's about data that can answer questions about itself.

What Else This Handles