Demo 02 ~3 min

The Reproducibility Gap

A reviewer asks four simple questions about your dataset. Without a DBOM, you can't answer any of them.

Origin and processing unknown → Full lineage chain verified

The Problem

A postdoc submits a paper to a top-tier journal. The methods section references experiment_v2.csv — ten rows of normalized spectroscopy data used to support the paper's central claim. The reviewer asks four questions: Which version of the raw data is this? Were outliers removed, and by what method? Who prepared this file? Can I get back to the original instrument readings?

The postdoc can't answer a single one. The file was passed through three hands over six months. The person who ran the outlier removal has graduated. The normalization parameters live in a Jupyter notebook that nobody can find. The paper is rejected with a one-line note: "insufficient provenance for reproducibility." Six months of research, blocked by a CSV with no memory of its own history.

What You Will See

Without Makoto

━━━ PART 1: Without DBOM — The Mystery Dataset ━━━ 📄 Found: experiment_v2.csv (10 rows) Columns: sample_id,trial_date,raw_measurement,normalized_value,label ❓ Questions the reviewer is asking: • Which version of the raw data is this? • Were outliers removed? By what method? • Who prepared this file? • Can I get back to the original instrument readings? ❌ No answers. The paper cites "experiment_v2.csv" — that's it. Reviewer rejects: 'insufficient provenance for reproducibility'

With Makoto

━━━ PART 2: With DBOM — Full Lineage ━━━ 📄 Found: experiment_v2.dbom.json DBOM ID: dbom-a3f7c2e1-... Created: 2025-01-10T14:32:00Z 📦 Source: URI: lab-instruments://spectro-3/run-2025-01-10 Hash: a1b2c3d4e5f6... Format: csv 🔏 Signature: Signer: github:dr-chen-lab Algorithm: sha256 🔗 Lineage (3 steps): Step 1: Raw spectroscopy reading from Instrument-3 Tool: spectro-reader v4.2 Input: n/a Output: a1b2c3d4e5f6... Step 2: Outlier removal (IQR method, threshold 1.5) Tool: scipy-outlier-filter v1.0 Input: a1b2c3d4e5f6... Output: b2c3d4e5f6a7... Step 3: Normalization (z-score, mean=0.483, std=0.127) Tool: sklearn-normalizer v2.1 Input: b2c3d4e5f6a7... Output: c3d4e5f6a7b8... 🔍 Chain verification: ✅ Step 1 → Step 2: hashes link correctly ✅ Step 2 → Step 3: hashes link correctly ✅ Final output matches source hash ✅ Full provenance chain verified — every question answered. Reviewer accepts: 'lineage is complete and verifiable'

Run It

$ git clone https://github.com/makoto-project/makoto

$ cd makoto/demos/02-reproducibility-gap

$ uv run demo.py

Key Insight: Reproducibility isn't about trusting people — it's about data that can answer questions about itself.

What Else This Handles

Journal submissions requiring data availability statements with verifiable provenance
Clinical trial datasets where processing steps must be auditable by regulators
Collaborative research where datasets pass between institutions over months
Reproducibility challenges in ML research where training data preprocessing matters

← Demo 01: Poisoned Pipeline Demo 03: GitHub Action →