Demo 05 ~3 min

AI Dataset Verification

Your ML training dataset looks fine — but one poisoned label can compromise every prediction. Can you tell the difference?

Dataset loaded on blind trust → Hash mismatch blocks training

The Problem

Your ML team downloads a training dataset from a shared drive. It's an ImageNet sample — ten images with labels, nothing unusual. The team lead uploaded it last month, or maybe the intern did. Nobody's quite sure. The file name looks right, the row count matches expectations, and the columns are correct. Training begins.

What nobody notices is that one label has been changed: a "tench" became a "goldfish." It's a subtle change — both are fish, both are plausible labels. But that single modification is a data poisoning attack. The model will learn a wrong association, and every future prediction involving those classes will be slightly off. The attack is invisible because the dataset has no memory of what it should contain. There's no hash to check, no signer to verify, no lineage to trace. The file simply says "trust me."

What You Will See

Without Makoto

━━━ PART 1: Training WITHOUT Verification ━━━ 📥 Loading dataset: imagenet_sample.csv (10 images) Source: downloaded from shared drive ❓ Is this the official dataset? ❓ Has anyone modified the labels? ❓ Who uploaded this version? 🤷 No way to know. Training proceeds on blind trust. ❌ If labels were poisoned, the model learns wrong — silently.

With Makoto

━━━ PART 2: Training WITH DBOM Verification ━━━ 📥 Loading dataset: imagenet_sample.csv DBOM found: imagenet_sample.dbom.json 🔍 Step 1: Hash verification Expected: a1b2c3d4e5f6a7b8c9d0e1f2a3b4c5d6... Actual: a1b2c3d4e5f6a7b8c9d0e1f2a3b4c5d6... ✅ Hash matches — file is unmodified 🔏 Step 2: Signer verification Signed by: github:ml-data-team ✅ Trusted signer confirmed 🔗 Step 3: Lineage (2 steps) Step 1: Downloaded from ImageNet official mirror Tool: dataset-downloader v1.0 Step 2: Label verification against ImageNet taxonomy Tool: label-verifier v2.0 ✅ All checks passed — safe to train

Tampered Dataset Detected

━━━ PART 3: Tampered Dataset → Instant Detection ━━━ 🦹 Attacker scenario: one label changed (tench → goldfish) Modified 1 of 10 labels — a subtle poisoning attack 🔍 Hash verification: Expected: a1b2c3d4e5f6a7b8c9d0e1f2a3b4c5d6... Actual: f9e8d7c6b5a4938271605f4e3d2c1b0a... ❌ HASH MISMATCH DETECTED 🛑 Training blocked — dataset integrity compromised 📋 Action: alert data team, quarantine file, check access logs

Run It

$ git clone https://github.com/makoto-project/makoto

$ cd makoto/demos/05-ai-dataset-verification

$ uv run demo.py

Key Insight: In AI, your model is only as trustworthy as your training data — and trust requires verification, not assumption.

What Else This Handles

Training dataset versioning across model iterations with full hash history
Regulatory compliance for AI models requiring documented data provenance (EU AI Act)
Multi-source dataset assembly where each source needs independent verification
Model retraining pipelines where stale or corrupted data must be automatically rejected

← Demo 04: Config Postmortem All Demos →