Apache Spark + Makoto Integration Concept

Stage-level attestations via SparkListener — works on EMR, Databricks, OSS.

Note: This page explores how Makoto Levels could be implemented on Apache Spark. It is a conceptual integration proposal — illustrative, not a shipped library. The patterns shown use real Apache Spark APIs; the Makoto pieces are sketches you (or we) could build out.

What is Apache Spark?

Apache Spark is the workhorse of large-scale batch and streaming. Every job is a DAG of stages; every stage reads and writes typed datasets. SparkListener gives you a hook on every stage boundary — perfect for emitting per-stage Transform attestations without changing user code.

SparkListener APISubscribe to stage start/end events for free attestations
UDF for Hashing`makoto_hash()` UDF computes content digests in parallel
Catalog IntegrationUnity Catalog + Iceberg tables surface lineage already
Structured StreamingPer-batch attestations on streaming queries

Integration Approach

Primary pattern: SparkListener + UDF + DataFrame writer wrapper. Below are the integration options ordered by lift required.

How Makoto attaches to Apache Spark

  • makoto-spark JAR — Drop the JAR on the classpath, configure `spark.extraListeners=dev.makoto.AttestationListener`. Every stage emits an attestation.
  • DataFrame writer wrapper — `df.makoto.write.parquet(path, level=2)` instead of `df.write.parquet(path)` — explicit attestation per output.
  • Iceberg/Delta hook — Catalog table commit triggers Origin attestation referencing the snapshot ID.
  • Streaming foreachBatch — Structured Streaming queries emit a Stream-window attestation per micro-batch.

Conceptual Code Example

Concept: SparkListener emits an attestation per stage

No user code changes — listener picks up every stage boundary

// Configure once at session startup
val spark = SparkSession.builder()
  .appName("sales-pipeline")
  .config("spark.extraListeners", "dev.makoto.AttestationListener")
  .config("makoto.level", "2")
  .config("makoto.signing.key", sys.env("MAKOTO_KEY_ID"))
  .config("makoto.dbom.sink", "s3://attestations/sales/")
  .getOrCreate()

// User code is unchanged
val orders = spark.read.parquet("s3://raw/orders/")

val cleaned = orders
  .withColumn("email_hash", makoto_hash($"email"))   // udf computes & records digest
  .filter($"status" === "paid")

cleaned.write
  .mode("overwrite")
  .parquet("s3://curated/orders/")

// The listener emits, for every stage:
//   - input dataset digests (Origin)
//   - transform predicate
//   - output dataset digest (Transform)
//   - signed DSSE envelope to s3://attestations/sales/

Potential Use Cases

Lakehouse Provenance

Iceberg or Delta tables ship with attestations matched to snapshot IDs.

ML Feature Pipelines

Every feature view emits a DBOM consumed by your model registry.

Cross-region Replication

Prove the destination dataset is byte-identical to the source after replication.

Hard-Multi-Tenant Clusters

Per-job DBOMs prove tenant A's job didn't pollute tenant B's data.

Interested in Apache Spark + Makoto?

This is a conceptual integration. If you're shipping Apache Spark pipelines and want to add Makoto attestations, open an issue or reach out — we'd love to scope a real implementation.

Learn about Apache Spark Read Makoto Spec All Integrations