✨ Apache Spark + Makoto Integration Concept
Stage-level attestations via SparkListener — works on EMR, Databricks, OSS.
What is Apache Spark?
Apache Spark is the workhorse of large-scale batch and streaming. Every job is a DAG of stages; every stage reads and writes typed datasets. SparkListener gives you a hook on every stage boundary — perfect for emitting per-stage Transform attestations without changing user code.
Integration Approach
Primary pattern: SparkListener + UDF + DataFrame writer wrapper. Below are the integration options ordered by lift required.
How Makoto attaches to Apache Spark
- makoto-spark JAR — Drop the JAR on the classpath, configure `spark.extraListeners=dev.makoto.AttestationListener`. Every stage emits an attestation.
- DataFrame writer wrapper — `df.makoto.write.parquet(path, level=2)` instead of `df.write.parquet(path)` — explicit attestation per output.
- Iceberg/Delta hook — Catalog table commit triggers Origin attestation referencing the snapshot ID.
- Streaming foreachBatch — Structured Streaming queries emit a Stream-window attestation per micro-batch.
Conceptual Code Example
Concept: SparkListener emits an attestation per stage
No user code changes — listener picks up every stage boundary
// Configure once at session startup val spark = SparkSession.builder() .appName("sales-pipeline") .config("spark.extraListeners", "dev.makoto.AttestationListener") .config("makoto.level", "2") .config("makoto.signing.key", sys.env("MAKOTO_KEY_ID")) .config("makoto.dbom.sink", "s3://attestations/sales/") .getOrCreate() // User code is unchanged val orders = spark.read.parquet("s3://raw/orders/") val cleaned = orders .withColumn("email_hash", makoto_hash($"email")) // udf computes & records digest .filter($"status" === "paid") cleaned.write .mode("overwrite") .parquet("s3://curated/orders/") // The listener emits, for every stage: // - input dataset digests (Origin) // - transform predicate // - output dataset digest (Transform) // - signed DSSE envelope to s3://attestations/sales/
Potential Use Cases
Lakehouse Provenance
Iceberg or Delta tables ship with attestations matched to snapshot IDs.
ML Feature Pipelines
Every feature view emits a DBOM consumed by your model registry.
Cross-region Replication
Prove the destination dataset is byte-identical to the source after replication.
Hard-Multi-Tenant Clusters
Per-job DBOMs prove tenant A's job didn't pollute tenant B's data.
Interested in Apache Spark + Makoto?
This is a conceptual integration. If you're shipping Apache Spark pipelines and want to add Makoto attestations, open an issue or reach out — we'd love to scope a real implementation.