Stop the Spark Tax: One Data Contract, Any Engine

Follow along in the notebook — run this example live in your browser, no install needed.

The Problem: The Same Rule, Written Three Times

Here's something most data engineering teams don't notice until it bites them. The rule status IN ('active', 'churned') exists in your codebase at least three times:

three_places.py — spot the problem

# 1. Spark job — Silver layer processing
df.filter(col("status").isin(["active", "churned"]))

# 2. Lambda / ACA — Bronze ingestion pre-check
assert df["status"].is_in(["active", "churned"]).all()

# 3. dbt — Gold layer test
# accepted_values: {column: status, values: [active, churned]}

Product adds "trial" to the status enum. One PR. Which of those three breaks tonight?

This is the real Spark Tax — not the cost of running Spark, but the cost of maintaining the same validation logic separately in every engine across your stack.

When the rule drifts between environments, your CI passes and your production job fails. Or worse: both pass, but they're enforcing different rules, and you don't know it until a bad record reaches your Gold layer six weeks later.

The Fix: Same Data Contract, Different Engine

LakeLogic is engine-agnostic. You write your quality rules once in YAML, and the same data contract runs identically on Polars, Spark, DuckDB, or Pandas. No code changes. No separate validation logic per environment.

contract.yaml

# Same YAML. Runs on Polars, Spark, DuckDB, Pandas.
quality:
  row_rules:
    - sql: "revenue >= 0"
    - sql: "email LIKE '%@%.%'"
    - sql: "status IN ('active', 'churned', 'pending')"

quarantine:
  enabled: true
  target: "quarantine/customers"

Running on Polars (Development + Lightweight Production)

validate.py

from lakelogic import DataProcessor

# Uses Polars under the hood — no JVM, no cluster
result = DataProcessor("contract.yaml").run_source()
print(f"✅ {len(result.good)} valid | ❌ {len(result.bad)} quarantined")

Running on Spark (Large-Scale Production)

validate_spark.py

from lakelogic import DataProcessor

# Same contract — just change the engine
result = DataProcessor(
    "contract.yaml",
    engine="spark",
).run_source()

The key insight: you don't need to rewrite your validation logic when you move between engines. Write your data contract once locally on Polars (instant startup, zero dependencies), deploy to Spark only when your data volumes actually require distributed compute.

When to Use What

Scenario	Engine	Why
Files under 1 GB	Polars	10x faster startup, $5/month container
1–50 GB files	DuckDB	Single-node analytical power, no cluster
50+ GB / multi-TB	Spark	Distributed compute is justified at this scale
Local development	Polars or Pandas	Instant, no infrastructure needed

The rule of thumb: if your file fits in memory on a single node, you don't need a distributed compute cluster to validate it.

The Real Cost: Maintenance, Drift, and the 2am Incident

The Spark Tax isn't primarily a cloud bill problem — it's an engineering reliability problem. Every time your validation logic lives in multiple places, you're accumulating maintenance debt:

Rule changes require multiple PRs — update Spark, update Lambda, update dbt. Hope nobody misses one.
Environments enforce different rules — CI says valid, production says invalid. Or both say valid and both are wrong.
Onboarding is harder — a new engineer has to find and understand validation logic scattered across the stack.
Testing is fragile — your unit tests cover the Polars path. Your integration tests cover Spark. Neither covers the case where they diverge.

LakeLogic doesn't tell you to stop using Spark. It tells you to stop reimplementing the same rules for every engine you touch. Use Spark where you genuinely need distributed compute — 50 GB+ loads, multi-TB reprocessing, streaming at scale. Use Polars or DuckDB for everything else. And let one data contract drive both.

Try It Yourself

terminal

# Install
pip install lakelogic

# Bootstrap contracts from your landing zone
lakelogic bootstrap --landing data/ --output contracts/ \
    --registry contracts/reg.yaml --suggest-rules

# Run validation on Polars (default)
lakelogic run --contract contracts/orders.yaml --source data/orders.csv

# Generate test data to verify quarantine
lakelogic generate --contract contracts/orders.yaml \
    --rows 1000 --invalid-ratio 0.1 --preview 5

One Data Contract. Any Engine.

Write your quality rules once in YAML. Run them on Polars, Spark, DuckDB or Pandas — no rewrites. Open source, MIT licensed.

⭐ Star on GitHub Read the Docs →