Follow along in the notebook — run this example live in your browser, no install needed.

The Problem: The Same Rule, Written Three Times

Here's something most data engineering teams don't notice until it bites them. The rule status IN ('active', 'churned') exists in your codebase at least three times:

three_places.py — spot the problem
# 1. Spark job — Silver layer processing
df.filter(col("status").isin(["active", "churned"]))

# 2. Lambda / ACA — Bronze ingestion pre-check
assert df["status"].is_in(["active", "churned"]).all()

# 3. dbt — Gold layer test
# accepted_values: {column: status, values: [active, churned]}

Product adds "trial" to the status enum. One PR. Which of those three breaks tonight?

This is the real Spark Tax — not the cost of running Spark, but the cost of maintaining the same validation logic separately in every engine across your stack.

When the rule drifts between environments, your CI passes and your production job fails. Or worse: both pass, but they're enforcing different rules, and you don't know it until a bad record reaches your Gold layer six weeks later.

The Fix: Same Data Contract, Different Engine

LakeLogic is engine-agnostic. You write your quality rules once in YAML, and the same data contract runs identically on Polars, Spark, DuckDB, or Pandas. No code changes. No separate validation logic per environment.

contract.yaml
# Same YAML. Runs on Polars, Spark, DuckDB, Pandas.
quality:
  row_rules:
    - sql: "revenue >= 0"
    - sql: "email LIKE '%@%.%'"
    - sql: "status IN ('active', 'churned', 'pending')"

quarantine:
  enabled: true
  target: "quarantine/customers"

Running on Polars (Development + Lightweight Production)

validate.py
from lakelogic import DataProcessor

# Uses Polars under the hood — no JVM, no cluster
result = DataProcessor("contract.yaml").run_source()
print(f"✅ {len(result.good)} valid | ❌ {len(result.bad)} quarantined")

Running on Spark (Large-Scale Production)

validate_spark.py
from lakelogic import DataProcessor

# Same contract — just change the engine
result = DataProcessor(
    "contract.yaml",
    engine="spark",
).run_source()

The key insight: you don't need to rewrite your validation logic when you move between engines. Write your data contract once locally on Polars (instant startup, zero dependencies), deploy to Spark only when your data volumes actually require distributed compute.

When to Use What

Scenario Engine Why
Files under 1 GB Polars 10x faster startup, $5/month container
1–50 GB files DuckDB Single-node analytical power, no cluster
50+ GB / multi-TB Spark Distributed compute is justified at this scale
Local development Polars or Pandas Instant, no infrastructure needed

The rule of thumb: if your file fits in memory on a single node, you don't need a distributed compute cluster to validate it.

The Real Cost: Maintenance, Drift, and the 2am Incident

The Spark Tax isn't primarily a cloud bill problem — it's an engineering reliability problem. Every time your validation logic lives in multiple places, you're accumulating maintenance debt:

LakeLogic doesn't tell you to stop using Spark. It tells you to stop reimplementing the same rules for every engine you touch. Use Spark where you genuinely need distributed compute — 50 GB+ loads, multi-TB reprocessing, streaming at scale. Use Polars or DuckDB for everything else. And let one data contract drive both.

Try It Yourself

terminal
# Install
pip install lakelogic

# Bootstrap contracts from your landing zone
lakelogic bootstrap --landing data/ --output contracts/ \
    --registry contracts/reg.yaml --suggest-rules

# Run validation on Polars (default)
lakelogic run --contract contracts/orders.yaml --source data/orders.csv

# Generate test data to verify quarantine
lakelogic generate --contract contracts/orders.yaml \
    --rows 1000 --invalid-ratio 0.1 --preview 5

One Data Contract. Any Engine.

Write your quality rules once in YAML. Run them on Polars, Spark, DuckDB or Pandas — no rewrites. Open source, MIT licensed.