The Problem: Validation Code That Grows Forever
You're building a data pipeline with Polars. You read a CSV, Parquet, or Delta file. You need to validate it before it goes downstream. So you start writing:
import polars as pl df = pl.read_parquet("data/orders.parquet") # Check for nulls bad_nulls = df.filter( pl.col("order_id").is_null() | pl.col("customer_id").is_null() ) # Check amount range bad_amounts = df.filter(pl.col("amount") < 0) # Check status enum valid_statuses = ["pending", "shipped", "delivered", "returned"] bad_status = df.filter(~pl.col("status").is_in(valid_statuses)) # Combine all bad rows... somehow # Add error reasons... manually # Track which rules each row failed... painfully # Write good rows to output... separately # Write bad rows to quarantine... with metadata # Add lineage columns... don't forget the run_id
This works for three rules. But then someone adds a regex check. Someone else adds a dataset-level uniqueness rule. You need to track which specific rule each row failed. You need to add lineage columns. You need to write the good and bad rows to different locations.
Before you know it, you have 200 lines of validation boilerplate that needs to be replicated in every pipeline. And when you need the same checks on Spark? You rewrite it all.
Validation logic shouldn't live in Python. It should live in a contract.
The Fix: One YAML, Zero Validation Code
With LakeLogic, you define your quality rules once in YAML. The framework handles the Polars expressions, quarantine routing, error reasons, and lineage injection for you:
info: version: 1 name: "orders_silver" domain: "crm" system: "shopify" model: fields: - name: order_id type: string required: true - name: customer_id type: string required: true - name: amount type: float - name: status type: string - name: order_date type: date quality: row_rules: - sql: "amount >= 0" - accepted_values: field: status values: ["pending", "shipped", "delivered", "returned"] dataset_rules: - unique: order_id quarantine: enabled: true target: "quarantine/orders" include_error_reason: true lineage: enabled: true
Now your Python code is three lines:
from lakelogic import DataProcessor result = DataProcessor("contracts/orders.yaml").run_source("data/orders.parquet") print(f"✅ {len(result.good)} validated ❌ {len(result.bad)} quarantined") # result.good → clean rows, ready for Silver layer # result.bad → quarantined rows with _lakelogic_errors column # Each bad row tells you EXACTLY which rule(s) failed
What You Get for Free
That YAML contract, combined with three lines of Python, gives you everything the 200-line manual version couldn't:
| Feature | Manual Polars | LakeLogic Contract |
|---|---|---|
| Null checks | Write is_null() per field |
Set required: true in schema |
| Range validation | Write filter(col > x) |
One SQL line: "amount >= 0" |
| Enum validation | Write is_in([...]) |
accepted_values shorthand |
| Uniqueness | Write group_by().count() |
unique: order_id |
| Error reasons per row | Build it yourself (painful) | Automatic _lakelogic_errors column |
| Quarantine routing | Write separate output logic | Built-in with target path |
| Lineage columns | Inject manually (run_id, timestamp) |
Automatic: run ID, source path, timestamp |
| 100% reconciliation | Hope for the best | Guaranteed: raw = good + bad |
| Engine portability | Rewrite for Spark/DuckDB | Same YAML, any engine |
Inspecting Quarantined Rows
The killer feature for debugging: every quarantined row includes a
_lakelogic_errors column that tells you exactly which rule(s) failed.
No guessing, no log diving:
from lakelogic import DataProcessor result = DataProcessor("contracts/orders.yaml").run_source("data/orders.parquet") # See exactly what went wrong print(result.bad.select(["order_id", "_lakelogic_errors"])) # Output: # ┌──────────┬─────────────────────────────────────┐ # │ order_id │ _lakelogic_errors │ # ├──────────┼─────────────────────────────────────┤ # │ ORD-4291 │ amount >= 0 │ # │ ORD-8833 │ accepted_values: status │ # │ null │ required: order_id; amount >= 0 │ # └──────────┴─────────────────────────────────────┘
Same Contract, Switch to Spark in One Line
This is where the contract-driven approach pays off. When your data grows from 500K rows to 200M rows and you need Spark, you change one argument — not your validation logic, not your quality rules, not your quarantine setup:
from lakelogic import DataProcessor # Local development — Polars (auto-detected) result = DataProcessor("contracts/orders.yaml").run_source("data/orders.parquet") # Production Databricks — Spark (same contract, same rules) result = DataProcessor("contracts/orders.yaml", engine="spark").run_source("catalog.bronze.orders") # CI/CD testing — DuckDB (fast, zero dependencies) result = DataProcessor("contracts/orders.yaml", engine="duckdb").run_source("data/orders.parquet")
Read more about why this matters in Stop the Spark Tax: One Data Contract, Any Engine.
Bootstrap from Existing Data
You don't need to write the YAML from scratch. LakeLogic can infer a full contract from your existing data — detecting types, null patterns, and even suggesting quality rules based on the data distribution:
from lakelogic import infer_contract # Point at any file — CSV, Parquet, JSON, Excel draft = infer_contract( "data/orders.parquet", title="Orders Silver", suggest_rules=True, detect_pii=True, ) draft.show() # Preview the generated YAML draft.save("contracts/orders.yaml") # Save and customize
Get Started in 60 Seconds
# Install pip install lakelogic # Bootstrap a contract from your data lakelogic bootstrap --source data/orders.parquet --output contracts/ # Validate lakelogic run --contract contracts/orders.yaml --source data/orders.parquet # See quarantined rows lakelogic run --contract contracts/orders.yaml --source data/orders.parquet \ --show-quarantine
Your Quality Rules. One YAML. Any Engine.
Stop rewriting validation logic. Define it once, run it everywhere — Polars, Spark, DuckDB.