Row-Level Data Quality in Polars — Without Writing Validation Code

Run it yourself — try the quality checks live in your browser, no install needed.

The Problem: Validation Code That Grows Forever

You're building a data pipeline with Polars. You read a CSV, Parquet, or Delta file. You need to validate it before it goes downstream. So you start writing:

validate_orders.py — the manual way

import polars as pl

df = pl.read_parquet("data/orders.parquet")

# Check for nulls
bad_nulls = df.filter(
    pl.col("order_id").is_null() | pl.col("customer_id").is_null()
)

# Check amount range
bad_amounts = df.filter(pl.col("amount") < 0)

# Check status enum
valid_statuses = ["pending", "shipped", "delivered", "returned"]
bad_status = df.filter(~pl.col("status").is_in(valid_statuses))

# Combine all bad rows... somehow
# Add error reasons... manually
# Track which rules each row failed... painfully
# Write good rows to output... separately
# Write bad rows to quarantine... with metadata
# Add lineage columns... don't forget the run_id

This works for three rules. But then someone adds a regex check. Someone else adds a dataset-level uniqueness rule. You need to track which specific rule each row failed. You need to add lineage columns. You need to write the good and bad rows to different locations.

Before you know it, you have 200 lines of validation boilerplate that needs to be replicated in every pipeline. And when you need the same checks on Spark? You rewrite it all.

Validation logic shouldn't live in Python. It should live in a contract.

The Fix: One YAML, Zero Validation Code

With LakeLogic, you define your quality rules once in YAML. The framework handles the Polars expressions, quarantine routing, error reasons, and lineage injection for you:

contracts/orders.yaml

info:
  version: 1
  name: "orders_silver"
  domain: "crm"
  system: "shopify"

model:
  fields:
    - name: order_id
      type: string
      required: true
    - name: customer_id
      type: string
      required: true
    - name: amount
      type: float
    - name: status
      type: string
    - name: order_date
      type: date

quality:
  row_rules:
    - sql: "amount >= 0"
    - accepted_values:
        field: status
        values: ["pending", "shipped", "delivered", "returned"]
  dataset_rules:
    - unique: order_id

quarantine:
  enabled: true
  target: "quarantine/orders"
  include_error_reason: true

lineage:
  enabled: true

Now your Python code is three lines:

pipeline.py

from lakelogic import DataProcessor

result = DataProcessor("contracts/orders.yaml").run_source("data/orders.parquet")

print(f"✅ {len(result.good)} validated  ❌ {len(result.bad)} quarantined")

# result.good → clean rows, ready for Silver layer
# result.bad  → quarantined rows with _lakelogic_errors column
# Each bad row tells you EXACTLY which rule(s) failed

What You Get for Free

That YAML contract, combined with three lines of Python, gives you everything the 200-line manual version couldn't:

Feature	Manual Polars	LakeLogic Contract
Null checks	Write `is_null()` per field	Set `required: true` in schema
Range validation	Write `filter(col > x)`	One SQL line: `"amount >= 0"`
Enum validation	Write `is_in([...])`	`accepted_values` shorthand
Uniqueness	Write `group_by().count()`	`unique: order_id`
Error reasons per row	Build it yourself (painful)	Automatic `_lakelogic_errors` column
Quarantine routing	Write separate output logic	Built-in with target path
Lineage columns	Inject manually (`run_id`, `timestamp`)	Automatic: run ID, source path, timestamp
100% reconciliation	Hope for the best	Guaranteed: `raw = good + bad`
Engine portability	Rewrite for Spark/DuckDB	Same YAML, any engine

Inspecting Quarantined Rows

The killer feature for debugging: every quarantined row includes a _lakelogic_errors column that tells you exactly which rule(s) failed. No guessing, no log diving:

inspect_quarantine.py

from lakelogic import DataProcessor

result = DataProcessor("contracts/orders.yaml").run_source("data/orders.parquet")

# See exactly what went wrong
print(result.bad.select(["order_id", "_lakelogic_errors"]))

# Output:
# ┌──────────┬─────────────────────────────────────┐
# │ order_id │ _lakelogic_errors                   │
# ├──────────┼─────────────────────────────────────┤
# │ ORD-4291 │ amount >= 0                         │
# │ ORD-8833 │ accepted_values: status              │
# │ null     │ required: order_id; amount >= 0     │
# └──────────┴─────────────────────────────────────┘

Same Contract, Switch to Spark in One Line

This is where the contract-driven approach pays off. When your data grows from 500K rows to 200M rows and you need Spark, you change one argument — not your validation logic, not your quality rules, not your quarantine setup:

scale_up.py

from lakelogic import DataProcessor

# Local development — Polars (auto-detected)
result = DataProcessor("contracts/orders.yaml").run_source("data/orders.parquet")

# Production Databricks — Spark (same contract, same rules)
result = DataProcessor("contracts/orders.yaml", engine="spark").run_source("catalog.bronze.orders")

# CI/CD testing — DuckDB (fast, zero dependencies)
result = DataProcessor("contracts/orders.yaml", engine="duckdb").run_source("data/orders.parquet")

Read more about why this matters in Stop the Spark Tax: One Data Contract, Any Engine.

Bootstrap from Existing Data

You don't need to write the YAML from scratch. LakeLogic can infer a full contract from your existing data — detecting types, null patterns, and even suggesting quality rules based on the data distribution:

bootstrap.py

from lakelogic import infer_contract

# Point at any file — CSV, Parquet, JSON, Excel
draft = infer_contract(
    "data/orders.parquet",
    title="Orders Silver",
    suggest_rules=True,
    detect_pii=True,
)

draft.show()      # Preview the generated YAML
draft.save("contracts/orders.yaml")  # Save and customize

Get Started in 60 Seconds

terminal

# Install
pip install lakelogic

# Bootstrap a contract from your data
lakelogic bootstrap --source data/orders.parquet --output contracts/

# Validate
lakelogic run --contract contracts/orders.yaml --source data/orders.parquet

# See quarantined rows
lakelogic run --contract contracts/orders.yaml --source data/orders.parquet \
    --show-quarantine

Your Quality Rules. One YAML. Any Engine.

Stop rewriting validation logic. Define it once, run it everywhere — Polars, Spark, DuckDB.

⭐ Star on GitHub Read the Docs →