Start With What Schema Validation Actually Does
Schema validation answers one question: “Does this data have the right structure?” That means checking column names, types, and nullability. Tools like JSON Schema, Pydantic, Pandera, and dbt tests are all doing some version of this.
{
"type": "object",
"properties": {
"email": { "type": "string" },
"age": { "type": "integer", "minimum": 0 },
"signup_id":{ "type": "string" }
},
"required": ["email", "signup_id"]
}
This is genuinely useful. It catches the obvious structural problems: age arriving
as a string, email missing entirely, unexpected extra fields.
You should absolutely be doing this.
But it can’t tell you that "not-an-email" is a valid string. Or that a
negative age slipped through. Or what to do with bad rows when they
arrive. Or who owns this dataset, or how it moves through your Bronze → Silver → Gold layers.
Schema validation enforces structure. A data contract enforces intent.
The Gap: Same Column, Different Problem
Consider an email field. Schema validation tells you it’s
a non-null string. That’s it. These five values all pass JSON Schema:
"jane@example.com" ✓ valid "not-an-email" ✓ passes — it's a non-null string "NULL" ✓ passes — still technically a string "@" ✓ passes — has an @, so it's a "string" "" ✓ passes — empty string is still a string
A data contract adds a row-level quality rule that catches the business problem, not just the type problem:
model:
fields:
- name: email
type: string
required: true
quality:
row_rules:
- name: email_format
sql: "email LIKE '%@%.%'" # enforces actual format
category: correctness
quarantine:
include_error_reason: true # bad rows routed out, not dropped
Now "not-an-email" is caught at ingest, routed to quarantine with
_reject_reason: row_rule:email_format, and the rest of the pipeline
continues on clean data.
Five Things a Data Contract Does That Schema Validation Cannot
1. Business-level quality rules
SQL row rules and dataset rules that encode what the data means, not just what type it is.
age >= 0, status IN ('active','churned'), COUNT(*) > 0.
2. Quarantine & reject reasons
Bad rows route to an audit table with a _reject_reason column. Clean rows keep flowing.
No silent drops, no catastrophic failures.
3. Lineage & provenance
Every run stamped with source path, run ID, and timestamp. You can trace any row in Gold back to its Bronze source file and exact ingest timestamp.
4. Engine portability
The same YAML data contract runs on Polars locally and Spark in production. No rewriting rules per environment. One contract, any engine.
5. Ownership & governance
The info.owner field makes every dataset owned. Combined with version control, contracts
become the team’s shared agreement on what data means.
6. Schema evolution policy
Define what happens when a new field appears (evolution: strict vs
additive) and what to do with unknown fields (unknown_fields: drop).
Side by Side: The Same Problem, Two Approaches
Your upstream team adds a new phone field to the CSV they send you.
Here’s what happens with each approach:
# JSON Schema — unknown fields pass silently { "email": "user@example.com", "signup_id": "SU-001", "phone": "+44 7700 900123" ← passes } # The schema said nothing about phone. # It lands in your table, untested, # with no governance policy applied. # PII in prod, undetected.
# contract.yaml defines the policy explicitly schema_policy: evolution: strict unknown_fields: drop ← phone is dropped # Or flag it for review: schema_policy: unknown_fields: quarantine # Either way: intentional behaviour, # not a silent accident. # The contract is the governance layer.
How a Full Data Contract Looks
Here’s a complete LakeLogic data contract for a web signups Bronze table — the same one from our quarantine post. Notice how much more information it carries than a JSON Schema:
version: "1.0.0" info: title: Bronze Web Signups version: "1.0.0" owner: marketing-data # who owns this dataset dataset: bronze_web_signups schema_policy: evolution: strict # schema changes must be intentional unknown_fields: drop # unrecognised fields never land in prod model: fields: - name: signup_id type: string required: true - name: email type: string required: true - name: event_date type: date required: true - name: age type: int quality: row_rules: - name: email_format sql: "email LIKE '%@%'" category: correctness # business correctness, not schema - name: age_positive sql: "age IS NULL OR age >= 0" dataset_rules: - name: total_signups sql: "SELECT COUNT(*) FROM bronze_web_signups" must_be_greater_than: 0 # dataset-level assertion lineage: enabled: true capture_source_path: true # trace every row to its source file capture_timestamp: true capture_run_id: true quarantine: include_error_reason: true # bad rows audited, never silently dropped
The Capability Comparison
| Capability | Schema Validation (JSON Schema, Pydantic, dbt tests) |
Data Contract (LakeLogic) |
|---|---|---|
| Column type checking | ✓ Yes | ✓ Yes |
| Null / required fields | ✓ Yes | ✓ Yes |
| Business quality rules email format, value ranges, referential checks |
✗ Not natively | ✓ SQL row rules + dataset rules |
| Bad row handling | ✗ Raise / fail or silent drop | ✓ Quarantine with reject reason |
| Schema evolution policy | ✗ No built-in policy | ✓ strict / additive, unknown_fields policy |
| Data lineage | ✗ Not applicable | ✓ Source path, run ID, timestamp per row |
| Ownership & governance | ✗ Not applicable | ✓ owner, version, dataset declared in YAML |
| Engine portability | ✗ Per-tool, per-environment | ✓ Same YAML on Polars, Spark, DuckDB, Pandas |
| Git-versioned as code | ✓ Possible | ✓ YAML files, PR-reviewed, changelog-tracked |
Are They Competing Tools?
Not exactly. Schema validation tools like Pydantic or dbt schema.yml tests
operate at the application layer or transformation layer.
They’re validating after the fact, or at a specific point in your stack.
A data contract operates at the ingest layer — the moment raw data enters your lakehouse from an external source. It’s the agreement between the producer of the data and your pipeline. Once your Bronze layer has been validated by a contract, your Silver and Gold transformations can trust what they receive.
Use schema validation within your services and transformations. Use a data contract at the boundary between data producers and your lakehouse.
When to Use What
| Situation | Recommended approach |
|---|---|
| Validating a REST API request or response | JSON Schema / Pydantic — fast, in-process |
| Testing dbt model outputs | dbt tests — native, already in your DAG |
| Ingesting CSV / Parquet from an external partner | Data contract — quarantine, lineage, ownership |
| Defining the Bronze → Silver quality gate | Data contract — business rules + quarantine |
| Testing the same rules across Polars locally and Spark in prod | Data contract — engine-agnostic YAML |
| Documenting who owns a dataset and what it means | Data contract — info.owner, version history in git |
Write Your First Data Contract in 60 Seconds
One YAML file. Runs on Polars, DuckDB, and Pandas. Quarantine included. Open source, MIT licensed.