Data Contracts vs Schema Validation — The Difference Matters

Follow along in the notebook — run a full data contract pipeline live in your browser, no install needed.

Start With What Schema Validation Actually Does

Schema validation answers one question: “Does this data have the right structure?” That means checking column names, types, and nullability. Tools like JSON Schema, Pydantic, Pandera, and dbt tests are all doing some version of this.

schema_validation.json — JSON Schema

{
    "type": "object",
    "properties": {
        "email":    { "type": "string" },
        "age":      { "type": "integer", "minimum": 0 },
        "signup_id":{ "type": "string" }
    },
    "required": ["email", "signup_id"]
}

This is genuinely useful. It catches the obvious structural problems: age arriving as a string, email missing entirely, unexpected extra fields. You should absolutely be doing this.

But it can’t tell you that "not-an-email" is a valid string. Or that a negative age slipped through. Or what to do with bad rows when they arrive. Or who owns this dataset, or how it moves through your Bronze → Silver → Gold layers.

Schema validation enforces structure. A data contract enforces intent.

The Gap: Same Column, Different Problem

Consider an email field. Schema validation tells you it’s a non-null string. That’s it. These five values all pass JSON Schema:

email column — all pass JSON Schema

  "jane@example.com"    ✓ valid
  "not-an-email"        ✓ passes — it's a non-null string
  "NULL"                ✓ passes — still technically a string
  "@"                   ✓ passes — has an @, so it's a "string"
  ""                    ✓ passes — empty string is still a string

A data contract adds a row-level quality rule that catches the business problem, not just the type problem:

contract.yaml — LakeLogic

model:
  fields:
    - name: email
      type: string
      required: true

quality:
  row_rules:
    - name: email_format
      sql: "email LIKE '%@%.%'"    # enforces actual format
      category: correctness

quarantine:
  include_error_reason: true   # bad rows routed out, not dropped

Now "not-an-email" is caught at ingest, routed to quarantine with _reject_reason: row_rule:email_format, and the rest of the pipeline continues on clean data.

Five Things a Data Contract Does That Schema Validation Cannot

🔍

1. Business-level quality rules

SQL row rules and dataset rules that encode what the data means, not just what type it is. age >= 0, status IN ('active','churned'), COUNT(*) > 0.

🔒

2. Quarantine & reject reasons

Bad rows route to an audit table with a _reject_reason column. Clean rows keep flowing. No silent drops, no catastrophic failures.

🗂️

3. Lineage & provenance

Every run stamped with source path, run ID, and timestamp. You can trace any row in Gold back to its Bronze source file and exact ingest timestamp.

⚙️

4. Engine portability

The same YAML data contract runs on Polars locally and Spark in production. No rewriting rules per environment. One contract, any engine.

👤

5. Ownership & governance

The info.owner field makes every dataset owned. Combined with version control, contracts become the team’s shared agreement on what data means.

📋

6. Schema evolution policy

Define what happens when a new field appears (evolution: strict vs additive) and what to do with unknown fields (unknown_fields: drop).

Side by Side: The Same Problem, Two Approaches

Your upstream team adds a new phone field to the CSV they send you. Here’s what happens with each approach:

❌ Schema Validation only

# JSON Schema — unknown fields pass silently
{
  "email": "user@example.com",
  "signup_id": "SU-001",
  "phone": "+44 7700 900123"   ← passes
}

# The schema said nothing about phone.
# It lands in your table, untested,
# with no governance policy applied.
# PII in prod, undetected.

✅ Data Contract

# contract.yaml defines the policy explicitly
schema_policy:
  evolution: strict
  unknown_fields: drop   ← phone is dropped

# Or flag it for review:
schema_policy:
  unknown_fields: quarantine

# Either way: intentional behaviour,
# not a silent accident.
# The contract is the governance layer.

How a Full Data Contract Looks

Here’s a complete LakeLogic data contract for a web signups Bronze table — the same one from our quarantine post. Notice how much more information it carries than a JSON Schema:

contract.yaml — complete example

version: "1.0.0"
info:
  title: Bronze Web Signups
  version: "1.0.0"
  owner: marketing-data          # who owns this dataset

dataset: bronze_web_signups

schema_policy:
  evolution: strict              # schema changes must be intentional
  unknown_fields: drop           # unrecognised fields never land in prod

model:
  fields:
    - name: signup_id
      type: string
      required: true
    - name: email
      type: string
      required: true
    - name: event_date
      type: date
      required: true
    - name: age
      type: int

quality:
  row_rules:
    - name: email_format
      sql: "email LIKE '%@%'"
      category: correctness      # business correctness, not schema
    - name: age_positive
      sql: "age IS NULL OR age >= 0"
  dataset_rules:
    - name: total_signups
      sql: "SELECT COUNT(*) FROM bronze_web_signups"
      must_be_greater_than: 0  # dataset-level assertion

lineage:
  enabled: true
  capture_source_path: true    # trace every row to its source file
  capture_timestamp: true
  capture_run_id: true

quarantine:
  include_error_reason: true  # bad rows audited, never silently dropped

The Capability Comparison

Capability	Schema Validation (JSON Schema, Pydantic, dbt tests)	Data Contract (LakeLogic)
Column type checking	✓ Yes	✓ Yes
Null / required fields	✓ Yes	✓ Yes
Business quality rules email format, value ranges, referential checks	✗ Not natively	✓ SQL row rules + dataset rules
Bad row handling	✗ Raise / fail or silent drop	✓ Quarantine with reject reason
Schema evolution policy	✗ No built-in policy	✓ strict / additive, unknown_fields policy
Data lineage	✗ Not applicable	✓ Source path, run ID, timestamp per row
Ownership & governance	✗ Not applicable	✓ owner, version, dataset declared in YAML
Engine portability	✗ Per-tool, per-environment	✓ Same YAML on Polars, Spark, DuckDB, Pandas
Git-versioned as code	✓ Possible	✓ YAML files, PR-reviewed, changelog-tracked

Are They Competing Tools?

Not exactly. Schema validation tools like Pydantic or dbt schema.yml tests operate at the application layer or transformation layer. They’re validating after the fact, or at a specific point in your stack.

A data contract operates at the ingest layer — the moment raw data enters your lakehouse from an external source. It’s the agreement between the producer of the data and your pipeline. Once your Bronze layer has been validated by a contract, your Silver and Gold transformations can trust what they receive.

Use schema validation within your services and transformations. Use a data contract at the boundary between data producers and your lakehouse.

When to Use What

Situation	Recommended approach
Validating a REST API request or response	JSON Schema / Pydantic — fast, in-process
Testing dbt model outputs	dbt tests — native, already in your DAG
Ingesting CSV / Parquet from an external partner	Data contract — quarantine, lineage, ownership
Defining the Bronze → Silver quality gate	Data contract — business rules + quarantine
Testing the same rules across Polars locally and Spark in prod	Data contract — engine-agnostic YAML
Documenting who owns a dataset and what it means	Data contract — `info.owner`, version history in git

Write Your First Data Contract in 60 Seconds

One YAML file. Runs on Polars, DuckDB, and Pandas. Quarantine included. Open source, MIT licensed.

⭐ Star on GitHub Open Notebook in Colab →

Data Contracts vs Schema Validation —
The Difference Matters