Your Data Estate.
Under Contract.

A declarative, contract-driven medallion pipeline engine for data mesh architectures. Describe your data products in YAML.

Open source · MIT License · Polars-native speed · Databricks & Fabric-ready

★ Azure Reference Architecture · Bronze → Silver → Gold · 40+ contracts included

Get started ★ View on GitHub

PyPI …

Quick install

LakeLogic is easy to install using your standard package manager. Select your environment and get started.

pip install lakelogic

poetry add lakelogic

conda install -c conda-forge lakelogic

Data Mesh Alignment

The missing runtime for your data mesh

Data Quality & Trust

100% Reconciliation mathematically guaranteed. Every record is accounted for, using SQL-First rules and strict Pydantic parsing. Invalid rows are safely quarantined, never silently dropped.

Compliance & Governance

Contract-driven right-to-erasure (GDPR/HIPAA) with nullify, hash, or redact strategies. Automatically inherit global PII masking and access rules across your federated data mesh.

Engine Portability & Scale

Write once, run anywhere. Seamlessly execute your medallion pipelines across DuckDB, Polars, or Spark without rewriting declarative contracts. Vendor-neutral runtime scaling to billions.

Developer Experience

Deep contextual logging, full pipeline dry-runs without moving data, and DDL-only generation for CI/CD. Multi-channel alerts flow automatically to Slack, Teams, and email.

Data Generation & AI

Built-in Faker simulation for referential integrity and AI edge case injection. LLM-powered contract onboarding can auto-generate schemas and quality rules from sample data.

Integrations

Reuse existing dbt schema.yml models natively. Integrate with dlt for 100+ verified REST and database sources. All synced through your central declarative contracts.

From centralized bottleneck
to a federated mesh

The Data Mesh Reality

Most data mesh initiatives stall because building data products is too complex. Domain teams are forced to string together Great Expectations for quality, Airflow for lineage, Faker for tests, and custom Spark scripts to merge data.

LakeLogic provides the missing self-serve platform. A domain owner simply declares their dataset in a single YAML contract. The engine automatically handles the rest—materializing Gold tables with quality gates, lineage, and SCD2 fully built in.

Stop configuring isolated tools. Start shipping governed data products.

Zero to pipeline

From raw file to
contract in one line

Drop any file on infer_contract() — LakeLogic infers the schema, detects null patterns, suggests quality rules from your data's actual distribution, and returns a ContractDraft you can inspect, save, or chain directly into a data generator.

No configuration files. No Python classes to subclass. One function call.

quickstart.py

from lakelogic import infer_contract, DataGenerator

# Infer a full contract from any file
draft = infer_contract(
    "data/orders.csv",
    title="Orders Bronze",
    suggest_rules=True,
    detect_pii=True,
)

draft.show()   # print YAML to inspect
draft.save("contracts/bronze_orders.yaml")

# Generate 10k test rows from the inferred contract
df = draft.to_generator(seed=42).generate(rows=10_000)

# Or skip YAML entirely — pure in-memory flow
df = (
    DataGenerator.from_file("data/orders.csv")
                 .generate(rows=5_000)
)

# Validate on ingest
from lakelogic import DataProcessor
result = DataProcessor("contracts/bronze_orders.yaml").run(df)

CSV

Parquet

Delta

NDJSON

Avro

Excel

ORC

Iceberg

ADLS

GCS

DataFrame

Support

Works with all common data formats

LakeLogic reads and validates any format your lakehouse ingests. Contracts are format-agnostic — the same YAML covers CSV landing files today and Delta tables tomorrow.

Text: CSV, JSON, NDJSON
Binary: Parquet, Delta Lake, Avro, Excel, ORC
Open table formats: Apache Iceberg, Delta Lake
Cloud storage: S3, Azure Blob / ADLS, GCS
DataFrames & Compute: DuckDB, Polars, PySpark

From raw file to production pipeline —
in under 10 minutes

Ship your first contract in seconds

Drop any file on infer_contract() — LakeLogic infers schema, nullability, and quality rules from your data's actual distribution.

Catch bad rows before they reach Silver

DataProcessor enforces every rule at runtime. Bad rows route to quarantine with a per-row reason column — no silent failures, no pipeline crashes.

50,000 realistic test rows from one command

DataGenerator produces realistic rows from the same contract — seeded from your real data's distributions. CI pipelines that actually catch issues.

Never reprocess what you've already ingested

Built-in watermark strategies — max target, lookback window, CDC — keep Bronze→Silver runs idempotent automatically. No manual bookmarks.

Know exactly where every bad row came from

Every run stamped with source path, run ID, and timestamp. Full lineage from source file → Bronze → Silver → Gold. Debug in minutes, not days.

Your contracts live in git. So does your trust.

YAML contracts are version-controlled, PR-reviewed, and environment-aware. Compliance teams get an audit trail. Engineers get a review process.

User guide

Discover more
in our docs

Open source

Star us
on GitHub

Pricing

100% free. Forever.

The full framework is open source under the MIT license. No feature gates, no usage limits.

LakeLogic Framework

Free forever · MIT License

Full DataProcessor + DataGenerator
infer_contract() from any file format
Quarantine with per-row error reasons
Incremental watermark strategies
CLI: generate, bootstrap, run, validate
DuckDB + Spark + Polars engines
Community support (GitHub Issues)

View on GitHub →

Community

Open

Contribute & collaborate

Full source code on GitHub (MIT)
Issue tracker & feature requests
Pull requests welcome
Example notebooks & playbooks
Changelog & release notes

Join on GitHub →

Open Source. Contract-Driven. Mesh-Ready.

LakeLogic is MIT-licensed and free forever. Deploy governed data products across any engine from a single YAML contract.

Release notes & new playbooks. No spam. Unsubscribe any time.

Get started with pip ★ Star on GitHub

Your Data Estate.
Under Contract.

Quick install

The missing runtime for your data mesh

Data Quality & Trust

Compliance & Governance

Engine Portability & Scale

Developer Experience

Data Generation & AI

Integrations

From centralized bottleneck
to a federated mesh

The Data Mesh Reality

From raw file to
contract in one line

Works with all common data formats

From raw file to production pipeline —
in under 10 minutes

Ship your first contract in seconds

Catch bad rows before they reach Silver

50,000 realistic test rows from one command

Never reprocess what you've already ingested

Know exactly where every bad row came from

Your contracts live in git. So does your trust.

Discover more
in our docs

Star us
on GitHub

Data engineering,
without the guesswork

Stop the Spark Tax: One Data Contract, Any Engine

How Quarantine Saved Our Pipeline (And My Sleep)

Data Contracts vs Schema Validation — The Difference Matters

100% free. Forever.

Open Source. Contract-Driven. Mesh-Ready.

Your Data Estate.Under Contract.

Quick install

The missing runtime for your data mesh

Data Quality & Trust

Compliance & Governance

Engine Portability & Scale

Developer Experience

Data Generation & AI

Integrations

From centralized bottleneckto a federated mesh

The Data Mesh Reality

From raw file tocontract in one line

Works with all common data formats

From raw file to production pipeline —in under 10 minutes

Ship your first contract in seconds

Catch bad rows before they reach Silver

50,000 realistic test rows from one command

Never reprocess what you've already ingested

Know exactly where every bad row came from

Your contracts live in git. So does your trust.

Discover morein our docs

Star uson GitHub

Data engineering,without the guesswork

Stop the Spark Tax: One Data Contract, Any Engine

How Quarantine Saved Our Pipeline (And My Sleep)

Data Contracts vs Schema Validation — The Difference Matters

100% free. Forever.

Open Source. Contract-Driven. Mesh-Ready.

Your Data Estate.
Under Contract.

From centralized bottleneck
to a federated mesh

From raw file to
contract in one line

From raw file to production pipeline —
in under 10 minutes

Discover more
in our docs

Star us
on GitHub

Data engineering,
without the guesswork