Your Data Estate.
Under Contract.

A declarative, contract-driven medallion pipeline engine for data mesh architectures. Describe your data products in YAML.

Open source · MIT License · Polars-native speed · Databricks & Fabric-ready
★ Azure Reference Architecture · Bronze → Silver → Gold · 40+ contracts included

LakeLogic is an open-source, contract-driven data engineering framework. Define your schema once in YAML — and get quality enforcement, quarantine routing, synthetic data generation, incremental processing, and data lineage out of the box. No boilerplate. No sprawl.

Quick install

LakeLogic is easy to install using your standard package manager. Select your environment and get started.

pip install lakelogic
poetry add lakelogic
conda install -c conda-forge lakelogic

Integrates with your stack

Databricks
Snowflake
dbt
Delta Lake
Azure ADLS
AWS S3
BigQuery
Data Mesh Alignment

The missing runtime for your data mesh

01

Data Quality & Trust

100% Reconciliation mathematically guaranteed. Every record is accounted for, using SQL-First rules and strict Pydantic parsing. Invalid rows are safely quarantined, never silently dropped.

02

Compliance & Governance

Contract-driven right-to-erasure (GDPR/HIPAA) with nullify, hash, or redact strategies. Automatically inherit global PII masking and access rules across your federated data mesh.

03

Engine Portability & Scale

Write once, run anywhere. Seamlessly execute your medallion pipelines across DuckDB, Polars, or Spark without rewriting declarative contracts. Vendor-neutral runtime scaling to billions.

04

Developer Experience

Deep contextual logging, full pipeline dry-runs without moving data, and DDL-only generation for CI/CD. Multi-channel alerts flow automatically to Slack, Teams, and email.

05

Data Generation & AI

Built-in Faker simulation for referential integrity and AI edge case injection. LLM-powered contract onboarding can auto-generate schemas and quality rules from sample data.

06

Integrations

Reuse existing dbt schema.yml models natively. Integrate with dlt for 100+ verified REST and database sources. All synced through your central declarative contracts.

From centralized bottleneck
to a federated mesh

The Data Mesh Reality

Most data mesh initiatives stall because building data products is too complex. Domain teams are forced to string together Great Expectations for quality, Airflow for lineage, Faker for tests, and custom Spark scripts to merge data.

LakeLogic provides the missing self-serve platform. A domain owner simply declares their dataset in a single YAML contract. The engine automatically handles the rest—materializing Gold tables with quality gates, lineage, and SCD2 fully built in.

Stop configuring isolated tools. Start shipping governed data products.

LakeLogic Self-serve platform Airflow + GE + PySpark + Test Generators 1 YAML contract
Zero to pipeline

From raw file to
contract in one line

Drop any file on infer_contract() — LakeLogic infers the schema, detects null patterns, suggests quality rules from your data's actual distribution, and returns a ContractDraft you can inspect, save, or chain directly into a data generator.

No configuration files. No Python classes to subclass. One function call.

quickstart.py
from lakelogic import infer_contract, DataGenerator

# Infer a full contract from any file
draft = infer_contract(
    "data/orders.csv",
    title="Orders Bronze",
    suggest_rules=True,
    detect_pii=True,
)

draft.show()   # print YAML to inspect
draft.save("contracts/bronze_orders.yaml")

# Generate 10k test rows from the inferred contract
df = draft.to_generator(seed=42).generate(rows=10_000)

# Or skip YAML entirely — pure in-memory flow
df = (
    DataGenerator.from_file("data/orders.csv")
                 .generate(rows=5_000)
)

# Validate on ingest
from lakelogic import DataProcessor
result = DataProcessor("contracts/bronze_orders.yaml").run(df)
CSV
Parquet
Delta
NDJSON
Avro
Excel
ORC
Iceberg
S3
ADLS
GCS
DataFrame
Support

Works with all common data formats

LakeLogic reads and validates any format your lakehouse ingests. Contracts are format-agnostic — the same YAML covers CSV landing files today and Delta tables tomorrow.

  • Text: CSV, JSON, NDJSON
  • Binary: Parquet, Delta Lake, Avro, Excel, ORC
  • Open table formats: Apache Iceberg, Delta Lake
  • Cloud storage: S3, Azure Blob / ADLS, GCS
  • DataFrames & Compute: DuckDB, Polars, PySpark

From raw file to production pipeline
in under 10 minutes

01

Ship your first contract in seconds

Drop any file on infer_contract() — LakeLogic infers schema, nullability, and quality rules from your data's actual distribution.

02

Catch bad rows before they reach Silver

DataProcessor enforces every rule at runtime. Bad rows route to quarantine with a per-row reason column — no silent failures, no pipeline crashes.

03

50,000 realistic test rows from one command

DataGenerator produces realistic rows from the same contract — seeded from your real data's distributions. CI pipelines that actually catch issues.

04

Never reprocess what you've already ingested

Built-in watermark strategies — max target, lookback window, CDC — keep Bronze→Silver runs idempotent automatically. No manual bookmarks.

05

Know exactly where every bad row came from

Every run stamped with source path, run ID, and timestamp. Full lineage from source file → Bronze → Silver → Gold. Debug in minutes, not days.

06

Your contracts live in git. So does your trust.

YAML contracts are version-controlled, PR-reviewed, and environment-aware. Compliance teams get an audit trail. Engineers get a review process.

From the blog

Data engineering,
without the guesswork

Read all posts →
Pricing

100% free. Forever.

The full framework is open source under the MIT license. No feature gates, no usage limits.

LakeLogic Framework
$0
Free forever · MIT License
  • Full DataProcessor + DataGenerator
  • infer_contract() from any file format
  • Quarantine with per-row error reasons
  • Incremental watermark strategies
  • CLI: generate, bootstrap, run, validate
  • DuckDB + Spark + Polars engines
  • Community support (GitHub Issues)
View on GitHub →

Open Source. Contract-Driven. Mesh-Ready.

LakeLogic is MIT-licensed and free forever. Deploy governed data products across any engine from a single YAML contract.

Release notes & new playbooks. No spam. Unsubscribe any time.

Get started with pip ★ Star on GitHub