A declarative, contract-driven medallion pipeline engine for data mesh architectures. Describe your data products in YAML.
LakeLogic is an open-source, contract-driven data engineering framework. Define your schema once in YAML — and get quality enforcement, quarantine routing, synthetic data generation, incremental processing, and data lineage out of the box. No boilerplate. No sprawl.
LakeLogic is easy to install using your standard package manager. Select your environment and get started.
pip install lakelogic
poetry add lakelogic
conda install -c conda-forge lakelogic
Integrates with your stack
100% Reconciliation mathematically guaranteed. Every record is accounted for, using SQL-First rules and strict Pydantic parsing. Invalid rows are safely quarantined, never silently dropped.
Contract-driven right-to-erasure (GDPR/HIPAA) with nullify, hash, or redact strategies. Automatically inherit global PII masking and access rules across your federated data mesh.
Write once, run anywhere. Seamlessly execute your medallion pipelines across DuckDB, Polars, or Spark without rewriting declarative contracts. Vendor-neutral runtime scaling to billions.
Deep contextual logging, full pipeline dry-runs without moving data, and DDL-only generation for CI/CD. Multi-channel alerts flow automatically to Slack, Teams, and email.
Built-in Faker simulation for referential integrity and AI edge case injection. LLM-powered contract onboarding can auto-generate schemas and quality rules from sample data.
Reuse existing dbt schema.yml models natively. Integrate with dlt for 100+ verified REST and database sources. All synced through your central declarative contracts.
Most data mesh initiatives stall because building data products is too complex. Domain teams are forced to string together Great Expectations for quality, Airflow for lineage, Faker for tests, and custom Spark scripts to merge data.
LakeLogic provides the missing self-serve platform. A domain owner simply declares their dataset in a single YAML contract. The engine automatically handles the rest—materializing Gold tables with quality gates, lineage, and SCD2 fully built in.
Stop configuring isolated tools. Start shipping governed data products.
Drop any file on infer_contract() —
LakeLogic infers the schema, detects null patterns, suggests quality rules from your data's actual
distribution, and returns a ContractDraft you can
inspect, save, or chain directly into a data generator.
No configuration files. No Python classes to subclass. One function call.
from lakelogic import infer_contract, DataGenerator # Infer a full contract from any file draft = infer_contract( "data/orders.csv", title="Orders Bronze", suggest_rules=True, detect_pii=True, ) draft.show() # print YAML to inspect draft.save("contracts/bronze_orders.yaml") # Generate 10k test rows from the inferred contract df = draft.to_generator(seed=42).generate(rows=10_000) # Or skip YAML entirely — pure in-memory flow df = ( DataGenerator.from_file("data/orders.csv") .generate(rows=5_000) ) # Validate on ingest from lakelogic import DataProcessor result = DataProcessor("contracts/bronze_orders.yaml").run(df)
LakeLogic reads and validates any format your lakehouse ingests. Contracts are format-agnostic — the same YAML covers CSV landing files today and Delta tables tomorrow.
Drop any file on infer_contract() —
LakeLogic infers schema, nullability, and quality rules from your data's actual distribution.
DataProcessor enforces every rule at runtime. Bad rows route to quarantine with a per-row reason column — no silent failures, no pipeline crashes.
DataGenerator produces realistic rows from the same contract — seeded from your real data's distributions. CI pipelines that actually catch issues.
Built-in watermark strategies — max target, lookback window, CDC — keep Bronze→Silver runs idempotent automatically. No manual bookmarks.
Every run stamped with source path, run ID, and timestamp. Full lineage from source file → Bronze → Silver → Gold. Debug in minutes, not days.
YAML contracts are version-controlled, PR-reviewed, and environment-aware. Compliance teams get an audit trail. Engineers get a review process.
Your team has the same data validation rule written in at least three places — Spark, Lambda, dbt. When one changes, the others drift. One YAML data contract fixes all three.
Read → ReliabilityOne bad row in a data pipeline shouldn't crash a 2-hour job. Route bad records out automatically, with a reject reason column.
Read → 📐 ArchitectureSchema validation checks shape. A data contract enforces meaning — quality rules, lineage, quarantine, and engine portability.
Read →The full framework is open source under the MIT license. No feature gates, no usage limits.
LakeLogic is MIT-licensed and free forever. Deploy governed data products across any engine from a single YAML contract.
Release notes & new playbooks. No spam. Unsubscribe any time.