The 2am Scenario
⚠️ Incident — 02:17 AM
Your overnight Bronze ingestion job, the one that processes 400,000 web signup records and feeds
your Silver marketing tables, has failed.
Dashboard is empty. The daily email campaign query is pulling from yesterday's data.
Stakeholders are waking up to stale numbers. You're trying to debug a stack trace at 2am
looking for one row that contains a null
where your pipeline expected a date.
This is the default behaviour of most data pipelines: one bad record vetoes the entire run. The job either errors out entirely, or silently drops bad records with no audit trail. Neither is acceptable in production.
The Nuclear Option Problem
The typical "fix" is to wrap everything in try/except and log a warning.
But that just hides the problem — bad data silently disappears and you have no way to
investigate, reprocess, or alert on it.
for row in records: try: process(row) except Exception as e: # Bad row silently vanishes. No audit trail. # No way to know what failed, or what % of data is bad. logger.warning(f"Skipping row: {e}")
You have two choices: fail loud (the 2am incident) or fail silent (broken dashboards that nobody notices for days). Neither is the right answer.
The right answer is a third option: route bad records out, keep clean records flowing, and leave a full audit trail of exactly what was rejected and why.
The Quarantine Pattern
In LakeLogic, you add three lines to your data contract. That's it.
version: "1.0.0" info: title: Bronze Web Signups dataset: bronze_web_signups model: fields: - name: signup_id type: string required: true - name: email type: string required: true - name: event_date type: date required: true - name: age type: int quality: row_rules: - name: email_format sql: "email LIKE '%@%'" - name: age_positive sql: "age IS NULL OR age >= 0" quarantine: include_error_reason: true # adds _reject_reason column
LakeLogic now does something different when it encounters a bad row. Instead of raising
an exception, it routes the row to a separate quarantine file — and stamps it with a
_reject_reason column explaining exactly which rule it violated.
Running It
from lakelogic import DataProcessor result = DataProcessor("contract.yaml").run_source() # result.good → clean records, ready for Silver # result.bad → quarantined records with _reject_reason print(f"✅ {len(result.good):,} valid rows") print(f"🔒 {len(result.bad):,} quarantined rows") # Inspect the reject reasons print(result.bad["_reject_reason"].value_counts())
What the quarantined rows look like
┌────────────┬──────────────────┬────────────┬─────┬──────────────────────────────────────┐ │ signup_id │ email │ event_date │ age │ _reject_reason │ ├────────────┼──────────────────┼────────────┼─────┼──────────────────────────────────────┤ │ SU-10042 │ not-an-email │ 2026-02-28 │ 32 │ row_rule:email_format │ │ SU-10091 │ jane@example.com │ 2026-03-01 │ -5 │ row_rule:age_positive │ │ SU-10103 │ NULL │ 2026-03-01 │ 28 │ required_field:email │ └────────────┴──────────────────┴────────────┴─────┴──────────────────────────────────────┘
Every quarantined row is preserved exactly as it arrived, with a clear reason. You can now investigate, fix the upstream source, and reprocess — without any data loss.
The Result: What Changes
✅ With quarantine enabled
The pipeline finishes. The dashboard is populated. The 0.4% of bad records are held in a quarantine file with full audit columns. Your data quality team can review them in the morning — not at 2am.
Writing Quarantine to a Table (Not Just a File)
By default, quarantined rows write to a Parquet file next to your output. For production pipelines, you probably want them in a queryable table your data quality team can monitor:
quarantine: include_error_reason: true target: "table:data_quality.bronze_signups_quarantine" # Writes to DuckDB by default (Polars/Pandas pipelines) # Writes to Spark Delta table on Databricks pipelines # Schema evolves automatically as your contract changes
Now your DQ team can run:
-- Which rules are failing most often this week? SELECT _reject_reason, COUNT(*) AS failures, DATE_TRUNC('day', event_date) AS day FROM data_quality.bronze_signups_quarantine WHERE _run_timestamp > NOW() - INTERVAL '7 days' GROUP BY 1, 3 ORDER BY 2 DESC;
When to Use Quarantine vs. Fail Fast
| Scenario | Behaviour | Why |
|---|---|---|
| Overnight batch jobs | Quarantine | Bad rows shouldn't cancel a 2-hour run |
| Bronze ingestion layer | Quarantine | Raw data is always dirty — expect and handle it |
| Financial calculations (revenue, payroll) | Fail fast | Partial results are worse than no results |
| Schema contract violations (wrong column types) | Fail fast | Structural problems need immediate attention |
| Silver/Gold quality gates | Quarantine | Route bad rows out, let clean rows continue downstream |
| CI / data contract tests | Fail fast | A broken contract should block the PR |
The Rule of Thumb
If a bad record would cause all the other records to be wrong, fail fast. If a bad record is just itself wrong, quarantine it and keep going.
Most data pipelines deal with the second case far more often than the first. Bronze ingestion especially — you're processing data from external sources you don't control. Quarantine is your safety net.
Add Quarantine to Your Pipeline Today
Three lines of YAML. Works on Polars, Spark, DuckDB and Pandas. Full audit trail included. Open source, MIT licensed.