IRON LAW: Data Quality Checks at Every Stage
A pipeline that moves bad data fast is worse than no pipeline — it
corrupts downstream analytics and decisions. Every pipeline stage
(extract, transform, load) must have data quality checks:
row counts, null checks, schema validation, freshness checks.
"Garbage in, garbage out" is not a warning — it's a guarantee.
ETL vs ELT
| Aspect |
ETL (Extract, Transform, Load) |
ELT (Extract, Load, Transform) |
| Transform where? |
Before loading (in pipeline) |
After loading (in warehouse) |
| Best for |
Structured data, compliance-heavy |
Cloud warehouses (BigQuery, Snowflake) |
| Flexibility |
Less (transform logic is fixed) |
More (transform in SQL after loading) |
| Cost |
Compute in pipeline |
Compute in warehouse |
| Trend |
Legacy/on-prem |
Modern/cloud-native |
Pipeline Architecture
[Sources] → [Extract] → [Stage] → [Transform] → [Load] → [Serve]
↑ ↓
| [Quality Checks at every stage] [Dashboard]
| [Monitoring & Alerting] [API]
└──────────────── [Orchestrator (Airflow/Prefect)] ──────┘
Data Source Types
| Source |
Extraction Method |
Challenges |
| Database |
CDC (Change Data Capture), bulk query, replication |
Schema changes, performance impact on source |
| API |
REST/GraphQL polling, webhooks |
Rate limits, pagination, auth token refresh |
| Files |
S3/GCS pickup, SFTP, email attachment |
Format inconsistency, encoding issues |
| Streaming |
Kafka, Kinesis, Pub/Sub |
Ordering, exactly-once processing |
| SaaS tools |
Pre-built connectors (Fivetran, Airbyte) |
API changes, data model complexity |
Orchestration Tools
| Tool |
Type |
Best For |
Complexity |
| Airflow |
Python DAGs |
Complex pipelines, team of engineers |
High |
| Prefect |
Python, modern API |
Simpler than Airflow, good DX |
Medium |
| dbt |
SQL transforms only |
Transform layer in ELT |
Low-Medium |
| Cron |
Simple scheduling |
Single script, low complexity |
Low |
| Fivetran/Airbyte |
Managed connectors |
Extract + Load (no transform) |
Low |
Data Quality Framework
| Check |
What It Validates |
When |
| Row count |
Expected number of rows (within ±10% of prior run) |
After extract, after load |
| Null check |
Critical columns have no unexpected nulls |
After extract |
| Schema validation |
Column names, types match expected |
After extract |
| Freshness |
Data is recent (not stale) |
After load |
| Uniqueness |
No duplicate primary keys |
After load |
| Range check |
Values within expected bounds |
After transform |
| Referential integrity |
Foreign keys match parent tables |
After load |
Pipeline Design Steps
- Map sources and destinations: What data, from where, to where?
- Define freshness requirements: Real-time? Hourly? Daily?
- Choose architecture: ETL or ELT based on tools and team
- Build incrementally: Start with one source, one destination, one schedule
- Add quality checks: At minimum: row count + null check + freshness
- Set up monitoring: Alert on failure, quality check violations, latency
- Document: Data dictionary, pipeline diagram, SLAs