Data Pipeline Design 資料管線設計

Released已發布

methodology methodology

Design data pipelines covering ETL vs ELT architectures, data source integration, scheduling, quality checks, and warehouse design. Use this skill when the user needs to move data between systems, build a data warehouse, automate data processing, or improve data reliability — even if they say 'move data from X to Y', 'build an ETL pipeline', 'our data is a mess', or 'set up a data warehouse'.

技術方法論技能：Data Pipeline Design 分析與應用。

View on GitHub在 GitHub 查看

Methodology 方法論

IRON LAW: Data Quality Checks at Every Stage

A pipeline that moves bad data fast is worse than no pipeline — it
corrupts downstream analytics and decisions. Every pipeline stage
(extract, transform, load) must have data quality checks:
row counts, null checks, schema validation, freshness checks.

"Garbage in, garbage out" is not a warning — it's a guarantee.

ETL vs ELT

Aspect	ETL (Extract, Transform, Load)	ELT (Extract, Load, Transform)
Transform where?	Before loading (in pipeline)	After loading (in warehouse)
Best for	Structured data, compliance-heavy	Cloud warehouses (BigQuery, Snowflake)
Flexibility	Less (transform logic is fixed)	More (transform in SQL after loading)
Cost	Compute in pipeline	Compute in warehouse
Trend	Legacy/on-prem	Modern/cloud-native

Pipeline Architecture

[Sources] → [Extract] → [Stage] → [Transform] → [Load] → [Serve]
   ↑                                                          ↓
   |              [Quality Checks at every stage]        [Dashboard]
   |              [Monitoring & Alerting]                [API]
   └──────────────── [Orchestrator (Airflow/Prefect)] ──────┘

Data Source Types

Source	Extraction Method	Challenges
Database	CDC (Change Data Capture), bulk query, replication	Schema changes, performance impact on source
API	REST/GraphQL polling, webhooks	Rate limits, pagination, auth token refresh
Files	S3/GCS pickup, SFTP, email attachment	Format inconsistency, encoding issues
Streaming	Kafka, Kinesis, Pub/Sub	Ordering, exactly-once processing
SaaS tools	Pre-built connectors (Fivetran, Airbyte)	API changes, data model complexity

Orchestration Tools

Tool	Type	Best For	Complexity
Airflow	Python DAGs	Complex pipelines, team of engineers	High
Prefect	Python, modern API	Simpler than Airflow, good DX	Medium
dbt	SQL transforms only	Transform layer in ELT	Low-Medium
Cron	Simple scheduling	Single script, low complexity	Low
Fivetran/Airbyte	Managed connectors	Extract + Load (no transform)	Low

Data Quality Framework

Check	What It Validates	When
Row count	Expected number of rows (within ±10% of prior run)	After extract, after load
Null check	Critical columns have no unexpected nulls	After extract
Schema validation	Column names, types match expected	After extract
Freshness	Data is recent (not stale)	After load
Uniqueness	No duplicate primary keys	After load
Range check	Values within expected bounds	After transform
Referential integrity	Foreign keys match parent tables	After load

Pipeline Design Steps

Map sources and destinations: What data, from where, to where?
Define freshness requirements: Real-time? Hourly? Daily?
Choose architecture: ETL or ELT based on tools and team
Build incrementally: Start with one source, one destination, one schedule
Add quality checks: At minimum: row count + null check + freshness
Set up monitoring: Alert on failure, quality check violations, latency
Document: Data dictionary, pipeline diagram, SLAs

Output Format輸出格式

# Data Pipeline Design: {Project}

Gotchas注意事項

Idempotency is essential: A pipeline that runs twice should produce the same result as running once. Use upsert (not insert) and date-partitioned loads.
Schema drift: Source systems change schemas without warning. Build schema detection and alerting.
Backfill capability: When a pipeline fails for 3 days, can you rerun for those days without duplicating data? Design for this from day 1.
Don't build what you can buy: Fivetran/Airbyte handle 200+ source connectors. Writing a custom Salesforce extractor is rarely worth the engineering time.
Data warehouse vs data lake: Warehouse (BigQuery, Snowflake) = structured, SQL-queryable. Lake (S3, GCS) = raw, any format. Most modern stacks use both (lakehouse pattern).

References參考資料

For dbt project structure, see references/dbt-guide.md
For data warehouse modeling (star schema), see references/dimensional-modeling.md

Tags標籤

technologydata-pipelineetldata-engineering