E
Exploratory Data Analysis (EDA) 探索性資料分析 EDA
Released已發布 methodology analytics
Conduct Exploratory Data Analysis (EDA) using descriptive statistics, visualizations, and data quality checks. Use this skill when the user has a dataset and needs to understand its structure, find patterns, detect anomalies, or prepare data for further analysis — even if they say 'what does this data look like', 'find interesting patterns', 'clean this data', or 'summarize this dataset'.
統計方法論技能:Exploratory Data Analysis (EDA) 分析與應用。
Overview概述
- Rows: {N}, Columns: {N}
- Date range: {if applicable}
- Key columns: {description}
Methodology 方法論
IRON LAW: Perform EDA Only AFTER Train/Test Split — Or You Leak the Future
Agents know "do EDA first." But they almost always do EDA on the FULL
dataset before splitting. This is information leakage: you've seen the
test set's distributions, outliers, and correlations, and your subsequent
modeling choices (feature scaling, outlier treatment, imputation strategy)
are now informed by data the model shouldn't see. Split first, then EDA
only on the training set. Apply the same transformations to the test set
without re-examining it.
Exception: data quality checks (nulls, dtypes, duplicates) CAN run on
the full dataset since they don't inform model hyperparameters.
EDA Workflow
Standard five-phase flow (structure → quality → univariate → bivariate → findings summary). Assume the agent already knows these steps. Focus on the non-obvious traps below instead.
Critical additions most EDA guides miss:
- Split BEFORE explore (see IRON LAW above)
- Missing data pattern matters more than count: MCAR is safe to impute; MNAR (e.g. high-income respondents skip income question) requires domain modeling, not mean-fill
- Simpson's paradox check: If a trend holds in the aggregate but reverses within subgroups, the aggregate trend is misleading. Always stratify by the most obvious confound before reporting a bivariate finding
- Data leakage in features: A feature that perfectly correlates with the target is usually derived FROM the target (e.g. "refund_amount" predicting churn — it's an effect, not a cause). Flag any feature with r > 0.95 for causal review
For the visualization selection guide, see references/missing-data.md.
Output Format輸出格式
# EDA Report: {Dataset Name}
Gotchas注意事項
- Correlation ≠ causation: EDA finds associations. Establishing causation requires controlled experiments or causal inference methods.
- Outliers can be data errors OR real signal: Don't auto-remove. Investigate. A transaction amount of $1M might be a typo or your biggest customer.
- Missing data has meaning: Data missing from one column may be related to values in another. "Missing income" may mean "unemployed", not random. Check patterns.
- Visualization lies: Truncated Y-axes, cherry-picked time ranges, and misleading scales can distort insights. Always use appropriate scales and note limitations.
References參考資料
- For missing data handling strategies, see
references/missing-data.md
Tags標籤
data-analysisedastatisticsvisualization