Exploratory Data Analysis (EDA) 探索性資料分析 EDA

Released已發布

methodology analytics

Conduct Exploratory Data Analysis (EDA) using descriptive statistics, visualizations, and data quality checks. Use this skill when the user has a dataset and needs to understand its structure, find patterns, detect anomalies, or prepare data for further analysis — even if they say 'what does this data look like', 'find interesting patterns', 'clean this data', or 'summarize this dataset'.

統計方法論技能：Exploratory Data Analysis (EDA) 分析與應用。

View on GitHub在 GitHub 查看

Methodology方法論

IRON LAW: Perform EDA Only AFTER Train/Test Split — Or You Leak the Future

Agents know "do EDA first." But they almost always do EDA on the FULL
dataset before splitting. This is information leakage: you've seen the
test set's distributions, outliers, and correlations, and your subsequent
modeling choices (feature scaling, outlier treatment, imputation strategy)
are now informed by data the model shouldn't see. Split first, then EDA
only on the training set. Apply the same transformations to the test set
without re-examining it.

Exception: data quality checks (nulls, dtypes, duplicates) CAN run on
the full dataset since they don't inform model hyperparameters.

EDA Workflow

Standard five-phase flow (structure → quality → univariate → bivariate → findings summary). Assume the agent already knows these steps. Focus on the non-obvious traps below instead.

Critical additions most EDA guides miss:

Split BEFORE explore (see IRON LAW above)
Missing data pattern matters more than count: MCAR is safe to impute; MNAR (e.g. high-income respondents skip income question) requires domain modeling, not mean-fill
Simpson's paradox check: If a trend holds in the aggregate but reverses within subgroups, the aggregate trend is misleading. Always stratify by the most obvious confound before reporting a bivariate finding
Data leakage in features: A feature that perfectly correlates with the target is usually derived FROM the target (e.g. "refund_amount" predicting churn — it's an effect, not a cause). Flag any feature with r > 0.95 for causal review

For the visualization selection guide, see references/missing-data.md.

Output Format輸出格式

# EDA Report: {Dataset Name}

Overview概述

Rows: {N}, Columns: {N}
Date range: {if applicable}
Key columns: {description}

Data Quality

Issue	Columns Affected	Count/%	Action
Missing values	{cols}	{N / %}	{drop / impute / investigate}
Outliers	{cols}	{N}	{cap / remove / keep}
Duplicates	—	{N}	{remove}

Key Statistics

Variable	Mean	Median	Std	Min	Max	Distribution
{var}	...	...	...	...	...	{normal/skewed/bimodal}

Key Findings

{insight with supporting data}
{insight}
{insight}

Recommendations

{next analysis step or data issue to resolve}

Gotchas注意事項

Correlation ≠ causation: EDA finds associations. Establishing causation requires controlled experiments or causal inference methods.
Outliers can be data errors OR real signal: Don't auto-remove. Investigate. A transaction amount of $1M might be a typo or your biggest customer.
Missing data has meaning: Data missing from one column may be related to values in another. "Missing income" may mean "unemployed", not random. Check patterns.
Visualization lies: Truncated Y-axes, cherry-picked time ranges, and misleading scales can distort insights. Always use appropriate scales and note limitations.

References參考資料

For missing data handling strategies, see references/missing-data.md

Tags標籤

data-analysisedastatisticsvisualization