Named Entity Recognition 命名實體辨識 NER

Released已發布

algorithm algorithm

Implement Named Entity Recognition to identify and classify entities in text. Use this skill when the user needs to extract people, organizations, locations, dates, or custom entities from documents — even if they say 'extract names from text', 'find companies mentioned', or 'entity extraction'.

演算法技能：Named Entity Recognition 分析與應用。

View on GitHub在 GitHub 查看

Overview概述

NER identifies and classifies named entities in text into predefined categories (Person, Organization, Location, Date, Money, etc.). Approaches: rule-based (regex, gazetteers), statistical (CRF), neural (BiLSTM-CRF, transformer-based). Modern NER uses spaCy or Hugging Face models with F1 scores 85-95%.

When to Use使用時機

Trigger conditions:

Extracting structured entities from unstructured text
Building knowledge graphs from documents
Preprocessing for information retrieval or question answering

When NOT to use:

For text classification (categorizing whole documents, not extracting entities)
For relation extraction between entities (need additional RE model)

Algorithm演算法

IRON LAW: NER Performance Depends on DOMAIN Match
A model trained on news text (OntoNotes) performs poorly on medical
records or legal documents. Domain-specific entities (drug names,
legal citations, product SKUs) require domain-specific training data
or fine-tuning. Always evaluate on YOUR domain's data.

Phase 1: Input Validation

Determine: target entity types (standard: PER, ORG, LOC, DATE, MONEY or custom), input language, domain. Select appropriate pre-trained model or prepare training data. Gate: Entity types defined, model or training data available.

Phase 2: Core Algorithm

Pre-trained model approach:

Load model (spaCy, Hugging Face NER pipeline)
Process text through the pipeline
Extract entity spans with type labels and confidence scores

Fine-tuning approach:

Annotate 200+ domain-specific examples in BIO format
Fine-tune transformer model (BERT, RoBERTa) on annotated data
Evaluate on held-out test set

Phase 3: Verification

Evaluate: precision, recall, F1 per entity type. Check: boundary detection (exact span match) and type classification accuracy. Gate: F1 > 0.80 per entity type on domain-relevant test data.

Phase 4: Output

Return extracted entities with types, positions, and confidence.

Output Format輸出格式

{
  "entities": [{"text": "Apple Inc.", "type": "ORG", "start": 0, "end": 10, "confidence": 0.95}],
  "metadata": {"model": "en_core_web_trf", "entities_found": 15, "types": {"PER": 5, "ORG": 6, "LOC": 4}}
}

Examples範例

Sample I/O

Input: "Tim Cook announced that Apple will open a new store in Taipei on March 15." Expected: [Tim Cook/PER, Apple/ORG, Taipei/LOC, March 15/DATE]

Edge Cases

Input	Expected	Why
"Apple" (no context)	Ambiguous (fruit or company)	Context-dependent entity typing
Nested entities	Depends on scheme	"Bank of America" = ORG, "America" = LOC within
Misspelled entity	May miss	"Appel" not in training data

Gotchas注意事項

Boundary errors: NER often gets the entity type right but the span wrong ("New" vs "New York City"). Evaluate with both exact and partial match metrics.
Ambiguity: "Jordan" can be a person, country, or brand. Context-dependent disambiguation is hard; some models output the most likely type.
Chinese/Japanese NER: No whitespace tokenization makes boundary detection harder. Use language-specific tokenizers (jieba for Chinese).
Annotation consistency: Training data quality is critical. Inconsistent annotations (sometimes labeling "Dr." as part of name, sometimes not) degrade model performance.
Entity linking: NER identifies mentions; entity linking resolves them to knowledge base entries. "Apple" → Apple Inc. (Q312) or apple (fruit). These are separate tasks.

References參考資料

For BIO annotation format and guidelines, see references/bio-annotation.md
For fine-tuning NER with transformers, see references/transformer-ner.md

Tags標籤

nlpnerentity-extractioninformation-extraction