Text Summarization 文字摘要

Released已發布

algorithm algorithm

Implement text summarization using extractive and abstractive approaches. Use this skill when the user needs to condense long documents, build an automatic summarization pipeline, or compare summarization strategies — even if they say 'summarize this document', 'TLDR', or 'key points extraction'.

演算法技能：Text Summarization 分析與應用。

View on GitHub在 GitHub 查看

Overview概述

Text summarization condenses documents while preserving key information. Extractive: selects and concatenates important sentences from the original. Abstractive: generates new text that paraphrases the content. Extractive is simpler and more faithful; abstractive is more fluent but may hallucinate.

When to Use使用時機

Trigger conditions:

Condensing long documents, reports, or article collections
Building automated summary pipelines for content curation
Comparing extractive vs abstractive approaches for a use case

When NOT to use:

When full document understanding is needed (summarization loses detail)
For structured data extraction (use NER or information extraction)

Algorithm 演算法

IRON LAW: Abstractive Summarization Can HALLUCINATE
Abstractive models may generate fluent text containing facts NOT in
the source. Always verify key claims in abstractive summaries against
the original document. For high-stakes use cases (legal, medical),
prefer extractive or use abstractive with factual consistency checking.

Phase 1: Input Validation

Determine: input length, target summary length (ratio or word count), single-doc vs multi-doc, domain. Gate: Input text available, target length defined.

Phase 2: Core Algorithm

Extractive (TextRank/LexRank):

Split document into sentences
Build similarity graph (sentence nodes, cosine similarity edges)
Run PageRank on sentence graph
Select top-k sentences by rank, reorder by original position

Abstractive (transformer-based):

Use pre-trained model (BART, T5, Pegasus)
Encode input document (handle length limits with chunking if needed)
Generate summary with beam search
Post-process: check for repetition, factual consistency

Phase 3: Verification

Evaluate: ROUGE scores (ROUGE-1, ROUGE-2, ROUGE-L) against reference summaries. Manual check for factual accuracy and coherence. Gate: ROUGE scores reasonable for domain, no hallucinations in spot-check.

Phase 4: Output

Return summary with metadata.

Output Format輸出格式

{
  "summary": "The company reported Q4 revenue of...",
  "method": "extractive_textrank",
  "metadata": {"input_words": 2000, "summary_words": 200, "compression_ratio": 0.10, "sentences_selected": 5}
}

Examples範例

Sample I/O

Input: 2000-word news article about quarterly earnings Expected: 200-word summary covering: revenue, profit, guidance, key highlights. Extractive: 5-6 selected sentences. Abstractive: coherent paragraph.

Edge Cases

Input	Expected	Why
Very short input (< 100 words)	Return as-is or minimal trimming	Already concise
Multiple contradicting sections	Summary may miss nuance	Summarization favors dominant theme
Technical jargon	Extractive preserves, abstractive may simplify	Domain expertise affects quality

Gotchas注意事項

ROUGE ≠ quality: ROUGE measures n-gram overlap with references. A high-ROUGE summary can be incoherent, and a low-ROUGE summary can be excellent with different word choices.
Input length limits: Transformer models have max token limits (512-4096). Long documents need chunking strategies (chunk-then-summarize or hierarchical summarization).
Repetition: Abstractive models sometimes repeat phrases. Use repetition penalty during generation (no_repeat_ngram_size).
Position bias: In news text, important information is front-loaded (inverted pyramid). Simple "take first N sentences" is a strong extractive baseline.
Multi-document summarization: Summarizing multiple related documents requires handling redundancy and contradiction across sources.

References參考資料

For TextRank/LexRank implementation details, see references/graph-based-extraction.md
For factual consistency checking, see references/factual-consistency.md

Tags標籤

nlpsummarizationextractiveabstractive