PageRank Algorithm PageRank 演算法

Released已發布

algorithm algorithm

Implement PageRank algorithm to compute web page importance scores using the random surfer model. Use this skill when the user needs to rank pages by link authority, build a simplified search ranking system, or understand how link structure determines page importance — even if they say 'which pages are most important', 'link analysis', or 'page authority score'.

演算法技能：PageRank Algorithm 分析與應用。

View on GitHub在 GitHub 查看

Overview概述

PageRank computes the importance of web pages by modeling a random surfer who follows links with probability d (damping factor) and jumps to a random page with probability 1-d. Converges in O(k * E) where k is iterations and E is number of edges.

When to Use使用時機

Trigger conditions:

Computing page importance from link graph structure
Building link-based authority scoring systems
Analyzing citation networks or any directed graph importance

When NOT to use:

When you only need keyword relevance (use TF-IDF instead)
When the graph is undirected or unweighted (consider centrality measures)

Algorithm 演算法

IRON LAW: PageRank Convergence
- Damping factor d MUST be < 1 (typically 0.85)
- Without damping, rank sinks and spider traps break convergence
- Correctness invariant: sum of all PageRank values = 1.0

Phase 1: Input Validation

Build adjacency list from link data. Verify: no self-loops counted, all nodes accounted for (including dangling nodes with no outlinks). Gate: Graph is well-formed, dangling nodes identified.

Phase 2: Core Algorithm

Initialize all N pages with PR = 1/N
For each iteration:
- For each page p: PR(p) = (1-d)/N + d * Σ(PR(q)/L(q)) for all q linking to p
- Distribute dangling node rank equally to all pages
Repeat until convergence (L1 norm change < ε, typically 1e-6)

Phase 3: Verification

Check: all PR values sum to ~1.0. Compare top-k rankings against known authority pages. Gate: |Σ PR - 1.0| < 0.001 and convergence achieved within max iterations.

Phase 4: Output

Return sorted page scores with rank position.

Output Format輸出格式

{
  "rankings": [{"page": "url", "score": 0.042, "rank": 1}],
  "metadata": {"nodes": 1000, "edges": 5000, "iterations": 45, "damping": 0.85, "converged": true}
}

Examples範例

Sample I/O

Input: Pages A→B, A→C, B→C, C→A (3 nodes, 4 edges, d=0.85) Expected Output: C: 0.390, A: 0.327, B: 0.283 (approximate)

Edge Cases

Input	Expected	Why
Single node, no links	PR = 1.0	Only node gets all rank
All nodes link to one	Target gets highest PR	Star topology concentrates rank
Dangling node (no outlinks)	Distribute its rank equally	Prevents rank leakage

Gotchas注意事項

Dangling nodes: Pages with no outgoing links leak rank. Redistribute their rank equally across all pages each iteration.
Spider traps: A group of pages that only link to each other accumulate rank. Damping factor prevents this but doesn't eliminate it entirely.
Convergence speed: Dense graphs converge faster. Sparse graphs with long chains may need 100+ iterations.
Floating point accumulation: For large graphs, use double precision. Single precision drifts noticeably after 50+ iterations.
Personalized PageRank: Standard PageRank uses uniform random jump. For personalized recommendations, bias the jump vector toward seed pages.

References參考資料

For mathematical derivation of convergence proof, see references/convergence-proof.md
For efficient sparse matrix implementation, see references/sparse-implementation.md

Tags標籤

seopagerankgraph-algorithmlink-analysis