Web Crawler 網頁爬蟲

Released已發布

algorithm algorithm

Implement a web crawler pipeline covering URL discovery, fetching, parsing, and storage. Use this skill when the user needs to build a site crawler, audit website structure, or collect web data systematically — even if they say 'scrape a website', 'crawl all pages', or 'site audit spider'.

演算法技能：Web Crawler 分析與應用。

View on GitHub在 GitHub 查看

Overview概述

A web crawler systematically traverses web pages by discovering URLs, fetching content, parsing HTML, and storing results. Uses BFS or priority-based frontier management. Performance is I/O-bound, typically limited by politeness constraints rather than compute.

When to Use使用時機

Trigger conditions:

Building a site audit tool to discover all pages and their link structure
Collecting structured data from websites at scale
Mapping site architecture for SEO analysis

When NOT to use:

When you need data from a single API endpoint (use HTTP client directly)
When a sitemap.xml provides all needed URLs (parse sitemap instead)

Algorithm演算法

IRON LAW: Respect robots.txt and Rate Limits
A crawler MUST:
1. Parse and obey robots.txt before crawling any path
2. Enforce crawl-delay (default 1s if unspecified)
3. Identify itself with a descriptive User-Agent
Ignoring these is unethical and will get your IP blocked.

Phase 1: Input Validation

Parse seed URLs, fetch and parse robots.txt for each domain, set crawl scope (same-domain, subdomain, or cross-domain). Gate: Valid seed URLs, robots.txt rules loaded, scope defined.

Phase 2: Core Algorithm

Initialize URL frontier with seed URLs (priority queue or FIFO)
Dequeue URL, check: not visited, allowed by robots.txt, within scope
Fetch page with timeout and retry logic, respect crawl-delay
Parse HTML: extract links (normalize, deduplicate), extract content/metadata
Enqueue discovered URLs, store parsed data
Repeat until frontier empty or limit reached

Phase 3: Verification

Check: no robots.txt violations in crawl log, no duplicate pages stored, all discovered URLs accounted for. Gate: Crawl completed within scope, politeness maintained.

Phase 4: Output

Return site map with pages, link graph, and extracted metadata.

Output Format輸出格式

{
  "pages": [{"url": "...", "status": 200, "title": "...", "links_out": 15, "depth": 2}],
  "metadata": {"pages_crawled": 500, "errors": 12, "duration_seconds": 300, "domain": "example.com"}
}

Examples範例

Sample I/O

Input: Seed: "https://example.com", max_depth: 2, max_pages: 100 Expected: Crawl tree with homepage at depth 0, linked pages at depth 1-2, respecting robots.txt

Edge Cases

Input	Expected	Why
robots.txt disallows /	Zero pages crawled	Must respect full disallow
Redirect loop	Stop after 5 redirects	Prevent infinite loop
Soft 404 (200 with error page)	Flag as soft 404	Status code alone is insufficient

Gotchas注意事項

URL normalization: http://Example.COM/path/ and http://example.com/path are the same URL. Normalize: lowercase host, remove default port, remove trailing slash, sort query params.
JavaScript-rendered content: A basic HTTP fetch misses JS-rendered content. Use headless browser (Playwright/Puppeteer) for SPAs.
Trap detection: Calendar pages, session IDs in URLs, and infinite pagination create crawler traps. Set max depth and URL pattern limits.
Rate limiting yourself: Parallel fetching without per-domain rate limiting will overwhelm small servers. Use per-domain semaphores.
Character encoding: Not all pages are UTF-8. Detect encoding from HTTP headers and meta tags; fall back to charset detection libraries.

References參考資料

For URL normalization rules (RFC 3986), see references/url-normalization.md
For distributed crawling architecture, see references/distributed-crawl.md

Tags標籤

seoweb-crawlerscrapingsite-audit