Yggdrasil
MCP ServersMCP 伺服器 SKILLs技能 PlugIns解決方案 Asgard AI SolutionAsgard AI 方案 Submit Listing申請上架 GitHub
W

Web Crawler 網頁爬蟲

Released已發布
algorithm algorithm

Implement a web crawler pipeline covering URL discovery, fetching, parsing, and storage. Use this skill when the user needs to build a site crawler, audit website structure, or collect web data systematically — even if they say 'scrape a website', 'crawl all pages', or 'site audit spider'.

演算法技能:Web Crawler 分析與應用。

View on GitHub在 GitHub 查看

Overview概述

A web crawler systematically traverses web pages by discovering URLs, fetching content, parsing HTML, and storing results. Uses BFS or priority-based frontier management. Performance is I/O-bound, typically limited by politeness constraints rather than compute.

When to Use使用時機

Trigger conditions:

  • Building a site audit tool to discover all pages and their link structure
  • Collecting structured data from websites at scale
  • Mapping site architecture for SEO analysis

When NOT to use:

  • When you need data from a single API endpoint (use HTTP client directly)
  • When a sitemap.xml provides all needed URLs (parse sitemap instead)

Algorithm 演算法

IRON LAW: Respect robots.txt and Rate Limits
A crawler MUST:
1. Parse and obey robots.txt before crawling any path
2. Enforce crawl-delay (default 1s if unspecified)
3. Identify itself with a descriptive User-Agent
Ignoring these is unethical and will get your IP blocked.

Phase 1: Input Validation

Parse seed URLs, fetch and parse robots.txt for each domain, set crawl scope (same-domain, subdomain, or cross-domain). Gate: Valid seed URLs, robots.txt rules loaded, scope defined.

Phase 2: Core Algorithm

  1. Initialize URL frontier with seed URLs (priority queue or FIFO)
  2. Dequeue URL, check: not visited, allowed by robots.txt, within scope
  3. Fetch page with timeout and retry logic, respect crawl-delay
  4. Parse HTML: extract links (normalize, deduplicate), extract content/metadata
  5. Enqueue discovered URLs, store parsed data
  6. Repeat until frontier empty or limit reached

Phase 3: Verification

Check: no robots.txt violations in crawl log, no duplicate pages stored, all discovered URLs accounted for. Gate: Crawl completed within scope, politeness maintained.

Phase 4: Output

Return site map with pages, link graph, and extracted metadata.

Output Format輸出格式

{
  "pages": [{"url": "...", "status": 200, "title": "...", "links_out": 15, "depth": 2}],
  "metadata": {"pages_crawled": 500, "errors": 12, "duration_seconds": 300, "domain": "example.com"}
}

Examples範例

Sample I/O

Input: Seed: "https://example.com", max_depth: 2, max_pages: 100 Expected: Crawl tree with homepage at depth 0, linked pages at depth 1-2, respecting robots.txt

Edge Cases

Input Expected Why
robots.txt disallows / Zero pages crawled Must respect full disallow
Redirect loop Stop after 5 redirects Prevent infinite loop
Soft 404 (200 with error page) Flag as soft 404 Status code alone is insufficient

Gotchas注意事項

  • URL normalization: http://Example.COM/path/ and http://example.com/path are the same URL. Normalize: lowercase host, remove default port, remove trailing slash, sort query params.
  • JavaScript-rendered content: A basic HTTP fetch misses JS-rendered content. Use headless browser (Playwright/Puppeteer) for SPAs.
  • Trap detection: Calendar pages, session IDs in URLs, and infinite pagination create crawler traps. Set max depth and URL pattern limits.
  • Rate limiting yourself: Parallel fetching without per-domain rate limiting will overwhelm small servers. Use per-domain semaphores.
  • Character encoding: Not all pages are UTF-8. Detect encoding from HTTP headers and meta tags; fall back to charset detection libraries.

References參考資料

  • For URL normalization rules (RFC 3986), see references/url-normalization.md
  • For distributed crawling architecture, see references/distributed-crawl.md

Tags標籤

seoweb-crawlerscrapingsite-audit