Crawl4Claude

Crawl4Claude is a Python library designed for automatically collecting and analyzing information from websites. It offers an easy-to-use interface that efficiently handles data extraction and organization. Particularly useful for data science and machine learning projects, it helps users quickly gather the necessary data.

GitHub Stars

5

User Rating

Not Rated

Favorites

0

Views

40

Forks

0

Issues

0

README
Documentation Scraper & MCP Server

A comprehensive, domain-agnostic documentation scraping and AI integration toolkit. Scrape any documentation website, create structured databases, and integrate with Claude Desktop via MCP (Model Context Protocol) for seamless AI-powered documentation assistance.

๐Ÿš€ Features
Core Functionality
  • ๐ŸŒ Universal Documentation Scraper: Works with any documentation website
  • ๐Ÿ“Š Structured Database: SQLite database with full-text search capabilities
  • ๐Ÿค– MCP Server Integration: Native Claude Desktop integration via Model Context Protocol
  • ๐Ÿ“ LLM-Optimized Output: Ready-to-use context files for AI applications
  • โš™๏ธ Configuration-Driven: Single config file controls all settings
Advanced Tools
  • ๐Ÿ” Query Interface: Command-line tool for searching and analyzing scraped content
  • ๐Ÿ› ๏ธ Debug Suite: Comprehensive debugging tools for testing and validation
  • ๐Ÿ“‹ Auto-Configuration: Automatic MCP setup file generation
  • ๐Ÿ“ˆ Progress Tracking: Detailed logging and error handling
  • ๐Ÿ’พ Resumable Crawls: Smart caching for interrupted crawls
๐Ÿ“‹ Prerequisites
  • Python 3.8 or higher
  • Internet connection
  • ~500MB free disk space per documentation site
๐Ÿ› ๏ธ Quick Start
1. Installation
# Clone the repository
git clone <repository-url>
cd documentation-scraper

# Install dependencies
pip install -r requirements.txt
2. Configure Your Target

Edit config.py to set your documentation site:

SCRAPER_CONFIG = {
    "base_url": "https://docs.example.com/",  # Your documentation site
    "output_dir": "docs_db",
    "max_pages": 200,
    # ... other settings
}
3. Run the Scraper
python docs_scraper.py
4. Query Your Documentation
# Search for content
python query_docs.py --search "tutorial"

# Browse by section
python query_docs.py --section "getting-started"

# Get statistics
python query_docs.py --stats
5. Set Up Claude Integration
# Generate MCP configuration files
python utils/gen_mcp.py

# Follow the instructions to add to Claude Desktop
๐Ÿ—๏ธ Project Structure
๐Ÿ“ documentation-scraper/
โ”œโ”€โ”€ ๐Ÿ“„ config.py                    # Central configuration file
โ”œโ”€โ”€ ๐Ÿ•ท๏ธ docs_scraper.py              # Main scraper script
โ”œโ”€โ”€ ๐Ÿ” query_docs.py                # Query and analysis tool
โ”œโ”€โ”€ ๐Ÿค– mcp_docs_server.py           # MCP server for Claude integration
โ”œโ”€โ”€ ๐Ÿ“‹ requirements.txt             # Python dependencies
โ”œโ”€โ”€ ๐Ÿ“ utils/                       # Debug and utility tools
โ”‚   โ”œโ”€โ”€ ๐Ÿ› ๏ธ gen_mcp.py               # Generate MCP config files
โ”‚   โ”œโ”€โ”€ ๐Ÿงช debug_scraper.py         # Test scraper functionality
โ”‚   โ”œโ”€โ”€ ๐Ÿ”ง debug_mcp_server.py      # Debug MCP server
โ”‚   โ”œโ”€โ”€ ๐ŸŽฏ debug_mcp_client.py      # Test MCP tools directly
โ”‚   โ”œโ”€โ”€ ๐Ÿ“ก debug_mcp_server_protocol.py # Test MCP via JSON-RPC
โ”‚   โ””โ”€โ”€ ๐ŸŒ debug_site_content.py    # Debug content extraction
โ”œโ”€โ”€ ๐Ÿ“ docs_db/                     # Generated documentation database
โ”‚   โ”œโ”€โ”€ ๐Ÿ“Š documentation.db         # SQLite database
โ”‚   โ”œโ”€โ”€ ๐Ÿ“„ documentation.json       # JSON export
โ”‚   โ”œโ”€โ”€ ๐Ÿ“‹ scrape_summary.json      # Statistics
โ”‚   โ””โ”€โ”€ ๐Ÿ“ llm_context/             # LLM-ready context files
โ””โ”€โ”€ ๐Ÿ“ mcp/                         # Generated MCP configuration
    โ”œโ”€โ”€ ๐Ÿ”ง run_mcp_server.bat       # Windows launcher script
    โ””โ”€โ”€ โš™๏ธ claude_mcp_config.json   # Claude Desktop config
โš™๏ธ Configuration
Main Configuration (config.py)

The entire system is controlled by a single configuration file:

# Basic scraping settings
SCRAPER_CONFIG = {
    "base_url": "https://docs.example.com/",
    "output_dir": "docs_db",
    "max_depth": 3,
    "max_pages": 200,
    "delay_between_requests": 0.5,
}

# URL filtering rules
URL_FILTER_CONFIG = {
    "skip_patterns": [r'/api/', r'\.pdf$'],
    "allowed_domains": ["docs.example.com"],
}

# MCP server settings
MCP_CONFIG = {
    "server_name": "docs-server",
    "default_search_limit": 10,
    "max_search_limit": 50,
}
Environment Overrides

You can override any setting with environment variables:

export DOCS_DB_PATH="/custom/path/documentation.db"
export DOCS_BASE_URL="https://different-docs.com/"
python mcp_docs_server.py
๐Ÿค– Claude Desktop Integration
Automatic Setup
  1. Generate configuration files:

    python utils/gen_mcp.py
    
  2. Copy the generated config to Claude Desktop:

    • Windows: %APPDATA%\Claude\claude_desktop_config.json
    • macOS: ~/Library/Application Support/Claude/claude_desktop_config.json
  3. Restart Claude Desktop

Manual Setup

If you prefer manual setup, add this to your Claude Desktop config:

{
  "mcpServers": {
    "docs": {
      "command": "python",
      "args": ["path/to/mcp_docs_server.py"],
      "cwd": "path/to/project",
      "env": {
        "DOCS_DB_PATH": "path/to/docs_db/documentation.db"
      }
    }
  }
}
Available MCP Tools

Once connected, Claude can use these tools:

  • ๐Ÿ” search_documentation: Search for content across all documentation
  • ๐Ÿ“š get_documentation_sections: List all available sections
  • ๐Ÿ“„ get_page_content: Get full content of specific pages
  • ๐Ÿ—‚๏ธ browse_section: Browse pages within a section
  • ๐Ÿ“Š get_documentation_stats: Get database statistics
๐Ÿ”ง Command Line Tools
Documentation Scraper
# Basic scraping
python docs_scraper.py

# Override config settings
python docs_scraper.py  # Settings from config.py
Query Tool
# Search for content
python query_docs.py --search "authentication guide"

# Browse specific sections  
python query_docs.py --section "api-reference"

# Get database statistics
python query_docs.py --stats

# List all sections
python query_docs.py --list-sections

# Export section to file
python query_docs.py --export-section "tutorials" --format markdown > tutorials.md

# Use custom database
python query_docs.py --db "custom/path/docs.db" --search "example"
Debug Tools
# Test scraper functionality
python utils/debug_scraper.py

# Test MCP server
python utils/debug_mcp_server.py

# Test MCP tools directly
python utils/debug_mcp_client.py

# Test MCP protocol
python utils/debug_mcp_server_protocol.py

# Debug content extraction
python utils/debug_site_content.py

# Generate MCP config files
python utils/gen_mcp.py
๐Ÿ“Š Database Schema
Pages Table
CREATE TABLE pages (
    id INTEGER PRIMARY KEY,
    url TEXT UNIQUE NOT NULL,
    title TEXT,
    content TEXT,
    markdown TEXT,
    word_count INTEGER,
    section TEXT,
    subsection TEXT,
    scraped_at TIMESTAMP,
    metadata TEXT
);
Full-Text Search
-- Search using FTS5
SELECT * FROM pages_fts WHERE pages_fts MATCH 'your search term';

-- Or use the query tool
python query_docs.py --search "your search term"
๐ŸŽฏ Example Use Cases
1. Documentation Analysis
# Get overview of documentation
python query_docs.py --stats

# Find all tutorial content
python query_docs.py --search "tutorial guide example"

# Export specific sections
python query_docs.py --export-section "getting-started" > onboarding.md
2. AI Integration with Claude
# Once MCP is set up, ask Claude:
# "Search the documentation for authentication examples"
# "What sections are available in the documentation?"
# "Show me the content for the API reference page"
3. Custom Applications
import sqlite3

# Connect to your scraped documentation
conn = sqlite3.connect('docs_db/documentation.db')

# Query for specific content
results = conn.execute("""
    SELECT title, url, markdown 
    FROM pages 
    WHERE section = 'tutorials' 
    AND word_count > 500
    ORDER BY word_count DESC
""").fetchall()

# Build your own tools on top of the structured data
๐Ÿ” Debugging and Testing
Test Scraper Before Full Run
python utils/debug_scraper.py
Validate Content Extraction
python utils/debug_site_content.py
Test MCP Integration
# Test server functionality
python utils/debug_mcp_server.py

# Test tools directly
python utils/debug_mcp_client.py

# Test JSON-RPC protocol
python utils/debug_mcp_server_protocol.py
๐Ÿ“ˆ Performance and Optimization
Scraping Performance
  • Start small: Use max_pages=50 for testing
  • Adjust depth: max_depth=2 covers most content efficiently
  • Rate limiting: Increase delay_between_requests if getting blocked
  • Caching: Enabled by default for resumable crawls
Database Performance
  • Full-text search: Automatic FTS5 index for fast searching
  • Indexing: Optimized indexes on URL and section columns
  • Word counts: Pre-calculated for quick statistics
MCP Performance
  • Configurable limits: Set appropriate search and section limits
  • Snippet length: Adjust snippet size for optimal response times
  • Connection pooling: Efficient database connections
๐ŸŒ Supported Documentation Sites

This scraper works with most documentation websites including:

  • Static sites: Hugo, Jekyll, MkDocs, Docusaurus
  • Documentation platforms: GitBook, Notion, Confluence
  • API docs: Swagger/OpenAPI documentation
  • Wiki-style: MediaWiki, TiddlyWiki
  • Custom sites: Any site with consistent HTML structure
Site-Specific Configuration

Customize URL filtering and content extraction for your target site:

URL_FILTER_CONFIG = {
    "skip_patterns": [
        r'/api/',           # Skip API endpoint docs
        r'/edit/',          # Skip edit pages  
        r'\.pdf$',          # Skip PDF files
    ],
    "allowed_domains": ["docs.yoursite.com"],
}

CONTENT_FILTER_CONFIG = {
    "remove_patterns": [
        r'Edit this page.*?\n',      # Remove edit links
        r'Was this helpful\?.*?\n',  # Remove feedback sections
    ],
}
๐Ÿค Contributing

We welcome contributions! Here are some areas where you can help:

  • New export formats: PDF, EPUB, Word documents
  • Enhanced content filtering: Better noise removal
  • Additional debug tools: More comprehensive testing
  • Documentation: Improve guides and examples
  • Performance optimizations: Faster scraping and querying
โš ๏ธ Responsible Usage
  • Respect robots.txt: Check the target site's robots.txt file
  • Rate limiting: Use appropriate delays between requests
  • Terms of service: Respect the documentation site's terms
  • Fair use: Use for educational, research, or personal purposes
  • Attribution: Credit the original documentation source
๐Ÿ“„ License

This project is provided as-is for educational and research purposes. Please respect the terms of service and licensing of the documentation sites you scrape.


๐ŸŽ‰ Getting Started Examples
Example 1: Scrape Python Documentation
# config.py
SCRAPER_CONFIG = {
    "base_url": "https://docs.python.org/3/",
    "max_pages": 500,
    "max_depth": 3,
}
Example 2: Scrape API Documentation
# config.py  
SCRAPER_CONFIG = {
    "base_url": "https://api-docs.example.com/",
    "max_pages": 200,
}

URL_FILTER_CONFIG = {
    "skip_patterns": [r'/changelog/', r'/releases/'],
}
Example 3: Corporate Documentation
# config.py
SCRAPER_CONFIG = {
    "base_url": "https://internal-docs.company.com/",
    "output_dir": "company_docs",
}

MCP_CONFIG = {
    "server_name": "company-docs-server",
    "docs_display_name": "Company Internal Docs",
}

Happy Documenting! ๐Ÿ“šโœจ

For questions, issues, or feature requests, please check the debug logs first, then create an issue with relevant details.


๐Ÿ™ Attribution

This project is powered by Crawl4AI - an amazing open-source LLM-friendly web crawler and scraper.

Powered by Crawl4AI

Crawl4AI enables the intelligent web scraping capabilities that make this documentation toolkit possible. A huge thanks to @unclecode and the Crawl4AI community for building such an incredible tool! ๐Ÿš€

Check out Crawl4AI:

๐Ÿ“„ License