doctor

doctorはPython用の静的解析ツールで、コードの品質を向上させるための機能を提供します。自動的にコードを分析し、潜在的な問題を特定することで、開発者がより良いコードを書く手助けをします。特に、コードの可読性や保守性を向上させるためのフィードバックを提供します。

GitHubスター

454

ユーザー評価

未評価

お気に入り

0

閲覧数

23

フォーク

63

イシュー

8

README
Doctor Logo
🩺 Doctor

Python Version
License
Python Tests
codecov

A tool for discovering, crawl, and indexing web sites to be exposed as an MCP server for LLM agents for better and more up-to-date reasoning and code generation.


🔍 Overview

Doctor provides a complete stack for:

  • Crawling web pages using crawl4ai with hierarchy tracking
  • Chunking text with LangChain
  • Creating embeddings with OpenAI via litellm
  • Storing data in DuckDB with vector search support
  • Exposing search functionality via a FastAPI web service
  • Making these capabilities available to LLMs through an MCP server
  • Navigating crawled sites with hierarchical site maps

🏗️ Core Infrastructure
🗄️ DuckDB
  • Database for storing document data and embeddings with vector search capabilities
  • Managed by unified Database class
📨 Redis
  • Message broker for asynchronous task processing
🕸️ Crawl Worker
  • Processes crawl jobs
  • Chunks text
  • Creates embeddings
🌐 Web Server
  • FastAPI service exposing endpoints
  • Fetching, searching, and viewing data
  • Exposing the MCP server

💻 Setup
⚙️ Prerequisites
  • Docker and Docker Compose
  • Python 3.10+
  • uv (Python package manager)
  • OpenAI API key
📦 Installation
  1. Clone this repository
  2. Set up environment variables:
    export OPENAI_API_KEY=your-openai-key
    
  3. Run the stack:
    docker compose up
    

👁 Usage
  1. Go to http://localhost:9111/docs to see the OpenAPI docs
  2. Look for the /fetch_url endpoint and start a crawl job by providing a URL
  3. Use /job_progress to see the current job status
  4. Configure your editor to use http://localhost:9111/mcp as an MCP server

☁️ Web API
Core Endpoints
  • POST /fetch_url: Start crawling a URL
  • GET /search_docs: Search indexed documents
  • GET /job_progress: Check crawl job progress
  • GET /list_doc_pages: List indexed pages
  • GET /get_doc_page: Get full text of a page
Site Map Feature

The Maps feature provides a hierarchical view of crawled websites, making it easy to navigate and explore the structure of indexed sites.

Endpoints:

  • GET /map: View an index of all crawled sites
  • GET /map/site/{root_page_id}: View the hierarchical tree structure of a specific site
  • GET /map/page/{page_id}: View a specific page with navigation (parent, siblings, children)
  • GET /map/page/{page_id}/raw: Get the raw markdown content of a page

Features:

  • Hierarchical Navigation: Pages maintain parent-child relationships, allowing you to navigate through the site structure
  • Domain Grouping: Pages from the same domain crawled individually are automatically grouped together
  • Automatic Title Extraction: Page titles are extracted from HTML or markdown content
  • Breadcrumb Navigation: Easy navigation with breadcrumbs showing the path from root to current page
  • Sibling Navigation: Quick access to pages at the same level in the hierarchy
  • Legacy Page Support: Pages crawled before hierarchy tracking are grouped by domain for easy access
  • No JavaScript Required: All navigation works with pure HTML and CSS for maximum compatibility

Usage Example:

  1. Crawl a website using the /fetch_url endpoint
  2. Visit /map to see all crawled sites
  3. Click on a site to view its hierarchical structure
  4. Navigate through pages using the provided links

🔧 MCP Integration

Ensure that your Docker Compose stack is up, and then add to your Cursor or VSCode MCP Servers configuration:

"doctor": {
    "type": "sse",
    "url": "http://localhost:9111/mcp"
}

🧪 Testing
Running Tests

To run all tests:

# Run all tests with coverage report
pytest

To run specific test categories:

# Run only unit tests
pytest -m unit

# Run only async tests
pytest -m async_test

# Run tests for a specific component
pytest tests/lib/test_crawler.py
Test Coverage

The project is configured to generate coverage reports automatically:

# Run tests with detailed coverage report
pytest --cov=src --cov-report=term-missing
Test Structure
  • tests/conftest.py: Common fixtures for all tests
  • tests/lib/: Tests for library components
    • test_crawler.py: Tests for the crawler module
    • test_crawler_enhanced.py: Tests for enhanced crawler with hierarchy tracking
    • test_chunker.py: Tests for the chunker module
    • test_embedder.py: Tests for the embedder module
    • test_database.py: Tests for the unified Database class
    • test_database_hierarchy.py: Tests for database hierarchy operations
  • tests/common/: Tests for common modules
  • tests/services/: Tests for service layer
    • test_map_service.py: Tests for the map service
  • tests/api/: Tests for API endpoints
    • test_map_api.py: Tests for map API endpoints
  • tests/integration/: Integration tests
    • test_processor_enhanced.py: Tests for enhanced processor with hierarchy

🐞 Code Quality
Pre-commit Hooks

The project is configured with pre-commit hooks that run automatically before each commit:

  • ruff check --fix: Lints code and automatically fixes issues
  • ruff format: Formats code according to project style
  • Trailing whitespace removal
  • End-of-file fixing
  • YAML validation
  • Large file checks
Setup Pre-commit

To set up pre-commit hooks:

# Install pre-commit
uv pip install pre-commit

# Install the git hooks
pre-commit install
Running Pre-commit Manually

You can run the pre-commit hooks manually on all files:

# Run all pre-commit hooks
pre-commit run --all-files

Or on staged files only:

# Run on staged files
pre-commit run

⚖️ License

This project is licensed under the MIT License - see the LICENSE.md file for details.