doctor
doctorはPython用の静的解析ツールで、コードの品質を向上させるための機能を提供します。自動的にコードを分析し、潜在的な問題を特定することで、開発者がより良いコードを書く手助けをします。特に、コードの可読性や保守性を向上させるためのフィードバックを提供します。
GitHubスター
454
ユーザー評価
未評価
お気に入り
0
閲覧数
21
フォーク
63
イシュー
8

🩺 Doctor
A tool for discovering, crawl, and indexing web sites to be exposed as an MCP server for LLM agents for better and more up-to-date reasoning and code generation.
🔍 Overview
Doctor provides a complete stack for:
- Crawling web pages using crawl4ai with hierarchy tracking
- Chunking text with LangChain
- Creating embeddings with OpenAI via litellm
- Storing data in DuckDB with vector search support
- Exposing search functionality via a FastAPI web service
- Making these capabilities available to LLMs through an MCP server
- Navigating crawled sites with hierarchical site maps
🏗️ Core Infrastructure
🗄️ DuckDB
- Database for storing document data and embeddings with vector search capabilities
- Managed by unified Database class
📨 Redis
- Message broker for asynchronous task processing
🕸️ Crawl Worker
- Processes crawl jobs
- Chunks text
- Creates embeddings
🌐 Web Server
- FastAPI service exposing endpoints
- Fetching, searching, and viewing data
- Exposing the MCP server
💻 Setup
⚙️ Prerequisites
- Docker and Docker Compose
- Python 3.10+
- uv (Python package manager)
- OpenAI API key
📦 Installation
- Clone this repository
- Set up environment variables:
export OPENAI_API_KEY=your-openai-key
- Run the stack:
docker compose up
👁 Usage
- Go to http://localhost:9111/docs to see the OpenAPI docs
- Look for the
/fetch_url
endpoint and start a crawl job by providing a URL - Use
/job_progress
to see the current job status - Configure your editor to use
http://localhost:9111/mcp
as an MCP server
☁️ Web API
Core Endpoints
POST /fetch_url
: Start crawling a URLGET /search_docs
: Search indexed documentsGET /job_progress
: Check crawl job progressGET /list_doc_pages
: List indexed pagesGET /get_doc_page
: Get full text of a page
Site Map Feature
The Maps feature provides a hierarchical view of crawled websites, making it easy to navigate and explore the structure of indexed sites.
Endpoints:
GET /map
: View an index of all crawled sitesGET /map/site/{root_page_id}
: View the hierarchical tree structure of a specific siteGET /map/page/{page_id}
: View a specific page with navigation (parent, siblings, children)GET /map/page/{page_id}/raw
: Get the raw markdown content of a page
Features:
- Hierarchical Navigation: Pages maintain parent-child relationships, allowing you to navigate through the site structure
- Domain Grouping: Pages from the same domain crawled individually are automatically grouped together
- Automatic Title Extraction: Page titles are extracted from HTML or markdown content
- Breadcrumb Navigation: Easy navigation with breadcrumbs showing the path from root to current page
- Sibling Navigation: Quick access to pages at the same level in the hierarchy
- Legacy Page Support: Pages crawled before hierarchy tracking are grouped by domain for easy access
- No JavaScript Required: All navigation works with pure HTML and CSS for maximum compatibility
Usage Example:
- Crawl a website using the
/fetch_url
endpoint - Visit
/map
to see all crawled sites - Click on a site to view its hierarchical structure
- Navigate through pages using the provided links
🔧 MCP Integration
Ensure that your Docker Compose stack is up, and then add to your Cursor or VSCode MCP Servers configuration:
"doctor": {
"type": "sse",
"url": "http://localhost:9111/mcp"
}
🧪 Testing
Running Tests
To run all tests:
# Run all tests with coverage report
pytest
To run specific test categories:
# Run only unit tests
pytest -m unit
# Run only async tests
pytest -m async_test
# Run tests for a specific component
pytest tests/lib/test_crawler.py
Test Coverage
The project is configured to generate coverage reports automatically:
# Run tests with detailed coverage report
pytest --cov=src --cov-report=term-missing
Test Structure
tests/conftest.py
: Common fixtures for all teststests/lib/
: Tests for library componentstest_crawler.py
: Tests for the crawler moduletest_crawler_enhanced.py
: Tests for enhanced crawler with hierarchy trackingtest_chunker.py
: Tests for the chunker moduletest_embedder.py
: Tests for the embedder moduletest_database.py
: Tests for the unified Database classtest_database_hierarchy.py
: Tests for database hierarchy operations
tests/common/
: Tests for common modulestests/services/
: Tests for service layertest_map_service.py
: Tests for the map service
tests/api/
: Tests for API endpointstest_map_api.py
: Tests for map API endpoints
tests/integration/
: Integration teststest_processor_enhanced.py
: Tests for enhanced processor with hierarchy
🐞 Code Quality
Pre-commit Hooks
The project is configured with pre-commit hooks that run automatically before each commit:
ruff check --fix
: Lints code and automatically fixes issuesruff format
: Formats code according to project style- Trailing whitespace removal
- End-of-file fixing
- YAML validation
- Large file checks
Setup Pre-commit
To set up pre-commit hooks:
# Install pre-commit
uv pip install pre-commit
# Install the git hooks
pre-commit install
Running Pre-commit Manually
You can run the pre-commit hooks manually on all files:
# Run all pre-commit hooks
pre-commit run --all-files
Or on staged files only:
# Run on staged files
pre-commit run
⚖️ License
This project is licensed under the MIT License - see the LICENSE.md file for details.