GitHubスター
0
ユーザー評価
未評価
フォーク
0
イシュー
1
閲覧数
0
お気に入り
0
Document Extractor MCP Server
A Model Context Protocol (MCP) server that extracts document content from Microsoft Learn and GitHub URLs, storing them in PocketBase for easy retrieval and search.
Features
✅ Latest MCP SDK Features (v1.12.0+)
- Modern
McpServer
architecture with enhanced capabilities - Multiple transport protocols: STDIO, Streamable HTTP, SSE
- Dynamic tool management with lazy loading
- Session management for stateful connections
- Server-Sent Events support with backwards compatibility
- Real-time server statistics and metrics
✅ Content Extraction
- Microsoft Learn articles with rich metadata
- GitHub files (README, documentation, code files)
- Intelligent content parsing and cleaning
- Duplicate detection and updates
✅ PocketBase Integration
- Persistent document storage
- Full-text search capabilities
- Metadata preservation
- CRUD operations
✅ Advanced Server Features
- Multiple transport modes (STDIO/HTTP)
- Health check and info endpoints
- Read-only mode support
- Enhanced error handling and debugging
- Resource endpoints for server metrics
✅ Rich Metadata
- Word counts and content statistics
- Source attribution and URLs
- Extraction timestamps
- Content headers and descriptions
Requirements
- Node.js 18+ with ES modules support
- PocketBase server running
- Network access for content extraction
Installation
1. Install Dependencies
# Navigate to the project directory
cd c:\powershell_scripts\pocketbase_document_mcp\document-extractor-mcp
# Install dependencies
npm install
2. PocketBase Setup
The MCP server supports both local and remote PocketBase instances. Choose the setup that best fits your needs:
Option A: Local PocketBase Instance
Download and install PocketBase:
# Download from https://pocketbase.io/docs/ # Extract the executable to your preferred directory
Start local PocketBase server:
# Run from the directory containing pocketbase.exe .\pocketbase.exe serve # Or specify custom port and data directory .\pocketbase.exe serve --http="127.0.0.1:8090" --dir="./pb_data"
Set up admin account:
- Access PocketBase Admin UI at http://127.0.0.1:8090/_/
- Create your admin account
- Note the email/password for configuration
Option B: Remote PocketBase Instance
Deploy PocketBase to your preferred hosting:
- Railway, Fly.io, DigitalOcean, AWS, etc.
- Follow your hosting provider's deployment guide
- Ensure HTTPS is enabled for production
Configure your remote instance:
- Set up admin account through the web interface
- Configure CORS settings if needed
- Note the full URL (e.g., https://your-pb-instance.com)
Option C: Docker PocketBase
Using Docker Compose:
version: '3.8' services: pocketbase: image: ghcr.io/muchobien/pocketbase:latest ports: - "8090:8090" volumes: - ./pb_data:/pb/pb_data
Collection Management (Automatic for all setups):
- The server will automatically create the required
documents
collection on startup - If
AUTO_CREATE_COLLECTION=true
(default), no manual setup needed - Use the
ensure_collection
tool to manually verify/create collections - Use the
collection_info
tool to check collection status
- The server will automatically create the required
Manual Collection Setup (if needed):
- Access PocketBase Admin UI
- Create a new collection named
documents
- Add these fields:
title (Text, required) content (Text, required) metadata (JSON, required) created (Date, auto-generated) updated (Date, optional)
3. Environment Configuration
Create a .env
file in the project root. The server supports both local and remote PocketBase instances:
For Local PocketBase Instance:
# PocketBase Configuration - Local
POCKETBASE_URL=http://127.0.0.1:8090
POCKETBASE_ADMIN_EMAIL=admin@example.com
POCKETBASE_ADMIN_PASSWORD=your-secure-password
# Collection Settings
DOCUMENTS_COLLECTION=documents
# Transport Configuration
TRANSPORT_MODE=stdio
HTTP_PORT=3000
# Development Settings
DEBUG=true
NODE_ENV=development
READ_ONLY_MODE=false
# Collection Management ✨ New!
AUTO_CREATE_COLLECTION=true
For Remote PocketBase Instance:
# PocketBase Configuration - Remote
POCKETBASE_URL=https://your-pocketbase-instance.com
POCKETBASE_ADMIN_EMAIL=admin@yourdomain.com
POCKETBASE_ADMIN_PASSWORD=your-secure-password
# Collection Settings
DOCUMENTS_COLLECTION=documents
# Transport Configuration
TRANSPORT_MODE=stdio
HTTP_PORT=3000
# Production Settings
DEBUG=false
NODE_ENV=production
READ_ONLY_MODE=false
# Collection Management
AUTO_CREATE_COLLECTION=true
For Dockerized PocketBase:
# PocketBase Configuration - Docker
POCKETBASE_URL=http://pocketbase:8090
POCKETBASE_ADMIN_EMAIL=admin@localhost
POCKETBASE_ADMIN_PASSWORD=admin123
# Collection Settings
DOCUMENTS_COLLECTION=documents
# Transport Configuration
TRANSPORT_MODE=stdio
HTTP_PORT=3000
# Container Settings
DEBUG=false
NODE_ENV=production
READ_ONLY_MODE=false
# Collection Management
AUTO_CREATE_COLLECTION=true
Usage
Starting the Server
The server supports multiple transport modes:
# STDIO mode (default) - for Claude Desktop and CLI clients
npm start
# or explicitly
npm run start:stdio
# HTTP mode - for web clients and testing
npm run start:http
# Development modes with debug logging
npm run dev # STDIO mode with debugging
npm run dev:http # HTTP mode with debugging
npm run dev:stdio # STDIO mode with debugging
# Test the setup
npm run test
Transport Modes
STDIO Mode (Default)
Perfect for Claude Desktop and command-line MCP clients:
npm start
HTTP Mode
Enables web-based clients and testing with multiple protocols:
npm run start:http
Available endpoints in HTTP mode:
POST /mcp
- Streamable HTTP transport (modern protocol 2025-03-26)GET /sse
- Server-Sent Events transport (legacy protocol 2024-11-05)POST /messages
- SSE message endpointGET /health
- Health check endpointGET /info
- Server information endpoint
Available Tools
1. extract_document
Extract and store content from URLs.
Parameters:
url
(string, required): Microsoft Learn or GitHub URL
Example:
{
"url": "https://learn.microsoft.com/en-us/azure/cognitive-services/openai/"
}
2. list_documents
List stored documents with pagination.
Parameters:
limit
(number, optional): Max results per page (1-100, default: 20)page
(number, optional): Page number (default: 1)
3. search_documents
Search documents by title or content.
Parameters:
query
(string, required): Search querylimit
(number, optional): Max results (1-100, default: 50)
4. get_document
Retrieve a specific document by ID.
Parameters:
id
(string, required): Document ID
5. delete_document
Delete a document by ID.
Parameters:
id
(string, required): Document ID to delete
6. ensure_collection
✨ New!
Check if the documents collection exists and create it if needed.
Parameters: None
Description: Automatically verifies the documents collection exists in PocketBase. If not found, creates the collection with the proper schema including all required fields and indexes.
7. collection_info
✨ New!
Get detailed information about the documents collection including statistics.
Parameters: None
Description: Returns comprehensive collection information including schema details, record counts, indexes, and timestamps.
Available Resources
1. stats://server
Real-time server statistics and metrics.
Content:
- Total document count
- Server information (name, version, uptime)
- Memory usage statistics
- Environment information
- Read-only mode status
Dynamic Tool Management
The server supports dynamic tool management with lazy loading:
// Tools can be dynamically enabled/disabled
if (process.env.READ_ONLY_MODE === 'true') {
// Write operations are disabled in read-only mode
deleteDocumentTool.disable();
extractDocumentTool.disable();
}
// Tools can be re-enabled at runtime
tool.enable();
Session Management
In HTTP mode, the server supports session management:
- Streamable HTTP: Modern session management with automatic session ID generation
- SSE (Legacy): Backwards compatible session handling
- Session persistence: Sessions are maintained across requests
- Automatic cleanup: Sessions are cleaned up when connections close
Supported Sources
Microsoft Learn
- Full article extraction
- Metadata preservation (description, keywords, author)
- Section headers extraction
- Content cleaning and formatting
Example URLs:
https://learn.microsoft.com/en-us/azure/cognitive-services/openai/
https://learn.microsoft.com/en-us/dotnet/core/introduction
GitHub
- File content extraction (README, docs, code)
- Repository metadata
- Branch handling (main/master fallback)
- File type detection
Supported URL formats:
https://github.com/owner/repo
(assumes README.md)https://github.com/owner/repo/blob/main/file.md
https://raw.githubusercontent.com/owner/repo/main/file.md
Configuration Options
Environment Variables
Variable | Description | Default |
---|---|---|
POCKETBASE_URL |
PocketBase server URL | http://127.0.0.1:8090 |
POCKETBASE_ADMIN_EMAIL |
Admin email for authentication | Required |
POCKETBASE_ADMIN_PASSWORD |
Admin password | Required |
DOCUMENTS_COLLECTION |
Collection name for documents | documents |
DEBUG |
Enable debug logging | false |
NODE_ENV |
Environment mode | development |
READ_ONLY_MODE |
Disable write operations | false |
AUTO_CREATE_COLLECTION |
Auto-create collections on startup | true |
Debug Mode
Enable detailed logging:
$env:DEBUG="true"; node server.js
Debug logs include:
- Authentication status
- Content extraction details
- Database operations
- Error context
Error Handling
The server implements comprehensive error handling:
- Network errors: Timeout and connection issues
- Authentication errors: PocketBase connection problems
- Validation errors: Invalid input parameters
- Content errors: Extraction failures
- Database errors: Storage and retrieval issues
All errors are returned as structured MCP responses with appropriate error codes.
Development
Scripts
# Start in development mode
npm run dev
# Start in production mode
npm start
# Install dependencies
npm run install-deps
Testing the Server
# Test basic functionality
$env:DEBUG="true"; node server.js
# In another terminal, you can test with MCP tools or:
# Use Claude Desktop with MCP configuration
# Use other MCP-compatible clients
Troubleshooting
Common Issues
Authentication Failed
- Verify PocketBase is running:
http://127.0.0.1:8090
- Check admin credentials in
.env
- Ensure admin user exists in PocketBase
- Verify PocketBase is running:
Content Extraction Errors
- Check network connectivity
- Verify URL accessibility
- Review debug logs for details
Collection Not Found
- Use the
ensure_collection
tool to automatically create the collection - Check collection name in environment variables
- Verify
AUTO_CREATE_COLLECTION
is enabled - Check collection permissions
- Use the
Module Import Errors
- Ensure
"type": "module"
in package.json - Use Node.js 18+ with ES modules support
- Check all dependencies are installed
- Ensure
Debug Information
Enable debug mode to see detailed logs:
$env:DEBUG="true"; node server.js
PocketBase Collection Schema
If you need to recreate the collection, use this schema:
{
"name": "documents",
"type": "base",
"schema": [
{
"name": "title",
"type": "text",
"required": true,
"options": {
"max": 255
}
},
{
"name": "content",
"type": "text",
"required": true
},
{
"name": "metadata",
"type": "json",
"required": true
},
{
"name": "created",
"type": "date",
"required": false
},
{
"name": "updated",
"type": "date",
"required": false
}
]
}
MCP Client Configuration
Claude Desktop Configuration
Add this to your Claude Desktop MCP settings:
{
"mcpServers": {
"document-extractor": {
"command": "node",
"args": ["c:\\powershell_scripts\\pocketbase_document_mcp\\document-extractor-mcp\\server.js"],
"env": {
"POCKETBASE_URL": "http://127.0.0.1:8090",
"POCKETBASE_ADMIN_EMAIL": "your-admin@example.com",
"POCKETBASE_ADMIN_PASSWORD": "your-password",
"DEBUG": "false"
}
}
}
}
License
MIT License - see LICENSE file for details.
Contributing
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
Changelog
v1.1.0 ✨ Latest Update
- Latest MCP SDK v1.13.1+: Upgraded to the newest Model Context Protocol SDK
- Latest PocketBase SDK v0.26.1+: Updated to the latest PocketBase features
- Collection Management Tools: Added
ensure_collection
andcollection_info
tools - Auto-Collection Creation: Automatic database schema setup on startup
- Enhanced Lazy Loading: Improved dynamic tool management
- Latest SSE Features: Modern Server-Sent Events implementation
- Improved Error Handling: Better collection management error recovery
- Enhanced Documentation: Comprehensive usage examples and troubleshooting
v1.0.0
- Updated to latest Anthropic MCP SDK
- Added comprehensive error handling
- Implemented input validation with Zod
- Enhanced metadata extraction
- Added debug logging
- Improved documentation
- Added PocketBase integration
- Support for Microsoft Learn and GitHub
Deployment
Smithery Deployment
This MCP server supports deployment on Smithery, a platform for hosting MCP servers.
TypeScript Deploy (Recommended)
The fastest way to deploy this server on Smithery:
- Fork or Clone this repository to your GitHub account
- Connect GitHub to Smithery (or claim your server if already listed)
- Navigate to the Deployments tab on your server page
- Click Deploy - Smithery will automatically build and host your server
The smithery.yaml
file is already configured for TypeScript/Node.js deployment.
Note: Despite being called "TypeScript Deploy", this method works perfectly for Node.js projects with ES modules.
Custom Deploy (Docker)
For advanced deployment with full Docker control:
- Replace smithery.yaml with the container configuration:
cp smithery-container.yaml smithery.yaml
- Push to GitHub with the updated configuration
- Deploy via Smithery's Deployments tab
The Dockerfile
is optimized for production deployment with security best practices.
Configuration
When deploying on Smithery, you'll configure:
- PocketBase URL: Your PocketBase instance URL
- Admin Credentials: Email and password for PocketBase admin
- Collection Settings: Default collection name and auto-creation
- Debug Mode: Enable detailed logging (optional)
Best Practices for Smithery
- Tool Discovery: All tools are available without authentication for discovery
- Lazy Authentication: API validation occurs only when tools are invoked
- Environment Variables: Configuration is handled via Smithery's config schema
- Health Checks: Built-in health monitoring at
/health
endpoint