dataproc-mcp
dataproc-mcpは、データ処理と分析を効率化するためのTypeScriptライブラリです。特に、データの変換や集計、可視化機能を提供し、開発者がデータ駆動型アプリケーションを構築する際に役立ちます。使いやすいAPIを通じて、複雑なデータ処理を簡素化し、迅速な開発を可能にします。
GitHubスター
9
ユーザー評価
未評価
お気に入り
0
閲覧数
17
フォーク
1
イシュー
2
Dataproc MCP Server
A production-ready Model Context Protocol (MCP) server for Google Cloud Dataproc operations with intelligent parameter injection, enterprise-grade security, and comprehensive tooling. Designed for seamless integration with Roo (VS Code).
🚀 Quick Start
Recommended: Roo (VS Code) Integration
Add this to your Roo MCP settings:
{
"mcpServers": {
"dataproc": {
"command": "npx",
"args": ["@dipseth/dataproc-mcp-server@latest"],
"env": {
"LOG_LEVEL": "info"
}
}
}
}
With Custom Config File
{
"mcpServers": {
"dataproc": {
"command": "npx",
"args": ["@dipseth/dataproc-mcp-server@latest"],
"env": {
"LOG_LEVEL": "info",
"DATAPROC_CONFIG_PATH": "/path/to/your/config.json"
}
}
}
}
Alternative: Global Installation
# Install globally
npm install -g @dipseth/dataproc-mcp-server
# Start the server
dataproc-mcp-server
# Or run directly
npx @dipseth/dataproc-mcp-server@latest
5-Minute Setup
Install the package:
npm install -g @dipseth/dataproc-mcp-server@latest
Run the setup:
dataproc-mcp --setup
Configure authentication:
# Edit the generated config file nano config/server.json
Start the server:
dataproc-mcp
🌐 Claude.ai Web App Compatibility
✅ PRODUCTION-READY: Full Claude.ai Integration with HTTPS Tunneling & OAuth
The Dataproc MCP Server now provides complete Claude.ai web app compatibility with a working solution that includes all 22 MCP tools!
🚀 Working Solution (Tested & Verified)
Terminal 1 - Start MCP Server:
DATAPROC_CONFIG_PATH=config/github-oauth-server.json npm start -- --http --oauth --port 8080
Terminal 2 - Start Cloudflare Tunnel:
cloudflared tunnel --url https://localhost:8443 --origin-server-name localhost --no-tls-verify
Result: Claude.ai can see and use all tools successfully! 🎉
Key Features:
- ✅ Complete Tool Access - All 22 MCP tools available in Claude.ai
- ✅ HTTPS Tunneling - Cloudflare tunnel for secure external access
- ✅ OAuth Authentication - GitHub OAuth for secure authentication
- ✅ Trusted Certificates - No browser warnings or connection issues
- ✅ WebSocket Support - Full WebSocket compatibility with Claude.ai
- ✅ Production Ready - Tested and verified working solution
Quick Setup:
- Setup GitHub OAuth (5 minutes)
- Generate SSL certificates:
npm run ssl:generate
- Start services (2 terminals as shown above)
- Connect Claude.ai to your tunnel URL
📖 Complete Guide: See
docs/claude-ai-integration.md
for detailed setup instructions, troubleshooting, and advanced features.
📖 Certificate Setup: See
docs/trusted-certificates.md
for SSL certificate configuration.
✨ Features
🎯 Core Capabilities
- 22 Production-Ready MCP Tools - Complete Dataproc management suite
- 🧠 Knowledge Base Semantic Search - Natural language queries with optional Qdrant integration
- 🚀 Response Optimization - 60-96% token reduction with Qdrant storage
- 🔄 Generic Type Conversion System - Automatic, type-safe data transformations
- 60-80% Parameter Reduction - Intelligent default injection
- Multi-Environment Support - Dev/staging/production configurations
- Service Account Impersonation - Enterprise authentication
- Real-time Job Monitoring - Comprehensive status tracking
🚀 Response Optimization
- 96.2% Token Reduction -
list_clusters
: 7,651 → 292 tokens - Automatic Qdrant Storage - Full data preserved and searchable
- Resource URI Access -
dataproc://responses/clusters/list/abc123
- Graceful Fallback - Works without Qdrant, falls back to full responses
- 9.95ms Processing - Lightning-fast optimization with <1MB memory usage
🔄 Generic Type Conversion System
- 75% Code Reduction - Eliminates manual conversion logic across services
- Type-Safe Transformations - Automatic field detection and mapping
- Intelligent Compression - Field-level compression with configurable thresholds
- 0.50ms Conversion Times - Lightning-fast processing with 100% compression ratios
- Zero-Configuration - Works automatically with existing TypeScript types
- Backward Compatible - Seamless integration with existing functionality
� Enterprise Security
- Input Validation - Zod schemas for all 16 tools
- Rate Limiting - Configurable abuse prevention
- Credential Management - Secure handling and rotation
- Audit Logging - Comprehensive security event tracking
- Threat Detection - Injection attack prevention
📊 Quality Assurance
- 90%+ Test Coverage - Comprehensive test suite
- Performance Monitoring - Configurable thresholds
- Multi-Environment Testing - Cross-platform validation
- Automated Quality Gates - CI/CD integration
- Security Scanning - Vulnerability management
🚀 Developer Experience
- 5-Minute Setup - Quick start guide
- Interactive Documentation - HTML docs with examples
- Comprehensive Examples - Multi-environment configs
- Troubleshooting Guides - Common issues and solutions
- IDE Integration - TypeScript support
🛠️ Complete MCP Tools Suite (22 Tools)
🔄 Enhanced with Generic Type Conversion: All tools now benefit from automatic, type-safe data transformations with intelligent compression and field mapping.
🚀 Cluster Management (8 Tools)
Tool | Description | Smart Defaults | Key Features |
---|---|---|---|
start_dataproc_cluster |
Create and start new clusters | ✅ 80% fewer params | Profile-based, auto-config |
create_cluster_from_yaml |
Create from YAML configuration | ✅ Project/region injection | Template-driven setup |
create_cluster_from_profile |
Create using predefined profiles | ✅ 85% fewer params | 8 built-in profiles |
list_clusters |
List all clusters with filtering | ✅ No params needed | Semantic queries, pagination |
list_tracked_clusters |
List MCP-created clusters | ✅ Profile filtering | Creation tracking |
get_cluster |
Get detailed cluster information | ✅ 75% fewer params | Semantic data extraction |
delete_cluster |
Delete existing clusters | ✅ Project/region defaults | Safe deletion |
get_zeppelin_url |
Get Zeppelin notebook URL | ✅ Auto-discovery | Web interface access |
💼 Job Management (7 Tools)
Tool | Description | Smart Defaults | Key Features |
---|---|---|---|
submit_hive_query |
Submit Hive queries to clusters | ✅ 70% fewer params | Async support, timeouts |
submit_dataproc_job |
Submit Spark/PySpark/Presto jobs | ✅ 75% fewer params | Multi-engine support, Local file staging |
cancel_dataproc_job |
Cancel running or pending jobs | ✅ JobID only needed | Emergency cancellation, cost control |
get_job_status |
Get job execution status | ✅ JobID only needed | Real-time monitoring |
get_job_results |
Get job outputs and results | ✅ Auto-pagination | Result formatting |
get_query_status |
Get Hive query status | ✅ Minimal params | Query tracking |
get_query_results |
Get Hive query results | ✅ Smart pagination | Enhanced async support |
📋 Configuration & Profiles (3 Tools)
Tool | Description | Smart Defaults | Key Features |
---|---|---|---|
list_profiles |
List available cluster profiles | ✅ Category filtering | 8 production profiles |
get_profile |
Get detailed profile configuration | ✅ Profile ID only | Template access |
query_cluster_data |
Query stored cluster data | ✅ Natural language | Semantic search |
📊 Analytics & Insights (4 Tools)
Tool | Description | Smart Defaults | Key Features |
---|---|---|---|
check_active_jobs |
Quick status of all active jobs | ✅ No params needed | Multi-project view |
get_cluster_insights |
Comprehensive cluster analytics | ✅ Auto-discovery | Machine types, components |
get_job_analytics |
Job performance analytics | ✅ Success rates | Error patterns, metrics |
query_knowledge |
Query comprehensive knowledge base | ✅ Natural language | Clusters, jobs, errors |
🎯 Key Capabilities
- 🧠 Semantic Search: Natural language queries with Qdrant integration
- ⚡ Smart Defaults: 60-80% parameter reduction through intelligent injection
- 📊 Response Optimization: 96% token reduction with full data preservation
- 🔄 Async Support: Non-blocking job submission and monitoring
- 🏷️ Profile System: 8 production-ready cluster templates
- 📈 Analytics: Comprehensive insights and performance tracking
📋 Configuration
Project-Based Configuration
The server supports a project-based configuration format:
# profiles/@analytics-workloads.yaml
my-company-analytics-prod-1234:
region: us-central1
tags:
- DataProc
- analytics
- production
labels:
service: analytics-service
owner: data-team
environment: production
cluster_config:
# ... cluster configuration
Authentication Methods
- Service Account Impersonation (Recommended)
- Direct Service Account Key
- Application Default Credentials
- Hybrid Authentication with fallbacks
📚 Documentation
- Quick Start Guide - Get started in 5 minutes
- Knowledge Base Semantic Search - Natural language queries and setup
- Generic Type Conversion System - Architectural design and implementation
- Generic Converter Migration Guide - Migration from manual conversions
- API Reference - Complete tool documentation
- Configuration Examples - Real-world configurations
- Security Guide - Best practices and compliance
- Installation Guide - Detailed setup instructions
🔧 MCP Client Integration
Claude Desktop
{
"mcpServers": {
"dataproc": {
"command": "npx",
"args": ["@dataproc/mcp-server"],
"env": {
"LOG_LEVEL": "info"
}
}
}
}
Roo (VS Code)
{
"mcpServers": {
"dataproc-server": {
"command": "npx",
"args": ["@dataproc/mcp-server"],
"disabled": false,
"alwaysAllow": [
"list_clusters",
"get_cluster",
"list_profiles"
]
}
}
}
🏗️ Architecture
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ MCP Client │────│ Dataproc MCP │────│ Google Cloud │
│ (Claude/Roo) │ │ Server │ │ Dataproc │
└─────────────────┘ └──────────────────┘ └─────────────────┘
│
┌──────┴──────┐
│ Features │
├─────────────┤
│ • Security │
│ • Profiles │
│ • Validation│
│ • Monitoring│
│ • Generic │
│ Converter │
└─────────────┘
🔄 Generic Type Conversion System Architecture
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ Source Types │────│ Generic Converter │────│ Qdrant Payloads │
│ • ClusterData │ │ System │ │ • Compressed │
│ • QueryResults │ │ │ │ • Type-Safe │
│ • JobData │ │ ┌──────────────┐ │ │ • Optimized │
└─────────────────┘ │ │Field Analyzer│ │ └─────────────────┘
│ │Transformation│ │
│ │Engine │ │
│ │Compression │ │
│ │Service │ │
│ └──────────────┘ │
└──────────────────┘
🚦 Performance
Response Time Achievements
- Schema Validation: ~2ms (target: <5ms) ✅
- Parameter Injection: ~1ms (target: <2ms) ✅
- Generic Type Conversion: ~0.50ms (target: <2ms) ✅
- Credential Validation: ~25ms (target: <50ms) ✅
- MCP Tool Call: ~50ms (target: <100ms) ✅
Throughput Achievements
- Schema Validation: ~2000 ops/sec ✅
- Parameter Injection: ~5000 ops/sec ✅
- Generic Type Conversion: ~2000 ops/sec ✅
- Credential Validation: ~200 ops/sec ✅
- MCP Tool Call: ~100 ops/sec ✅
Compression Achievements
- Field-Level Compression: Up to 100% compression ratios ✅
- Memory Optimization: 30-60% reduction in memory usage ✅
- Type Safety: Zero runtime type errors with automatic validation ✅
🧪 Testing
# Run all tests
npm test
# Run specific test suites
npm run test:unit
npm run test:integration
npm run test:performance
# Run with coverage
npm run test:coverage
🤝 Contributing
We welcome contributions! Please see our Contributing Guide for details.
Development Setup
# Clone the repository
git clone https://github.com/dipseth/dataproc-mcp.git
cd dataproc-mcp
# Install dependencies
npm install
# Build the project
npm run build
# Run tests
npm test
# Start development server
npm run dev
📄 License
This project is licensed under the MIT License - see the LICENSE file for details.
🆘 Support
- GitHub Issues: Report bugs and request features
- Documentation: Complete documentation
- NPM Package: Package information
🏆 Acknowledgments
- Model Context Protocol - The protocol that makes this possible
- Google Cloud Dataproc - The service we're integrating with
- Qdrant - High-performance vector database powering our semantic search and knowledge indexing
- TypeScript - For type safety and developer experience
Made with ❤️ for the MCP and Google Cloud communities