dataproc-mcp

dataproc-mcp is a TypeScript library designed to streamline data processing and analysis. It offers features for data transformation, aggregation, and visualization, aiding developers in building data-driven applications. With an easy-to-use API, it simplifies complex data operations and enables rapid development.

GitHub Stars

9

User Rating

Not Rated

Favorites

0

Views

28

Forks

1

Issues

2

README
Dataproc MCP Server

npm version
npm downloads
Build Status
Release Status
Coverage Status
License: MIT
Node.js Version
TypeScript
MCP Compatible
semantic-release

A production-ready Model Context Protocol (MCP) server for Google Cloud Dataproc operations with intelligent parameter injection, enterprise-grade security, and comprehensive tooling. Designed for seamless integration with Roo (VS Code).

πŸš€ Quick Start
Recommended: Roo (VS Code) Integration

Add this to your Roo MCP settings:

{
  "mcpServers": {
    "dataproc": {
      "command": "npx",
      "args": ["@dipseth/dataproc-mcp-server@latest"],
      "env": {
        "LOG_LEVEL": "info"
      }
    }
  }
}
With Custom Config File
{
  "mcpServers": {
    "dataproc": {
      "command": "npx",
      "args": ["@dipseth/dataproc-mcp-server@latest"],
      "env": {
        "LOG_LEVEL": "info",
        "DATAPROC_CONFIG_PATH": "/path/to/your/config.json"
      }
    }
  }
}
Alternative: Global Installation
# Install globally
npm install -g @dipseth/dataproc-mcp-server

# Start the server
dataproc-mcp-server

# Or run directly
npx @dipseth/dataproc-mcp-server@latest
5-Minute Setup
  1. Install the package:

    npm install -g @dipseth/dataproc-mcp-server@latest
    
  2. Run the setup:

    dataproc-mcp --setup
    
  3. Configure authentication:

    # Edit the generated config file
    nano config/server.json
    
  4. Start the server:

    dataproc-mcp
    
🌐 Claude.ai Web App Compatibility

βœ… PRODUCTION-READY: Full Claude.ai Integration with HTTPS Tunneling & OAuth

The Dataproc MCP Server now provides complete Claude.ai web app compatibility with a working solution that includes all 22 MCP tools!

πŸš€ Working Solution (Tested & Verified)

Terminal 1 - Start MCP Server:

DATAPROC_CONFIG_PATH=config/github-oauth-server.json npm start -- --http --oauth --port 8080

Terminal 2 - Start Cloudflare Tunnel:

cloudflared tunnel --url https://localhost:8443 --origin-server-name localhost --no-tls-verify

Result: Claude.ai can see and use all tools successfully! πŸŽ‰

Key Features:
  • βœ… Complete Tool Access - All 22 MCP tools available in Claude.ai
  • βœ… HTTPS Tunneling - Cloudflare tunnel for secure external access
  • βœ… OAuth Authentication - GitHub OAuth for secure authentication
  • βœ… Trusted Certificates - No browser warnings or connection issues
  • βœ… WebSocket Support - Full WebSocket compatibility with Claude.ai
  • βœ… Production Ready - Tested and verified working solution
Quick Setup:
  1. Setup GitHub OAuth (5 minutes)
  2. Generate SSL certificates: npm run ssl:generate
  3. Start services (2 terminals as shown above)
  4. Connect Claude.ai to your tunnel URL

πŸ“– Complete Guide: See docs/claude-ai-integration.md for detailed setup instructions, troubleshooting, and advanced features.

πŸ“– Certificate Setup: See docs/trusted-certificates.md for SSL certificate configuration.

✨ Features
🎯 Core Capabilities
  • 22 Production-Ready MCP Tools - Complete Dataproc management suite
  • 🧠 Knowledge Base Semantic Search - Natural language queries with optional Qdrant integration
  • πŸš€ Response Optimization - 60-96% token reduction with Qdrant storage
  • πŸ”„ Generic Type Conversion System - Automatic, type-safe data transformations
  • 60-80% Parameter Reduction - Intelligent default injection
  • Multi-Environment Support - Dev/staging/production configurations
  • Service Account Impersonation - Enterprise authentication
  • Real-time Job Monitoring - Comprehensive status tracking
πŸš€ Response Optimization
  • 96.2% Token Reduction - list_clusters: 7,651 β†’ 292 tokens
  • Automatic Qdrant Storage - Full data preserved and searchable
  • Resource URI Access - dataproc://responses/clusters/list/abc123
  • Graceful Fallback - Works without Qdrant, falls back to full responses
  • 9.95ms Processing - Lightning-fast optimization with <1MB memory usage
πŸ”„ Generic Type Conversion System
  • 75% Code Reduction - Eliminates manual conversion logic across services
  • Type-Safe Transformations - Automatic field detection and mapping
  • Intelligent Compression - Field-level compression with configurable thresholds
  • 0.50ms Conversion Times - Lightning-fast processing with 100% compression ratios
  • Zero-Configuration - Works automatically with existing TypeScript types
  • Backward Compatible - Seamless integration with existing functionality
οΏ½ Enterprise Security
  • Input Validation - Zod schemas for all 16 tools
  • Rate Limiting - Configurable abuse prevention
  • Credential Management - Secure handling and rotation
  • Audit Logging - Comprehensive security event tracking
  • Threat Detection - Injection attack prevention
πŸ“Š Quality Assurance
  • 90%+ Test Coverage - Comprehensive test suite
  • Performance Monitoring - Configurable thresholds
  • Multi-Environment Testing - Cross-platform validation
  • Automated Quality Gates - CI/CD integration
  • Security Scanning - Vulnerability management
πŸš€ Developer Experience
  • 5-Minute Setup - Quick start guide
  • Interactive Documentation - HTML docs with examples
  • Comprehensive Examples - Multi-environment configs
  • Troubleshooting Guides - Common issues and solutions
  • IDE Integration - TypeScript support
πŸ› οΈ Complete MCP Tools Suite (22 Tools)

πŸ”„ Enhanced with Generic Type Conversion: All tools now benefit from automatic, type-safe data transformations with intelligent compression and field mapping.

πŸš€ Cluster Management (8 Tools)
Tool Description Smart Defaults Key Features
start_dataproc_cluster Create and start new clusters βœ… 80% fewer params Profile-based, auto-config
create_cluster_from_yaml Create from YAML configuration βœ… Project/region injection Template-driven setup
create_cluster_from_profile Create using predefined profiles βœ… 85% fewer params 8 built-in profiles
list_clusters List all clusters with filtering βœ… No params needed Semantic queries, pagination
list_tracked_clusters List MCP-created clusters βœ… Profile filtering Creation tracking
get_cluster Get detailed cluster information βœ… 75% fewer params Semantic data extraction
delete_cluster Delete existing clusters βœ… Project/region defaults Safe deletion
get_zeppelin_url Get Zeppelin notebook URL βœ… Auto-discovery Web interface access
πŸ’Ό Job Management (7 Tools)
Tool Description Smart Defaults Key Features
submit_hive_query Submit Hive queries to clusters βœ… 70% fewer params Async support, timeouts
submit_dataproc_job Submit Spark/PySpark/Presto jobs βœ… 75% fewer params Multi-engine support, Local file staging
cancel_dataproc_job Cancel running or pending jobs βœ… JobID only needed Emergency cancellation, cost control
get_job_status Get job execution status βœ… JobID only needed Real-time monitoring
get_job_results Get job outputs and results βœ… Auto-pagination Result formatting
get_query_status Get Hive query status βœ… Minimal params Query tracking
get_query_results Get Hive query results βœ… Smart pagination Enhanced async support
πŸ“‹ Configuration & Profiles (3 Tools)
Tool Description Smart Defaults Key Features
list_profiles List available cluster profiles βœ… Category filtering 8 production profiles
get_profile Get detailed profile configuration βœ… Profile ID only Template access
query_cluster_data Query stored cluster data βœ… Natural language Semantic search
πŸ“Š Analytics & Insights (4 Tools)
Tool Description Smart Defaults Key Features
check_active_jobs Quick status of all active jobs βœ… No params needed Multi-project view
get_cluster_insights Comprehensive cluster analytics βœ… Auto-discovery Machine types, components
get_job_analytics Job performance analytics βœ… Success rates Error patterns, metrics
query_knowledge Query comprehensive knowledge base βœ… Natural language Clusters, jobs, errors
🎯 Key Capabilities
  • 🧠 Semantic Search: Natural language queries with Qdrant integration
  • ⚑ Smart Defaults: 60-80% parameter reduction through intelligent injection
  • πŸ“Š Response Optimization: 96% token reduction with full data preservation
  • πŸ”„ Async Support: Non-blocking job submission and monitoring
  • 🏷️ Profile System: 8 production-ready cluster templates
  • πŸ“ˆ Analytics: Comprehensive insights and performance tracking
πŸ“‹ Configuration
Project-Based Configuration

The server supports a project-based configuration format:

# profiles/@analytics-workloads.yaml
my-company-analytics-prod-1234:
  region: us-central1
  tags:
    - DataProc
    - analytics
    - production
  labels:
    service: analytics-service
    owner: data-team
    environment: production
  cluster_config:
    # ... cluster configuration
Authentication Methods
  1. Service Account Impersonation (Recommended)
  2. Direct Service Account Key
  3. Application Default Credentials
  4. Hybrid Authentication with fallbacks
πŸ“š Documentation
πŸ”§ MCP Client Integration
Claude Desktop
{
  "mcpServers": {
    "dataproc": {
      "command": "npx",
      "args": ["@dataproc/mcp-server"],
      "env": {
        "LOG_LEVEL": "info"
      }
    }
  }
}
Roo (VS Code)
{
  "mcpServers": {
    "dataproc-server": {
      "command": "npx",
      "args": ["@dataproc/mcp-server"],
      "disabled": false,
      "alwaysAllow": [
        "list_clusters",
        "get_cluster",
        "list_profiles"
      ]
    }
  }
}
πŸ—οΈ Architecture
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   MCP Client    │────│  Dataproc MCP    │────│  Google Cloud   β”‚
β”‚  (Claude/Roo)   β”‚    β”‚     Server       β”‚    β”‚    Dataproc     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚
                       β”Œβ”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”
                       β”‚   Features  β”‚
                       β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
                       β”‚ β€’ Security  β”‚
                       β”‚ β€’ Profiles  β”‚
                       β”‚ β€’ Validationβ”‚
                       β”‚ β€’ Monitoringβ”‚
                       β”‚ β€’ Generic    β”‚
                       β”‚   Converter  β”‚
                       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
πŸ”„ Generic Type Conversion System Architecture
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Source Types   │────│ Generic Converter │────│ Qdrant Payloads β”‚
β”‚ β€’ ClusterData   β”‚    β”‚    System        β”‚    β”‚ β€’ Compressed    β”‚
β”‚ β€’ QueryResults  β”‚    β”‚                  β”‚    β”‚ β€’ Type-Safe     β”‚
β”‚ β€’ JobData       β”‚    β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚    β”‚ β€’ Optimized     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚ β”‚Field Analyzerβ”‚ β”‚    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                       β”‚ β”‚Transformationβ”‚ β”‚
                       β”‚ β”‚Engine        β”‚ β”‚
                       β”‚ β”‚Compression   β”‚ β”‚
                       β”‚ β”‚Service       β”‚ β”‚
                       β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
                       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
🚦 Performance
Response Time Achievements
  • Schema Validation: ~2ms (target: <5ms) βœ…
  • Parameter Injection: ~1ms (target: <2ms) βœ…
  • Generic Type Conversion: ~0.50ms (target: <2ms) βœ…
  • Credential Validation: ~25ms (target: <50ms) βœ…
  • MCP Tool Call: ~50ms (target: <100ms) βœ…
Throughput Achievements
  • Schema Validation: ~2000 ops/sec βœ…
  • Parameter Injection: ~5000 ops/sec βœ…
  • Generic Type Conversion: ~2000 ops/sec βœ…
  • Credential Validation: ~200 ops/sec βœ…
  • MCP Tool Call: ~100 ops/sec βœ…
Compression Achievements
  • Field-Level Compression: Up to 100% compression ratios βœ…
  • Memory Optimization: 30-60% reduction in memory usage βœ…
  • Type Safety: Zero runtime type errors with automatic validation βœ…
πŸ§ͺ Testing
# Run all tests
npm test

# Run specific test suites
npm run test:unit
npm run test:integration
npm run test:performance

# Run with coverage
npm run test:coverage
🀝 Contributing

We welcome contributions! Please see our Contributing Guide for details.

Development Setup
# Clone the repository
git clone https://github.com/dipseth/dataproc-mcp.git
cd dataproc-mcp

# Install dependencies
npm install

# Build the project
npm run build

# Run tests
npm test

# Start development server
npm run dev
πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ†˜ Support
πŸ† Acknowledgments

Made with ❀️ for the MCP and Google Cloud communities