mcp-monitoring
A sophisticated Model Context Protocol (MCP) server that provides intelligent monitoring and observability integration. This server enables natural language interactions with Prometheus, AlertManager, and Grafana through chat-style commands, advanced query processing, and comprehensive monitoring automation.
GitHubスター
0
ユーザー評価
未評価
フォーク
0
イシュー
0
閲覧数
1
お気に入り
0
📊 Monitoring MCP Server
A sophisticated Model Context Protocol (MCP) server that provides intelligent monitoring and observability integration. This server enables natural language interactions with Prometheus, AlertManager, and Grafana through chat-style commands, advanced query processing, and comprehensive monitoring automation.
🌟 Overview
This MCP server transforms how you interact with monitoring infrastructure by providing:
- Natural Language Processing: Ask monitoring questions in plain English
- Intelligent Query Translation: Automatically converts questions to PromQL queries
- Historical Alert Analysis: Count failures, outages, and incidents over time
- Multi-Source Integration: Seamlessly works with Prometheus, AlertManager, and Grafana
- Automated Incident Detection: Smart pattern recognition for service failures
✨ Key Features
🧠 Natural Language Query Engine
- Smart Intent Recognition: Understands monitoring questions like "How many times did service X fail?"
- Automatic Time Range Parsing: Handles phrases like "last 2 weeks", "yesterday", "past month"
- Service Name Detection: Recognizes services like opengrok, jenkins, grafana, prometheus
- Alert Pattern Matching: Identifies automation failures, service outages, and critical incidents
- Context-Aware Responses: Provides detailed breakdowns with incident counts and durations
🔍 Prometheus Integration
- Advanced PromQL Generation: Automatically creates complex queries based on natural language
- Historical Data Analysis: Analyzes alert trends and service availability over time
- Metric Discovery: Browse and search available metrics with intelligent filtering
- Range Query Optimization: Smart step sizing for different time ranges
- Alert History Tracking: Tracks firing periods and incident detection
🚨 AlertManager Integration
- Real-time Alert Monitoring: Query active, pending, and resolved alerts
- Smart Alert Filtering: Filter by service, severity, alertname, or custom labels
- Alert Fingerprinting: Track unique alert instances and their lifecycle
- Incident Correlation: Group related alerts and calculate total impact
📊 Grafana Integration (Optional)
- Dashboard Discovery: Find dashboards related to specific services
- Dynamic Dashboard Links: Generate direct links to relevant monitoring views
- Service Context Mapping: Connect services to their monitoring dashboards
🛠️ Available Tools
Natural Language Query
// Ask monitoring questions in plain English
mcp_monitoring_natural_language_query({
question: "how many times did jenkins fail in the last week?",
timeRange: "last week" // optional
})
Active Alerts
// Get currently firing alerts
mcp_monitoring_get_active_alerts({
filter: "alertname=cleanup-zuultmp" // optional filter
})
Prometheus Instant Query
// Execute PromQL queries
mcp_monitoring_query_prometheus({
query: "up{job='prometheus'}",
time: "2024-01-15T10:30:00Z" // optional timestamp
})
Prometheus Range Query
// Get historical time series data
mcp_monitoring_query_prometheus_range({
query: "ALERTS{severity='critical'}",
start: "2024-01-01T00:00:00Z",
end: "2024-01-15T00:00:00Z",
step: "1h" // optional resolution
})
🚀 Quick Start
Installation
git clone <repository-url>
cd monitoring-mcp
npm install
npm run build
Configuration
Set environment variables:
export PROMETHEUS_URL="https://prometheus.example.com"
export ALERTMANAGER_URL="https://alertmanager.example.com"
export GRAFANA_URL="https://grafana.example.com" # Optional
export GRAFANA_API_TOKEN="your-grafana-token" # Optional - Ask admin to create service user and provide token
Running the Server
npm start
# or
node dist/index.js
💬 Natural Language Examples
Service Failure Analysis
Q: "How many times did prevent-opengrok automation fail in the last 2 weeks?"
A: 46 failures over 2 days and 3 hours total downtime
Q: "Show me jenkins outages yesterday"
A: Detailed breakdown of jenkins service interruptions
Q: "Count critical alerts for grafana service this month"
A: Historical analysis with incident timeline
Service Availability Queries
Q: "How many times was prometheus down last week?"
A: Service downtime incidents with duration analysis
Q: "Show cleanup-zuultmp disk usage alerts"
A: Disk space warnings and critical alerts breakdown
Q: "What automation failures happened in the past 7 days?"
A: Comprehensive automation failure report
🔧 Integration Examples
VS Code MCP Configuration
{
"servers": {
"monitoring-mcp": {
"command": "node",
"args": [
"/Users/MCP/mcp-monitoring/dist/index.js"
],
"env": {
"PROMETHEUS_URL": "${input:prometheus_base_url}",
"ALERTMANAGER_URL": "${input:alertmanager_base_url}",
"GRAFANA_URL": "${input:grafana_base_url}",
"GRAFANA_API_KEY": "${input:grafana_api_key}"
}
}
}
}
}
For Grafana Token ask the admin to create a service user and provide the token
🎯 Use Cases
DevOps Teams
- Incident Response: Quickly assess service health and failure patterns
- Postmortem Analysis: Historical incident data for root cause analysis
- Capacity Planning: Trend analysis and resource utilization monitoring
- Alert Fatigue Management: Identify noisy alerts and optimization opportunities
SRE Teams
- SLI/SLO Monitoring: Service availability and performance tracking
- Error Budget Analysis: Calculate error rates and availability metrics
- Automated Reporting: Generate incident reports and availability summaries
- Proactive Monitoring: Identify patterns before they become critical issues
Development Teams
- Deployment Monitoring: Track deployment success/failure rates
- Performance Regression Detection: Compare metrics across releases
- Integration Testing: Monitor test environment stability
- Feature Flag Impact: Assess performance impact of feature rollouts
🧩 Architecture
Smart Query Processing Pipeline
- Intent Recognition: Parse natural language to understand query type
- Service Detection: Identify target services and components
- Time Range Extraction: Parse temporal expressions into date ranges
- PromQL Generation: Create optimized queries based on intent
- Data Analysis: Process results and calculate meaningful metrics
- Response Formatting: Present data in human-readable format
Supported Query Types
current_alerts
: Active/firing alerts right nowhistorical_alerts
: Past incidents and failure countsservice_availability
: Uptime/downtime analysisdashboard_discovery
: Find relevant monitoring dashboardsmetrics
: General metric queries and analysis
📈 Performance Features
- Intelligent Query Optimization: Automatic step sizing for different time ranges
- Result Caching: Avoid redundant API calls for recent queries
- Timeout Handling: Graceful handling of slow monitoring APIs
- Batch Processing: Efficient handling of multi-service queries
- Memory Management: Optimized for long-running server deployment
🔒 Security & Best Practices
Authentication
- Secure API token storage for Grafana integration
- Support for basic auth with Prometheus/AlertManager
- Environment variable configuration for sensitive data
Network Security
- HTTPS-only connections to monitoring services
- Configurable timeout and retry policies
- Certificate validation for secure connections
Access Control
- Read-only operations by design
- No data modification capabilities
- Audit logging for all monitoring queries
🐛 Troubleshooting
Common Issues
# Connection errors
Error: connect ECONNREFUSED
Solution: Check PROMETHEUS_URL and network connectivity
# Authentication failures
Error: 401 Unauthorized
Solution: Verify API tokens and authentication credentials
# Query timeouts
Error: timeout of 30000ms exceeded
Solution: Reduce query complexity or time range
# No data returned
Warning: No matching metrics found
Solution: Check service names and time range validity
Debug Mode
# Enable verbose logging
DEBUG=monitoring-mcp node dist/index.js
# Check configuration
node -e "console.log(process.env.PROMETHEUS_URL)"
🚀 Advanced Usage
Custom Service Detection
The server automatically recognizes these services:
cleanup-zuultmp
,opengrok
,jenkins
grafana
,prometheus
,alertmanager
gerrit
,nginx
,mysql
,redis
,elasticsearch
Advanced Natural Language Patterns
"How many times did [service] fail in the last [time period]?"
"Show me [severity] alerts for [service] [time range]"
"Count [alert name] incidents in [time period]"
"When was [service] down last [time period]?"
🤝 Contributing
Contributions welcome! Please ensure:
- TypeScript compilation passes (
npm run build
) - Natural language query tests pass
- Documentation updated for new features
- Error handling comprehensive
Built with ❤️ for DevOps and SRE teams who want smarter monitoring interactions