mcp-monitoring

A sophisticated Model Context Protocol (MCP) server that provides intelligent monitoring and observability integration. This server enables natural language interactions with Prometheus, AlertManager, and Grafana through chat-style commands, advanced query processing, and comprehensive monitoring automation.

GitHub Stars

0

User Rating

Not Rated

Forks

0

Issues

0

Views

2

Favorites

0

README
📊 Monitoring MCP Server

A sophisticated Model Context Protocol (MCP) server that provides intelligent monitoring and observability integration. This server enables natural language interactions with Prometheus, AlertManager, and Grafana through chat-style commands, advanced query processing, and comprehensive monitoring automation.

🌟 Overview

This MCP server transforms how you interact with monitoring infrastructure by providing:

  • Natural Language Processing: Ask monitoring questions in plain English
  • Intelligent Query Translation: Automatically converts questions to PromQL queries
  • Historical Alert Analysis: Count failures, outages, and incidents over time
  • Multi-Source Integration: Seamlessly works with Prometheus, AlertManager, and Grafana
  • Automated Incident Detection: Smart pattern recognition for service failures
✨ Key Features
🧠 Natural Language Query Engine
  • Smart Intent Recognition: Understands monitoring questions like "How many times did service X fail?"
  • Automatic Time Range Parsing: Handles phrases like "last 2 weeks", "yesterday", "past month"
  • Service Name Detection: Recognizes services like opengrok, jenkins, grafana, prometheus
  • Alert Pattern Matching: Identifies automation failures, service outages, and critical incidents
  • Context-Aware Responses: Provides detailed breakdowns with incident counts and durations
🔍 Prometheus Integration
  • Advanced PromQL Generation: Automatically creates complex queries based on natural language
  • Historical Data Analysis: Analyzes alert trends and service availability over time
  • Metric Discovery: Browse and search available metrics with intelligent filtering
  • Range Query Optimization: Smart step sizing for different time ranges
  • Alert History Tracking: Tracks firing periods and incident detection
🚨 AlertManager Integration
  • Real-time Alert Monitoring: Query active, pending, and resolved alerts
  • Smart Alert Filtering: Filter by service, severity, alertname, or custom labels
  • Alert Fingerprinting: Track unique alert instances and their lifecycle
  • Incident Correlation: Group related alerts and calculate total impact
📊 Grafana Integration (Optional)
  • Dashboard Discovery: Find dashboards related to specific services
  • Dynamic Dashboard Links: Generate direct links to relevant monitoring views
  • Service Context Mapping: Connect services to their monitoring dashboards
🛠️ Available Tools
Natural Language Query
// Ask monitoring questions in plain English
mcp_monitoring_natural_language_query({
  question: "how many times did jenkins fail in the last week?",
  timeRange: "last week"  // optional
})
Active Alerts
// Get currently firing alerts
mcp_monitoring_get_active_alerts({
  filter: "alertname=cleanup-zuultmp"  // optional filter
})
Prometheus Instant Query
// Execute PromQL queries
mcp_monitoring_query_prometheus({
  query: "up{job='prometheus'}",
  time: "2024-01-15T10:30:00Z"  // optional timestamp
})
Prometheus Range Query
// Get historical time series data
mcp_monitoring_query_prometheus_range({
  query: "ALERTS{severity='critical'}",
  start: "2024-01-01T00:00:00Z",
  end: "2024-01-15T00:00:00Z",
  step: "1h"  // optional resolution
})
🚀 Quick Start
Installation
git clone <repository-url>
cd monitoring-mcp
npm install
npm run build
Configuration

Set environment variables:

export PROMETHEUS_URL="https://prometheus.example.com"
export ALERTMANAGER_URL="https://alertmanager.example.com"
export GRAFANA_URL="https://grafana.example.com"          # Optional
export GRAFANA_API_TOKEN="your-grafana-token"             # Optional - Ask admin to create service user and provide token
Running the Server
npm start
# or
node dist/index.js
💬 Natural Language Examples
Service Failure Analysis
Q: "How many times did prevent-opengrok automation fail in the last 2 weeks?"
A: 46 failures over 2 days and 3 hours total downtime

Q: "Show me jenkins outages yesterday"
A: Detailed breakdown of jenkins service interruptions

Q: "Count critical alerts for grafana service this month"
A: Historical analysis with incident timeline
Service Availability Queries
Q: "How many times was prometheus down last week?"
A: Service downtime incidents with duration analysis

Q: "Show cleanup-zuultmp disk usage alerts"  
A: Disk space warnings and critical alerts breakdown

Q: "What automation failures happened in the past 7 days?"
A: Comprehensive automation failure report
🔧 Integration Examples
VS Code MCP Configuration
{
  "servers": {
    "monitoring-mcp": {
      "command": "node",
      "args": [
        "/Users/MCP/mcp-monitoring/dist/index.js"
      ],
      "env": {
        "PROMETHEUS_URL": "${input:prometheus_base_url}",
        "ALERTMANAGER_URL": "${input:alertmanager_base_url}",
        "GRAFANA_URL": "${input:grafana_base_url}",
        "GRAFANA_API_KEY": "${input:grafana_api_key}"
        }
      }
    }
  }
}

For Grafana Token ask the admin to create a service user and provide the token

🎯 Use Cases
DevOps Teams
  • Incident Response: Quickly assess service health and failure patterns
  • Postmortem Analysis: Historical incident data for root cause analysis
  • Capacity Planning: Trend analysis and resource utilization monitoring
  • Alert Fatigue Management: Identify noisy alerts and optimization opportunities
SRE Teams
  • SLI/SLO Monitoring: Service availability and performance tracking
  • Error Budget Analysis: Calculate error rates and availability metrics
  • Automated Reporting: Generate incident reports and availability summaries
  • Proactive Monitoring: Identify patterns before they become critical issues
Development Teams
  • Deployment Monitoring: Track deployment success/failure rates
  • Performance Regression Detection: Compare metrics across releases
  • Integration Testing: Monitor test environment stability
  • Feature Flag Impact: Assess performance impact of feature rollouts
🧩 Architecture
Smart Query Processing Pipeline
  1. Intent Recognition: Parse natural language to understand query type
  2. Service Detection: Identify target services and components
  3. Time Range Extraction: Parse temporal expressions into date ranges
  4. PromQL Generation: Create optimized queries based on intent
  5. Data Analysis: Process results and calculate meaningful metrics
  6. Response Formatting: Present data in human-readable format
Supported Query Types
  • current_alerts: Active/firing alerts right now
  • historical_alerts: Past incidents and failure counts
  • service_availability: Uptime/downtime analysis
  • dashboard_discovery: Find relevant monitoring dashboards
  • metrics: General metric queries and analysis
📈 Performance Features
  • Intelligent Query Optimization: Automatic step sizing for different time ranges
  • Result Caching: Avoid redundant API calls for recent queries
  • Timeout Handling: Graceful handling of slow monitoring APIs
  • Batch Processing: Efficient handling of multi-service queries
  • Memory Management: Optimized for long-running server deployment
🔒 Security & Best Practices
Authentication
  • Secure API token storage for Grafana integration
  • Support for basic auth with Prometheus/AlertManager
  • Environment variable configuration for sensitive data
Network Security
  • HTTPS-only connections to monitoring services
  • Configurable timeout and retry policies
  • Certificate validation for secure connections
Access Control
  • Read-only operations by design
  • No data modification capabilities
  • Audit logging for all monitoring queries
🐛 Troubleshooting
Common Issues
# Connection errors
Error: connect ECONNREFUSED
Solution: Check PROMETHEUS_URL and network connectivity

# Authentication failures  
Error: 401 Unauthorized
Solution: Verify API tokens and authentication credentials

# Query timeouts
Error: timeout of 30000ms exceeded
Solution: Reduce query complexity or time range

# No data returned
Warning: No matching metrics found
Solution: Check service names and time range validity
Debug Mode
# Enable verbose logging
DEBUG=monitoring-mcp node dist/index.js

# Check configuration
node -e "console.log(process.env.PROMETHEUS_URL)"
🚀 Advanced Usage
Custom Service Detection

The server automatically recognizes these services:

  • cleanup-zuultmp, opengrok, jenkins
  • grafana, prometheus, alertmanager
  • gerrit, nginx, mysql, redis, elasticsearch
Advanced Natural Language Patterns
"How many times did [service] fail in the last [time period]?"
"Show me [severity] alerts for [service] [time range]"
"Count [alert name] incidents in [time period]"
"When was [service] down last [time period]?"
🤝 Contributing

Contributions welcome! Please ensure:

  • TypeScript compilation passes (npm run build)
  • Natural language query tests pass
  • Documentation updated for new features
  • Error handling comprehensive

Built with ❤️ for DevOps and SRE teams who want smarter monitoring interactions

Author Information

0

Followers

15

Repositories

0

Gists

1

Total Contributions

Top Contributors

Threads