inference-gateway

An open-source, high-performance gateway unifying multiple LLM providers, from local solutions like Ollama to major cloud providers such as OpenAI, Groq, Cohere, Anthropic, Cloudflare and DeepSeek.

GitHub Stars

56

User Rating

Not Rated

Favorites

0

Views

5

Forks

7

Issues

5

README
Inference Gateway

CI Status Version License

The Inference Gateway is a proxy server designed to facilitate access to various language model APIs. It allows users to interact with different language models through a unified interface, simplifying the configuration and the process of sending requests and receiving responses from multiple LLMs, enabling an easy use of Mixture of Experts.

Key Features
  • ๐Ÿ“œ Open Source: Available under the MIT License.
  • ๐Ÿš€ Unified API Access: Proxy requests to multiple language model APIs, including OpenAI, Ollama, Groq, Cohere etc.
  • โš™๏ธ Environment Configuration: Easily configure API keys and URLs through environment variables.
  • ๐Ÿ”ง Tool-use Support: Enable function calling capabilities across supported providers with a unified API.
  • ๐ŸŒ MCP Support: Full Model Context Protocol integration - automatically discover and expose tools from MCP servers to LLMs without client-side tool management.
  • ๐Ÿค A2A Support: Agent-to-Agent protocol integration - connect to external A2A-compliant agents and automatically expose their skills as tools.
  • ๐ŸŒŠ Streaming Responses: Stream tokens in real-time as they're generated from language models.
  • ๐Ÿ–ฅ๏ธ Web Interface: Access through a modern web UI for easy interaction and management.
  • ๐Ÿณ Docker Support: Use Docker and Docker Compose for easy setup and deployment.
  • โ˜ธ๏ธ Kubernetes Support: Ready for deployment in Kubernetes environments.
  • ๐Ÿ“Š OpenTelemetry: Monitor and analyze performance.
  • ๐Ÿ›ก๏ธ Production Ready: Built with production in mind, with configurable timeouts and TLS support.
  • ๐ŸŒฟ Lightweight: Includes only essential libraries and runtime, resulting in smaller size binary of ~10.8MB.
  • ๐Ÿ“‰ Minimal Resource Consumption: Designed to consume minimal resources and have a lower footprint.
  • ๐Ÿ“š Documentation: Well documented with examples and guides.
  • ๐Ÿงช Tested: Extensively tested with unit tests and integration tests.
  • ๐Ÿ› ๏ธ Maintained: Actively maintained and developed.
  • ๐Ÿ“ˆ Scalable: Easily scalable and can be used in a distributed environment - with HPA in Kubernetes.
  • ๐Ÿ”’ Compliance and Data Privacy: This project does not collect data or analytics, ensuring compliance and data privacy.
  • ๐Ÿ  Self-Hosted: Can be self-hosted for complete control over the deployment environment.
  • โŒจ๏ธ CLI Tool: Improved command-line interface for managing and interacting with the Inference Gateway
Overview

You can horizontally scale the Inference Gateway to handle multiple requests from clients. The Inference Gateway will forward the requests to the respective provider and return the response to the client.

Note: Both A2A and MCP middleware components can be easily toggled on/off via environment variables (A2A_ENABLE, MCP_ENABLE) or bypassed per-request using headers (X-A2A-Bypass, X-MCP-Bypass), giving you full control over which capabilities are active.

The following diagram illustrates the flow:

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#326CE5', 'primaryTextColor': '#fff', 'lineColor': '#5D8AA8', 'secondaryColor': '#006100' }, 'fontFamily': 'Arial', 'flowchart': {'nodeSpacing': 50, 'rankSpacing': 70, 'padding': 15}}}%%


graph TD
    %% Client nodes
    A["๐Ÿ‘ฅ Clients / ๐Ÿค– Agents"] --> |POST /v1/chat/completions| Auth
    UI["๐Ÿ’ป Web UI"] --> |API requests| Auth

    %% Auth node
    Auth["๐Ÿ”’ Optional OIDC"] --> |Auth?| IG1
    Auth --> |Auth?| IG2
    Auth --> |Auth?| IG3

    %% Gateway nodes
    IG1["๐Ÿ–ฅ๏ธ Inference Gateway"] --> P
    IG2["๐Ÿ–ฅ๏ธ Inference Gateway"] --> P
    IG3["๐Ÿ–ฅ๏ธ Inference Gateway"] --> P

    %% Middleware Processing (Sequential) and Direct Routing
    P["๐Ÿ”Œ Proxy Gateway"] --> A2A["๐Ÿค A2A Middleware"]
    P --> |"Direct routing bypassing middleware"| Direct["๐Ÿ”Œ Direct Providers"]
    A2A --> |"If A2A bypassed or complete"| MCP["๐ŸŒ MCP Middleware"]
    MCP --> |"Middleware chain complete"| Providers["๐Ÿค– LLM Providers"]

    %% A2A External Agents (First Layer)
    A2A --> A2A1["๐Ÿ“… Calendar Agent"]
    A2A --> A2A2["๐Ÿงฎ Calculator Agent"]
    A2A --> A2A3["๐ŸŒค๏ธ Weather Agent"]
    A2A --> A2A4["โœˆ๏ธ Booking Agent"]

    %% MCP Tool Servers (Second Layer)
    MCP --> MCP1["๐Ÿ“ File System Server"]
    MCP --> MCP2["๐Ÿ” Search Server"]
    MCP --> MCP3["๐ŸŒ Web Server"]

    %% LLM Providers (Middleware Enhanced)
    Providers --> C1["๐Ÿฆ™ Ollama"]
    Providers --> D1["๐Ÿš€ Groq"]
    Providers --> E1["โ˜๏ธ OpenAI"]

    %% Direct Providers (Bypass Middleware)
    Direct --> C["๐Ÿฆ™ Ollama"]
    Direct --> D["๐Ÿš€ Groq"]
    Direct --> E["โ˜๏ธ OpenAI"]
    Direct --> G["โšก Cloudflare"]
    Direct --> H1["๐Ÿ’ฌ Cohere"]
    Direct --> H2["๐Ÿง  Anthropic"]
    Direct --> H3["๐Ÿ‹ DeepSeek"]

    %% Define styles
    classDef client fill:#9370DB,stroke:#333,stroke-width:1px,color:white;
    classDef auth fill:#F5A800,stroke:#333,stroke-width:1px,color:black;
    classDef gateway fill:#326CE5,stroke:#fff,stroke-width:1px,color:white;
    classDef provider fill:#32CD32,stroke:#333,stroke-width:1px,color:white;
    classDef ui fill:#FF6B6B,stroke:#333,stroke-width:1px,color:white;
    classDef mcp fill:#FF69B4,stroke:#333,stroke-width:1px,color:white;
    classDef a2a fill:#FFA500,stroke:#333,stroke-width:1px,color:white;

    %% Apply styles
    class A client;
    class UI ui;
    class Auth auth;
    class IG1,IG2,IG3,P gateway;
    class C,D,E,G,H1,H2,H3,C1,D1,E1,Providers provider;
    class MCP,MCP1,MCP2,MCP3 mcp;
    class A2A,A2A1,A2A2,A2A3,A2A4 a2a;
    class Direct direct;

Client is sending:

curl -X POST http://localhost:8080/v1/chat/completions
  -d '{
    "model": "openai/gpt-3.5-turbo",
    "messages": [
      {
        "role": "system",
        "content": "You are a pirate."
      },
      {
        "role": "user",
        "content": "Hello, world! How are you doing today?"
      }
    ],
  }'

** Internally the request is proxied to OpenAI, the Inference Gateway inferring the provider by the model name.

You can also send the request explicitly using ?provider=openai or any other supported provider in the URL.

Finally client receives:

{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "Ahoy, matey! ๐Ÿดโ€โ˜ ๏ธ The seas be wild, the sun be bright, and this here pirate be ready to conquer the day! What be yer business, landlubber? ๐Ÿฆœ",
        "role": "assistant"
      }
    }
  ],
  "created": 1741821109,
  "id": "chatcmpl-dc24995a-7a6e-4d95-9ab3-279ed82080bb",
  "model": "N/A",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 0,
    "prompt_tokens": 0,
    "total_tokens": 0
  }
}

For streaming the tokens simply add to the request body stream: true.

Middleware Control and Bypass Mechanisms

The Inference Gateway uses middleware to process requests and add capabilities like MCP (Model Context Protocol) and A2A (Agent-to-Agent) integrations. Clients can control which middlewares are active using bypass headers:

Bypass Headers
  • X-MCP-Bypass: Skip MCP middleware processing
  • X-A2A-Bypass: Skip A2A middleware processing
Client Control Examples
# Use only MCP capabilities (skip A2A)
curl -X POST http://localhost:8080/v1/chat/completions \
  -H "X-A2A-Bypass: true" \
  -d '{
    "model": "openai/gpt-4",
    "messages": [{"role": "user", "content": "Help me with file operations"}]
  }'

# Use only A2A capabilities (skip MCP)
curl -X POST http://localhost:8080/v1/chat/completions \
  -H "X-MCP-Bypass: true" \
  -d '{
    "model": "anthropic/claude-3-haiku",
    "messages": [{"role": "user", "content": "Connect to external agents"}]
  }'

# Skip both middlewares for direct provider access
curl -X POST http://localhost:8080/v1/chat/completions \
  -H "X-MCP-Bypass: true" \
  -H "X-A2A-Bypass: true" \
  -d '{
    "model": "groq/llama-3-8b",
    "messages": [{"role": "user", "content": "Simple chat without tools"}]
  }'
When to Use Bypass Headers

For Performance:

  • Skip middleware processing when you don't need tool capabilities
  • Reduce latency for simple chat interactions

For Selective Features:

  • Use only MCP tools (skip A2A): Add X-A2A-Bypass: true
  • Use only A2A agents (skip MCP): Add X-MCP-Bypass: true
  • Direct provider access (skip both): Add both headers

For Development:

  • Test middleware behavior in isolation
  • Debug tool integration issues
  • Ensure backward compatibility with existing applications

For Agent Communication:

  • Prevent infinite loops when A2A agents make their own chat completion requests
  • Use X-A2A-Bypass: true to avoid triggering A2A servers recursively
How It Works Internally

The middlewares use these same headers to prevent infinite loops during their operation:

MCP Processing:

  • When tools are detected in a response, the MCP agent makes up to 10 follow-up requests
  • Each follow-up request includes X-MCP-Bypass: true to skip middleware re-processing
  • This allows the agent to iterate without creating circular calls

A2A Processing:

  • When A2A agents execute skills, they may need to make their own chat requests
  • The X-A2A-Bypass: true header prevents these internal calls from triggering more A2A processing
  • This enables clean agent-to-agent communication

Note: These bypass headers only affect middleware processing. The core chat completions functionality remains available regardless of header values.

Model Context Protocol (MCP) Integration

Enable MCP to automatically provide tools to LLMs without requiring clients to manage them:

# Enable MCP and connect to tool servers
export MCP_ENABLE=true
export MCP_SERVERS="http://filesystem-server:3001/mcp,http://search-server:3002/mcp"

# LLMs will automatically discover and use available tools
curl -X POST http://localhost:8080/v1/chat/completions \
  -d '{
    "model": "openai/gpt-4",
    "messages": [{"role": "user", "content": "List files in the current directory"}]
  }'

The gateway automatically injects available tools into requests and handles tool execution, making external capabilities seamlessly available to any LLM.

Learn more: Model Context Protocol Documentation | MCP Integration Example

Agent-to-Agent (A2A) Integration

Enable A2A to connect with external agents and expose their skills as tools:

Manual Configuration:

# Enable A2A and connect to agent endpoints
export A2A_ENABLE=true
export A2A_AGENTS="http://booking-agent:3001,http://calculator-agent:3002"

Kubernetes Service Discovery:

# Enable A2A with automatic Kubernetes service discovery
export A2A_ENABLE=true
export A2A_SERVICE_DISCOVERY_ENABLE=true
export A2A_SERVICE_DISCOVERY_NAMESPACE=agents  # Optional: defaults to current namespace
# LLMs will automatically discover and use agent skills
curl -X POST http://localhost:8080/v1/chat/completions \
  -d '{
    "model": "openai/gpt-4",
    "messages": [{"role": "user", "content": "Book a flight to New York and calculate the cost"}]
  }'

The gateway automatically discovers agent skills, converts them to chat completion tools, and handles skill execution, enabling seamless collaboration between LLMs and external agents. In Kubernetes environments, agents deployed with the inference-gateway operator are automatically discovered without manual configuration.

Learn more: A2A Protocol Documentation | A2A Integration Example | Curated A2A Agents

Metrics and Observability

The Inference Gateway provides comprehensive OpenTelemetry metrics for monitoring performance, usage, and function/tool call activity. Metrics are automatically exported to Prometheus format and available on port 9464 by default.

Enabling Metrics
# Enable telemetry and set metrics port (default: 9464)
export TELEMETRY_ENABLE=true
export TELEMETRY_METRICS_PORT=9464

# Access metrics endpoint
curl http://localhost:9464/metrics
Available Metrics
Token Usage Metrics

Track token consumption across different providers and models:

  • llm_usage_prompt_tokens_total - Counter for prompt tokens consumed
  • llm_usage_completion_tokens_total - Counter for completion tokens generated
  • llm_usage_total_tokens_total - Counter for total token usage

Labels: provider, model

# Total tokens used by OpenAI models in the last hour
sum(increase(llm_usage_total_tokens_total{provider="openai"}[1h])) by (model)
Request/Response Metrics

Monitor API performance and reliability:

  • llm_requests_total - Counter for total requests processed
  • llm_responses_total - Counter for responses by HTTP status code
  • llm_request_duration - Histogram for end-to-end request duration (milliseconds)

Labels: provider, request_method, request_path, status_code (responses only)

# 95th percentile request latency by provider
histogram_quantile(0.95, sum(rate(llm_request_duration_bucket{provider=~"openai|anthropic"}[5m])) by (provider, le))

# Error rate percentage by provider
100 * sum(rate(llm_responses_total{status_code!~"2.."}[5m])) by (provider) / sum(rate(llm_responses_total[5m])) by (provider)
Function/Tool Call Metrics

Comprehensive tracking of tool executions for MCP, A2A, and standard function calls:

  • llm_tool_calls_total - Counter for total function/tool calls executed
  • llm_tool_calls_success_total - Counter for successful tool executions
  • llm_tool_calls_failure_total - Counter for failed tool executions
  • llm_tool_call_duration - Histogram for tool execution duration (milliseconds)

Labels: provider, model, tool_type, tool_name, error_type (failures only)

Tool Types:

  • mcp - Model Context Protocol tools (prefix: mcp_)
  • a2a - Agent-to-Agent tools (prefix: a2a_)
  • standard_tool_use - Other function calls
# Tool call success rate by type
100 * sum(rate(llm_tool_calls_success_total[5m])) by (tool_type) / sum(rate(llm_tool_calls_total[5m])) by (tool_type)

# Average tool execution time by provider
sum(rate(llm_tool_call_duration_sum[5m])) by (provider) / sum(rate(llm_tool_call_duration_count[5m])) by (provider)

# Most frequently used tools
topk(10, sum(increase(llm_tool_calls_total[1h])) by (tool_name))
Monitoring Setup
Docker Compose Example

Complete monitoring stack with Grafana dashboards:

cd examples/docker-compose/monitoring/
cp .env.example .env  # Configure your API keys
docker compose up -d

# Access Grafana at http://localhost:3000 (admin/admin)
Kubernetes Example

Production-ready monitoring with Prometheus Operator:

cd examples/kubernetes/monitoring/
task deploy-infrastructure
task deploy-inference-gateway

# Access via port-forward or ingress
kubectl port-forward svc/grafana-service 3000:3000
Histogram Boundaries

Request and tool call duration histograms use optimized boundaries for millisecond precision:

[1, 5, 10, 25, 50, 75, 100, 250, 500, 750, 1000, 2500, 5000, 7500, 10000] ms
Grafana Dashboard

The included Grafana dashboard provides:

  • Real-time Metrics: 5-second refresh rate for immediate feedback
  • Tool Call Analytics: Success rates, duration analysis, and failure tracking
  • Provider Comparison: Performance metrics across all supported providers
  • Usage Insights: Token consumption patterns and cost analysis
  • Error Monitoring: Failed requests and tool call error classification
Prometheus Configuration

The gateway exposes metrics compatible with Prometheus scraping:

scrape_configs:
  - job_name: 'inference-gateway'
    static_configs:
      - targets: ['localhost:9464']
    scrape_interval: 5s
    scrape_timeout: 4s
Provider Detection

Metrics automatically detect providers from:

  • Model prefixes: openai/gpt-4, anthropic/claude-3-haiku, groq/llama-3-8b
  • URL parameters: ?provider=openai

Supported providers: openai, anthropic, groq, cohere, ollama, cloudflare, deepseek, google

Learn more: Docker Compose Monitoring | Kubernetes Monitoring | OpenTelemetry Documentation

Supported API's
Development Environment

The Inference Gateway uses Flox to provide a reproducible, cross-platform development environment. Flox eliminates the need to manually install and manage development tools, ensuring all developers have the same setup regardless of their operating system.

Prerequisites
Quick Start
  1. Clone the repository:

    git clone https://github.com/inference-gateway/inference-gateway.git
    cd inference-gateway
    
  2. Activate the development environment:

    flox activate
    

    This command will:

    • โœ… Install all required development tools with pinned versions
    • โœ… Set up Go environment variables and paths
    • โœ… Download Go dependencies automatically
    • โœ… Configure shell aliases for common commands
    • โœ… Display helpful getting started information
  3. Install git hooks (recommended):

    task pre-commit:install
    
  4. Build and test:

    task build
    task test
    
Available Tools

The Flox environment provides all necessary development tools with pinned versions for reproducibility:

Tool Version Purpose
Go 1.24.5 Primary language runtime
Task 3.44.0 Task runner and build automation
Docker 28.3.2 Container runtime
Docker Compose 2.38.1 Multi-container orchestration
golangci-lint 2.3.0 Go code linting
mockgen 0.5.2 Go mock generation
Node.js 22.17.0 JavaScript runtime (for npm tools)
Prettier 3.6.2 Code formatting
Spectral 6.15.0 OpenAPI/JSON Schema linting (via npx)
curl 8.14.1 HTTP client for testing
jq 1.8.1 JSON processing
kubectl 1.33.3 Kubernetes CLI
Helm 3.18.4 Kubernetes package manager
Common Commands

The environment provides convenient aliases for frequently used commands:

Alias Command Description
build task build Build the gateway binary
test task test Run all tests
lint task lint Run code linting
gen task generate Generate code from schemas
spec npx @stoplight/spectral-cli Lint OpenAPI specs
gs git status Git status
gl git log --oneline -10 Git log (last 10 commits)
gd git diff Git diff

Task Commands:

task --list                    # Show all available tasks
task build                     # Build the gateway
task run                       # Run the gateway locally
task test                      # Run tests
task lint                      # Run linting
task generate                  # Generate code from schemas
task pre-commit:install        # Install git hooks
task mcp:schema:download       # Download latest MCP schema
task a2a:schema:download       # Download latest A2A schema

Development Workflow:

# Lint OpenAPI specifications
spec lint openapi.yaml

# Format code
prettier --write .

# Generate mocks
mockgen -source=internal/provider.go -destination=mocks/provider.go

# Test with curl
curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "gpt-3.5-turbo", "messages": [{"role": "user", "content": "Hello"}]}'
Environment Details

Cross-Platform Support:

  • โœ… macOS (ARM64 & x86_64)
  • โœ… Linux (ARM64 & x86_64)
  • โœ… Automatic nvm compatibility (no conflicts)

Environment Variables:

  • GOPATH: $HOME/go
  • GOPROXY: https://proxy.golang.org,direct
  • GOSUMDB: sum.golang.org
  • GO111MODULE: on
  • CGO_ENABLED: 1

Path Configuration:

  • Go binaries: $GOPATH/bin
  • Project binaries: ./bin
  • npm packages: Handled automatically via npx

Shell Integration:

  • Bash and Zsh completion support
  • Custom aliases for productivity
  • Automatic tool availability detection

Reproducibility:

  • All tools use pinned versions
  • Consistent environment across team members
  • No manual tool installation required
  • Isolated from system packages

To exit the development environment, simply run:

exit
Configuration

The Inference Gateway can be configured using environment variables. The following environment variables are supported.

Examples
SDKs

More SDKs could be generated using the OpenAPI specification. The following SDKs are currently available:

CLI Tool

The Inference Gateway CLI provides a powerful command-line interface for managing and interacting with the Inference Gateway. It offers tools for configuration, monitoring, and management of inference services.

Key Features
  • Status Monitoring: Check gateway health and resource usage
  • Interactive Chat: Chat with models using an interactive interface
  • Configuration Management: Manage gateway settings via YAML config
  • Project Initialization: Set up local project configurations
  • Tool Execution: LLMs can execute whitelisted commands and tools
Installation
Using Go Install
go install github.com/inference-gateway/cli@latest
Using Install Script
curl -fsSL https://raw.githubusercontent.com/inference-gateway/cli/main/install.sh | bash
Manual Download

Download the latest release from the releases page.

Quick Start
  1. Initialize project configuration:

    infer init
    
  2. Check gateway status:

    infer status
    
  3. Start an interactive chat:

    infer chat
    

For more details, see the CLI documentation.

License

This project is licensed under the MIT License.

Contributing

Found a bug, missing provider, or have a feature in mind?
You're more than welcome to submit pull requests or open issues for any fixes, improvements, or new ideas!

Please read the CONTRIBUTING.md for more details.

Motivation

My motivation is to build AI Agents without being tied to a single vendor. By avoiding vendor lock-in and supporting self-hosted LLMs from a single interface, organizations gain both portability and data privacy. You can choose to consume LLMs from a cloud provider or run them entirely offline with Ollama.

Note: This project is independently developed and is not backed by any venture capital or corporate interests, ensuring that Inference Gateway remains focused on developer needs rather than investor demands.