Version: Latest

Install with Envoy AI Gateway

This guide provides step-by-step instructions for integrating the vLLM Semantic Router with Envoy AI Gateway on Kubernetes for advanced traffic management and AI-specific features.

For large request bodies or streamed immediate responses from Semantic Router, also see Streamed ExtProc and immediate responses. That guide shows how to switch the ExtProc filter from BUFFERED to STREAMED request bodies and how streamed Chat Completions clients receive looper or fast_response immediate responses.

Architecture Overview

The deployment consists of:

vLLM Semantic Router: Provides intelligent request routing and semantic understanding
Envoy Gateway: Core gateway functionality and traffic management
Envoy AI Gateway: AI Gateway built on Envoy Gateway for LLM providers

Benefits of Integration

Integrating vLLM Semantic Router with Envoy AI Gateway provides enterprise-grade capabilities for production LLM deployments:

1. Hybrid Model Selection

Seamlessly route requests between cloud LLM providers (OpenAI, Anthropic, etc.) and self-hosted models.

2. Token Rate Limiting

Protect your infrastructure and control costs with fine-grained rate limiting:

Input token limits: Control request size to prevent abuse
Output token limits: Manage response generation costs
Total token limits: Set overall usage quotas per user/tenant
Time-based windows: Configure limits per second, minute, or hour

3. Model/Provider Failover

Ensure high availability with automatic failover mechanisms:

Detect unhealthy backends and route traffic to healthy instances
Support for active-passive and active-active failover strategies
Graceful degradation when primary models are unavailable

4. Traffic Splitting & Canary Testing

Deploy new models safely with progressive rollout capabilities:

A/B Testing: Split traffic between model versions to compare performance
Canary Deployments: Gradually shift traffic to new models (e.g., 5% → 25% → 50% → 100%)
Shadow Traffic: Send duplicate requests to new models without affecting production
Weight-based routing: Fine-tune traffic distribution across model variants

5. LLM Observability & Monitoring

Gain deep insights into your LLM infrastructure:

Request/Response Metrics: Track latency, throughput, token usage, and error rates
Model Performance: Monitor accuracy, quality scores, and user satisfaction
Cost Analytics: Analyze spending patterns across models and providers
Distributed Tracing: End-to-end visibility with OpenTelemetry integration
Custom Dashboards: Visualize metrics in Prometheus, Grafana, or your preferred monitoring stack

Supported LLM Providers

Provider Name	API Schema Config on AIServiceBackend	Upstream Authentication Config on BackendSecurityPolicy	Status
OpenAI	`{"name":"OpenAI","version":"v1"}`	API Key	✅
AWS Bedrock	`{"name":"AWSBedrock"}`	AWS Bedrock Credentials	✅
Azure OpenAI	`{"name":"AzureOpenAI","version":"2025-01-01-preview"}` or `{"name":"OpenAI", "version": "openai/v1"}`	Azure Credentials or Azure API Key	✅
Google Gemini on AI Studio	`{"name":"OpenAI","version":"v1beta/openai"}`	API Key	✅
Google Vertex AI	`{"name":"GCPVertexAI"}`	GCP Credentials	✅
Anthropic on GCP Vertex AI	`{"name":"GCPAnthropic", "version":"vertex-2023-10-16"}`	GCP Credentials	✅
Groq	`{"name":"OpenAI","version":"openai/v1"}`	API Key	✅
Grok	`{"name":"OpenAI","version":"v1"}`	API Key	✅
Together AI	`{"name":"OpenAI","version":"v1"}`	API Key	✅
Cohere	`{"name":"Cohere","version":"v2"}` or `{"name":"OpenAI","version":"v1"}`	API Key	✅
Mistral	`{"name":"OpenAI","version":"v1"}`	API Key	✅
DeepInfra	`{"name":"OpenAI","version":"v1/openai"}`	API Key	✅
DeepSeek	`{"name":"OpenAI","version":"v1"}`	API Key	✅
Hunyuan	`{"name":"OpenAI","version":"v1"}`	API Key	✅
MiniMax	`{"name":"OpenAI","version":"v1"}`	API Key	✅
Tencent LLM Knowledge Engine	`{"name":"OpenAI","version":"v1"}`	API Key	✅
Tetrate Agent Router Service (TARS)	`{"name":"OpenAI","version":"v1"}`	API Key	✅
SambaNova	`{"name":"OpenAI","version":"v1"}`	API Key	✅
Anthropic	`{"name":"Anthropic"}`	Anthropic API Key	✅
Self-hosted-models	`{"name":"OpenAI","version":"v1"}`	N/A	✅

Prerequisites

Before starting, ensure you have the following tools installed:

kind - Kubernetes in Docker (Optional)
kubectl - Kubernetes CLI
Helm - Package manager for Kubernetes

Step 1: Create Kind Cluster (Optional)

Create a local Kubernetes cluster optimized for the semantic router workload:

kind create cluster --name semantic-router-cluster

# Verify cluster is ready
kubectl wait --for=condition=Ready nodes --all --timeout=300s

Step 2: Deploy vLLM Semantic Router

Deploy the semantic router service with all required components using Helm:

# Install with custom values from GHCR OCI registry
# (Optional) If you use a registry mirror/proxy, append: --set global.imageRegistry=<your-registry>
helm install semantic-router oci://ghcr.io/vllm-project/charts/semantic-router \
  --version v0.0.0-latest \
  --namespace vllm-semantic-router-system \
  --create-namespace \
  -f https://raw.githubusercontent.com/vllm-project/semantic-router/refs/heads/main/deploy/kubernetes/ai-gateway/semantic-router-values/values.yaml

# Wait for deployment to be ready (this may take several minutes for model downloads)
kubectl wait --for=condition=Available deployment/semantic-router -n vllm-semantic-router-system --timeout=600s

# Verify deployment status
kubectl get pods -n vllm-semantic-router-system

Note: The values file contains the configuration for the semantic router, including model settings, categories, and routing rules. You can download and customize it from values.yaml.

Step 3: Install Envoy Gateway

Install the core Envoy Gateway for traffic management:

# Install Envoy Gateway using Helm
helm upgrade -i eg oci://docker.io/envoyproxy/gateway-helm \
  --version v0.0.0-latest \
  --namespace envoy-gateway-system \
  --create-namespace \
  -f https://raw.githubusercontent.com/envoyproxy/ai-gateway/main/manifests/envoy-gateway-values.yaml

kubectl wait --timeout=2m -n envoy-gateway-system deployment/envoy-gateway --for=condition=Available

Step 4: Install Envoy AI Gateway

Install the AI-specific extensions for inference workloads:

# Install Envoy AI Gateway using Helm
helm upgrade -i aieg oci://docker.io/envoyproxy/ai-gateway-helm \
    --version v0.0.0-latest \
    --namespace envoy-ai-gateway-system \
    --create-namespace

# Install Envoy AI Gateway CRDs
helm upgrade -i aieg-crd oci://docker.io/envoyproxy/ai-gateway-crds-helm --version v0.0.0-latest --namespace envoy-ai-gateway-system

# Wait for AI Gateway Controller to be ready
kubectl wait --timeout=300s -n envoy-ai-gateway-system deployment/ai-gateway-controller --for=condition=Available

Step 5: Deploy Demo LLM

Create a demo LLM to serve as the backend for the semantic router:

# Deploy demo LLM
kubectl apply -f https://raw.githubusercontent.com/vllm-project/semantic-router/refs/heads/main/deploy/kubernetes/ai-gateway/aigw-resources/base-model.yaml

Step 6: Create Gateway API Resources

Create the necessary Gateway API resources for the AI gateway:

kubectl apply -f https://raw.githubusercontent.com/vllm-project/semantic-router/refs/heads/main/deploy/kubernetes/ai-gateway/aigw-resources/gwapi-resources.yaml

Testing the Deployment

Method 1: Port Forwarding (Recommended for Local Testing)

Set up port forwarding to access the gateway locally:

# Get the Envoy service name
export ENVOY_SERVICE=$(kubectl get svc -n envoy-gateway-system \
  --selector=gateway.envoyproxy.io/owning-gateway-namespace=default,gateway.envoyproxy.io/owning-gateway-name=semantic-router \
  -o jsonpath='{.items[0].metadata.name}')

kubectl port-forward -n envoy-gateway-system svc/$ENVOY_SERVICE 8080:80

Send Test Requests

Once the gateway is accessible, test the inference endpoint:

# Test math domain chat completions endpoint
curl -i -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "MoM",
    "messages": [
      {"role": "user", "content": "What is the derivative of f(x) = x^3?"}
    ]
  }'

Troubleshooting

Common Issues

Gateway not accessible:

# Check gateway status
kubectl get gateway semantic-router -n default

# Check Envoy service
kubectl get svc -n envoy-gateway-system

AI Gateway controller not ready:

# Check AI gateway controller logs
kubectl logs -n envoy-ai-gateway-system deployment/ai-gateway-controller

# Check controller status
kubectl get deployment -n envoy-ai-gateway-system

Semantic router not responding:

# Check semantic router pod status
kubectl get pods -n vllm-semantic-router-system

# Check semantic router logs
kubectl logs -n vllm-semantic-router-system deployment/semantic-router

Cleanup

To remove the entire deployment:

# Remove Gateway API resources and Demo LLM
kubectl delete -f https://raw.githubusercontent.com/vllm-project/semantic-router/refs/heads/main/deploy/kubernetes/ai-gateway/aigw-resources/gwapi-resources.yaml
kubectl delete -f https://raw.githubusercontent.com/vllm-project/semantic-router/refs/heads/main/deploy/kubernetes/ai-gateway/aigw-resources/base-model.yaml

# Remove semantic router
helm uninstall semantic-router -n vllm-semantic-router-system

# Remove AI gateway
helm uninstall aieg -n envoy-ai-gateway-system
helm uninstall aieg-crd -n envoy-ai-gateway-system

# Remove Envoy gateway
helm uninstall eg -n envoy-gateway-system

# Delete kind cluster (optional)
kind delete cluster --name semantic-router-cluster

Next Steps

Configure custom routing rules in the AI Gateway
Set up monitoring and observability
Implement authentication and authorization
Scale the semantic router deployment for production workloads

Install with Envoy AI Gateway

Architecture Overview​

Benefits of Integration​

1. Hybrid Model Selection​

2. Token Rate Limiting​

3. Model/Provider Failover​

4. Traffic Splitting & Canary Testing​

5. LLM Observability & Monitoring​

Supported LLM Providers​

Prerequisites​

Step 1: Create Kind Cluster (Optional)​

Step 2: Deploy vLLM Semantic Router​

Step 3: Install Envoy Gateway​

Step 4: Install Envoy AI Gateway​

Step 5: Deploy Demo LLM​

Step 6: Create Gateway API Resources​

Testing the Deployment​

Method 1: Port Forwarding (Recommended for Local Testing)​

Send Test Requests​

Troubleshooting​

Common Issues​

Cleanup​

Next Steps​