Install with Envoy AI Gateway
This guide provides step-by-step instructions for integrating the vLLM Semantic Router with Envoy AI Gateway on Kubernetes for advanced traffic management and AI-specific features.
For large request bodies or streamed immediate responses from Semantic Router, also see Streamed ExtProc and immediate responses. That guide shows how to switch the ExtProc filter from BUFFERED to STREAMED request bodies and how streamed Chat Completions clients receive looper or fast_response immediate responses.
Architecture Overview
The deployment consists of:
- vLLM Semantic Router: Provides intelligent request routing and semantic understanding
- Envoy Gateway: Core gateway functionality and traffic management
- Envoy AI Gateway: AI Gateway built on Envoy Gateway for LLM providers
Benefits of Integration
Integrating vLLM Semantic Router with Envoy AI Gateway provides enterprise-grade capabilities for production LLM deployments:
1. Hybrid Model Selection
Seamlessly route requests between cloud LLM providers (OpenAI, Anthropic, etc.) and self-hosted models.
2. Token Rate Limiting
Protect your infrastructure and control costs with fine-grained rate limiting:
- Input token limits: Control request size to prevent abuse
- Output token limits: Manage response generation costs
- Total token limits: Set overall usage quotas per user/tenant
- Time-based windows: Configure limits per second, minute, or hour
3. Model/Provider Failover
Ensure high availability with automatic failover mechanisms:
- Detect unhealthy backends and route traffic to healthy instances
- Support for active-passive and active-active failover strategies
- Graceful degradation when primary models are unavailable
4. Traffic Splitting & Canary Testing
Deploy new models safely with progressive rollout capabilities:
- A/B Testing: Split traffic between model versions to compare performance
- Canary Deployments: Gradually shift traffic to new models (e.g., 5% → 25% → 50% → 100%)
- Shadow Traffic: Send duplicate requests to new models without affecting production
- Weight-based routing: Fine-tune traffic distribution across model variants
5. LLM Observability & Monitoring
Gain deep insights into your LLM infrastructure:
- Request/Response Metrics: Track latency, throughput, token usage, and error rates
- Model Performance: Monitor accuracy, quality scores, and user satisfaction
- Cost Analytics: Analyze spending patterns across models and providers
- Distributed Tracing: End-to-end visibility with OpenTelemetry integration
- Custom Dashboards: Visualize metrics in Prometheus, Grafana, or your preferred monitoring stack
Supported LLM Providers
| Provider Name | API Schema Config on AIServiceBackend | Upstream Authentication Config on BackendSecurityPolicy | Status |
|---|---|---|---|
| OpenAI | {"name":"OpenAI","version":"v1"} | API Key | ✅ |
| AWS Bedrock | {"name":"AWSBedrock"} | AWS Bedrock Credentials | ✅ |
| Azure OpenAI | {"name":"AzureOpenAI","version":"2025-01-01-preview"} or {"name":"OpenAI", "version": "openai/v1"} | Azure Credentials or Azure API Key | ✅ |
| Google Gemini on AI Studio | {"name":"OpenAI","version":"v1beta/openai"} | API Key | ✅ |
| Google Vertex AI | {"name":"GCPVertexAI"} | GCP Credentials | ✅ |
| Anthropic on GCP Vertex AI | {"name":"GCPAnthropic", "version":"vertex-2023-10-16"} | GCP Credentials | ✅ |
| Groq | {"name":"OpenAI","version":"openai/v1"} | API Key | ✅ |
| Grok | {"name":"OpenAI","version":"v1"} | API Key | ✅ |
| Together AI | {"name":"OpenAI","version":"v1"} | API Key | ✅ |
| Cohere | {"name":"Cohere","version":"v2"} or {"name":"OpenAI","version":"v1"} | API Key | ✅ |
| Mistral | {"name":"OpenAI","version":"v1"} | API Key | ✅ |
| DeepInfra | {"name":"OpenAI","version":"v1/openai"} | API Key | ✅ |
| DeepSeek | {"name":"OpenAI","version":"v1"} | API Key | ✅ |
| Hunyuan | {"name":"OpenAI","version":"v1"} | API Key | ✅ |
| MiniMax | {"name":"OpenAI","version":"v1"} | API Key | ✅ |
| Tencent LLM Knowledge Engine | {"name":"OpenAI","version":"v1"} | API Key | ✅ |
| Tetrate Agent Router Service (TARS) | {"name":"OpenAI","version":"v1"} | API Key | ✅ |
| SambaNova | {"name":"OpenAI","version":"v1"} | API Key | ✅ |
| Anthropic | {"name":"Anthropic"} | Anthropic API Key | ✅ |
| Self-hosted-models | {"name":"OpenAI","version":"v1"} | N/A | ✅ |
Prerequisites
Before starting, ensure you have the following tools installed:
- kind - Kubernetes in Docker (Optional)
- kubectl - Kubernetes CLI
- Helm - Package manager for Kubernetes
Step 1: Create Kind Cluster (Optional)
Create a local Kubernetes cluster optimized for the semantic router workload:
kind create cluster --name semantic-router-cluster
# Verify cluster is ready
kubectl wait --for=condition=Ready nodes --all --timeout=300s
Step 2: Deploy vLLM Semantic Router
Deploy the semantic router service with all required components using Helm:
# Install with custom values from GHCR OCI registry
# (Optional) If you use a registry mirror/proxy, append: --set global.imageRegistry=<your-registry>
helm install semantic-router oci://ghcr.io/vllm-project/charts/semantic-router \
--version v0.0.0-latest \
--namespace vllm-semantic-router-system \
--create-namespace \
-f https://raw.githubusercontent.com/vllm-project/semantic-router/refs/heads/main/deploy/kubernetes/ai-gateway/semantic-router-values/values.yaml
# Wait for deployment to be ready (this may take several minutes for model downloads)
kubectl wait --for=condition=Available deployment/semantic-router -n vllm-semantic-router-system --timeout=600s
# Verify deployment status
kubectl get pods -n vllm-semantic-router-system
Note: The values file contains the configuration for the semantic router, including model settings, categories, and routing rules. You can download and customize it from values.yaml.
Step 3: Install Envoy Gateway
Install the core Envoy Gateway for traffic management:
# Install Envoy Gateway using Helm
helm upgrade -i eg oci://docker.io/envoyproxy/gateway-helm \
--version v0.0.0-latest \
--namespace envoy-gateway-system \
--create-namespace \
-f https://raw.githubusercontent.com/envoyproxy/ai-gateway/main/manifests/envoy-gateway-values.yaml
kubectl wait --timeout=2m -n envoy-gateway-system deployment/envoy-gateway --for=condition=Available
Step 4: Install Envoy AI Gateway
Install the AI-specific extensions for inference workloads:
# Install Envoy AI Gateway using Helm
helm upgrade -i aieg oci://docker.io/envoyproxy/ai-gateway-helm \
--version v0.0.0-latest \
--namespace envoy-ai-gateway-system \
--create-namespace
# Install Envoy AI Gateway CRDs
helm upgrade -i aieg-crd oci://docker.io/envoyproxy/ai-gateway-crds-helm --version v0.0.0-latest --namespace envoy-ai-gateway-system
# Wait for AI Gateway Controller to be ready
kubectl wait --timeout=300s -n envoy-ai-gateway-system deployment/ai-gateway-controller --for=condition=Available
Step 5: Deploy Demo LLM
Create a demo LLM to serve as the backend for the semantic router:
# Deploy demo LLM
kubectl apply -f https://raw.githubusercontent.com/vllm-project/semantic-router/refs/heads/main/deploy/kubernetes/ai-gateway/aigw-resources/base-model.yaml
Step 6: Create Gateway API Resources
Create the necessary Gateway API resources for the AI gateway:
kubectl apply -f https://raw.githubusercontent.com/vllm-project/semantic-router/refs/heads/main/deploy/kubernetes/ai-gateway/aigw-resources/gwapi-resources.yaml
Testing the Deployment
Method 1: Port Forwarding (Recommended for Local Testing)
Set up port forwarding to access the gateway locally:
# Get the Envoy service name
export ENVOY_SERVICE=$(kubectl get svc -n envoy-gateway-system \
--selector=gateway.envoyproxy.io/owning-gateway-namespace=default,gateway.envoyproxy.io/owning-gateway-name=semantic-router \
-o jsonpath='{.items[0].metadata.name}')
kubectl port-forward -n envoy-gateway-system svc/$ENVOY_SERVICE 8080:80
Send Test Requests
Once the gateway is accessible, test the inference endpoint:
# Test math domain chat completions endpoint
curl -i -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "MoM",
"messages": [
{"role": "user", "content": "What is the derivative of f(x) = x^3?"}
]
}'
Troubleshooting
Common Issues
Gateway not accessible:
# Check gateway status
kubectl get gateway semantic-router -n default
# Check Envoy service
kubectl get svc -n envoy-gateway-system
AI Gateway controller not ready:
# Check AI gateway controller logs
kubectl logs -n envoy-ai-gateway-system deployment/ai-gateway-controller
# Check controller status
kubectl get deployment -n envoy-ai-gateway-system
Semantic router not responding:
# Check semantic router pod status
kubectl get pods -n vllm-semantic-router-system
# Check semantic router logs
kubectl logs -n vllm-semantic-router-system deployment/semantic-router
Cleanup
To remove the entire deployment:
# Remove Gateway API resources and Demo LLM
kubectl delete -f https://raw.githubusercontent.com/vllm-project/semantic-router/refs/heads/main/deploy/kubernetes/ai-gateway/aigw-resources/gwapi-resources.yaml
kubectl delete -f https://raw.githubusercontent.com/vllm-project/semantic-router/refs/heads/main/deploy/kubernetes/ai-gateway/aigw-resources/base-model.yaml
# Remove semantic router
helm uninstall semantic-router -n vllm-semantic-router-system
# Remove AI gateway
helm uninstall aieg -n envoy-ai-gateway-system
helm uninstall aieg-crd -n envoy-ai-gateway-system
# Remove Envoy gateway
helm uninstall eg -n envoy-gateway-system
# Delete kind cluster (optional)
kind delete cluster --name semantic-router-cluster
Next Steps
- Configure custom routing rules in the AI Gateway
- Set up monitoring and observability
- Implement authentication and authorization
- Scale the semantic router deployment for production workloads