Skip to main content
Documentation

Install with Envoy AI Gateway

This guide provides step-by-step instructions for integrating the vLLM Semantic Router with Envoy AI Gateway on Kubernetes for advanced traffic management and AI-specific features.

Version: Latest

Install with Envoy AI Gateway

This guide provides step-by-step instructions for integrating the vLLM Semantic Router with Envoy AI Gateway on Kubernetes for advanced traffic management and AI-specific features.

For large request bodies or streamed immediate responses from Semantic Router, also see Streamed ExtProc and immediate responses. That guide shows how to switch the ExtProc filter from BUFFERED to STREAMED request bodies and how streamed Chat Completions clients receive looper or fast_response immediate responses.

Architecture Overview

The deployment consists of:

  • vLLM Semantic Router: Provides intelligent request routing and semantic understanding
  • Envoy Gateway: Core gateway functionality and traffic management
  • Envoy AI Gateway: AI Gateway built on Envoy Gateway for LLM providers

Benefits of Integration

Integrating vLLM Semantic Router with Envoy AI Gateway provides enterprise-grade capabilities for production LLM deployments:

1. Hybrid Model Selection

Seamlessly route requests between cloud LLM providers (OpenAI, Anthropic, etc.) and self-hosted models.

2. Token Rate Limiting

Protect your infrastructure and control costs with fine-grained rate limiting:

  • Input token limits: Control request size to prevent abuse
  • Output token limits: Manage response generation costs
  • Total token limits: Set overall usage quotas per user/tenant
  • Time-based windows: Configure limits per second, minute, or hour

3. Model/Provider Failover

Ensure high availability with automatic failover mechanisms:

  • Detect unhealthy backends and route traffic to healthy instances
  • Support for active-passive and active-active failover strategies
  • Graceful degradation when primary models are unavailable

4. Traffic Splitting & Canary Testing

Deploy new models safely with progressive rollout capabilities:

  • A/B Testing: Split traffic between model versions to compare performance
  • Canary Deployments: Gradually shift traffic to new models (e.g., 5% → 25% → 50% → 100%)
  • Shadow Traffic: Send duplicate requests to new models without affecting production
  • Weight-based routing: Fine-tune traffic distribution across model variants

5. LLM Observability & Monitoring

Gain deep insights into your LLM infrastructure:

  • Request/Response Metrics: Track latency, throughput, token usage, and error rates
  • Model Performance: Monitor accuracy, quality scores, and user satisfaction
  • Cost Analytics: Analyze spending patterns across models and providers
  • Distributed Tracing: End-to-end visibility with OpenTelemetry integration
  • Custom Dashboards: Visualize metrics in Prometheus, Grafana, or your preferred monitoring stack

Supported LLM Providers

Provider NameAPI Schema Config on AIServiceBackendUpstream Authentication Config on BackendSecurityPolicyStatus
OpenAI{"name":"OpenAI","version":"v1"}API Key
AWS Bedrock{"name":"AWSBedrock"}AWS Bedrock Credentials
Azure OpenAI{"name":"AzureOpenAI","version":"2025-01-01-preview"} or {"name":"OpenAI", "version": "openai/v1"}Azure Credentials or Azure API Key
Google Gemini on AI Studio{"name":"OpenAI","version":"v1beta/openai"}API Key
Google Vertex AI{"name":"GCPVertexAI"}GCP Credentials
Anthropic on GCP Vertex AI{"name":"GCPAnthropic", "version":"vertex-2023-10-16"}GCP Credentials
Groq{"name":"OpenAI","version":"openai/v1"}API Key
Grok{"name":"OpenAI","version":"v1"}API Key
Together AI{"name":"OpenAI","version":"v1"}API Key
Cohere{"name":"Cohere","version":"v2"} or {"name":"OpenAI","version":"v1"}API Key
Mistral{"name":"OpenAI","version":"v1"}API Key
DeepInfra{"name":"OpenAI","version":"v1/openai"}API Key
DeepSeek{"name":"OpenAI","version":"v1"}API Key
Hunyuan{"name":"OpenAI","version":"v1"}API Key
MiniMax{"name":"OpenAI","version":"v1"}API Key
Tencent LLM Knowledge Engine{"name":"OpenAI","version":"v1"}API Key
Tetrate Agent Router Service (TARS){"name":"OpenAI","version":"v1"}API Key
SambaNova{"name":"OpenAI","version":"v1"}API Key
Anthropic{"name":"Anthropic"}Anthropic API Key
Self-hosted-models{"name":"OpenAI","version":"v1"}N/A

Prerequisites

Before starting, ensure you have the following tools installed:

  • kind - Kubernetes in Docker (Optional)
  • kubectl - Kubernetes CLI
  • Helm - Package manager for Kubernetes

Step 1: Create Kind Cluster (Optional)

Create a local Kubernetes cluster optimized for the semantic router workload:

kind create cluster --name semantic-router-cluster

# Verify cluster is ready
kubectl wait --for=condition=Ready nodes --all --timeout=300s

Step 2: Deploy vLLM Semantic Router

Deploy the semantic router service with all required components using Helm:

# Install with custom values from GHCR OCI registry
# (Optional) If you use a registry mirror/proxy, append: --set global.imageRegistry=<your-registry>
helm install semantic-router oci://ghcr.io/vllm-project/charts/semantic-router \
--version v0.0.0-latest \
--namespace vllm-semantic-router-system \
--create-namespace \
-f https://raw.githubusercontent.com/vllm-project/semantic-router/refs/heads/main/deploy/kubernetes/ai-gateway/semantic-router-values/values.yaml

# Wait for deployment to be ready (this may take several minutes for model downloads)
kubectl wait --for=condition=Available deployment/semantic-router -n vllm-semantic-router-system --timeout=600s

# Verify deployment status
kubectl get pods -n vllm-semantic-router-system

Note: The values file contains the configuration for the semantic router, including model settings, categories, and routing rules. You can download and customize it from values.yaml.

Step 3: Install Envoy Gateway

Install the core Envoy Gateway for traffic management:

# Install Envoy Gateway using Helm
helm upgrade -i eg oci://docker.io/envoyproxy/gateway-helm \
--version v0.0.0-latest \
--namespace envoy-gateway-system \
--create-namespace \
-f https://raw.githubusercontent.com/envoyproxy/ai-gateway/main/manifests/envoy-gateway-values.yaml

kubectl wait --timeout=2m -n envoy-gateway-system deployment/envoy-gateway --for=condition=Available

Step 4: Install Envoy AI Gateway

Install the AI-specific extensions for inference workloads:

# Install Envoy AI Gateway using Helm
helm upgrade -i aieg oci://docker.io/envoyproxy/ai-gateway-helm \
--version v0.0.0-latest \
--namespace envoy-ai-gateway-system \
--create-namespace

# Install Envoy AI Gateway CRDs
helm upgrade -i aieg-crd oci://docker.io/envoyproxy/ai-gateway-crds-helm --version v0.0.0-latest --namespace envoy-ai-gateway-system

# Wait for AI Gateway Controller to be ready
kubectl wait --timeout=300s -n envoy-ai-gateway-system deployment/ai-gateway-controller --for=condition=Available

Step 5: Deploy Demo LLM

Create a demo LLM to serve as the backend for the semantic router:

# Deploy demo LLM
kubectl apply -f https://raw.githubusercontent.com/vllm-project/semantic-router/refs/heads/main/deploy/kubernetes/ai-gateway/aigw-resources/base-model.yaml

Step 6: Create Gateway API Resources

Create the necessary Gateway API resources for the AI gateway:

kubectl apply -f https://raw.githubusercontent.com/vllm-project/semantic-router/refs/heads/main/deploy/kubernetes/ai-gateway/aigw-resources/gwapi-resources.yaml

Testing the Deployment

Set up port forwarding to access the gateway locally:

# Get the Envoy service name
export ENVOY_SERVICE=$(kubectl get svc -n envoy-gateway-system \
--selector=gateway.envoyproxy.io/owning-gateway-namespace=default,gateway.envoyproxy.io/owning-gateway-name=semantic-router \
-o jsonpath='{.items[0].metadata.name}')

kubectl port-forward -n envoy-gateway-system svc/$ENVOY_SERVICE 8080:80

Send Test Requests

Once the gateway is accessible, test the inference endpoint:

# Test math domain chat completions endpoint
curl -i -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "MoM",
"messages": [
{"role": "user", "content": "What is the derivative of f(x) = x^3?"}
]
}'

Troubleshooting

Common Issues

Gateway not accessible:

# Check gateway status
kubectl get gateway semantic-router -n default

# Check Envoy service
kubectl get svc -n envoy-gateway-system

AI Gateway controller not ready:

# Check AI gateway controller logs
kubectl logs -n envoy-ai-gateway-system deployment/ai-gateway-controller

# Check controller status
kubectl get deployment -n envoy-ai-gateway-system

Semantic router not responding:

# Check semantic router pod status
kubectl get pods -n vllm-semantic-router-system

# Check semantic router logs
kubectl logs -n vllm-semantic-router-system deployment/semantic-router

Cleanup

To remove the entire deployment:

# Remove Gateway API resources and Demo LLM
kubectl delete -f https://raw.githubusercontent.com/vllm-project/semantic-router/refs/heads/main/deploy/kubernetes/ai-gateway/aigw-resources/gwapi-resources.yaml
kubectl delete -f https://raw.githubusercontent.com/vllm-project/semantic-router/refs/heads/main/deploy/kubernetes/ai-gateway/aigw-resources/base-model.yaml

# Remove semantic router
helm uninstall semantic-router -n vllm-semantic-router-system

# Remove AI gateway
helm uninstall aieg -n envoy-ai-gateway-system
helm uninstall aieg-crd -n envoy-ai-gateway-system

# Remove Envoy gateway
helm uninstall eg -n envoy-gateway-system

# Delete kind cluster (optional)
kind delete cluster --name semantic-router-cluster

Next Steps

  • Configure custom routing rules in the AI Gateway
  • Set up monitoring and observability
  • Implement authentication and authorization
  • Scale the semantic router deployment for production workloads