Kubernetes Operator
The Semantic Router Operator provides a Kubernetes-native way to deploy and manage vLLM Semantic Router instances using Custom Resource Definitions (CRDs). It simplifies deployment, configuration, and lifecycle management across Kubernetes and OpenShift platforms.
Features
- 🚀 Declarative Deployment: Define semantic router instances using Kubernetes CRDs
- 🔄 Automatic Configuration: Generates and manages ConfigMaps for semantic router configuration
- 📦 Persistent Storage: Manages PVCs for ML model storage with automatic lifecycle
- 🔐 Platform Detection: Automatically detects and configures for OpenShift or standard Kubernetes
- 📊 Built-in Observability: Metrics, tracing, and monitoring support out of the box
- 🎯 Production Features: HPA, ingress, service mesh integration, and pod disruption budgets
- 🛡️ Secure by Default: Drops all capabilities, prevents privilege escalation
Quick Start
Prerequisites
- Kubernetes 1.24+ or OpenShift 4.12+
kubectlorocCLI configured- Cluster admin access (for CRD installation)
Installation
Option 1: Using Kustomize (Standard Kubernetes)
# Clone the repository
git clone https://github.com/vllm-project/semantic-router
cd semantic-router/deploy/operator
# Install CRDs
make install
# Deploy the operator
make deploy IMG=ghcr.io/vllm-project/semantic-router-operator:latest
Verify the operator is running:
kubectl get pods -n semantic-router-operator-system
Option 2: Using OLM (OpenShift)
For OpenShift deployments using Operator Lifecycle Manager:
cd semantic-router/deploy/operator
# Build and push to your registry (Quay, internal registry, etc.)
podman login quay.io
make podman-build IMG=quay.io/<your-org>/semantic-router-operator:latest
make podman-push IMG=quay.io/<your-org>/semantic-router-operator:latest
# Deploy using OLM
make openshift-deploy
See the OpenShift Quick Start Guide for detailed instructions.
Deploy Your First Router
Quick Start with Sample Configurations
Choose a pre-configured sample based on your infrastructure:
# Simple standalone deployment with KServe backend
kubectl apply -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/operator/config/samples/vllm.ai_v1alpha1_semanticrouter_simple.yaml
# Full-featured OpenShift deployment with Routes
kubectl apply -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/operator/config/samples/vllm.ai_v1alpha1_semanticrouter_openshift.yaml
# Gateway integration mode (Istio/Envoy Gateway)
kubectl apply -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/operator/config/samples/vllm.ai_v1alpha1_semanticrouter_gateway.yaml
# Llama Stack backend discovery
kubectl apply -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/operator/config/samples/vllm.ai_v1alpha1_semanticrouter_llamastack.yaml
Custom Configuration
Create a my-router.yaml file:
apiVersion: vllm.ai/v1alpha1
kind: SemanticRouter
metadata:
name: my-router
namespace: default
spec:
replicas: 2
image:
repository: ghcr.io/vllm-project/semantic-router/extproc
tag: latest
# Configure vLLM backend endpoints
vllmEndpoints:
# KServe InferenceService (RHOAI 3.x)
- name: llama3-8b-endpoint
model: llama3-8b
reasoningFamily: qwen3
backend:
type: kserve
inferenceServiceName: llama-3-8b
weight: 1
resources:
limits:
memory: "7Gi"
cpu: "2"
requests:
memory: "3Gi"
cpu: "1"
persistence:
enabled: true
size: 10Gi
storageClassName: "standard"
config:
bert_model:
model_id: "models/mom-embedding-light"
threshold: "0.6"
use_cpu: true
semantic_cache:
enabled: true
backend_type: "memory"
max_entries: 1000
ttl_seconds: 3600
tools:
enabled: true
top_k: 3
similarity_threshold: "0.2"
prompt_guard:
enabled: true
threshold: "0.7"
toolsDb:
- tool:
type: "function"
function:
name: "get_weather"
description: "Get weather information for a location"
parameters:
type: "object"
properties:
location:
type: "string"
description: "City and state, e.g. San Francisco, CA"
required: ["location"]
description: "Weather information tool"
category: "weather"
tags: ["weather", "temperature"]
Apply the configuration:
kubectl apply -f my-router.yaml
Verify Deployment
# Check the SemanticRouter resource
kubectl get semanticrouter my-router
# Check created resources
kubectl get deployment,service,configmap -l app.kubernetes.io/instance=my-router
# View status
kubectl describe semanticrouter my-router
# View logs
kubectl logs -f deployment/my-router
Expected output:
NAME PHASE REPLICAS READY AGE
semanticrouter.vllm.ai/my-router Running 2 2 5m
Backend Discovery Types
The operator supports three types of backend discovery for connecting semantic router to vLLM model servers. Choose the type that matches your infrastructure.
KServe InferenceService Discovery
For RHOAI 3.x or standalone KServe deployments. The operator automatically discovers the predictor service created by KServe.
spec:
vllmEndpoints:
- name: llama3-8b-endpoint
model: llama3-8b
reasoningFamily: qwen3
backend:
type: kserve
inferenceServiceName: llama-3-8b # InferenceService in same namespace
weight: 1
When to use:
- Running on Red Hat OpenShift AI (RHOAI) 3.x
- Using KServe for model serving
- Want automatic service discovery
How it works:
- Discovers the predictor service:
{inferenceServiceName}-predictor - Uses port 8443 (KServe default HTTPS port)
- Works in the same namespace as SemanticRouter
Llama Stack Service Discovery
Discovers Llama Stack deployments using Kubernetes label selectors.
spec:
vllmEndpoints:
- name: llama-405b-endpoint
model: llama-3.3-70b-instruct
reasoningFamily: gpt
backend:
type: llamastack
discoveryLabels:
app: llama-stack
model: llama-3.3-70b
weight: 1
When to use:
- Using Meta's Llama Stack for model serving
- Multiple Llama Stack services with different models
- Want label-based service discovery
How it works:
- Lists services matching the label selector
- Uses first matching service if multiple found
- Extracts port from service definition
Direct Kubernetes Service
Direct connection to any Kubernetes service (vLLM, TGI, etc.).
spec:
vllmEndpoints:
- name: custom-vllm-endpoint
model: deepseek-r1-distill-qwen-7b
reasoningFamily: deepseek
backend:
type: service
service:
name: vllm-deepseek
namespace: vllm-serving # Can reference service in another namespace
port: 8000
weight: 1
When to use:
- Direct vLLM deployments
- Custom model servers with OpenAI-compatible API
- Cross-namespace service references
- Maximum control over service endpoints
How it works:
- Connects to specified service directly
- No discovery - uses explicit configuration
- Supports cross-namespace references
Multiple Backends
You can configure multiple backends with load balancing weights:
spec:
vllmEndpoints:
# KServe backend
- name: llama3-8b
model: llama3-8b
reasoningFamily: qwen3
backend:
type: kserve
inferenceServiceName: llama-3-8b
weight: 2 # Higher weight = more traffic
# Direct service backend
- name: qwen-7b
model: qwen2.5-7b
reasoningFamily: qwen3
backend:
type: service
service:
name: vllm-qwen
port: 8000
weight: 1
Deployment Modes
The operator supports two deployment modes with different architectures.
Standalone Mode (Default)
Deploys semantic router with an Envoy sidecar container that acts as an ingress gateway.
Architecture:
Client → Service (8080) → Envoy Sidecar → ExtProc gRPC → Semantic Router → vLLM
When to use:
- Simple deployments without existing service mesh
- Testing and development
- Self-contained deployment with minimal dependencies
Configuration:
spec:
# No gateway configuration - defaults to standalone mode
service:
type: ClusterIP
api:
port: 8080 # Client traffic enters here
targetPort: 8080 # Envoy ingress port
grpc:
port: 50051 # ExtProc communication
targetPort: 50051
Operator behavior:
- Deploys pod with two containers: semantic router + Envoy sidecar
- Envoy handles ingress and forwards to semantic router via ExtProc gRPC
- Status shows
gatewayMode: "standalone"
Gateway Integration Mode
Reuses an existing Gateway (Istio, Envoy Gateway, etc.) and creates an HTTPRoute.
Architecture:
Client → Gateway (Istio/Envoy) → HTTPRoute → Service (8080) → Semantic Router API → vLLM
When to use:
- Existing Istio or Envoy Gateway deployment
- Centralized ingress management
- Multi-tenancy with shared gateway
- Advanced traffic management (circuit breaking, retries, rate limiting)
Configuration:
spec:
gateway:
existingRef:
name: istio-ingressgateway # Or your Envoy Gateway name
namespace: istio-system
# Service only needs API port in gateway mode
service:
type: ClusterIP
api:
port: 8080
targetPort: 8080
Operator behavior:
- Creates HTTPRoute resource pointing to the specified Gateway
- Skips Envoy sidecar container in pod spec
- Sets
status.gatewayMode: "gateway-integration" - Semantic router operates in pure API mode (no ExtProc)
Example: See vllm.ai_v1alpha1_semanticrouter_gateway.yaml
OpenShift Routes
For OpenShift deployments, the operator can create Routes for external access with TLS termination.
Basic OpenShift Route
spec:
openshift:
routes:
enabled: true
hostname: semantic-router.apps.openshift.example.com # Optional - auto-generated if omitted
tls:
termination: edge # TLS terminates at Route, plain HTTP to backend
insecureEdgeTerminationPolicy: Redirect # Redirect HTTP to HTTPS
TLS Termination Options
- edge (recommended): TLS terminates at Route, plain HTTP to backend
- passthrough: TLS passthrough to backend (requires backend TLS)
- reencrypt: TLS terminates at Route, re-encrypts to backend
When to Use OpenShift Routes
- Running on OpenShift 4.x
- Need external access without configuring Ingress
- Want auto-generated hostnames
- Require OpenShift-native TLS management
Status Information
After creating a Route, check the status:
kubectl get semanticrouter my-router -o jsonpath='{.status.openshiftFeatures}'
Output:
{
"routesEnabled": true,
"routeHostname": "semantic-router-default.apps.openshift.example.com"
}
Example: See vllm.ai_v1alpha1_semanticrouter_route.yaml
Choosing Your Configuration
Use this decision tree to select the right configuration:
┌─ Need to run on OpenShift?
│ ├─ YES → Use openshift sample (Routes + KServe/service backends)
│ └─ NO ↓
│
├─ Have existing Gateway (Istio/Envoy)?
│ ├─ YES → Use gateway sample (Gateway integration mode)
│ └─ NO ↓
│