Installation
This guide will help you install and run the vLLM Semantic Router. The router runs entirely on CPU and does not require GPU for inference.
System Requirementsâ
note
No GPU required - the router runs efficiently on CPU using optimized BERT models.
Requirements:
- Python: 3.10 or higher
- Container Runtime: Docker or Podman (required for running the router container)
Quick Startâ
1. Install vLLM Semantic Routerâ
# Create a virtual environment (recommended)
python -m venv vsr
source vsr/bin/activate # On Windows: vsr\Scripts\activate
# Install from PyPI
pip install vllm-sr
Verify installation:
vllm-sr --version
2. Initialize Configurationâ
# Create config.yaml in current directory
vllm-sr init
This creates a config.yaml file with default settings.
3. Configure Your Backendâ
Edit the generated config.yaml to configure your model and backend endpoint:
providers:
# Model configuration
models:
- name: "qwen/qwen3-1.8b" # Model name
endpoints:
- name: "my_vllm"
weight: 1
endpoint: "localhost:8000" # Domain or IP:port
protocol: "http" # http or https
access_key: "your-token-here" # Optional: for authentication
# Default model for fallback
default_model: "qwen/qwen3-1.8b"
Configuration Options:
- endpoint: Domain name or IP address with port (e.g.,
localhost:8000,api.openai.com) - protocol:
httporhttps - access_key: Optional authentication token (Bearer token)
- weight: Load balancing weight (default: 1)
Example: Local vLLM
providers:
models:
- name: "qwen/qwen3-1.8b"
endpoints:
- name: "local_vllm"
weight: 1
endpoint: "localhost:8000"
protocol: "http"
default_model: "qwen/qwen3-1.8b"
Example: External API with HTTPS
providers:
models:
- name: "openai/gpt-4"
endpoints:
- name: "openai_api"
weight: 1
endpoint: "api.openai.com"
protocol: "https"
access_key: "sk-xxxxxx"
default_model: "openai/gpt-4"
4. Start the Routerâ
vllm-sr serve
The router will:
- Automatically download required ML models (~1.5GB, one-time)
- Start Envoy proxy on port 8888
- Start the semantic router service
- Enable metrics on port 9190
5. Test the Routerâ
curl http://localhost:8888/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "MoM",
"messages": [{"role": "user", "content": "Hello!"}]
}'
6. Launch Dashboardâ
vllm-sr dashboard
Common Commandsâ
# View logs
vllm-sr logs router # Router logs
vllm-sr logs envoy # Envoy logs
vllm-sr logs router -f # Follow logs
# Check status
vllm-sr status
# Stop the router
vllm-sr stop
Advanced Configurationâ
HuggingFace Settingsâ
Set environment variables before starting:
export HF_ENDPOINT=https://huggingface.co # Or mirror: https://hf-mirror.com
export HF_TOKEN=your_token_here # Only for gated models
export HF_HOME=/path/to/cache # Custom cache directory
vllm-sr serve
Custom Optionsâ
# Use custom config file
vllm-sr serve --config my-config.yaml
# Use custom Docker image
vllm-sr serve --image ghcr.io/vllm-project/semantic-router/vllm-sr:latest
# Control image pull policy
vllm-sr serve --image-pull-policy always
Kubernetes Deploymentâ
For production deployments on Kubernetes or OpenShift, use the Kubernetes Operator:
Quick Start with Operatorâ
# Clone repository
git clone https://github.com/vllm-project/semantic-router
cd semantic-router/deploy/operator
# Install CRDs and operator
make install
make deploy IMG=ghcr.io/vllm-project/semantic-router-operator:latest
# Deploy a semantic router instance
kubectl apply -f config/samples/vllm_v1alpha1_semanticrouter.yaml
Benefits:
- â Declarative configuration using Kubernetes CRDs
- â Automatic platform detection (OpenShift/Kubernetes)
- â Built-in high availability and scaling
- â Integrated monitoring and observability
- â Lifecycle management and upgrades
See the Kubernetes Operator Guide for complete documentation.
Other Kubernetes Deployment Optionsâ
- Istio Integration - Service mesh deployment
- AI Gateway - Gateway API integration
- Production Stack - Complete production setup
- Dynamo - Dynamic configuration management
Docker Composeâ
For local development and testing:
- Docker Compose - Quick local deployment
Next Stepsâ
- Configuration Guide - Advanced routing and signal configuration
- Kubernetes Operator - Production Kubernetes deployment
- API Documentation - Complete API reference
- Tutorials - Learn by example
Getting Helpâ
- Issues: GitHub Issues
- Community: Join
#semantic-routerchannel in vLLM Slack - Documentation: vllm-semantic-router.com