使用 Operator 安装
Semantic Router Operator 提供了一种 Kubernetes 原生的方式,通过自定义资源定义(CRD)来部署与管理 vLLM Semantic Router 实例。它可以在 Kubernetes 与 OpenShift 平台上简化部署、配置与生命周期管理。
特性
- 声明式部署:使用 Kubernetes CRD 定义语义路由实例
- 自动配置:生成并管理用于语义路由配置的 ConfigMap
- 持久化存储:管理用于 ML 模型存储的 PVC,并自动处理生命周期
- 平台探测:自动识别 OpenShift 或标准 Kubernetes,并做相应配置
- 内置可观测性:默认支持指标、链路追踪与监控
- 生产能力:HPA、Ingress、Service Mesh 集成、Pod Disruption Budget
- 默认安全:移除全部 capability,禁止特权提升
前置条件
- Kubernetes 1.24+ 或 OpenShift 4.12+
- 已配置好的
kubectl或oc命令行 - 集群管理员权限(用于安装 CRD)
安装
选项 1:使用 Kustomize(标准 Kubernetes)
# Clone the repository
git clone https://github.com/vllm-project/semantic-router
cd semantic-router/deploy/operator
# Install CRDs
make install
# Deploy the operator
make deploy IMG=ghcr.io/vllm-project/semantic-router/operator:latest
验证 operator 正在运行:
kubectl get pods -n semantic-router-operator-system
选项 2:使用 OLM(OpenShift)
适用于通过 Operator Lifecycle Manager 部署到 OpenShift 的场景:
cd semantic-router/deploy/operator
# Build and push to your registry (Quay, internal registry, etc.)
podman login quay.io
make podman-build IMG=quay.io/<your-org>/semantic-router-operator:latest
make podman-push IMG=quay.io/<your-org>/semantic-router-operator:latest
# Deploy using OLM
make openshift-deploy
部署你的第一个 Router
使用示例配置快速开始
根据你的基础设施选择一个预配置示例:
# Simple standalone deployment with KServe backend
kubectl apply -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/operator/config/samples/vllm.ai_v1alpha1_semanticrouter_simple.yaml
# Full-featured OpenShift deployment with Routes
kubectl apply -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/operator/config/samples/vllm.ai_v1alpha1_semanticrouter_openshift.yaml
# Gateway integration mode (Istio/Envoy Gateway)
kubectl apply -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/operator/config/samples/vllm.ai_v1alpha1_semanticrouter_gateway.yaml
# Llama Stack backend discovery
kubectl apply -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/operator/config/samples/vllm.ai_v1alpha1_semanticrouter_llamastack.yaml
# Redis cache backend for production caching
kubectl apply -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/operator/config/samples/vllm.ai_v1alpha1_semanticrouter_redis_cache.yaml
# Milvus cache backend for large-scale deployments
kubectl apply -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/operator/config/samples/vllm.ai_v1alpha1_semanticrouter_milvus_cache.yaml
# Hybrid cache backend for optimal performance
kubectl apply -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/operator/config/samples/vllm.ai_v1alpha1_semanticrouter_hybrid_cache.yaml
# mmBERT 2D Matryoshka embeddings with layer early exit
kubectl apply -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/operator/config/samples/vllm.ai_v1alpha1_semanticrouter_mmbert.yaml
# Complexity-aware routing for intelligent model selection
kubectl apply -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/operator/config/samples/vllm.ai_v1alpha1_semanticrouter_complexity.yaml
自定义配置
创建一个 my-router.yaml 文件:
apiVersion: vllm.ai/v1alpha1
kind: SemanticRouter
metadata:
name: my-router
namespace: default
spec:
replicas: 2
image:
repository: ghcr.io/vllm-project/semantic-router/extproc
tag: latest
# Configure vLLM backend endpoints
vllmEndpoints:
# KServe InferenceService (RHOAI 3.x)
- name: llama3-8b-endpoint
model: llama3-8b
reasoningFamily: qwen3
loras:
- name: computer-science-expert
description: Adapter for advanced computer science prompts
backend:
type: kserve
inferenceServiceName: llama-3-8b
weight: 1
resources:
limits:
memory: "7Gi"
cpu: "2"
requests:
memory: "3Gi"
cpu: "1"
persistence:
enabled: true
size: 10Gi
storageClassName: "standard"
config:
providers:
defaults:
default_model: llama3-8b
default_reasoning_effort: medium
reasoning_families:
qwen3:
type: chat_template_kwargs
parameter: enable_thinking
models:
- name: llama3-8b
provider_model_id: llama3-8b
backend_refs:
- name: llama3-8b-endpoint
endpoint: llama-3-8b-predictor.default.svc.cluster.local:80
protocol: http
routing:
modelCards:
- name: llama3-8b
modality: text
capabilities: ["chat", "reasoning"]
decisions:
- name: default-route
description: Catch-all route
priority: 100
rules:
operator: AND
conditions: []
modelRefs:
- model: llama3-8b
use_reasoning: false
global:
stores:
semantic_cache:
enabled: true
backend_type: memory
max_entries: 1000
ttl_seconds: 3600
integrations:
tools:
enabled: true
top_k: 3
similarity_threshold: 0.2
model_catalog:
system:
prompt_guard: models/mmbert32k-jailbreak-detector-merged
modules:
prompt_guard:
enabled: true
model_ref: prompt_guard
threshold: 0.7
toolsDb:
- tool:
type: "function"
function:
name: "get_weather"
description: "Get weather information for a location"
parameters:
type: "object"
properties:
location:
type: "string"
description: "City and state, e.g. San Francisco, CA"
required: ["location"]
description: "Weather information tool"
category: "weather"
tags: ["weather", "temperature"]
应用配置:
kubectl apply -f my-router.yaml
spec.config 应使用与本地 config.yaml 相同的规范化 providers/routing/global 布局。spec.vllmEndpoints 仍是 Kubernetes 适配层,用于发现后端与 served-model alias;operator 在渲染 runtime config 时,会将其转换 为规范化的 providers.models[].backend_refs[] 与 routing.modelCards 条目(包含可选的 loras)。
高级特性
Embedding 模型配置
operator 支持三种高性能 embedding 模型,用于语义理解与缓存。你可以根据场景配置这些模型以优化效果。
可用的 embedding 模型
-
Qwen3-Embedding(1024 维,32K 上下文)
- 适合:高质量语义理解与长上下文
- 场景:复杂查询、研究文档、细致分析
-
EmbeddingGemma(768 维,8K 上下文)
- 适合:更快性能与较好精度
- 场景:实时应用、高吞吐
-
mmBERT 2D Matryoshka(64-768 维,多语言)
- 适合:可通过 layer early exit 自适应权衡速度与质量
- 场景:多语言部署、需要灵活的质量/速度权衡