Skip to main content
Version: 🚧 Next

Kubernetes Operator

The Semantic Router Operator provides a Kubernetes-native way to deploy and manage vLLM Semantic Router instances using Custom Resource Definitions (CRDs). It simplifies deployment, configuration, and lifecycle management across Kubernetes and OpenShift platforms.

Features​

  • 🚀 Declarative Deployment: Define semantic router instances using Kubernetes CRDs
  • 🔄 Automatic Configuration: Generates and manages ConfigMaps for semantic router configuration
  • đŸ“Ļ Persistent Storage: Manages PVCs for ML model storage with automatic lifecycle
  • 🔐 Platform Detection: Automatically detects and configures for OpenShift or standard Kubernetes
  • 📊 Built-in Observability: Metrics, tracing, and monitoring support out of the box
  • đŸŽ¯ Production Features: HPA, ingress, service mesh integration, and pod disruption budgets
  • đŸ›Ąī¸ Secure by Default: Drops all capabilities, prevents privilege escalation

Prerequisites​

  • Kubernetes 1.24+ or OpenShift 4.12+
  • kubectl or oc CLI configured
  • Cluster admin access (for CRD installation)

Installation​

Option 1: Using Kustomize (Standard Kubernetes)​

# Clone the repository
git clone https://github.com/vllm-project/semantic-router
cd semantic-router/deploy/operator

# Install CRDs
make install

# Deploy the operator
make deploy IMG=ghcr.io/vllm-project/semantic-router-operator:latest

Verify the operator is running:

kubectl get pods -n semantic-router-operator-system

Option 2: Using OLM (OpenShift)​

For OpenShift deployments using Operator Lifecycle Manager:

cd semantic-router/deploy/operator

# Build and push to your registry (Quay, internal registry, etc.)
podman login quay.io
make podman-build IMG=quay.io/<your-org>/semantic-router-operator:latest
make podman-push IMG=quay.io/<your-org>/semantic-router-operator:latest

# Deploy using OLM
make openshift-deploy

Deploy Your First Router​

Quick Start with Sample Configurations​

Choose a pre-configured sample based on your infrastructure:

# Simple standalone deployment with KServe backend
kubectl apply -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/operator/config/samples/vllm.ai_v1alpha1_semanticrouter_simple.yaml

# Full-featured OpenShift deployment with Routes
kubectl apply -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/operator/config/samples/vllm.ai_v1alpha1_semanticrouter_openshift.yaml

# Gateway integration mode (Istio/Envoy Gateway)
kubectl apply -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/operator/config/samples/vllm.ai_v1alpha1_semanticrouter_gateway.yaml

# Llama Stack backend discovery
kubectl apply -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/operator/config/samples/vllm.ai_v1alpha1_semanticrouter_llamastack.yaml

# Redis cache backend for production caching
kubectl apply -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/operator/config/samples/vllm.ai_v1alpha1_semanticrouter_redis_cache.yaml

# Milvus cache backend for large-scale deployments
kubectl apply -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/operator/config/samples/vllm.ai_v1alpha1_semanticrouter_milvus_cache.yaml

# Hybrid cache backend for optimal performance
kubectl apply -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/operator/config/samples/vllm.ai_v1alpha1_semanticrouter_hybrid_cache.yaml

# mmBERT 2D Matryoshka embeddings with layer early exit
kubectl apply -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/operator/config/samples/vllm.ai_v1alpha1_semanticrouter_mmbert.yaml

# Complexity-aware routing for intelligent model selection
kubectl apply -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/operator/config/samples/vllm.ai_v1alpha1_semanticrouter_complexity.yaml

Custom Configuration​

Create a my-router.yaml file:

apiVersion: vllm.ai/v1alpha1
kind: SemanticRouter
metadata:
name: my-router
namespace: default
spec:
replicas: 2

image:
repository: ghcr.io/vllm-project/semantic-router/extproc
tag: latest

# Configure vLLM backend endpoints
vllmEndpoints:
# KServe InferenceService (RHOAI 3.x)
- name: llama3-8b-endpoint
model: llama3-8b
reasoningFamily: qwen3
backend:
type: kserve
inferenceServiceName: llama-3-8b
weight: 1

resources:
limits:
memory: "7Gi"
cpu: "2"
requests:
memory: "3Gi"
cpu: "1"

persistence:
enabled: true
size: 10Gi
storageClassName: "standard"

config:
bert_model:
model_id: "models/mom-embedding-light"
threshold: "0.6"
use_cpu: true

semantic_cache:
enabled: true
backend_type: "memory"
max_entries: 1000
ttl_seconds: 3600

tools:
enabled: true
top_k: 3
similarity_threshold: "0.2"

prompt_guard:
enabled: true
threshold: "0.7"

toolsDb:
- tool:
type: "function"
function:
name: "get_weather"
description: "Get weather information for a location"
parameters:
type: "object"
properties:
location:
type: "string"
description: "City and state, e.g. San Francisco, CA"
required: ["location"]
description: "Weather information tool"
category: "weather"
tags: ["weather", "temperature"]

Apply the configuration:

kubectl apply -f my-router.yaml

Advanced Features​

Embedding Models Configuration​

The operator supports three high-performance embedding models for semantic understanding and caching. You can configure these models to optimize for your specific use case.

Available Embedding Models​

  1. Qwen3-Embedding (1024 dimensions, 32K context)

    • Best for: High-quality semantic understanding with long context
    • Use case: Complex queries, research documents, detailed analysis
  2. EmbeddingGemma (768 dimensions, 8K context)

    • Best for: Fast performance with good accuracy
    • Use case: Real-time applications, high-throughput scenarios
  3. mmBERT 2D Matryoshka (64-768 dimensions, multilingual)

    • Best for: Adaptive performance with layer early exit
    • Use case: Multilingual deployments, flexible quality/speed trade-offs

Example: mmBERT with Layer Early Exit​

spec:
config:
# Configure mmBERT 2D Matryoshka embeddings
embedding_models:
mmbert_model_path: "models/mmbert-embedding"
use_cpu: true

# HNSW configuration for embedding-based classification
hnsw_config:
model_type: "mmbert"

# Layer early exit: balance speed vs accuracy
# Layer 3: ~7x speedup (fast, good for high-volume queries)
# Layer 6: ~3.6x speedup (balanced - recommended)
# Layer 11: ~2x speedup (higher accuracy)
# Layer 22: full model (maximum accuracy)
target_layer: 6

# Dimension reduction for faster similarity search
# Options: 64, 128, 256, 512, 768
target_dimension: 256

preload_embeddings: true
enable_soft_matching: true
min_score_threshold: "0.5"

# Use mmBERT in semantic cache
semantic_cache:
enabled: true
backend_type: "memory"
embedding_model: "mmbert"
similarity_threshold: "0.85"
max_entries: 5000
ttl_seconds: 7200

See mmbert sample configuration for a complete example.

Example: Qwen3 with Redis Cache​

spec:
config:
embedding_models:
qwen3_model_path: "models/qwen3-embedding"
use_cpu: true

semantic_cache:
enabled: true
backend_type: "redis"
embedding_model: "qwen3"
redis:
connection:
host: redis.cache-backends.svc.cluster.local
port: 6379
index:
vector_field:
dimension: 1024 # Qwen3 dimension

See redis cache sample configuration for a complete example.

Complexity-Aware Routing​

Route queries to different models based on complexity classification. Simple queries go to fast models, complex queries go to powerful models.

Example Configuration​

spec:
# Configure multiple backends with different capabilities
vllmEndpoints:
- name: llama-8b-fast
model: llama3-8b
reasoningFamily: qwen3
backend:
type: kserve
inferenceServiceName: llama-3-8b
weight: 2 # Prefer for simple queries

- name: llama-70b-reasoning
model: llama3-70b
reasoningFamily: deepseek
backend:
type: kserve
inferenceServiceName: llama-3-70b
weight: 1 # Use for complex queries

config:
# Define complexity rules
complexity_rules:
# Rule 1: Code complexity
- name: "code-complexity"
description: "Classify coding tasks by complexity"
threshold: "0.3" # Lower threshold works better for embedding-based similarity

# Examples of complex coding tasks
hard:
candidates:
- "Implement a distributed lock manager with leader election"
- "Design a database migration system with rollback support"
- "Create a compiler optimization pass for loop unrolling"

# Examples of simple coding tasks
easy:
candidates:
- "Write a function to reverse a string"
- "Create a class to represent a rectangle"
- "Implement a simple counter with increment/decrement"

# Rule 2: Reasoning complexity
- name: "reasoning-complexity"
description: "Classify reasoning and problem-solving tasks"
threshold: "0.3" # Lower threshold works better for embedding-based similarity

hard:
candidates:
- "Analyze the geopolitical implications of renewable energy adoption"
- "Evaluate the ethical considerations of AI in healthcare"
- "Design a multi-stage marketing strategy for a new product launch"

easy:
candidates:
- "What is the capital of France?"
- "How many days are in a week?"
- "Name three common pets"

# Rule 3: Domain-specific complexity with conditional application
- name: "medical-complexity"
description: "Classify medical queries (only for medical domain)"
threshold: "0.3" # Lower threshold works better for embedding-based similarity

hard:
candidates:
- "Differential diagnosis for chest pain with dyspnea"
- "Treatment protocol for multi-drug resistant tuberculosis"

easy:
candidates:
- "What is the normal body temperature?"
- "What are common symptoms of a cold?"

# Only apply this rule if domain signal indicates medical domain
composer:
operator: "AND"
conditions:
- type: "domain"
name: "medical"

How it works:

  1. Incoming query is compared against hard and easy candidate examples
  2. Similarity scores determine complexity classification
  3. Output signals: {rule-name}:hard, {rule-name}:easy, or {rule-name}:medium
  4. Router uses signals to select appropriate backend model
  5. Composer allows conditional rule application based on other signals

See complexity routing sample configuration for a complete example.

Verify Deployment​

# Check the SemanticRouter resource
kubectl get semanticrouter my-router

# Check created resources
kubectl get deployment,service,configmap -l app.kubernetes.io/instance=my-router

# View status
kubectl describe semanticrouter my-router

# View logs
kubectl logs -f deployment/my-router

Expected output:

NAME                        PHASE     REPLICAS   READY   AGE
semanticrouter.vllm.ai/my-router Running 2 2 5m

Backend Discovery Types​

The operator supports three types of backend discovery for connecting semantic router to vLLM model servers. Choose the type that matches your infrastructure.

KServe InferenceService Discovery​

For RHOAI 3.x or standalone KServe deployments. The operator automatically discovers the predictor service created by KServe.

spec:
vllmEndpoints:
- name: llama3-8b-endpoint
model: llama3-8b
reasoningFamily: qwen3
backend:
type: kserve
inferenceServiceName: llama-3-8b # InferenceService in same namespace
weight: 1

When to use:

  • Running on Red Hat OpenShift AI (RHOAI) 3.x
  • Using KServe for model serving
  • Want automatic service discovery

How it works:

  • Discovers the predictor service: {inferenceServiceName}-predictor
  • Uses port 8443 (KServe default HTTPS port)
  • Works in the same namespace as SemanticRouter

Llama Stack Service Discovery​

Discovers Llama Stack deployments using Kubernetes label selectors.

spec:
vllmEndpoints:
- name: llama-405b-endpoint
model: llama-3.3-70b-instruct
reasoningFamily: gpt
backend:
type: llamastack
discoveryLabels:
app: llama-stack
model: llama-3.3-70b
weight: 1

When to use:

  • Using Meta's Llama Stack for model serving
  • Multiple Llama Stack services with different models
  • Want label-based service discovery

How it works:

  • Lists services matching the label selector
  • Uses first matching service if multiple found
  • Extracts port from service definition

Direct Kubernetes Service​

Direct connection to any Kubernetes service (vLLM, TGI, etc.).

spec:
vllmEndpoints:
- name: custom-vllm-endpoint
model: deepseek-r1-distill-qwen-7b
reasoningFamily: deepseek
backend:
type: service
service:
name: vllm-deepseek
namespace: vllm-serving # Can reference service in another namespace
port: 8000
weight: 1

When to use:

  • Direct vLLM deployments
  • Custom model servers with OpenAI-compatible API
  • Cross-namespace service references
  • Maximum control over service endpoints

How it works:

  • Connects to specified service directly
  • No discovery - uses explicit configuration
  • Supports cross-namespace references

Multiple Backends​

You can configure multiple backends with load balancing weights:

spec:
vllmEndpoints:
# KServe backend
- name: llama3-8b
model: llama3-8b
reasoningFamily: qwen3
backend:
type: kserve
inferenceServiceName: llama-3-8b
weight: 2 # Higher weight = more traffic

# Direct service backend
- name: qwen-7b
model: qwen2.5-7b
reasoningFamily: qwen3
backend:
type: service
service:
name: vllm-qwen
port: 8000
weight: 1

Deployment Modes​

The operator supports two deployment modes with different architectures.

Standalone Mode (Default)​

Deploys semantic router with an Envoy sidecar container that acts as an ingress gateway.

Architecture:

Client → Service (8080) → Envoy Sidecar → ExtProc gRPC → Semantic Router → vLLM

When to use:

  • Simple deployments without existing service mesh
  • Testing and development
  • Self-contained deployment with minimal dependencies

Configuration:

spec:
# No gateway configuration - defaults to standalone mode
service:
type: ClusterIP
api:
port: 8080 # Client traffic enters here
targetPort: 8080 # Envoy ingress port
grpc:
port: 50051 # ExtProc communication
targetPort: 50051

Operator behavior:

  • Deploys pod with two containers: semantic router + Envoy sidecar
  • Envoy handles ingress and forwards to semantic router via ExtProc gRPC
  • Status shows gatewayMode: "standalone"

Gateway Integration Mode​

Reuses an existing Gateway (Istio, Envoy Gateway, etc.) and creates an HTTPRoute.

Architecture:

Client → Gateway (Istio/Envoy) → HTTPRoute → Service (8080) → Semantic Router API → vLLM

When to use:

  • Existing Istio or Envoy Gateway deployment
  • Centralized ingress management
  • Multi-tenancy with shared gateway
  • Advanced traffic management (circuit breaking, retries, rate limiting)

Configuration:

spec:
gateway:
existingRef:
name: istio-ingressgateway # Or your Envoy Gateway name
namespace: istio-system

# Service only needs API port in gateway mode
service:
type: ClusterIP
api:
port: 8080
targetPort: 8080

Operator behavior:

  1. Creates HTTPRoute resource pointing to the specified Gateway
  2. Skips Envoy sidecar container in pod spec
  3. Sets status.gatewayMode: "gateway-integration"
  4. Semantic router operates in pure API mode (no ExtProc)

Example: See vllm.ai_v1alpha1_semanticrouter_gateway.yaml

OpenShift Routes​

For OpenShift deployments, the operator can create Routes for external access with TLS termination.

Basic OpenShift Route​

spec:
openshift:
routes:
enabled: true
hostname: semantic-router.apps.openshift.example.com # Optional - auto-generated if omitted
tls:
termination: edge # TLS terminates at Route, plain HTTP to backend
insecureEdgeTerminationPolicy: Redirect # Redirect HTTP to HTTPS

TLS Termination Options​

  • edge (recommended): TLS terminates at Route, plain HTTP to backend
  • passthrough: TLS passthrough to backend (requires backend TLS)
  • reencrypt: TLS terminates at Route, re-encrypts to backend

When to Use OpenShift Routes​

  • Running on OpenShift 4.x
  • Need external access without configuring Ingress
  • Want auto-generated hostnames
  • Require OpenShift-native TLS management

Status Information​

After creating a Route, check the status:

kubectl get semanticrouter my-router -o jsonpath='{.status.openshiftFeatures}'

Output:

{
"routesEnabled": true,
"routeHostname": "semantic-router-default.apps.openshift.example.com"
}

Example: See vllm.ai_v1alpha1_semanticrouter_route.yaml

Choosing Your Configuration​

Use this decision tree to select the right configuration:

┌─ Need to run on OpenShift?
│ ├─ YES → Use openshift sample (Routes + KServe/service backends)
│ └─ NO ↓
│
├─ Have existing Gateway (Istio/Envoy)?
│ ├─ YES → Use gateway sample (Gateway integration mode)
│ └─ NO ↓
│
├─ Using Meta Llama Stack?
│ ├─ YES → Use llamastack sample
│ └─ NO ↓
│
└─ Simple deployment → Use simple sample (standalone mode)

Backend choice:

┌─ Running RHOAI 3.x or KServe?
│ ├─ YES → Use KServe backend type
│ └─ NO ↓
│
├─ Using Meta Llama Stack?
│ ├─ YES → Use llamastack backend type
│ └─ NO ↓
│
└─ Have direct vLLM service? → Use service backend type

Architecture​

The operator manages a complete stack of resources for each SemanticRouter:

┌─────────────────────────────────────────────────────┐
│ SemanticRouter CR │
│ apiVersion: vllm.ai/v1alpha1 │
│ kind: SemanticRouter │
└──────────────────â”Ŧ──────────────────────────────────┘
│
â–ŧ
┌─────────────────────┐
│ Operator Controller │
│ - Watches CR │
│ - Reconciles state │
│ - Platform detection│
└─────────â”Ŧ────────────┘
│
┌────────────â”ŧ────────────â”Ŧ──────────────┐
â–ŧ â–ŧ â–ŧ â–ŧ
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐
│Deployment│ │ Service │ │ConfigMap│ │ PVC │
│ │ │ - gRPC │ │ - config│ │ - models│
│ │ │ - API │ │ - tools │ │ │
│ │ │ - metrics│ │ │ │ │
└─────────┘ └─────────┘ └─────────┘ └─────────┘

Managed Resources:

  • Deployment: Runs semantic router pods with configurable replicas
  • Service: Exposes gRPC (50051), HTTP API (8080), and metrics (9190)
  • ConfigMap: Contains semantic router configuration and tools database
  • ServiceAccount: For RBAC (optional, created when specified)
  • PersistentVolumeClaim: For ML model storage (optional, when persistence enabled)
  • HorizontalPodAutoscaler: For auto-scaling (optional, when autoscaling enabled)
  • Ingress: For external access (optional, when ingress enabled)

Platform Detection and Security​

The operator automatically detects the platform and configures security contexts appropriately.

OpenShift Platform​

When running on OpenShift, the operator:

  • Detects: Checks for route.openshift.io API resources
  • Security Context: Does NOT set runAsUser, runAsGroup, or fsGroup
  • Rationale: Lets OpenShift SCCs assign UIDs/GIDs from the namespace's allowed range
  • Compatible with: restricted SCC (default) and custom SCCs

Standard Kubernetes​

When running on standard Kubernetes, the operator:

  • Security Context: Sets runAsUser: 1000, fsGroup: 1000, runAsNonRoot: true
  • Rationale: Provides secure defaults for pod security policies/standards

Both Platforms​

Regardless of platform:

  • Drops ALL capabilities (drop: [ALL])
  • Prevents privilege escalation (allowPrivilegeEscalation: false)
  • No special permissions or SCCs required beyond defaults

Override Security Context​

You can override automatic security contexts in your CR:

spec:
# Container security context
securityContext:
runAsNonRoot: true
runAsUser: 2000
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL

# Pod security context
podSecurityContext:
runAsNonRoot: true
runAsUser: 2000
fsGroup: 2000
OpenShift Note

When running on OpenShift, it's recommended to omit runAsUser and fsGroup and let SCCs handle UID/GID assignment automatically.

Configuration Reference​

Image Configuration​

spec:
image:
repository: ghcr.io/vllm-project/semantic-router/extproc
tag: latest
pullPolicy: IfNotPresent
imageRegistry: "" # Optional: custom registry prefix

# Optional: Image pull secrets
imagePullSecrets:
- name: ghcr-secret

Service Configuration​

spec:
service:
type: ClusterIP # or NodePort, LoadBalancer

grpc:
port: 50051
targetPort: 50051

api:
port: 8080
targetPort: 8080

metrics:
enabled: true
port: 9190
targetPort: 9190

Persistence Configuration​

spec:
persistence:
enabled: true
storageClassName: "standard" # Adjust for your cluster
accessMode: ReadWriteOnce
size: 10Gi

# Optional: Use existing PVC
existingClaim: "my-existing-pvc"

# Optional: PVC annotations
annotations:
backup.velero.io/backup-volumes: "models"
Storage Validation

The operator validates that the specified StorageClass exists before creating the PVC. If storageClassName is omitted, the cluster's default StorageClass is used.

Storage Class Examples:

  • AWS EKS: gp3-csi, gp2
  • GKE: standard, premium-rwo
  • Azure AKS: managed, managed-premium
  • OpenShift: gp3-csi, thin, ocs-storagecluster-ceph-rbd

Semantic Cache Backends​

The operator supports multiple cache backends for semantic caching, which significantly improves latency and reduces token usage by caching similar queries and their responses.

Prerequisites

The operator does not deploy Redis or Milvus. You must deploy these services separately in your cluster before using them as cache backends. The operator only configures the SemanticRouter to connect to your existing Redis/Milvus deployment.

For deployment examples, see the Redis and Milvus sections below.

Alternative: If you prefer automatic deployment of Redis/Milvus, consider using the Helm chart, which can deploy cache backends as Helm chart dependencies.

Supported Backends​

1. Memory Cache (Default)​

Simple in-memory cache suitable for development and small deployments.

Characteristics:

  • No external dependencies
  • Fast access
  • Not persistent (cleared on restart)
  • Limited by pod memory

Configuration:

spec:
config:
bert_model:
model_id: models/mom-embedding-light
threshold: "0.6"
use_cpu: true

semantic_cache:
enabled: true
backend_type: memory
similarity_threshold: "0.8"
max_entries: 1000
ttl_seconds: 3600
eviction_policy: fifo # fifo, lru, or lfu

When to use:

  • Development and testing
  • Small deployments (<1000 cached queries)
  • No persistence requirements
2. Redis Cache​

High-performance distributed cache using Redis with vector search capabilities.

Characteristics:

  • Distributed and scalable
  • Persistent storage (with AOF/RDB)
  • HNSW or FLAT indexing
  • Wide ecosystem support

Prerequisites:

  • Redis 7.0+ with RediSearch module
  • Create Kubernetes Secret for password:
kubectl create secret generic redis-credentials \
--from-literal=password='your-redis-password'

Configuration:

spec:
config:
bert_model:
model_id: models/mom-embedding-light
threshold: "0.6"
use_cpu: true

semantic_cache:
enabled: true
backend_type: redis
similarity_threshold: "0.85"
ttl_seconds: 3600
embedding_model: bert

redis:
connection:
host: redis.default.svc.cluster.local
port: 6379
database: 0
password_secret_ref:
name: redis-credentials
key: password
timeout: 30
tls:
enabled: false

index:
name: semantic_cache_idx
prefix: "cache:"
vector_field:
name: embedding
dimension: 384 # Match your embedding model
metric_type: COSINE
index_type: HNSW
params:
M: 16
efConstruction: 64

search:
topk: 1

development:
auto_create_index: true
verbose_errors: true

When to use:

  • Production deployments with moderate scale
  • Need persistence and high availability
  • Existing Redis infrastructure
  • Fast in-memory performance required

Example: vllm.ai_v1alpha1_semanticrouter_redis_cache.yaml

3. Milvus Cache​

Enterprise-grade vector database for production deployments with large cache volumes.

Characteristics:

  • Highly scalable and distributed
  • Advanced indexing (HNSW, IVF, etc.)
  • Built-in data lifecycle management
  • High availability support

Prerequisites:

  • Milvus 2.3+ (standalone or cluster)
  • Create Kubernetes Secret for credentials:
kubectl create secret generic milvus-credentials \
--from-literal=password='your-milvus-password'

Configuration:

spec:
config:
bert_model:
model_id: models/mom-embedding-light
threshold: "0.6"
use_cpu: true

semantic_cache:
enabled: true
backend_type: milvus
similarity_threshold: "0.90"
ttl_seconds: 7200
embedding_model: bert

milvus:
connection:
host: milvus-standalone.default.svc.cluster.local
port: 19530
database: semantic_router_cache
timeout: 30
auth:
enabled: true
username: root
password_secret_ref:
name: milvus-credentials
key: password

collection:
name: semantic_cache
description: "Semantic cache for LLM responses"
vector_field:
name: embedding
dimension: 384 # Match your embedding model
metric_type: IP
index:
type: HNSW
params:
M: 16
efConstruction: 64

search:
params:
ef: 64
topk: 10
consistency_level: Session

performance:
connection_pool:
max_connections: 10
max_idle_connections: 5
batch:
insert_batch_size: 100

data_management:
ttl:
enabled: true
timestamp_field: created_at
cleanup_interval: 3600

development:
auto_create_collection: true

When to use:

  • Large-scale production deployments
  • Need advanced vector search capabilities
  • Require data lifecycle management (TTL, compaction)
  • High availability and scalability requirements

Example: vllm.ai_v1alpha1_semanticrouter_milvus_cache.yaml

4. Hybrid Cache​

Combines in-memory HNSW index with persistent Milvus storage for optimal performance and durability.

Characteristics:

  • Fast in-memory search with HNSW
  • Persistent storage in Milvus
  • Best of both worlds
  • Automatic synchronization

Configuration:

spec:
config:
bert_model:
model_id: models/mom-embedding-light
threshold: "0.6"
use_cpu: true

semantic_cache:
enabled: true
backend_type: hybrid
similarity_threshold: "0.85"
ttl_seconds: 3600
max_entries: 5000
eviction_policy: lru
embedding_model: bert

# HNSW in-memory configuration
hnsw:
use_hnsw: true
hnsw_m: 32
hnsw_ef_construction: 128
max_memory_entries: 5000

# Milvus persistent storage (same config as milvus backend)
milvus:
connection:
host: milvus-standalone.default.svc.cluster.local
port: 19530
# ... rest of milvus config

When to use:

  • Need fastest possible cache lookups
  • Require persistence and durability
  • Willing to trade memory for performance
  • High-throughput production deployments

Example: vllm.ai_v1alpha1_semanticrouter_hybrid_cache.yaml

Deploying Redis​

Before using Redis cache backend, deploy Redis with RediSearch module to your cluster:

---
apiVersion: v1
kind: Namespace
metadata:
name: cache-backends
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: redis
namespace: cache-backends
spec:
replicas: 1
selector:
matchLabels:
app: redis
template:
metadata:
labels:
app: redis
spec:
containers:
- name: redis
image: redis/redis-stack-server:latest
ports:
- containerPort: 6379
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi
---
apiVersion: v1
kind: Service
metadata:
name: redis
namespace: cache-backends
spec:
type: ClusterIP
ports:
- port: 6379
targetPort: 6379
selector:
app: redis

Apply the configuration:

kubectl apply -f redis-deployment.yaml

Create the credentials Secret:

kubectl create secret generic redis-credentials \
--from-literal=password='' # Empty for no password, or set your password

For production deployments, consider using:

Deploying Milvus​

Before using Milvus cache backend, deploy Milvus to your cluster:

---
apiVersion: apps/v1
kind: Deployment
metadata:
name: milvus-standalone
namespace: cache-backends
spec:
replicas: 1
selector:
matchLabels:
app: milvus
template:
metadata:
labels:
app: milvus
spec:
containers:
- name: milvus
image: milvusdb/milvus:v2.4.0
command: ["milvus", "run", "standalone"]
ports:
- containerPort: 19530
name: grpc
- containerPort: 9091
name: metrics
env:
- name: ETCD_USE_EMBED
value: "true"
- name: COMMON_STORAGETYPE
value: "local"
resources:
requests:
cpu: 500m
memory: 1Gi
limits:
cpu: 2000m
memory: 4Gi
volumeMounts:
- name: milvus-data
mountPath: /var/lib/milvus
volumes:
- name: milvus-data
emptyDir: {} # Use PVC for production
---
apiVersion: v1
kind: Service
metadata:
name: milvus-standalone
namespace: cache-backends
spec:
type: ClusterIP
ports:
- port: 19530
targetPort: 19530
name: grpc
selector:
app: milvus

Apply the configuration:

kubectl apply -f milvus-deployment.yaml

Create the credentials Secret:

kubectl create secret generic milvus-credentials \
--from-literal=password='Milvus' # Default Milvus password

For production deployments, consider using:

Production Best Practices

For production cache backends:

  1. Use persistent volumes (not emptyDir)
  2. Enable authentication and TLS
  3. Configure resource limits appropriately
  4. Set up monitoring and alerting
  5. Use operators or Helm charts for easier management
  6. Consider managed services for reduced operational overhead

Embedding Models​

The semantic cache supports different embedding models for similarity calculation:

  • bert (default): Lightweight, 384 dimensions, good for general use
  • qwen3: Higher quality, 1024 dimensions, better accuracy
  • gemma: Balanced, 768 dimensions, moderate performance

Configure via:

spec:
config:
semantic_cache:
embedding_model: bert # or qwen3, gemma

Note: Ensure the dimension in cache config matches your chosen embedding model.

Migration Between Backends​

Migrating from memory cache to Redis or Milvus is straightforward:

  1. Deploy Redis or Milvus in your cluster
  2. Create the credentials Secret
  3. Update SemanticRouter CR with new backend configuration
  4. Apply the changes - operator will perform rolling update

The cache will be empty after migration but will populate naturally as queries are processed.

Cache Configuration Reference​

For detailed configuration options, use kubectl explain:

# Redis cache configuration
kubectl explain semanticrouter.spec.config.semantic_cache.redis

# Milvus cache configuration
kubectl explain semanticrouter.spec.config.semantic_cache.milvus

# HNSW configuration
kubectl explain semanticrouter.spec.config.semantic_cache.hnsw

Semantic Router Configuration​

Full semantic router configuration is embedded in the CR. See the complete examples above and in deploy/operator/config/samples/.

Key configuration sections:

spec:
config:
# BERT model for embeddings (required for semantic cache)
bert_model:
model_id: "models/mom-embedding-light"
threshold: 0.6
use_cpu: true

# Semantic cache (see Semantic Cache Backends section above)
semantic_cache:
enabled: true
backend_type: "memory" # or redis, milvus, hybrid
similarity_threshold: 0.8
max_entries: 1000
ttl_seconds: 3600
eviction_policy: "fifo"

# Tools auto-selection
tools:
enabled: true
top_k: 3
similarity_threshold: 0.2
tools_db_path: "config/tools_db.json"
fallback_to_empty: true

# Prompt guard (jailbreak detection)
prompt_guard:
enabled: true
model_id: "models/mom-jailbreak-classifier"
threshold: 0.7
use_cpu: true

# Classifiers
classifier:
category_model:
model_id: "models/lora_intent_classifier_bert-base-uncased_model"
threshold: 0.6
use_cpu: true
pii_model:
model_id: "models/pii_classifier_modernbert-base_presidio_token_model"
threshold: 0.7
use_cpu: true

# Reasoning configuration per model family
reasoning_families:
deepseek:
type: "chat_template_kwargs"
parameter: "thinking"
qwen3:
type: "chat_template_kwargs"
parameter: "enable_thinking"
gpt:
type: "reasoning_effort"
parameter: "reasoning_effort"

# API batch classification
api:
batch_classification:
max_batch_size: 100
concurrency_threshold: 5
max_concurrency: 8
metrics:
enabled: true
detailed_goroutine_tracking: true
sample_rate: 1.0

# Observability
observability:
tracing:
enabled: false
provider: "opentelemetry"
exporter:
type: "otlp"
endpoint: "jaeger:4317"

Tools Database​

Define available tools for auto-selection:

spec:
toolsDb:
- tool:
type: "function"
function:
name: "search_web"
description: "Search the web for information"
parameters:
type: "object"
properties:
query:
type: "string"
description: "Search query"
required: ["query"]
description: "Search the internet, web search, find information online"
category: "search"
tags: ["search", "web", "internet"]

- tool:
type: "function"
function:
name: "calculate"
description: "Perform mathematical calculations"
parameters:
type: "object"
properties:
expression:
type: "string"
required: ["expression"]
description: "Calculate mathematical expressions"
category: "math"
tags: ["math", "calculation"]

Autoscaling (HPA)​

spec:
autoscaling:
enabled: true
minReplicas: 2
maxReplicas: 10
targetCPUUtilizationPercentage: 70
targetMemoryUtilizationPercentage: 80

Ingress Configuration​

spec:
ingress:
enabled: true
className: "nginx" # or "haproxy", "traefik", etc.
annotations:
cert-manager.io/cluster-issuer: "letsencrypt-prod"
hosts:
- host: router.example.com
paths:
- path: /
pathType: Prefix
servicePort: 8080
tls:
- secretName: router-tls
hosts:
- router.example.com

Production Deployment​

High Availability Setup​

apiVersion: vllm.ai/v1alpha1
kind: SemanticRouter
metadata:
name: prod-router
spec:
replicas: 3

# Anti-affinity for spreading across nodes
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app.kubernetes.io/instance: prod-router
topologyKey: kubernetes.io/hostname

# Autoscaling
autoscaling:
enabled: true
minReplicas: 3
maxReplicas: 20
targetCPUUtilizationPercentage: 70

# Production resources
resources:
limits:
memory: "10Gi"
cpu: "4"
requests:
memory: "5Gi"
cpu: "2"

# Strict probes
livenessProbe:
enabled: true
initialDelaySeconds: 60
periodSeconds: 30
failureThreshold: 3

readinessProbe:
enabled: true
initialDelaySeconds: 30
periodSeconds: 10
failureThreshold: 3

Pod Disruption Budget​

Create a PDB to ensure availability during updates:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: prod-router-pdb
spec:
maxUnavailable: 1
selector:
matchLabels:
app.kubernetes.io/instance: prod-router

Resource Allocation Guidelines​

Workload TypeMemory RequestCPU RequestMemory LimitCPU Limit
Development1Gi500m2Gi1
Staging3Gi17Gi2
Production5Gi210Gi4

Monitoring and Observability​

Metrics​

Prometheus metrics are exposed on port 9190:

# Port-forward to access metrics locally
kubectl port-forward svc/my-router 9190:9190

# View metrics
curl http://localhost:9190/metrics

Key Metrics:

  • semantic_router_request_duration_seconds - Request latency
  • semantic_router_cache_hit_total - Cache hit rate
  • semantic_router_classification_duration_seconds - Classification latency
  • semantic_router_tokens_total - Token usage
  • semantic_router_reasoning_requests_total - Reasoning mode usage

ServiceMonitor (Prometheus Operator)​

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: semantic-router-metrics
spec:
selector:
matchLabels:
app.kubernetes.io/instance: my-router
endpoints:
- port: metrics
interval: 30s
path: /metrics

Distributed Tracing​

Enable OpenTelemetry tracing:

spec:
config:
observability:
tracing:
enabled: true
provider: "opentelemetry"
exporter:
type: "otlp"
endpoint: "jaeger-collector:4317"
insecure: true
sampling:
type: "always_on"
rate: 1.0

Troubleshooting​

Common Issues​

Backend Discovery Failures​

Symptom: "No backends found" or "Failed to discover backend" in logs

For KServe backends:

# Check InferenceService exists and is ready
kubectl get inferenceservice llama-3-8b

# Check predictor service was created by KServe
kubectl get service llama-3-8b-predictor

# Verify InferenceService status
kubectl describe inferenceservice llama-3-8b

For Llama Stack backends:

# Verify services exist with correct labels
kubectl get services -l app=llama-stack,model=llama-3.3-70b

# Check service labels match discoveryLabels in CR
kubectl get service <service-name> -o jsonpath='{.metadata.labels}'

For direct service backends:

# Verify service exists in specified namespace
kubectl get service vllm-deepseek -n vllm-serving

# Check service has ports defined
kubectl get service vllm-deepseek -n vllm-serving -o jsonpath='{.spec.ports[0]}'

Gateway Integration Issues​

Symptom: HTTPRoute not created or traffic not reaching semantic router

# Verify Gateway exists
kubectl get gateway istio-ingressgateway -n istio-system

# Check HTTPRoute was created
kubectl get httproute -l app.kubernetes.io/instance=my-router

# Verify Gateway supports HTTPRoute (Gateway API v1)
kubectl get gateway istio-ingressgateway -n istio-system -o yaml | grep -A5 listeners

# Check operator status
kubectl get semanticrouter my-router -o jsonpath='{.status.gatewayMode}'
# Should show: "gateway-integration"

OpenShift Route Issues​

Symptom: Route not created on OpenShift

# Verify running on OpenShift cluster
kubectl api-resources | grep route.openshift.io

# Check if Route was created
kubectl get route -l app.kubernetes.io/instance=my-router

# Check operator detected OpenShift
kubectl logs -n semantic-router-operator-system \
deployment/semantic-router-operator-controller-manager \
| grep -i "openshift\|route"

# Verify Route status
kubectl get semanticrouter my-router -o jsonpath='{.status.openshiftFeatures}'

Pod stuck in ImagePullBackOff​

# Check image pull secrets
kubectl describe pod <pod-name>

# Create image pull secret
kubectl create secret docker-registry ghcr-secret \
--docker-server=ghcr.io \
--docker-username=<username> \
--docker-password=<personal-access-token>

# Add to SemanticRouter
spec:
imagePullSecrets:
- name: ghcr-secret

PVC stuck in Pending​

# Check storage class exists
kubectl get storageclass

# Check PVC events
kubectl describe pvc my-router-models

# Update storage class in CR
spec:
persistence:
storageClassName: "your-available-storage-class"

Models not downloading​

# Check if HF token secret exists
kubectl get secret hf-token-secret

# Create HF token secret
kubectl create secret generic hf-token-secret \
--from-literal=token=hf_xxxxxxxxxxxxx

# Add to SemanticRouter CR
spec:
env:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-token-secret
key: token

Operator not detecting platform correctly​

# Check operator logs for platform detection
kubectl logs -n semantic-router-operator-system \
deployment/semantic-router-operator-controller-manager \
| grep -i "platform\|openshift"

# Should see one of:
# "Detected OpenShift platform - will use OpenShift-compatible security contexts"
# "Detected standard Kubernetes platform - will use standard security contexts"

Development​

Local Development​

cd deploy/operator

# Run tests
make test

# Generate CRDs and code
make generate
make manifests

# Build operator binary
make build

# Run locally against your kubeconfig
make run

Testing with kind​

# Create kind cluster
kind create cluster --name operator-test

# Build and load image
make docker-build IMG=semantic-router-operator:dev
kind load docker-image semantic-router-operator:dev --name operator-test

# Deploy
make install
make deploy IMG=semantic-router-operator:dev

# Create test instance
kubectl apply -f config/samples/vllm_v1alpha1_semanticrouter.yaml

Next Steps​