Version: 🚧 Next

Kubernetes Operator

The Semantic Router Operator provides a Kubernetes-native way to deploy and manage vLLM Semantic Router instances using Custom Resource Definitions (CRDs). It simplifies deployment, configuration, and lifecycle management across Kubernetes and OpenShift platforms.

Features

🚀 Declarative Deployment: Define semantic router instances using Kubernetes CRDs
🔄 Automatic Configuration: Generates and manages ConfigMaps for semantic router configuration
📦 Persistent Storage: Manages PVCs for ML model storage with automatic lifecycle
🔐 Platform Detection: Automatically detects and configures for OpenShift or standard Kubernetes
📊 Built-in Observability: Metrics, tracing, and monitoring support out of the box
🎯 Production Features: HPA, ingress, service mesh integration, and pod disruption budgets
🛡️ Secure by Default: Drops all capabilities, prevents privilege escalation

Prerequisites

Kubernetes 1.24+ or OpenShift 4.12+
kubectl or oc CLI configured
Cluster admin access (for CRD installation)

Installation

Option 1: Using Kustomize (Standard Kubernetes)

# Clone the repository
git clone https://github.com/vllm-project/semantic-router
cd semantic-router/deploy/operator

# Install CRDs
make install

# Deploy the operator
make deploy IMG=ghcr.io/vllm-project/semantic-router-operator:latest

Verify the operator is running:

kubectl get pods -n semantic-router-operator-system

Option 2: Using OLM (OpenShift)

For OpenShift deployments using Operator Lifecycle Manager:

cd semantic-router/deploy/operator

# Build and push to your registry (Quay, internal registry, etc.)
podman login quay.io
make podman-build IMG=quay.io/<your-org>/semantic-router-operator:latest
make podman-push IMG=quay.io/<your-org>/semantic-router-operator:latest

# Deploy using OLM
make openshift-deploy

Deploy Your First Router

Quick Start with Sample Configurations

Choose a pre-configured sample based on your infrastructure:

# Simple standalone deployment with KServe backend
kubectl apply -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/operator/config/samples/vllm.ai_v1alpha1_semanticrouter_simple.yaml

# Full-featured OpenShift deployment with Routes
kubectl apply -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/operator/config/samples/vllm.ai_v1alpha1_semanticrouter_openshift.yaml

# Gateway integration mode (Istio/Envoy Gateway)
kubectl apply -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/operator/config/samples/vllm.ai_v1alpha1_semanticrouter_gateway.yaml

# Llama Stack backend discovery
kubectl apply -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/operator/config/samples/vllm.ai_v1alpha1_semanticrouter_llamastack.yaml

# Redis cache backend for production caching
kubectl apply -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/operator/config/samples/vllm.ai_v1alpha1_semanticrouter_redis_cache.yaml

# Milvus cache backend for large-scale deployments
kubectl apply -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/operator/config/samples/vllm.ai_v1alpha1_semanticrouter_milvus_cache.yaml

# Hybrid cache backend for optimal performance
kubectl apply -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/operator/config/samples/vllm.ai_v1alpha1_semanticrouter_hybrid_cache.yaml

# mmBERT 2D Matryoshka embeddings with layer early exit
kubectl apply -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/operator/config/samples/vllm.ai_v1alpha1_semanticrouter_mmbert.yaml

# Complexity-aware routing for intelligent model selection
kubectl apply -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/operator/config/samples/vllm.ai_v1alpha1_semanticrouter_complexity.yaml

Custom Configuration

Create a my-router.yaml file:

apiVersion: vllm.ai/v1alpha1
kind: SemanticRouter
metadata:
  name: my-router
  namespace: default
spec:
  replicas: 2

  image:
    repository: ghcr.io/vllm-project/semantic-router/extproc
    tag: latest

  # Configure vLLM backend endpoints
  vllmEndpoints:
    # KServe InferenceService (RHOAI 3.x)
    - name: llama3-8b-endpoint
      model: llama3-8b
      reasoningFamily: qwen3
      backend:
        type: kserve
        inferenceServiceName: llama-3-8b
      weight: 1

  resources:
    limits:
      memory: "7Gi"
      cpu: "2"
    requests:
      memory: "3Gi"
      cpu: "1"

  persistence:
    enabled: true
    size: 10Gi
    storageClassName: "standard"

  config:
    bert_model:
      model_id: "models/mom-embedding-light"
      threshold: "0.6"
      use_cpu: true

    semantic_cache:
      enabled: true
      backend_type: "memory"
      max_entries: 1000
      ttl_seconds: 3600

    tools:
      enabled: true
      top_k: 3
      similarity_threshold: "0.2"

    prompt_guard:
      enabled: true
      threshold: "0.7"

  toolsDb:
    - tool:
        type: "function"
        function:
          name: "get_weather"
          description: "Get weather information for a location"
          parameters:
            type: "object"
            properties:
              location:
                type: "string"
                description: "City and state, e.g. San Francisco, CA"
            required: ["location"]
      description: "Weather information tool"
      category: "weather"
      tags: ["weather", "temperature"]

Apply the configuration:

kubectl apply -f my-router.yaml

Advanced Features

Embedding Models Configuration

The operator supports three high-performance embedding models for semantic understanding and caching. You can configure these models to optimize for your specific use case.

Available Embedding Models

Qwen3-Embedding (1024 dimensions, 32K context)
- Best for: High-quality semantic understanding with long context
- Use case: Complex queries, research documents, detailed analysis
EmbeddingGemma (768 dimensions, 8K context)
- Best for: Fast performance with good accuracy
- Use case: Real-time applications, high-throughput scenarios
mmBERT 2D Matryoshka (64-768 dimensions, multilingual)
- Best for: Adaptive performance with layer early exit
- Use case: Multilingual deployments, flexible quality/speed trade-offs

Example: mmBERT with Layer Early Exit

spec:
  config:
    # Configure mmBERT 2D Matryoshka embeddings
    embedding_models:
      mmbert_model_path: "models/mmbert-embedding"
      use_cpu: true

      # HNSW configuration for embedding-based classification
      hnsw_config:
        model_type: "mmbert"

        # Layer early exit: balance speed vs accuracy
        # Layer 3: ~7x speedup (fast, good for high-volume queries)
        # Layer 6: ~3.6x speedup (balanced - recommended)
        # Layer 11: ~2x speedup (higher accuracy)
        # Layer 22: full model (maximum accuracy)
        target_layer: 6

        # Dimension reduction for faster similarity search
        # Options: 64, 128, 256, 512, 768
        target_dimension: 256

        preload_embeddings: true
        enable_soft_matching: true
        min_score_threshold: "0.5"

    # Use mmBERT in semantic cache
    semantic_cache:
      enabled: true
      backend_type: "memory"
      embedding_model: "mmbert"
      similarity_threshold: "0.85"
      max_entries: 5000
      ttl_seconds: 7200

See mmbert sample configuration for a complete example.

Example: Qwen3 with Redis Cache

spec:
  config:
    embedding_models:
      qwen3_model_path: "models/qwen3-embedding"
      use_cpu: true

    semantic_cache:
      enabled: true
      backend_type: "redis"
      embedding_model: "qwen3"
      redis:
        connection:
          host: redis.cache-backends.svc.cluster.local
          port: 6379
        index:
          vector_field:
            dimension: 1024  # Qwen3 dimension

See redis cache sample configuration for a complete example.

Complexity-Aware Routing

Route queries to different models based on complexity classification. Simple queries go to fast models, complex queries go to powerful models.

Example Configuration

spec:
  # Configure multiple backends with different capabilities
  vllmEndpoints:
    - name: llama-8b-fast
      model: llama3-8b
      reasoningFamily: qwen3
      backend:
        type: kserve
        inferenceServiceName: llama-3-8b
      weight: 2  # Prefer for simple queries

    - name: llama-70b-reasoning
      model: llama3-70b
      reasoningFamily: deepseek
      backend:
        type: kserve
        inferenceServiceName: llama-3-70b
      weight: 1  # Use for complex queries

  config:
    # Define complexity rules
    complexity_rules:
      # Rule 1: Code complexity
      - name: "code-complexity"
        description: "Classify coding tasks by complexity"
        threshold: "0.3"  # Lower threshold works better for embedding-based similarity

        # Examples of complex coding tasks
        hard:
          candidates:
            - "Implement a distributed lock manager with leader election"
            - "Design a database migration system with rollback support"
            - "Create a compiler optimization pass for loop unrolling"

        # Examples of simple coding tasks
        easy:
          candidates:
            - "Write a function to reverse a string"
            - "Create a class to represent a rectangle"
            - "Implement a simple counter with increment/decrement"

      # Rule 2: Reasoning complexity
      - name: "reasoning-complexity"
        description: "Classify reasoning and problem-solving tasks"
        threshold: "0.3"  # Lower threshold works better for embedding-based similarity

        hard:
          candidates:
            - "Analyze the geopolitical implications of renewable energy adoption"
            - "Evaluate the ethical considerations of AI in healthcare"
            - "Design a multi-stage marketing strategy for a new product launch"

        easy:
          candidates:
            - "What is the capital of France?"
            - "How many days are in a week?"
            - "Name three common pets"

      # Rule 3: Domain-specific complexity with conditional application
      - name: "medical-complexity"
        description: "Classify medical queries (only for medical domain)"
        threshold: "0.3"  # Lower threshold works better for embedding-based similarity

        hard:
          candidates:
            - "Differential diagnosis for chest pain with dyspnea"
            - "Treatment protocol for multi-drug resistant tuberculosis"

        easy:
          candidates:
            - "What is the normal body temperature?"
            - "What are common symptoms of a cold?"

        # Only apply this rule if domain signal indicates medical domain
        composer:
          operator: "AND"
          conditions:
            - type: "domain"
              name: "medical"

How it works:

Incoming query is compared against hard and easy candidate examples
Similarity scores determine complexity classification
Output signals: {rule-name}:hard, {rule-name}:easy, or {rule-name}:medium
Router uses signals to select appropriate backend model
Composer allows conditional rule application based on other signals

See complexity routing sample configuration for a complete example.

Verify Deployment

# Check the SemanticRouter resource
kubectl get semanticrouter my-router

# Check created resources
kubectl get deployment,service,configmap -l app.kubernetes.io/instance=my-router

# View status
kubectl describe semanticrouter my-router

# View logs
kubectl logs -f deployment/my-router

Expected output:

NAME                        PHASE     REPLICAS   READY   AGE
semanticrouter.vllm.ai/my-router   Running   2          2       5m

Backend Discovery Types

The operator supports three types of backend discovery for connecting semantic router to vLLM model servers. Choose the type that matches your infrastructure.

KServe InferenceService Discovery

For RHOAI 3.x or standalone KServe deployments. The operator automatically discovers the predictor service created by KServe.

spec:
  vllmEndpoints:
    - name: llama3-8b-endpoint
      model: llama3-8b
      reasoningFamily: qwen3
      backend:
        type: kserve
        inferenceServiceName: llama-3-8b  # InferenceService in same namespace
      weight: 1

When to use:

Running on Red Hat OpenShift AI (RHOAI) 3.x
Using KServe for model serving
Want automatic service discovery

How it works:

Discovers the predictor service: {inferenceServiceName}-predictor
Uses port 8443 (KServe default HTTPS port)
Works in the same namespace as SemanticRouter

Llama Stack Service Discovery

Discovers Llama Stack deployments using Kubernetes label selectors.

spec:
  vllmEndpoints:
    - name: llama-405b-endpoint
      model: llama-3.3-70b-instruct
      reasoningFamily: gpt
      backend:
        type: llamastack
        discoveryLabels:
          app: llama-stack
          model: llama-3.3-70b
      weight: 1

When to use:

Using Meta's Llama Stack for model serving
Multiple Llama Stack services with different models
Want label-based service discovery

How it works:

Lists services matching the label selector
Uses first matching service if multiple found
Extracts port from service definition

Direct Kubernetes Service

Direct connection to any Kubernetes service (vLLM, TGI, etc.).

spec:
  vllmEndpoints:
    - name: custom-vllm-endpoint
      model: deepseek-r1-distill-qwen-7b
      reasoningFamily: deepseek
      backend:
        type: service
        service:
          name: vllm-deepseek
          namespace: vllm-serving  # Can reference service in another namespace
          port: 8000
      weight: 1

When to use:

Direct vLLM deployments
Custom model servers with OpenAI-compatible API
Cross-namespace service references
Maximum control over service endpoints

How it works:

Connects to specified service directly
No discovery - uses explicit configuration
Supports cross-namespace references

Multiple Backends

You can configure multiple backends with load balancing weights:

spec:
  vllmEndpoints:
    # KServe backend
    - name: llama3-8b
      model: llama3-8b
      reasoningFamily: qwen3
      backend:
        type: kserve
        inferenceServiceName: llama-3-8b
      weight: 2  # Higher weight = more traffic

    # Direct service backend
    - name: qwen-7b
      model: qwen2.5-7b
      reasoningFamily: qwen3
      backend:
        type: service
        service:
          name: vllm-qwen
          port: 8000
      weight: 1

Deployment Modes

The operator supports two deployment modes with different architectures.

Standalone Mode (Default)

Deploys semantic router with an Envoy sidecar container that acts as an ingress gateway.

Architecture:

Client → Service (8080) → Envoy Sidecar → ExtProc gRPC → Semantic Router → vLLM

When to use:

Simple deployments without existing service mesh
Testing and development
Self-contained deployment with minimal dependencies

Configuration:

spec:
  # No gateway configuration - defaults to standalone mode
  service:
    type: ClusterIP
    api:
      port: 8080  # Client traffic enters here
      targetPort: 8080  # Envoy ingress port
    grpc:
      port: 50051  # ExtProc communication
      targetPort: 50051

Operator behavior:

Deploys pod with two containers: semantic router + Envoy sidecar
Envoy handles ingress and forwards to semantic router via ExtProc gRPC
Status shows gatewayMode: "standalone"

Gateway Integration Mode

Reuses an existing Gateway (Istio, Envoy Gateway, etc.) and creates an HTTPRoute.

Architecture:

Client → Gateway (Istio/Envoy) → HTTPRoute → Service (8080) → Semantic Router API → vLLM

When to use:

Existing Istio or Envoy Gateway deployment
Centralized ingress management
Multi-tenancy with shared gateway
Advanced traffic management (circuit breaking, retries, rate limiting)

Configuration:

spec:
  gateway:
    existingRef:
      name: istio-ingressgateway  # Or your Envoy Gateway name
      namespace: istio-system

  # Service only needs API port in gateway mode
  service:
    type: ClusterIP
    api:
      port: 8080
      targetPort: 8080

Operator behavior:

Creates HTTPRoute resource pointing to the specified Gateway
Skips Envoy sidecar container in pod spec
Sets status.gatewayMode: "gateway-integration"
Semantic router operates in pure API mode (no ExtProc)

Example: See vllm.ai_v1alpha1_semanticrouter_gateway.yaml

OpenShift Routes

For OpenShift deployments, the operator can create Routes for external access with TLS termination.

Basic OpenShift Route

spec:
  openshift:
    routes:
      enabled: true
      hostname: semantic-router.apps.openshift.example.com  # Optional - auto-generated if omitted
      tls:
        termination: edge  # TLS terminates at Route, plain HTTP to backend
        insecureEdgeTerminationPolicy: Redirect  # Redirect HTTP to HTTPS

TLS Termination Options

edge (recommended): TLS terminates at Route, plain HTTP to backend
passthrough: TLS passthrough to backend (requires backend TLS)
reencrypt: TLS terminates at Route, re-encrypts to backend

When to Use OpenShift Routes

Running on OpenShift 4.x
Need external access without configuring Ingress
Want auto-generated hostnames
Require OpenShift-native TLS management

Status Information

After creating a Route, check the status:

kubectl get semanticrouter my-router -o jsonpath='{.status.openshiftFeatures}'

Output:

{
  "routesEnabled": true,
  "routeHostname": "semantic-router-default.apps.openshift.example.com"
}

Example: See vllm.ai_v1alpha1_semanticrouter_route.yaml

Choosing Your Configuration

Use this decision tree to select the right configuration:

┌─ Need to run on OpenShift?
│  ├─ YES → Use openshift sample (Routes + KServe/service backends)
│  └─ NO ↓
│
├─ Have existing Gateway (Istio/Envoy)?
│  ├─ YES → Use gateway sample (Gateway integration mode)
│  └─ NO ↓
│
├─ Using Meta Llama Stack?
│  ├─ YES → Use llamastack sample
│  └─ NO ↓
│
└─ Simple deployment → Use simple sample (standalone mode)

Backend choice:

┌─ Running RHOAI 3.x or KServe?
│  ├─ YES → Use KServe backend type
│  └─ NO ↓
│
├─ Using Meta Llama Stack?
│  ├─ YES → Use llamastack backend type
│  └─ NO ↓
│
└─ Have direct vLLM service? → Use service backend type

Architecture

The operator manages a complete stack of resources for each SemanticRouter:

┌─────────────────────────────────────────────────────┐
│              SemanticRouter CR                       │
│  apiVersion: vllm.ai/v1alpha1                       │
│  kind: SemanticRouter                               │
└──────────────────┬──────────────────────────────────┘
                   │
                   ▼
        ┌─────────────────────┐
        │  Operator Controller │
        │  - Watches CR        │
        │  - Reconciles state  │
        │  - Platform detection│
        └─────────┬────────────┘
                  │
     ┌────────────┼────────────┬──────────────┐
     ▼            ▼            ▼              ▼
┌─────────┐  ┌─────────┐  ┌─────────┐  ┌─────────┐
│Deployment│  │ Service │  │ConfigMap│  │   PVC   │
│         │  │ - gRPC  │  │ - config│  │ - models│
│         │  │ - API   │  │ - tools │  │         │
│         │  │ - metrics│  │         │  │         │
└─────────┘  └─────────┘  └─────────┘  └─────────┘

Managed Resources:

Deployment: Runs semantic router pods with configurable replicas
Service: Exposes gRPC (50051), HTTP API (8080), and metrics (9190)
ConfigMap: Contains semantic router configuration and tools database
ServiceAccount: For RBAC (optional, created when specified)
PersistentVolumeClaim: For ML model storage (optional, when persistence enabled)
HorizontalPodAutoscaler: For auto-scaling (optional, when autoscaling enabled)
Ingress: For external access (optional, when ingress enabled)

Platform Detection and Security

The operator automatically detects the platform and configures security contexts appropriately.

OpenShift Platform

When running on OpenShift, the operator:

Detects: Checks for route.openshift.io API resources
Security Context: Does NOT set runAsUser, runAsGroup, or fsGroup
Rationale: Lets OpenShift SCCs assign UIDs/GIDs from the namespace's allowed range
Compatible with: restricted SCC (default) and custom SCCs

Standard Kubernetes

When running on standard Kubernetes, the operator:

Security Context: Sets runAsUser: 1000, fsGroup: 1000, runAsNonRoot: true
Rationale: Provides secure defaults for pod security policies/standards

Both Platforms

Regardless of platform:

Drops ALL capabilities (drop: [ALL])
Prevents privilege escalation (allowPrivilegeEscalation: false)
No special permissions or SCCs required beyond defaults

Override Security Context

You can override automatic security contexts in your CR:

spec:
  # Container security context
  securityContext:
    runAsNonRoot: true
    runAsUser: 2000
    allowPrivilegeEscalation: false
    capabilities:
      drop:
        - ALL

  # Pod security context
  podSecurityContext:
    runAsNonRoot: true
    runAsUser: 2000
    fsGroup: 2000

OpenShift Note

When running on OpenShift, it's recommended to omit runAsUser and fsGroup and let SCCs handle UID/GID assignment automatically.

Configuration Reference

Image Configuration

spec:
  image:
    repository: ghcr.io/vllm-project/semantic-router/extproc
    tag: latest
    pullPolicy: IfNotPresent
    imageRegistry: ""  # Optional: custom registry prefix

  # Optional: Image pull secrets
  imagePullSecrets:
    - name: ghcr-secret

Service Configuration

spec:
  service:
    type: ClusterIP  # or NodePort, LoadBalancer

    grpc:
      port: 50051
      targetPort: 50051

    api:
      port: 8080
      targetPort: 8080

    metrics:
      enabled: true
      port: 9190
      targetPort: 9190

Persistence Configuration

spec:
  persistence:
    enabled: true
    storageClassName: "standard"  # Adjust for your cluster
    accessMode: ReadWriteOnce
    size: 10Gi

    # Optional: Use existing PVC
    existingClaim: "my-existing-pvc"

    # Optional: PVC annotations
    annotations:
      backup.velero.io/backup-volumes: "models"

Storage Validation

The operator validates that the specified StorageClass exists before creating the PVC. If storageClassName is omitted, the cluster's default StorageClass is used.

Storage Class Examples:

AWS EKS: gp3-csi, gp2
GKE: standard, premium-rwo
Azure AKS: managed, managed-premium
OpenShift: gp3-csi, thin, ocs-storagecluster-ceph-rbd

Semantic Cache Backends

The operator supports multiple cache backends for semantic caching, which significantly improves latency and reduces token usage by caching similar queries and their responses.

Prerequisites

The operator does not deploy Redis or Milvus. You must deploy these services separately in your cluster before using them as cache backends. The operator only configures the SemanticRouter to connect to your existing Redis/Milvus deployment.

For deployment examples, see the Redis and Milvus sections below.

Alternative: If you prefer automatic deployment of Redis/Milvus, consider using the Helm chart, which can deploy cache backends as Helm chart dependencies.

Supported Backends

1. Memory Cache (Default)

Simple in-memory cache suitable for development and small deployments.

Characteristics:

No external dependencies
Fast access
Not persistent (cleared on restart)
Limited by pod memory

Configuration:

spec:
  config:
    bert_model:
      model_id: models/mom-embedding-light
      threshold: "0.6"
      use_cpu: true

    semantic_cache:
      enabled: true
      backend_type: memory
      similarity_threshold: "0.8"
      max_entries: 1000
      ttl_seconds: 3600
      eviction_policy: fifo  # fifo, lru, or lfu

When to use:

Development and testing
Small deployments (<1000 cached queries)
No persistence requirements

2. Redis Cache

High-performance distributed cache using Redis with vector search capabilities.

Characteristics:

Distributed and scalable
Persistent storage (with AOF/RDB)
HNSW or FLAT indexing
Wide ecosystem support

Prerequisites:

Redis 7.0+ with RediSearch module
Create Kubernetes Secret for password:

kubectl create secret generic redis-credentials \
  --from-literal=password='your-redis-password'

Configuration:

spec:
  config:
    bert_model:
      model_id: models/mom-embedding-light
      threshold: "0.6"
      use_cpu: true

    semantic_cache:
      enabled: true
      backend_type: redis
      similarity_threshold: "0.85"
      ttl_seconds: 3600
      embedding_model: bert

      redis:
        connection:
          host: redis.default.svc.cluster.local
          port: 6379
          database: 0
          password_secret_ref:
            name: redis-credentials
            key: password
          timeout: 30
          tls:
            enabled: false

        index:
          name: semantic_cache_idx
          prefix: "cache:"
          vector_field:
            name: embedding
            dimension: 384  # Match your embedding model
            metric_type: COSINE
          index_type: HNSW
          params:
            M: 16
            efConstruction: 64

        search:
          topk: 1

        development:
          auto_create_index: true
          verbose_errors: true

When to use:

Production deployments with moderate scale
Need persistence and high availability
Existing Redis infrastructure
Fast in-memory performance required

Example: vllm.ai_v1alpha1_semanticrouter_redis_cache.yaml

3. Milvus Cache

Enterprise-grade vector database for production deployments with large cache volumes.

Characteristics:

Highly scalable and distributed
Advanced indexing (HNSW, IVF, etc.)
Built-in data lifecycle management
High availability support

Prerequisites:

Milvus 2.3+ (standalone or cluster)
Create Kubernetes Secret for credentials:

kubectl create secret generic milvus-credentials \
  --from-literal=password='your-milvus-password'

Configuration:

spec:
  config:
    bert_model:
      model_id: models/mom-embedding-light
      threshold: "0.6"
      use_cpu: true

    semantic_cache:
      enabled: true
      backend_type: milvus
      similarity_threshold: "0.90"
      ttl_seconds: 7200
      embedding_model: bert

      milvus:
        connection:
          host: milvus-standalone.default.svc.cluster.local
          port: 19530
          database: semantic_router_cache
          timeout: 30
          auth:
            enabled: true
            username: root
            password_secret_ref:
              name: milvus-credentials
              key: password

        collection:
          name: semantic_cache
          description: "Semantic cache for LLM responses"
          vector_field:
            name: embedding
            dimension: 384  # Match your embedding model
            metric_type: IP
          index:
            type: HNSW
            params:
              M: 16
              efConstruction: 64

        search:
          params:
            ef: 64
          topk: 10
          consistency_level: Session

        performance:
          connection_pool:
            max_connections: 10
            max_idle_connections: 5
          batch:
            insert_batch_size: 100

        data_management:
          ttl:
            enabled: true
            timestamp_field: created_at
            cleanup_interval: 3600

        development:
          auto_create_collection: true

When to use:

Large-scale production deployments
Need advanced vector search capabilities
Require data lifecycle management (TTL, compaction)
High availability and scalability requirements

Example: vllm.ai_v1alpha1_semanticrouter_milvus_cache.yaml

4. Hybrid Cache

Combines in-memory HNSW index with persistent Milvus storage for optimal performance and durability.

Characteristics:

Fast in-memory search with HNSW
Persistent storage in Milvus
Best of both worlds
Automatic synchronization

Configuration:

spec:
  config:
    bert_model:
      model_id: models/mom-embedding-light
      threshold: "0.6"
      use_cpu: true

    semantic_cache:
      enabled: true
      backend_type: hybrid
      similarity_threshold: "0.85"
      ttl_seconds: 3600
      max_entries: 5000
      eviction_policy: lru
      embedding_model: bert

      # HNSW in-memory configuration
      hnsw:
        use_hnsw: true
        hnsw_m: 32
        hnsw_ef_construction: 128
        max_memory_entries: 5000

      # Milvus persistent storage (same config as milvus backend)
      milvus:
        connection:
          host: milvus-standalone.default.svc.cluster.local
          port: 19530
          # ... rest of milvus config

When to use:

Need fastest possible cache lookups
Require persistence and durability
Willing to trade memory for performance
High-throughput production deployments

Example: vllm.ai_v1alpha1_semanticrouter_hybrid_cache.yaml

Deploying Redis

Before using Redis cache backend, deploy Redis with RediSearch module to your cluster:

---
apiVersion: v1
kind: Namespace
metadata:
  name: cache-backends
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: redis
  namespace: cache-backends
spec:
  replicas: 1
  selector:
    matchLabels:
      app: redis
  template:
    metadata:
      labels:
        app: redis
    spec:
      containers:
      - name: redis
        image: redis/redis-stack-server:latest
        ports:
        - containerPort: 6379
        resources:
          requests:
            cpu: 100m
            memory: 256Mi
          limits:
            cpu: 500m
            memory: 512Mi
---
apiVersion: v1
kind: Service
metadata:
  name: redis
  namespace: cache-backends
spec:
  type: ClusterIP
  ports:
  - port: 6379
    targetPort: 6379
  selector:
    app: redis

Apply the configuration:

kubectl apply -f redis-deployment.yaml

Create the credentials Secret:

kubectl create secret generic redis-credentials \
  --from-literal=password=''  # Empty for no password, or set your password

For production deployments, consider using:

Redis Operator
Redis Enterprise Operator
Managed Redis services (AWS ElastiCache, Azure Cache for Redis, GCP Memorystore)

Deploying Milvus

Before using Milvus cache backend, deploy Milvus to your cluster:

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: milvus-standalone
  namespace: cache-backends
spec:
  replicas: 1
  selector:
    matchLabels:
      app: milvus
  template:
    metadata:
      labels:
        app: milvus
    spec:
      containers:
      - name: milvus
        image: milvusdb/milvus:v2.4.0
        command: ["milvus", "run", "standalone"]
        ports:
        - containerPort: 19530
          name: grpc
        - containerPort: 9091
          name: metrics
        env:
        - name: ETCD_USE_EMBED
          value: "true"
        - name: COMMON_STORAGETYPE
          value: "local"
        resources:
          requests:
            cpu: 500m
            memory: 1Gi
          limits:
            cpu: 2000m
            memory: 4Gi
        volumeMounts:
        - name: milvus-data
          mountPath: /var/lib/milvus
      volumes:
      - name: milvus-data
        emptyDir: {}  # Use PVC for production
---
apiVersion: v1
kind: Service
metadata:
  name: milvus-standalone
  namespace: cache-backends
spec:
  type: ClusterIP
  ports:
  - port: 19530
    targetPort: 19530
    name: grpc
  selector:
    app: milvus

Apply the configuration:

kubectl apply -f milvus-deployment.yaml

Create the credentials Secret:

kubectl create secret generic milvus-credentials \
  --from-literal=password='Milvus'  # Default Milvus password

For production deployments, consider using:

Production Best Practices

For production cache backends:

Use persistent volumes (not emptyDir)
Enable authentication and TLS
Configure resource limits appropriately
Set up monitoring and alerting
Use operators or Helm charts for easier management
Consider managed services for reduced operational overhead

Embedding Models

The semantic cache supports different embedding models for similarity calculation:

bert (default): Lightweight, 384 dimensions, good for general use
qwen3: Higher quality, 1024 dimensions, better accuracy
gemma: Balanced, 768 dimensions, moderate performance

Configure via:

spec:
  config:
    semantic_cache:
      embedding_model: bert  # or qwen3, gemma

Note: Ensure the dimension in cache config matches your chosen embedding model.

Migration Between Backends

Migrating from memory cache to Redis or Milvus is straightforward:

Deploy Redis or Milvus in your cluster
Create the credentials Secret
Update SemanticRouter CR with new backend configuration
Apply the changes - operator will perform rolling update

The cache will be empty after migration but will populate naturally as queries are processed.

Cache Configuration Reference

For detailed configuration options, use kubectl explain:

# Redis cache configuration
kubectl explain semanticrouter.spec.config.semantic_cache.redis

# Milvus cache configuration
kubectl explain semanticrouter.spec.config.semantic_cache.milvus

# HNSW configuration
kubectl explain semanticrouter.spec.config.semantic_cache.hnsw

Semantic Router Configuration

Full semantic router configuration is embedded in the CR. See the complete examples above and in deploy/operator/config/samples/.

Key configuration sections:

spec:
  config:
    # BERT model for embeddings (required for semantic cache)
    bert_model:
      model_id: "models/mom-embedding-light"
      threshold: 0.6
      use_cpu: true

    # Semantic cache (see Semantic Cache Backends section above)
    semantic_cache:
      enabled: true
      backend_type: "memory"  # or redis, milvus, hybrid
      similarity_threshold: 0.8
      max_entries: 1000
      ttl_seconds: 3600
      eviction_policy: "fifo"

    # Tools auto-selection
    tools:
      enabled: true
      top_k: 3
      similarity_threshold: 0.2
      tools_db_path: "config/tools_db.json"
      fallback_to_empty: true

    # Prompt guard (jailbreak detection)
    prompt_guard:
      enabled: true
      model_id: "models/mom-jailbreak-classifier"
      threshold: 0.7
      use_cpu: true

    # Classifiers
    classifier:
      category_model:
        model_id: "models/lora_intent_classifier_bert-base-uncased_model"
        threshold: 0.6
        use_cpu: true
      pii_model:
        model_id: "models/pii_classifier_modernbert-base_presidio_token_model"
        threshold: 0.7
        use_cpu: true

    # Reasoning configuration per model family
    reasoning_families:
      deepseek:
        type: "chat_template_kwargs"
        parameter: "thinking"
      qwen3:
        type: "chat_template_kwargs"
        parameter: "enable_thinking"
      gpt:
        type: "reasoning_effort"
        parameter: "reasoning_effort"

    # API batch classification
    api:
      batch_classification:
        max_batch_size: 100
        concurrency_threshold: 5
        max_concurrency: 8
        metrics:
          enabled: true
          detailed_goroutine_tracking: true
          sample_rate: 1.0

    # Observability
    observability:
      tracing:
        enabled: false
        provider: "opentelemetry"
        exporter:
          type: "otlp"
          endpoint: "jaeger:4317"

Tools Database

Define available tools for auto-selection:

spec:
  toolsDb:
    - tool:
        type: "function"
        function:
          name: "search_web"
          description: "Search the web for information"
          parameters:
            type: "object"
            properties:
              query:
                type: "string"
                description: "Search query"
            required: ["query"]
      description: "Search the internet, web search, find information online"
      category: "search"
      tags: ["search", "web", "internet"]

    - tool:
        type: "function"
        function:
          name: "calculate"
          description: "Perform mathematical calculations"
          parameters:
            type: "object"
            properties:
              expression:
                type: "string"
            required: ["expression"]
      description: "Calculate mathematical expressions"
      category: "math"
      tags: ["math", "calculation"]

Autoscaling (HPA)

spec:
  autoscaling:
    enabled: true
    minReplicas: 2
    maxReplicas: 10
    targetCPUUtilizationPercentage: 70
    targetMemoryUtilizationPercentage: 80

Ingress Configuration

spec:
  ingress:
    enabled: true
    className: "nginx"  # or "haproxy", "traefik", etc.
    annotations:
      cert-manager.io/cluster-issuer: "letsencrypt-prod"
    hosts:
      - host: router.example.com
        paths:
          - path: /
            pathType: Prefix
            servicePort: 8080
    tls:
      - secretName: router-tls
        hosts:
          - router.example.com

Production Deployment

High Availability Setup

apiVersion: vllm.ai/v1alpha1
kind: SemanticRouter
metadata:
  name: prod-router
spec:
  replicas: 3

  # Anti-affinity for spreading across nodes
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchLabels:
              app.kubernetes.io/instance: prod-router
          topologyKey: kubernetes.io/hostname

  # Autoscaling
  autoscaling:
    enabled: true
    minReplicas: 3
    maxReplicas: 20
    targetCPUUtilizationPercentage: 70

  # Production resources
  resources:
    limits:
      memory: "10Gi"
      cpu: "4"
    requests:
      memory: "5Gi"
      cpu: "2"

  # Strict probes
  livenessProbe:
    enabled: true
    initialDelaySeconds: 60
    periodSeconds: 30
    failureThreshold: 3

  readinessProbe:
    enabled: true
    initialDelaySeconds: 30
    periodSeconds: 10
    failureThreshold: 3

Pod Disruption Budget

Create a PDB to ensure availability during updates:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: prod-router-pdb
spec:
  maxUnavailable: 1
  selector:
    matchLabels:
      app.kubernetes.io/instance: prod-router

Resource Allocation Guidelines

Workload Type	Memory Request	CPU Request	Memory Limit	CPU Limit
Development	1Gi	500m	2Gi	1
Staging	3Gi	1	7Gi	2
Production	5Gi	2	10Gi	4

Monitoring and Observability

Metrics

Prometheus metrics are exposed on port 9190:

# Port-forward to access metrics locally
kubectl port-forward svc/my-router 9190:9190

# View metrics
curl http://localhost:9190/metrics

Key Metrics:

semantic_router_request_duration_seconds - Request latency
semantic_router_cache_hit_total - Cache hit rate
semantic_router_classification_duration_seconds - Classification latency
semantic_router_tokens_total - Token usage
semantic_router_reasoning_requests_total - Reasoning mode usage

ServiceMonitor (Prometheus Operator)

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: semantic-router-metrics
spec:
  selector:
    matchLabels:
      app.kubernetes.io/instance: my-router
  endpoints:
    - port: metrics
      interval: 30s
      path: /metrics

Distributed Tracing

Enable OpenTelemetry tracing:

spec:
  config:
    observability:
      tracing:
        enabled: true
        provider: "opentelemetry"
        exporter:
          type: "otlp"
          endpoint: "jaeger-collector:4317"
          insecure: true
        sampling:
          type: "always_on"
          rate: 1.0

Troubleshooting

Common Issues

Backend Discovery Failures

Symptom: "No backends found" or "Failed to discover backend" in logs

For KServe backends:

# Check InferenceService exists and is ready
kubectl get inferenceservice llama-3-8b

# Check predictor service was created by KServe
kubectl get service llama-3-8b-predictor

# Verify InferenceService status
kubectl describe inferenceservice llama-3-8b

For Llama Stack backends:

# Verify services exist with correct labels
kubectl get services -l app=llama-stack,model=llama-3.3-70b

# Check service labels match discoveryLabels in CR
kubectl get service <service-name> -o jsonpath='{.metadata.labels}'

For direct service backends:

# Verify service exists in specified namespace
kubectl get service vllm-deepseek -n vllm-serving

# Check service has ports defined
kubectl get service vllm-deepseek -n vllm-serving -o jsonpath='{.spec.ports[0]}'

Gateway Integration Issues

Symptom: HTTPRoute not created or traffic not reaching semantic router

# Verify Gateway exists
kubectl get gateway istio-ingressgateway -n istio-system

# Check HTTPRoute was created
kubectl get httproute -l app.kubernetes.io/instance=my-router

# Verify Gateway supports HTTPRoute (Gateway API v1)
kubectl get gateway istio-ingressgateway -n istio-system -o yaml | grep -A5 listeners

# Check operator status
kubectl get semanticrouter my-router -o jsonpath='{.status.gatewayMode}'
# Should show: "gateway-integration"

OpenShift Route Issues

Symptom: Route not created on OpenShift

# Verify running on OpenShift cluster
kubectl api-resources | grep route.openshift.io

# Check if Route was created
kubectl get route -l app.kubernetes.io/instance=my-router

# Check operator detected OpenShift
kubectl logs -n semantic-router-operator-system \
  deployment/semantic-router-operator-controller-manager \
  | grep -i "openshift\|route"

# Verify Route status
kubectl get semanticrouter my-router -o jsonpath='{.status.openshiftFeatures}'

Pod stuck in `ImagePullBackOff`

# Check image pull secrets
kubectl describe pod <pod-name>

# Create image pull secret
kubectl create secret docker-registry ghcr-secret \
  --docker-server=ghcr.io \
  --docker-username=<username> \
  --docker-password=<personal-access-token>

# Add to SemanticRouter
spec:
  imagePullSecrets:
    - name: ghcr-secret

PVC stuck in `Pending`

# Check storage class exists
kubectl get storageclass

# Check PVC events
kubectl describe pvc my-router-models

# Update storage class in CR
spec:
  persistence:
    storageClassName: "your-available-storage-class"

Models not downloading

# Check if HF token secret exists
kubectl get secret hf-token-secret

# Create HF token secret
kubectl create secret generic hf-token-secret \
  --from-literal=token=hf_xxxxxxxxxxxxx

# Add to SemanticRouter CR
spec:
  env:
    - name: HF_TOKEN
      valueFrom:
        secretKeyRef:
          name: hf-token-secret
          key: token

Operator not detecting platform correctly

# Check operator logs for platform detection
kubectl logs -n semantic-router-operator-system \
  deployment/semantic-router-operator-controller-manager \
  | grep -i "platform\|openshift"

# Should see one of:
# "Detected OpenShift platform - will use OpenShift-compatible security contexts"
# "Detected standard Kubernetes platform - will use standard security contexts"

Development

Local Development

cd deploy/operator

# Run tests
make test

# Generate CRDs and code
make generate
make manifests

# Build operator binary
make build

# Run locally against your kubeconfig
make run

Testing with kind

# Create kind cluster
kind create cluster --name operator-test

# Build and load image
make docker-build IMG=semantic-router-operator:dev
kind load docker-image semantic-router-operator:dev --name operator-test

# Deploy
make install
make deploy IMG=semantic-router-operator:dev

# Create test instance
kubectl apply -f config/samples/vllm_v1alpha1_semanticrouter.yaml

Features​

Prerequisites​

Installation​

Option 1: Using Kustomize (Standard Kubernetes)​

Option 2: Using OLM (OpenShift)​

Deploy Your First Router​

Quick Start with Sample Configurations​

Custom Configuration​

Advanced Features​

Embedding Models Configuration​

Available Embedding Models​

Example: mmBERT with Layer Early Exit​

Example: Qwen3 with Redis Cache​

Complexity-Aware Routing​

Example Configuration​

Verify Deployment​

Backend Discovery Types​

KServe InferenceService Discovery​

Llama Stack Service Discovery​

Direct Kubernetes Service​

Multiple Backends​

Deployment Modes​

Standalone Mode (Default)​

Gateway Integration Mode​

OpenShift Routes​

Basic OpenShift Route​

TLS Termination Options​

When to Use OpenShift Routes​

Status Information​

Choosing Your Configuration​

Architecture​

Platform Detection and Security​

OpenShift Platform​

Standard Kubernetes​

Both Platforms​

Override Security Context​

Configuration Reference​

Image Configuration​

Service Configuration​

Persistence Configuration​

Semantic Cache Backends​

Supported Backends​

1. Memory Cache (Default)​

2. Redis Cache​

3. Milvus Cache​

4. Hybrid Cache​

Deploying Redis​

Deploying Milvus​

Embedding Models​

Migration Between Backends​

Cache Configuration Reference​

Semantic Router Configuration​

Tools Database​

Autoscaling (HPA)​

Ingress Configuration​

Production Deployment​

High Availability Setup​

Pod Disruption Budget​

Resource Allocation Guidelines​

Monitoring and Observability​

Metrics​

ServiceMonitor (Prometheus Operator)​

Distributed Tracing​

Troubleshooting​

Common Issues​

Backend Discovery Failures​

Gateway Integration Issues​

OpenShift Route Issues​

Pod stuck in ImagePullBackOff​

PVC stuck in Pending​

Models not downloading​

Operator not detecting platform correctly​

Development​

Local Development​

Testing with kind​

Next Steps​

Features

Prerequisites

Installation

Option 1: Using Kustomize (Standard Kubernetes)

Option 2: Using OLM (OpenShift)

Deploy Your First Router

Quick Start with Sample Configurations

Custom Configuration

Advanced Features

Embedding Models Configuration

Available Embedding Models

Example: mmBERT with Layer Early Exit

Example: Qwen3 with Redis Cache

Complexity-Aware Routing

Example Configuration

Verify Deployment

Backend Discovery Types

KServe InferenceService Discovery

Llama Stack Service Discovery

Direct Kubernetes Service

Multiple Backends

Deployment Modes

Standalone Mode (Default)

Gateway Integration Mode

OpenShift Routes

Basic OpenShift Route

TLS Termination Options

When to Use OpenShift Routes

Status Information

Choosing Your Configuration

Architecture

Platform Detection and Security

OpenShift Platform

Standard Kubernetes

Both Platforms

Override Security Context

Configuration Reference

Image Configuration

Service Configuration

Persistence Configuration

Semantic Cache Backends

Supported Backends

1. Memory Cache (Default)

2. Redis Cache

3. Milvus Cache

4. Hybrid Cache

Deploying Redis

Deploying Milvus

Embedding Models

Migration Between Backends

Cache Configuration Reference

Semantic Router Configuration

Tools Database

Autoscaling (HPA)

Ingress Configuration

Production Deployment

High Availability Setup

Pod Disruption Budget

Resource Allocation Guidelines

Monitoring and Observability

Metrics

ServiceMonitor (Prometheus Operator)

Distributed Tracing

Troubleshooting

Common Issues

Backend Discovery Failures

Gateway Integration Issues

OpenShift Route Issues

Pod stuck in `ImagePullBackOff`

PVC stuck in `Pending`

Models not downloading

Operator not detecting platform correctly

Development

Local Development

Testing with kind

Next Steps