Model Serving Architecture
Moving models from notebooks to production requires a robust serving infrastructure. This guide covers patterns for serving ML models at scale.
Serving Architecture Patterns
┌─────────────────────────────────────────────────────┐
│ Load Balancer │
│ (NGINX / Istio Gateway) │
├─────────────────────────────────────────────────────┤
│ │
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ │
│ │ Model A │ │ Model B │ │ Model C │ │
│ │ (Triton) │ │ (vLLM) │ │ (TorchServe)│ │
│ │ GPU: A100 │ │ GPU: A100 │ │ CPU only │ │
│ └────────────┘ └────────────┘ └────────────┘ │
│ │
│ ┌──────────────────────────────────────────────┐ │
│ │ Model Registry (MLflow) │ │
│ └──────────────────────────────────────────────┘ │
│ ┌──────────────────────────────────────────────┐ │
│ │ Feature Store (Feast) │ │
│ └──────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────┘
Serving Frameworks Comparison
| Framework | Best For | Protocol | GPU Support | Auto-scaling |
|---|---|---|---|---|
| Triton | Multi-framework models | gRPC/HTTP | Excellent | Yes |
| vLLM | LLM inference | OpenAI-compatible | Excellent | Yes |
| TorchServe | PyTorch models | REST/gRPC | Good | Yes |
| TF Serving | TensorFlow models | REST/gRPC | Good | Yes |
| KServe | K8s-native serving | REST/gRPC | Good | Built-in |
Deploying with Triton Inference Server
Model Repository Structure
model_repository/
├── text_classification/
│ ├── config.pbtxt
│ └── 1/
│ └── model.onnx
├── image_detection/
│ ├── config.pbtxt
│ └── 1/
│ └── model.plan
└── embeddings/
├── config.pbtxt
└── 1/
└── model.pt
Kubernetes Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: triton-inference
namespace: ml-serving
spec:
replicas: 2
selector:
matchLabels:
app: triton
template:
metadata:
labels:
app: triton
spec:
containers:
- name: triton
image: nvcr.io/nvidia/tritonserver:24.01-py3
args:
- tritonserver
- --model-repository=s3://models/repository
- --log-verbose=1
ports:
- containerPort: 8000 # HTTP
- containerPort: 8001 # gRPC
- containerPort: 8002 # Metrics
resources:
limits:
nvidia.com/gpu: 1
memory: "16Gi"
requests:
cpu: "4"
memory: "8Gi"
livenessProbe:
httpGet:
path: /v2/health/live
port: 8000
initialDelaySeconds: 30
readinessProbe:
httpGet:
path: /v2/health/ready
port: 8000
---
apiVersion: v1
kind: Service
metadata:
name: triton-service
spec:
selector:
app: triton
ports:
- name: http
port: 8000
- name: grpc
port: 8001
- name: metrics
port: 8002
Serving LLMs with vLLM
vLLM provides high-throughput LLM serving with PagedAttention:
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-server
spec:
replicas: 1
selector:
matchLabels:
app: vllm
template:
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
args:
- --model=meta-llama/Llama-3-8B-Instruct
- --tensor-parallel-size=1
- --max-model-len=4096
- --gpu-memory-utilization=0.9
ports:
- containerPort: 8000
resources:
limits:
nvidia.com/gpu: 1
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-secret
key: token
Query the LLM (OpenAI-compatible API)
from openai import OpenAI
client = OpenAI(
base_url="http://vllm-service:8000/v1",
api_key="not-needed"
)
response = client.chat.completions.create(
model="meta-llama/Llama-3-8B-Instruct",
messages=[
{"role": "user", "content": "Explain AIOps in 3 sentences."}
],
max_tokens=200
)
print(response.choices[0].message.content)
Auto-Scaling Model Servers
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: triton-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: triton-inference
minReplicas: 1
maxReplicas: 8
metrics:
- type: Pods
pods:
metric:
name: nv_inference_queue_duration_us
target:
type: AverageValue
averageValue: "1000" # Scale up when queue > 1ms
Best Practices
- Batch requests — group inference requests for higher GPU utilization
- Model caching — pre-load models to avoid cold starts
- Health checks — implement both liveness and readiness probes
- A/B testing — use traffic splitting for model version testing
- Monitor latency — track p50, p95, p99 inference latency
Next Steps
- GPU Cluster Setup — Configure GPU infrastructure
- MLOps Pipelines — Automate model lifecycle