Kubernetes Operations Guide

This guide covers day-2 Kubernetes operations — keeping your clusters secure, observable, and running smoothly in production.

Cluster Architecture

┌──────────────────────────────────────────────────────────┐
│                   Production Cluster                      │
│                                                           │
│  Control Plane (3 nodes, HA)                             │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐              │
│  │ API Server│  │ API Server│  │ API Server│             │
│  │ etcd      │  │ etcd      │  │ etcd      │            │
│  │ Scheduler │  │ Controller│  │           │             │
│  └──────────┘  └──────────┘  └──────────┘              │
│                                                           │
│  Worker Nodes                                             │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌────────┐ │
│  │ App Pods  │  │ App Pods  │  │ App Pods  │  │GPU Pods│  │
│  │ Sidecar   │  │ Sidecar   │  │ Sidecar   │  │ML Jobs │  │
│  │ (Envoy)   │  │ (Envoy)   │  │ (Envoy)   │  │        │  │
│  └──────────┘  └──────────┘  └──────────┘  └────────┘ │
│                                                           │
│  System Components                                        │
│  ┌────────────────────────────────────────────────────┐  │
│  │ Ingress Controller │ cert-manager │ external-dns   │  │
│  │ Prometheus │ Grafana │ Loki │ Karpenter            │  │
│  └────────────────────────────────────────────────────┘  │
└──────────────────────────────────────────────────────────┘

Resource Management

Resource Quotas per Namespace

apiVersion: v1
kind: ResourceQuota
metadata:
  name: team-quota
  namespace: team-backend
spec:
  hard:
    requests.cpu: "20"
    requests.memory: 40Gi
    limits.cpu: "40"
    limits.memory: 80Gi
    pods: "50"
    services: "10"
    persistentvolumeclaims: "20"

Limit Ranges (Default Pod Resources)

apiVersion: v1
kind: LimitRange
metadata:
  name: default-limits
  namespace: team-backend
spec:
  limits:
  - default:
      cpu: "500m"
      memory: "512Mi"
    defaultRequest:
      cpu: "100m"
      memory: "128Mi"
    max:
      cpu: "4"
      memory: "8Gi"
    min:
      cpu: "50m"
      memory: "64Mi"
    type: Container

Auto-Scaling

Horizontal Pod Autoscaler (HPA)

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-server
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  minReplicas: 3
  maxReplicas: 20
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300  # Wait 5min before scale down
      policies:
      - type: Percent
        value: 10           # Scale down 10% at a time
        periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 30
      policies:
      - type: Percent
        value: 50           # Scale up 50% at a time
        periodSeconds: 60
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: "100"

Cluster Autoscaler with Karpenter

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: default
spec:
  template:
    spec:
      requirements:
      - key: kubernetes.io/arch
        operator: In
        values: ["amd64", "arm64"]
      - key: karpenter.sh/capacity-type
        operator: In
        values: ["on-demand", "spot"]
      - key: node.kubernetes.io/instance-type
        operator: In
        values: ["m6i.large", "m6i.xlarge", "m7g.large", "m7g.xlarge"]
  limits:
    cpu: 100
    memory: 400Gi
  disruption:
    consolidationPolicy: WhenUnderutilized
    expireAfter: 720h  # Replace nodes every 30 days

Security Hardening

Network Policies

# Deny all ingress by default
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: deny-all-ingress
  namespace: production
spec:
  podSelector: {}
  policyTypes:
  - Ingress

---
# Allow specific traffic
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-api-to-db
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: database
  ingress:
  - from:
    - podSelector:
        matchLabels:
          app: api-server
    ports:
    - protocol: TCP
      port: 5432

Pod Security Standards

apiVersion: v1
kind: Namespace
metadata:
  name: production
  labels:
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/audit: restricted
    pod-security.kubernetes.io/warn: restricted

Troubleshooting Runbook

Pod Not Starting

# Check pod events
kubectl describe pod <pod-name> -n <namespace>

# Check container logs
kubectl logs <pod-name> -n <namespace> --previous

# Common causes:
# - ImagePullBackOff: wrong image name or missing credentials
# - CrashLoopBackOff: application crash, check logs
# - Pending: insufficient resources, check node capacity
# - OOMKilled: increase memory limits

High Latency Investigation

# Check pod resource usage
kubectl top pods -n <namespace> --sort-by=cpu

# Check node resource pressure
kubectl top nodes

# Check for throttled pods
kubectl get pods -n <namespace> -o json | \
  jq '.items[] | select(.status.containerStatuses[]?.state.waiting.reason == "CrashLoopBackOff") | .metadata.name'

# Check HPA status
kubectl get hpa -n <namespace>

# Check for pending pods (resource starvation)
kubectl get pods --field-selector=status.phase=Pending -A

Cluster Health Check Script

#!/bin/bash
echo "=== Node Status ==="
kubectl get nodes -o wide

echo -e "\n=== Pod Issues ==="
kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded

echo -e "\n=== Resource Usage ==="
kubectl top nodes

echo -e "\n=== Recent Events (Warnings) ==="
kubectl get events -A --field-selector type=Warning --sort-by='.lastTimestamp' | tail -20

echo -e "\n=== PVC Issues ==="
kubectl get pvc -A --field-selector=status.phase!=Bound

Next Steps

CI/CD Pipeline Patterns — Deploy to Kubernetes
Observability Stack — Monitor your clusters
Terraform Best Practices — IaC for K8s

Cluster Architecture​

Resource Management​

Resource Quotas per Namespace​

Limit Ranges (Default Pod Resources)​

Auto-Scaling​

Horizontal Pod Autoscaler (HPA)​

Cluster Autoscaler with Karpenter​

Security Hardening​

Network Policies​

Pod Security Standards​

Troubleshooting Runbook​

Pod Not Starting​

High Latency Investigation​

Cluster Health Check Script​

Next Steps​

Cluster Architecture

Resource Management

Resource Quotas per Namespace

Limit Ranges (Default Pod Resources)

Auto-Scaling

Horizontal Pod Autoscaler (HPA)

Cluster Autoscaler with Karpenter

Security Hardening

Network Policies

Pod Security Standards

Troubleshooting Runbook

Pod Not Starting

High Latency Investigation

Cluster Health Check Script

Next Steps