Kubernetes in Production: The 15-Point Checklist Every Team Needs
Running Kubernetes in development is easy. Running it in production is where teams get burned. After helping dozens of teams move to production Kubernetes, here's the checklist we use in every engagement.
Cluster Architecture
1. Multi-AZ / Multi-Region Deployment
Production clusters should span at least 3 availability zones. Single-AZ clusters are a single point of failure.
# EKS node group spread across AZs
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: production
region: us-east-1
availabilityZones:
- us-east-1a
- us-east-1b
- us-east-1c
2. Dedicated Node Pools
Separate workloads by node pool type:
- System pool — CoreDNS, kube-proxy, monitoring agents
- Application pool — Your services
- Spot/Preemptible pool — Batch jobs, non-critical workloads
3. Resource Requests and Limits
Every pod must define resource requests and limits. Without them, the scheduler can't make intelligent placement decisions.
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
Security
4. RBAC with Least Privilege
Never use cluster-admin for application service accounts. Define granular roles:
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: app-reader
namespace: production
rules:
- apiGroups: [""]
resources: ["pods", "services"]
verbs: ["get", "list", "watch"]
5. Network Policies
Default deny all traffic, then explicitly allow what's needed:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
6. Pod Security Standards
Enforce pod security standards to prevent privileged containers:
- No root containers
- Read-only root filesystem
- No privilege escalation
- Drop all capabilities
7. Secrets Management
Never store secrets in plain Kubernetes Secrets. Use:
- External Secrets Operator with AWS Secrets Manager or HashiCorp Vault
- Sealed Secrets for GitOps workflows
- SOPS for encrypted secrets in Git
Observability
8. Metrics (Prometheus + Grafana)
Monitor cluster health, node resources, and application metrics:
- Cluster-level: node CPU, memory, disk, network
- Kubernetes-level: pod restarts, pending pods, deployment status
- Application-level: request rate, error rate, latency (RED metrics)
9. Logging (Loki or ELK)
Centralized logging with structured output:
- Application logs → stdout/stderr
- Log aggregation → Fluentd/Fluent Bit → Loki or Elasticsearch
- Retention policy → 30 days hot, 90 days warm
10. Distributed Tracing
Implement OpenTelemetry for request tracing across microservices. This is critical for debugging latency and understanding service dependencies.
Operations
11. Horizontal Pod Autoscaler (HPA)
Scale based on CPU, memory, or custom metrics:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
12. Pod Disruption Budgets
Prevent deployments and node drains from taking down too many pods:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: api-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: api
13. Health Checks
Every container needs liveness, readiness, and startup probes:
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 10
periodSeconds: 15
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
14. GitOps Deployment
Use ArgoCD or Flux for declarative, Git-based deployments:
- All manifests in Git
- Automated sync from Git to cluster
- Drift detection and reconciliation
- Audit trail for every change
15. Disaster Recovery
- etcd backups — Automated daily backups with tested restore procedures
- Cluster recreation — Infrastructure as Code for cluster provisioning
- Application state — PersistentVolume snapshots and database backups
- Runbooks — Documented procedures for every failure scenario
The Bottom Line
Production Kubernetes requires discipline across security, observability, and operational readiness. Skip any of these items and you're running on borrowed time.
Need help getting your Kubernetes clusters production-ready? Schedule a consultation — we've done this for 50+ teams.
