Kubernetes in Production: The 15-Point Checklist Every Team Needs

March 10, 2026 · 3 min read

AIOps & DevOps Consultant

Running Kubernetes in development is easy. Running it in production is where teams get burned. After helping dozens of teams move to production Kubernetes, here's the checklist we use in every engagement.

Cluster Architecture

1. Multi-AZ / Multi-Region Deployment

Production clusters should span at least 3 availability zones. Single-AZ clusters are a single point of failure.

# EKS node group spread across AZs
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
  name: production
  region: us-east-1
availabilityZones:
  - us-east-1a
  - us-east-1b
  - us-east-1c

2. Dedicated Node Pools

Separate workloads by node pool type:

System pool — CoreDNS, kube-proxy, monitoring agents
Application pool — Your services
Spot/Preemptible pool — Batch jobs, non-critical workloads

3. Resource Requests and Limits

Every pod must define resource requests and limits. Without them, the scheduler can't make intelligent placement decisions.

resources:
  requests:
    cpu: 100m
    memory: 128Mi
  limits:
    cpu: 500m
    memory: 512Mi

Security

4. RBAC with Least Privilege

Never use cluster-admin for application service accounts. Define granular roles:

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: app-reader
  namespace: production
rules:
  - apiGroups: [""]
    resources: ["pods", "services"]
    verbs: ["get", "list", "watch"]

5. Network Policies

Default deny all traffic, then explicitly allow what's needed:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny
spec:
  podSelector: {}
  policyTypes:
    - Ingress
    - Egress

6. Pod Security Standards

Enforce pod security standards to prevent privileged containers:

No root containers
Read-only root filesystem
No privilege escalation
Drop all capabilities

7. Secrets Management

Never store secrets in plain Kubernetes Secrets. Use:

External Secrets Operator with AWS Secrets Manager or HashiCorp Vault
Sealed Secrets for GitOps workflows
SOPS for encrypted secrets in Git

Observability

8. Metrics (Prometheus + Grafana)

Monitor cluster health, node resources, and application metrics:

Cluster-level: node CPU, memory, disk, network
Kubernetes-level: pod restarts, pending pods, deployment status
Application-level: request rate, error rate, latency (RED metrics)

9. Logging (Loki or ELK)

Centralized logging with structured output:

Application logs → stdout/stderr
Log aggregation → Fluentd/Fluent Bit → Loki or Elasticsearch
Retention policy → 30 days hot, 90 days warm

10. Distributed Tracing

Implement OpenTelemetry for request tracing across microservices. This is critical for debugging latency and understanding service dependencies.

Operations

11. Horizontal Pod Autoscaler (HPA)

Scale based on CPU, memory, or custom metrics:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api
  minReplicas: 3
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70

12. Pod Disruption Budgets

Prevent deployments and node drains from taking down too many pods:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: api

13. Health Checks

Every container needs liveness, readiness, and startup probes:

livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 15
readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 5

14. GitOps Deployment

Use ArgoCD or Flux for declarative, Git-based deployments:

All manifests in Git
Automated sync from Git to cluster
Drift detection and reconciliation
Audit trail for every change

15. Disaster Recovery

etcd backups — Automated daily backups with tested restore procedures
Cluster recreation — Infrastructure as Code for cluster provisioning
Application state — PersistentVolume snapshots and database backups
Runbooks — Documented procedures for every failure scenario

The Bottom Line

Production Kubernetes requires discipline across security, observability, and operational readiness. Skip any of these items and you're running on borrowed time.

Need help getting your Kubernetes clusters production-ready? Schedule a consultation — we've done this for 50+ teams.

Cluster Architecture​

1. Multi-AZ / Multi-Region Deployment​

2. Dedicated Node Pools​

3. Resource Requests and Limits​

Security​

4. RBAC with Least Privilege​

5. Network Policies​

6. Pod Security Standards​

7. Secrets Management​

Observability​

8. Metrics (Prometheus + Grafana)​

9. Logging (Loki or ELK)​

10. Distributed Tracing​

Operations​

11. Horizontal Pod Autoscaler (HPA)​

12. Pod Disruption Budgets​

13. Health Checks​

14. GitOps Deployment​

15. Disaster Recovery​

The Bottom Line​