Kubernetes Pod Troubleshooting Guide | Debug CrashLoopBackOff, OOMKilled, Pending

Kubernetes Pod Troubleshooting Guide | Debug CrashLoopBackOff, OOMKilled, Pending

이 글의 핵심

Pod crashes, OOM kills, and scheduling failures are the most common Kubernetes pain points. This guide gives you a systematic troubleshooting workflow and kubectl one-liners for every scenario.

Troubleshooting Workflow

When a pod misbehaves, follow this sequence:

# 1. Check pod status
kubectl get pods

# 2. Describe the pod (events, conditions, resource usage)
kubectl describe pod <pod-name>

# 3. Check logs
kubectl logs <pod-name>
kubectl logs <pod-name> --previous  # last crash

# 4. Shell into container (if it's running)
kubectl exec -it <pod-name> -- /bin/sh

# 5. Check events for the namespace
kubectl get events --sort-by='.lastTimestamp'

1. CrashLoopBackOff

The container starts and exits repeatedly. Kubernetes backs off exponentially between restarts.

Diagnose:

kubectl describe pod <pod-name>
# Look at: Last State, Exit Code, Reason

kubectl logs <pod-name> --previous
# Application output from the last crashed container

Common causes and fixes:

Exit CodeMeaningFix
1Application errorCheck --previous logs
127Command not foundFix command/args in spec
137OOMKilledIncrease memory limit
143SIGTERM timeoutFix graceful shutdown
# Check what command the container is running
kubectl get pod <pod-name> -o jsonpath='{.spec.containers[0].command}'

# Check environment variables (missing config?)
kubectl get pod <pod-name> -o jsonpath='{.spec.containers[0].env}'

Typical fix — missing environment variable:

env:
  - name: DATABASE_URL
    valueFrom:
      secretKeyRef:
        name: app-secrets
        key: database-url

2. OOMKilled (Exit Code 137)

Your container exceeded its memory limit.

Diagnose:

kubectl describe pod <pod-name>
# Look for: OOMKilled, Last State reason

# Check current memory usage
kubectl top pod <pod-name>
kubectl top pod <pod-name> --containers

Fix — increase memory limit:

resources:
  requests:
    memory: "256Mi"
  limits:
    memory: "512Mi"  # increase this

Find memory-hungry pods cluster-wide:

kubectl top pods --all-namespaces --sort-by=memory | head -20

Tips:

  • Set limits at 2× your typical peak usage
  • Use requests for scheduling, limits for enforcement
  • Profile your app before setting limits — guessing leads to OOM or resource waste

3. ImagePullBackOff / ErrImagePull

Kubernetes can’t pull the container image.

Diagnose:

kubectl describe pod <pod-name>
# Look at Events: Failed to pull image "..."

Common causes:

# 1. Wrong image name or tag
# Fix: correct the image in your Deployment spec
image: myapp:v1.2.3  # verify this tag exists in your registry

# 2. Private registry — missing imagePullSecret
kubectl create secret docker-registry regcred \
  --docker-server=registry.example.com \
  --docker-username=user \
  --docker-password=password

# Reference in pod spec:
imagePullSecrets:
  - name: regcred

# 3. Rate limiting (Docker Hub)
# Fix: authenticate or use a mirror

4. Pending

Pod is stuck waiting to be scheduled.

Diagnose:

kubectl describe pod <pod-name>
# Look at Events: "0/3 nodes are available: ..."

Cause: Insufficient resources

kubectl describe nodes | grep -A5 "Allocated resources"
kubectl top nodes

# Fix: reduce requests or scale the cluster
resources:
  requests:
    cpu: "100m"    # not "1000m" unless you need it
    memory: "128Mi"

Cause: Node selector / affinity mismatch

kubectl get nodes --show-labels
# Verify your nodeSelector labels exist on nodes

# Fix nodeSelector mismatch
nodeSelector:
  kubernetes.io/arch: amd64  # make sure nodes have this label

Cause: Taints with no toleration

kubectl describe node <node-name> | grep Taint

# Add toleration to your pod spec
tolerations:
  - key: "dedicated"
    operator: "Equal"
    value: "gpu"
    effect: "NoSchedule"

Cause: PVC not bound

kubectl get pvc
# If STATUS is Pending, the PV doesn't exist or StorageClass is wrong

kubectl describe pvc <pvc-name>

5. Running But Not Ready

Pod is running but failing readiness probe — traffic isn’t routed to it.

kubectl describe pod <pod-name>
# Look for: Readiness probe failed

Fix — check your probe config:

readinessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 10   # wait before first check
  periodSeconds: 5
  failureThreshold: 3
  timeoutSeconds: 2

Test the probe endpoint manually:

kubectl exec -it <pod-name> -- curl http://localhost:8080/health

6. Network Issues

Pod can’t reach another service:

# Check service exists and has endpoints
kubectl get service <service-name>
kubectl get endpoints <service-name>
# If ENDPOINTS is <none>, no pods match the selector

# Test DNS resolution from inside a pod
kubectl exec -it <pod-name> -- nslookup my-service.default.svc.cluster.local

# Test connectivity
kubectl exec -it <pod-name> -- curl http://my-service:8080/health

# Check NetworkPolicy
kubectl get networkpolicy -n <namespace>
kubectl describe networkpolicy <policy-name>

Service selector mismatch (most common cause of empty endpoints):

# Service selector
kubectl get service my-app -o jsonpath='{.spec.selector}'
# Output: {"app":"my-app"}

# Pod labels
kubectl get pods --show-labels
# Make sure pods have app=my-app label

7. Init Container Failures

kubectl describe pod <pod-name>
# Look at Init Containers section

kubectl logs <pod-name> -c <init-container-name>

Common: init container waiting for a database that isn’t ready.

initContainers:
  - name: wait-for-db
    image: busybox
    command: ['sh', '-c', 'until nc -z postgres 5432; do echo waiting; sleep 2; done']

8. Useful kubectl One-Liners

# All pods not Running across all namespaces
kubectl get pods -A --field-selector=status.phase!=Running

# Watch pod restarts
kubectl get pods -w

# Pod resource usage sorted by CPU
kubectl top pods --sort-by=cpu

# Describe all pods matching a label
kubectl describe pods -l app=myapp

# Copy file from pod
kubectl cp <pod-name>:/var/log/app.log ./app.log

# Port forward to local machine
kubectl port-forward pod/<pod-name> 8080:8080

# Run a debug pod with full tools
kubectl run debug --image=nicolaka/netshoot -it --rm -- bash

# Force delete a stuck Terminating pod
kubectl delete pod <pod-name> --grace-period=0 --force

# View resource quotas and limits
kubectl describe resourcequota -n <namespace>
kubectl describe limitrange -n <namespace>

9. Pod Status Quick Reference

StatusMeaningFirst action
PendingNot scheduleddescribe pod → check Events
Init:0/1Init container runninglogs -c <init-name>
PodInitializingInit done, main startingWait or check image pull
RunningRunning but may not be readyCheck readiness probe
CrashLoopBackOffApp crashinglogs --previous
OOMKilledMemory limit exceededIncrease memory limit
TerminatingBeing deletedWait; force delete if stuck
ImagePullBackOffCan’t pull imageCheck image name + registry creds
ErrorContainer exited with errorlogs --previous + exit code

Systematic Checklist

□ kubectl get pods          → identify status
□ kubectl describe pod      → read Events section
□ kubectl logs --previous   → app-level error
□ kubectl top pod           → memory/CPU usage
□ kubectl get events        → cluster-level context
□ kubectl exec -- curl      → test endpoints from inside
□ kubectl get endpoints     → verify service routing

Most Kubernetes issues fall into five categories: application errors (logs), resource constraints (top/describe), scheduling conflicts (describe node), image problems (events), and network misconfigurations (endpoints/NetworkPolicy). Work through the checklist in order and you’ll resolve 95% of issues in under five minutes.