๐Ÿ”ง Error Fixes
ยท 5 min read
Last updated on

Kubernetes OOMKilled โ€” How to Fix It


State:       Terminated
Reason:      OOMKilled
Exit Code:   137

Your container exceeded its memory limit and Kubernetes killed it. Exit code 137 means the process received SIGKILL (128 + 9) โ€” the kernelโ€™s Out of Memory (OOM) killer terminated it because there was no memory left within the cgroup.

Why this happens

Every container in Kubernetes runs inside a cgroup with a memory ceiling. When the containerโ€™s memory usage hits the limits.memory value, the Linux kernelโ€™s OOM killer immediately terminates the process. Thereโ€™s no graceful shutdown โ€” itโ€™s an instant kill.

Common causes:

  • Memory leak โ€” The application gradually consumes more memory until it hits the limit
  • Limit set too low โ€” The application legitimately needs more memory than allocated
  • Traffic spike โ€” More concurrent requests means more memory for buffers, connections, and in-flight data
  • JVM/Node.js heap misconfiguration โ€” The runtimeโ€™s heap size exceeds the container limit
  • Large file processing โ€” Loading entire files into memory instead of streaming

Step 1: Diagnose the actual memory usage

Before changing limits, understand how much memory your pod actually needs.

# Current memory usage (requires metrics-server)
kubectl top pod my-pod --containers

# Historical usage โ€” check what it was using before it died
kubectl describe pod my-pod | grep -A 10 "Last State"

# Check the OOM event
kubectl get events --field-selector involvedObject.name=my-pod --sort-by='.lastTimestamp'

# Get logs from the crashed container
kubectl logs my-pod --previous

If kubectl top shows memory climbing steadily over time, you likely have a memory leak. If it spikes suddenly, itโ€™s probably a traffic or workload spike.

Fix 1: Increase the memory limit

The simplest fix โ€” give the container more memory. But donโ€™t just double it blindly. Set it based on observed peak usage plus a 20-30% buffer.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
spec:
  template:
    spec:
      containers:
      - name: my-app
        resources:
          requests:
            memory: "256Mi"   # Scheduler uses this for placement
          limits:
            memory: "512Mi"   # Hard ceiling โ€” OOMKilled if exceeded

How to choose the right value:

  1. Run kubectl top pod during peak traffic for a few days
  2. Note the maximum observed usage
  3. Set the limit to 1.3ร— that value (30% headroom)
  4. Set the request to the average usage

Fix 2: Fix the memory leak

If memory grows continuously until OOM, you have a leak. Common sources by language:

Node.js:

# Enable heap snapshots
kubectl exec my-pod -- node --inspect=0.0.0.0:9229 app.js

# Or add to deployment
env:
  - name: NODE_OPTIONS
    value: "--max-old-space-size=384 --expose-gc"

Common Node.js leaks: unclosed event listeners, growing arrays/maps that are never cleared, closures holding references to large objects, unresolved promises accumulating.

Python:

# Use tracemalloc to find leaks
import tracemalloc
tracemalloc.start()
# ... run your code ...
snapshot = tracemalloc.take_snapshot()
top_stats = snapshot.statistics('lineno')
for stat in top_stats[:10]:
    print(stat)

Common Python leaks: global lists that grow forever, caching without eviction, circular references preventing garbage collection.

Go:

import _ "net/http/pprof"
// Then: kubectl port-forward my-pod 6060:6060
// Visit: http://localhost:6060/debug/pprof/heap

Common Go leaks: goroutines that never exit, growing slices that are never trimmed, sync.Pool misuse.

Fix 3: Configure JVM heap size (Java/Kotlin/Scala)

The JVM allocates heap memory independently of the container limit. If the JVM heap exceeds the containerโ€™s cgroup limit, OOMKilled happens.

containers:
- name: my-java-app
  resources:
    limits:
      memory: "1Gi"
  env:
    - name: JAVA_OPTS
      value: "-Xmx768m -Xms512m -XX:+UseContainerSupport"

Rules for JVM in containers:

  • Set -Xmx to ~75% of the container memory limit (the rest is for metaspace, thread stacks, native memory, and OS overhead)
  • Use -XX:+UseContainerSupport (default since Java 10) so the JVM respects cgroup limits
  • For Java 8, use -XX:+UnlockExperimentalVMOptions -XX:+UseCGroupMemoryLimitForHeap
Container limitRecommended -Xmx
512Mi384m
1Gi768m
2Gi1536m
4Gi3072m

Fix 4: Configure Node.js memory limit

Node.js has its own heap limit (default ~1.5GB on 64-bit). In a container with less memory, you must cap it.

containers:
- name: my-node-app
  resources:
    limits:
      memory: "512Mi"
  env:
    - name: NODE_OPTIONS
      value: "--max-old-space-size=384"

Set --max-old-space-size to ~75% of the container limit (in MB). The remaining 25% covers the V8 new space, native addons, buffers, and OS overhead.

Fix 5: Use Guaranteed QoS class

Kubernetes has three Quality of Service classes. Setting requests equal to limits gives your pod โ€œGuaranteedโ€ QoS, meaning itโ€™s the last to be evicted under node memory pressure.

resources:
  requests:
    memory: "512Mi"
  limits:
    memory: "512Mi"  # Same as request = Guaranteed QoS

QoS classes (from most to least protected):

  1. Guaranteed โ€” requests == limits for all containers
  2. Burstable โ€” at least one container has requests < limits
  3. BestEffort โ€” no requests or limits set (first to be killed)

Fix 6: Scale horizontally instead of vertically

If your app handles concurrent requests, running more pods with less memory each is often better than one pod with lots of memory.

# Scale out
kubectl scale deployment my-app --replicas=3

# Or use HPA (Horizontal Pod Autoscaler)
kubectl autoscale deployment my-app --min=2 --max=10 --cpu-percent=70

This works well for stateless web servers and API services. It doesnโ€™t help for batch jobs or single-process workloads.

Fix 7: Optimize memory usage in your application

Before throwing more memory at the problem:

  • Stream large files instead of loading them entirely into memory
  • Use pagination for database queries instead of fetching all rows
  • Limit concurrency โ€” fewer simultaneous requests = less memory
  • Clear caches โ€” add TTL or LRU eviction to in-memory caches
  • Use connection pooling โ€” reuse database connections instead of creating new ones

Debugging: OOMKilled vs eviction

There are two ways a pod can be killed for memory:

OOMKilledEviction
TriggerContainer exceeds its own limitNode runs out of memory
Exit code137Varies
Visible inkubectl describe pod โ†’ Reason: OOMKilledkubectl get events โ†’ Evicted
FixIncrease container limit or reduce usageAdd more nodes or reduce cluster load

FAQ

My pod keeps getting OOMKilled during startup. Whatโ€™s wrong?

Some applications (especially JVMs) need significant memory during initialization โ€” loading classes, warming caches, building indexes. Set a higher limit or add a startup probe with a generous timeout so Kubernetes doesnโ€™t kill it before itโ€™s ready.

Can I get a warning before OOMKill happens?

Not directly from Kubernetes. But you can set up Prometheus alerts on container_memory_usage_bytes approaching container_spec_memory_limit_bytes. Alert at 80% to give yourself time to react.

Does OOMKilled count against my restart policy?

Yes. Each OOMKill increments the restart count. After repeated restarts, the pod enters CrashLoopBackOff with exponential backoff delays. See CrashLoopBackOff fix.

Should I set memory limits on all containers?

Yes. Without limits, a single container can consume all node memory and cause other pods to be evicted. Always set limits, even if generous. The only exception is development/test clusters where you want maximum flexibility.

๐Ÿ“˜