5 Kubernetes Pod Troubleshooting Tricks That Save Hours
I've been debugging Kubernetes pods for over six years, and I've watched too many engineers spend entire afternoons hunting down issues that could be resolved in minutes with the right approach. Last week, a client called me in a panic because their payment processing pods were crashing every few hours, and their team had been troubleshooting for two days straight.
Using the techniques I'm about to share, we identified the root cause—memory pressure from a misconfigured JVM heap—in under 20 minutes. The fix took another 10 minutes to deploy.
These aren't the basic kubectl get pods
commands you'll find in every tutorial. These are the debugging tricks that separate experienced Kubernetes operators from those still fumbling through documentation at 2 AM.
The Debug kubectl Commands Most Engineers Don't Know
Before diving into specific tricks, let's talk about the debugging commands that most engineers never learn. Kubernetes 1.23+ introduced several powerful debugging features that remain underutilized:
kubectl debug
for ephemeral containerskubectl alpha debug
for node debuggingkubectl logs --previous
with advanced filteringkubectl top
with custom resource queries
The problem is that most teams learn Kubernetes through basic tutorials that focus on deployment, not production troubleshooting. When things break, they fall back to kubectl describe
and hope for the best.
Trick #1: Using kubectl debug for Live Container Inspection
This is my go-to technique for debugging running containers without restarting them or modifying the original pod spec. The kubectl debug
command creates an ephemeral container that shares the process namespace with your target container.
Here's how I used it to debug that payment processing issue:
# Create a debug container with full debugging tools
kubectl debug payment-processor-7d9b8c-xk2m9 \
--image=nicolaka/netshoot \
--target=payment-app \
--share-processes=true \
--copy-to=payment-debug
The nicolaka/netshoot
image is packed with debugging tools: tcpdump, curl, dig, nslookup, iperf, and more. Once inside the debug container, I could inspect the running Java process:
# Inside the debug container
ps aux | grep java
top -p 1234 # Monitor the specific Java process
cat /proc/1234/limits # Check process limits
cat /proc/1234/status | grep -i mem # Memory usage details
This revealed that the JVM was hitting its memory limit because -Xmx
was set to 2GB, but the container limit was only 1.5GB. The JVM couldn't allocate its maximum heap size, causing intermittent crashes.
Pro tip: Always use --copy-to
when debugging production pods. This creates a copy of the pod with the debug container attached, leaving your original pod untouched.
Trick #2: Pod Resource Limits Detective Work with kubectl top
The kubectl top
command is powerful, but most engineers only use its basic form. Here's how to use it for serious detective work:
# Get detailed resource usage for specific pods
kubectl top pods --containers --sort-by=memory -n production
# Monitor resource usage over time with watch
watch -n 2 'kubectl top pods --sort-by=cpu --no-headers | head -10'
# Compare resource usage to limits
kubectl get pods -o custom-columns=\
"NAME:.metadata.name,\
CPU_REQ:.spec.containers[0].resources.requests.cpu,\
CPU_LIM:.spec.containers[0].resources.limits.cpu,\
MEM_REQ:.spec.containers[0].resources.requests.memory,\
MEM_LIM:.spec.containers[0].resources.limits.memory"
But here's the real trick: combining kubectl top
with resource calculations to identify pods approaching their limits:
#!/bin/bash
# Script to find pods using >80% of their memory limit
kubectl top pods --no-headers | while read pod cpu memory; do
mem_usage=$(echo $memory | sed 's/Mi//')
mem_limit=$(kubectl get pod $pod -o jsonpath='{.spec.containers[0].resources.limits.memory}' | sed 's/Mi//')
if [ ! -z "$mem_limit" ] && [ "$mem_limit" != "null" ]; then
usage_percent=$(echo "scale=2; $mem_usage * 100 / $mem_limit" | bc)
if (( $(echo "$usage_percent > 80" | bc -l) )); then
echo "WARNING: $pod using ${usage_percent}% of memory limit"
fi
fi
done
This script has saved me countless times by identifying pods that are about to hit memory limits before they crash.
Trick #3: Network Troubleshooting with Ephemeral Containers
Network issues are notoriously difficult to debug in Kubernetes. DNS resolution failures, service discovery problems, and connectivity issues can be maddening to track down. Ephemeral containers make this much easier.
Here's my standard network debugging workflow:
# Create a network debugging pod in the same namespace
kubectl debug target-pod-name \
--image=nicolaka/netshoot \
--target=main-container \
--share-processes=true
# Inside the debug container, run comprehensive network tests
nslookup kubernetes.default.svc.cluster.local
dig +short api-service.production.svc.cluster.local
curl -v http://api-service:8080/health
# Test connectivity to external services
curl -v -m 10 https://api.stripe.com
nslookup api.stripe.com
# Capture network traffic
tcpdump -i eth0 -w /tmp/capture.pcap port 8080
Last month, I debugged a mysterious service connectivity issue where pods could reach external APIs but not internal services. Using this approach, I discovered that the CoreDNS configuration was corrupted, causing internal DNS queries to fail intermittently.
For pods that don't allow process sharing, you can create a standalone debug pod:
kubectl run network-debug --rm -it \
--image=nicolaka/netshoot \
--restart=Never \
-- /bin/bash
Trick #4: Quick Log Pattern Analysis with kubectl logs --previous
Most engineers know about kubectl logs --previous
for crashed containers, but few use it effectively for pattern analysis. Here are the advanced techniques I use:
# Compare current and previous container logs side by side
diff <(kubectl logs pod-name --previous) <(kubectl logs pod-name)
# Find the exact moment a container started failing
kubectl logs pod-name --previous --timestamps | tail -50
# Search for specific error patterns across restarts
kubectl logs pod-name --previous | grep -E "(ERROR|FATAL|Exception)" | tail -10
kubectl logs pod-name | grep -E "(ERROR|FATAL|Exception)" | head -10
# Analyze startup time differences
echo "Previous startup:"
kubectl logs pod-name --previous --timestamps | grep -i "started" | head -1
echo "Current startup:"
kubectl logs pod-name --timestamps | grep -i "started" | head -1
For applications that log in JSON format, I use jq
to parse and analyze log patterns:
# Extract error messages from JSON logs
kubectl logs pod-name --previous | jq -r 'select(.level=="error") | .message'
# Compare error frequencies between restarts
echo "Previous errors:"
kubectl logs pod-name --previous | jq -r 'select(.level=="error") | .error_type' | sort | uniq -c
echo "Current errors:"
kubectl logs pod-name | jq -r 'select(.level=="error") | .error_type' | sort | uniq -c
This approach helped me identify a subtle memory leak in a Node.js application. The logs showed that garbage collection cycles were becoming progressively longer with each restart, indicating that the heap wasn't being properly cleaned up.
Trick #5: Node Affinity Issues - The Hidden Pod Killer
Node affinity and anti-affinity rules are often overlooked when debugging pod scheduling issues. I've seen pods stuck in "Pending" state for hours because of misconfigured affinity rules that no one remembered setting.
Here's how to quickly diagnose node affinity problems:
# Check why a pod isn't scheduling
kubectl describe pod stuck-pod-name | grep -A 10 "Events:"
# List all nodes with their labels
kubectl get nodes --show-labels
# Find pods with affinity rules
kubectl get pods -o yaml | grep -A 20 -B 5 "affinity:"
# Check if any nodes match the pod's requirements
kubectl get pod stuck-pod-name -o yaml | grep -A 20 "nodeSelector\|affinity"
But here's the real trick—a script that checks if any available nodes satisfy a pod's scheduling requirements:
#!/bin/bash
POD_NAME=$1
NAMESPACE=${2:-default}
echo "Analyzing scheduling requirements for $POD_NAME..."
# Get pod's node selector and affinity rules
kubectl get pod $POD_NAME -n $NAMESPACE -o yaml > /tmp/pod.yaml
# Extract node selector
NODE_SELECTOR=$(kubectl get pod $POD_NAME -n $NAMESPACE -o jsonpath='{.spec.nodeSelector}')
if [ "$NODE_SELECTOR" != "{}" ] && [ ! -z "$NODE_SELECTOR" ]; then
echo "Node Selector: $NODE_SELECTOR"
fi
# Check resource requests
CPU_REQUEST=$(kubectl get pod $POD_NAME -n $NAMESPACE -o jsonpath='{.spec.containers[0].resources.requests.cpu}')
MEM_REQUEST=$(kubectl get pod $POD_NAME -n $NAMESPACE -o jsonpath='{.spec.containers[0].resources.requests.memory}')
echo "Resource Requests - CPU: $CPU_REQUEST, Memory: $MEM_REQUEST"
# List nodes with available resources
kubectl describe nodes | grep -A 5 "Allocated resources" | grep -E "(cpu|memory)"
I once spent three hours debugging a "pending" pod only to discover that someone had added a node selector for disk=ssd
, but all our SSD nodes were cordoned for maintenance. This script would have caught that immediately.
Bonus: One-Liner Scripts for Common Pod Problems
Here are my favorite one-liners that I keep in my .bashrc
for quick pod troubleshooting:
# Find pods with high restart counts
alias k8s-restarts="kubectl get pods --all-namespaces --sort-by='.status.containerStatuses[0].restartCount' | tail -10"
# Get pods consuming the most CPU
alias k8s-cpu-hogs="kubectl top pods --sort-by=cpu --no-headers | head -10"
# Find pods in non-running states
alias k8s-broken="kubectl get pods --all-namespaces --field-selector=status.phase!=Running"
# Quick resource summary for a namespace
k8s-resources() {
echo "=== Pods ==="
kubectl get pods -n $1
echo "=== Resource Usage ==="
kubectl top pods -n $1 --sort-by=memory
echo "=== Events ==="
kubectl get events -n $1 --sort-by=.metadata.creationTimestamp | tail -5
}
# Find pods scheduled on a specific node
k8s-node-pods() {
kubectl get pods --all-namespaces -o wide --field-selector spec.nodeName=$1
}
When to Escalate Beyond These Quick Fixes
These tricks will solve about 80% of common pod issues, but sometimes you need to dig deeper. Escalate to more advanced debugging when you encounter:
- Kernel-level issues: Use
kubectl debug node/node-name
to debug node-level problems - Network policy violations: Requires analyzing Calico/Cilium logs and policy configurations
- Storage issues: Need to examine PersistentVolume status and underlying storage provider logs
- Multi-container coordination problems: Requires analyzing init containers and sidecar interactions
- Cluster-wide resource exhaustion: Need cluster-level monitoring and capacity planning
The key is knowing when a quick fix won't work and you need to involve platform engineering or SRE teams.
Wrapping Up
These five troubleshooting tricks have saved me hundreds of hours over the years. The kubectl debug
command alone has eliminated the need to rebuild containers with debugging tools. Combined with smart resource monitoring and log analysis, you can resolve most pod issues in minutes instead of hours.
The real secret isn't knowing every Kubernetes command—it's having a systematic approach to debugging that starts with the most likely causes and uses the right tools for each type of problem.
At BeddaTech, we help teams build robust Kubernetes operations practices that prevent these issues before they happen. If your team is spending too much time fighting fires instead of building features, we should talk. Our Fractional CTO services include establishing monitoring, alerting, and debugging workflows that keep your applications running smoothly.
What's your go-to Kubernetes debugging trick? I'd love to hear about the techniques that have saved your team time and frustration.