bedda.tech logobedda.tech
← Back to blog

5 Kubernetes Pod Troubleshooting Tricks That Save Hours

Matthew J. Whitney
9 min read
cloud computingdevopsbest practicessoftware architecture

I've been debugging Kubernetes pods for over six years, and I've watched too many engineers spend entire afternoons hunting down issues that could be resolved in minutes with the right approach. Last week, a client called me in a panic because their payment processing pods were crashing every few hours, and their team had been troubleshooting for two days straight.

Using the techniques I'm about to share, we identified the root cause—memory pressure from a misconfigured JVM heap—in under 20 minutes. The fix took another 10 minutes to deploy.

These aren't the basic kubectl get pods commands you'll find in every tutorial. These are the debugging tricks that separate experienced Kubernetes operators from those still fumbling through documentation at 2 AM.

The Debug kubectl Commands Most Engineers Don't Know

Before diving into specific tricks, let's talk about the debugging commands that most engineers never learn. Kubernetes 1.23+ introduced several powerful debugging features that remain underutilized:

  • kubectl debug for ephemeral containers
  • kubectl alpha debug for node debugging
  • kubectl logs --previous with advanced filtering
  • kubectl top with custom resource queries

The problem is that most teams learn Kubernetes through basic tutorials that focus on deployment, not production troubleshooting. When things break, they fall back to kubectl describe and hope for the best.

Trick #1: Using kubectl debug for Live Container Inspection

This is my go-to technique for debugging running containers without restarting them or modifying the original pod spec. The kubectl debug command creates an ephemeral container that shares the process namespace with your target container.

Here's how I used it to debug that payment processing issue:

# Create a debug container with full debugging tools
kubectl debug payment-processor-7d9b8c-xk2m9 \
  --image=nicolaka/netshoot \
  --target=payment-app \
  --share-processes=true \
  --copy-to=payment-debug

The nicolaka/netshoot image is packed with debugging tools: tcpdump, curl, dig, nslookup, iperf, and more. Once inside the debug container, I could inspect the running Java process:

# Inside the debug container
ps aux | grep java
top -p 1234  # Monitor the specific Java process
cat /proc/1234/limits  # Check process limits
cat /proc/1234/status | grep -i mem  # Memory usage details

This revealed that the JVM was hitting its memory limit because -Xmx was set to 2GB, but the container limit was only 1.5GB. The JVM couldn't allocate its maximum heap size, causing intermittent crashes.

Pro tip: Always use --copy-to when debugging production pods. This creates a copy of the pod with the debug container attached, leaving your original pod untouched.

Trick #2: Pod Resource Limits Detective Work with kubectl top

The kubectl top command is powerful, but most engineers only use its basic form. Here's how to use it for serious detective work:

# Get detailed resource usage for specific pods
kubectl top pods --containers --sort-by=memory -n production

# Monitor resource usage over time with watch
watch -n 2 'kubectl top pods --sort-by=cpu --no-headers | head -10'

# Compare resource usage to limits
kubectl get pods -o custom-columns=\
"NAME:.metadata.name,\
CPU_REQ:.spec.containers[0].resources.requests.cpu,\
CPU_LIM:.spec.containers[0].resources.limits.cpu,\
MEM_REQ:.spec.containers[0].resources.requests.memory,\
MEM_LIM:.spec.containers[0].resources.limits.memory"

But here's the real trick: combining kubectl top with resource calculations to identify pods approaching their limits:

#!/bin/bash
# Script to find pods using >80% of their memory limit
kubectl top pods --no-headers | while read pod cpu memory; do
  mem_usage=$(echo $memory | sed 's/Mi//')
  mem_limit=$(kubectl get pod $pod -o jsonpath='{.spec.containers[0].resources.limits.memory}' | sed 's/Mi//')
  
  if [ ! -z "$mem_limit" ] && [ "$mem_limit" != "null" ]; then
    usage_percent=$(echo "scale=2; $mem_usage * 100 / $mem_limit" | bc)
    if (( $(echo "$usage_percent > 80" | bc -l) )); then
      echo "WARNING: $pod using ${usage_percent}% of memory limit"
    fi
  fi
done

This script has saved me countless times by identifying pods that are about to hit memory limits before they crash.

Trick #3: Network Troubleshooting with Ephemeral Containers

Network issues are notoriously difficult to debug in Kubernetes. DNS resolution failures, service discovery problems, and connectivity issues can be maddening to track down. Ephemeral containers make this much easier.

Here's my standard network debugging workflow:

# Create a network debugging pod in the same namespace
kubectl debug target-pod-name \
  --image=nicolaka/netshoot \
  --target=main-container \
  --share-processes=true

# Inside the debug container, run comprehensive network tests
nslookup kubernetes.default.svc.cluster.local
dig +short api-service.production.svc.cluster.local
curl -v http://api-service:8080/health

# Test connectivity to external services
curl -v -m 10 https://api.stripe.com
nslookup api.stripe.com

# Capture network traffic
tcpdump -i eth0 -w /tmp/capture.pcap port 8080

Last month, I debugged a mysterious service connectivity issue where pods could reach external APIs but not internal services. Using this approach, I discovered that the CoreDNS configuration was corrupted, causing internal DNS queries to fail intermittently.

For pods that don't allow process sharing, you can create a standalone debug pod:

kubectl run network-debug --rm -it \
  --image=nicolaka/netshoot \
  --restart=Never \
  -- /bin/bash

Trick #4: Quick Log Pattern Analysis with kubectl logs --previous

Most engineers know about kubectl logs --previous for crashed containers, but few use it effectively for pattern analysis. Here are the advanced techniques I use:

# Compare current and previous container logs side by side
diff <(kubectl logs pod-name --previous) <(kubectl logs pod-name)

# Find the exact moment a container started failing
kubectl logs pod-name --previous --timestamps | tail -50

# Search for specific error patterns across restarts
kubectl logs pod-name --previous | grep -E "(ERROR|FATAL|Exception)" | tail -10
kubectl logs pod-name | grep -E "(ERROR|FATAL|Exception)" | head -10

# Analyze startup time differences
echo "Previous startup:"
kubectl logs pod-name --previous --timestamps | grep -i "started" | head -1
echo "Current startup:"
kubectl logs pod-name --timestamps | grep -i "started" | head -1

For applications that log in JSON format, I use jq to parse and analyze log patterns:

# Extract error messages from JSON logs
kubectl logs pod-name --previous | jq -r 'select(.level=="error") | .message'

# Compare error frequencies between restarts
echo "Previous errors:"
kubectl logs pod-name --previous | jq -r 'select(.level=="error") | .error_type' | sort | uniq -c
echo "Current errors:"
kubectl logs pod-name | jq -r 'select(.level=="error") | .error_type' | sort | uniq -c

This approach helped me identify a subtle memory leak in a Node.js application. The logs showed that garbage collection cycles were becoming progressively longer with each restart, indicating that the heap wasn't being properly cleaned up.

Trick #5: Node Affinity Issues - The Hidden Pod Killer

Node affinity and anti-affinity rules are often overlooked when debugging pod scheduling issues. I've seen pods stuck in "Pending" state for hours because of misconfigured affinity rules that no one remembered setting.

Here's how to quickly diagnose node affinity problems:

# Check why a pod isn't scheduling
kubectl describe pod stuck-pod-name | grep -A 10 "Events:"

# List all nodes with their labels
kubectl get nodes --show-labels

# Find pods with affinity rules
kubectl get pods -o yaml | grep -A 20 -B 5 "affinity:"

# Check if any nodes match the pod's requirements
kubectl get pod stuck-pod-name -o yaml | grep -A 20 "nodeSelector\|affinity"

But here's the real trick—a script that checks if any available nodes satisfy a pod's scheduling requirements:

#!/bin/bash
POD_NAME=$1
NAMESPACE=${2:-default}

echo "Analyzing scheduling requirements for $POD_NAME..."

# Get pod's node selector and affinity rules
kubectl get pod $POD_NAME -n $NAMESPACE -o yaml > /tmp/pod.yaml

# Extract node selector
NODE_SELECTOR=$(kubectl get pod $POD_NAME -n $NAMESPACE -o jsonpath='{.spec.nodeSelector}')
if [ "$NODE_SELECTOR" != "{}" ] && [ ! -z "$NODE_SELECTOR" ]; then
  echo "Node Selector: $NODE_SELECTOR"
fi

# Check resource requests
CPU_REQUEST=$(kubectl get pod $POD_NAME -n $NAMESPACE -o jsonpath='{.spec.containers[0].resources.requests.cpu}')
MEM_REQUEST=$(kubectl get pod $POD_NAME -n $NAMESPACE -o jsonpath='{.spec.containers[0].resources.requests.memory}')

echo "Resource Requests - CPU: $CPU_REQUEST, Memory: $MEM_REQUEST"

# List nodes with available resources
kubectl describe nodes | grep -A 5 "Allocated resources" | grep -E "(cpu|memory)"

I once spent three hours debugging a "pending" pod only to discover that someone had added a node selector for disk=ssd, but all our SSD nodes were cordoned for maintenance. This script would have caught that immediately.

Bonus: One-Liner Scripts for Common Pod Problems

Here are my favorite one-liners that I keep in my .bashrc for quick pod troubleshooting:

# Find pods with high restart counts
alias k8s-restarts="kubectl get pods --all-namespaces --sort-by='.status.containerStatuses[0].restartCount' | tail -10"

# Get pods consuming the most CPU
alias k8s-cpu-hogs="kubectl top pods --sort-by=cpu --no-headers | head -10"

# Find pods in non-running states
alias k8s-broken="kubectl get pods --all-namespaces --field-selector=status.phase!=Running"

# Quick resource summary for a namespace
k8s-resources() {
  echo "=== Pods ==="
  kubectl get pods -n $1
  echo "=== Resource Usage ==="
  kubectl top pods -n $1 --sort-by=memory
  echo "=== Events ==="
  kubectl get events -n $1 --sort-by=.metadata.creationTimestamp | tail -5
}

# Find pods scheduled on a specific node
k8s-node-pods() {
  kubectl get pods --all-namespaces -o wide --field-selector spec.nodeName=$1
}

When to Escalate Beyond These Quick Fixes

These tricks will solve about 80% of common pod issues, but sometimes you need to dig deeper. Escalate to more advanced debugging when you encounter:

  • Kernel-level issues: Use kubectl debug node/node-name to debug node-level problems
  • Network policy violations: Requires analyzing Calico/Cilium logs and policy configurations
  • Storage issues: Need to examine PersistentVolume status and underlying storage provider logs
  • Multi-container coordination problems: Requires analyzing init containers and sidecar interactions
  • Cluster-wide resource exhaustion: Need cluster-level monitoring and capacity planning

The key is knowing when a quick fix won't work and you need to involve platform engineering or SRE teams.

Wrapping Up

These five troubleshooting tricks have saved me hundreds of hours over the years. The kubectl debug command alone has eliminated the need to rebuild containers with debugging tools. Combined with smart resource monitoring and log analysis, you can resolve most pod issues in minutes instead of hours.

The real secret isn't knowing every Kubernetes command—it's having a systematic approach to debugging that starts with the most likely causes and uses the right tools for each type of problem.

At BeddaTech, we help teams build robust Kubernetes operations practices that prevent these issues before they happen. If your team is spending too much time fighting fires instead of building features, we should talk. Our Fractional CTO services include establishing monitoring, alerting, and debugging workflows that keep your applications running smoothly.

What's your go-to Kubernetes debugging trick? I'd love to hear about the techniques that have saved your team time and frustration.

Have Questions or Need Help?

Our team is ready to assist you with your project needs.

Contact Us