Introduction: When All Else Fails, Layer by Layer

This week, I found myself knee-deep in troubleshooting Grafana Labs telemetry containers that refused to start properly in an OpenShift cluster (loki, mimir, prometheus, tempo, otel collector, spring boot services, etc.). The workloads looked good, the containers were running, but the telemetry data simply wouldn’t flow. Sound familiar?

After hours of fruitless searching and even getting blocked by AI assistance (yes, even Claude couldn’t crack it), I realized I was in need of applying long learned principles. That’s when I decided to document the mental framework I developed over years of debugging network issues.

This article presents my Network Troubleshooting Matrix—a systematic method based on the OSI model that takes you from symptoms to root cause, layer by layer. Feel free to use it as a reference, or create your own.

My Network Troubleshooting Matrix

Here’s the systematic approach I use for diagnosing network connectivity issues, organized by OSI layer from top to bottom:

OSI Layer Step Common Issues Tools Example Command Focus Area
Layer 1 – Physical Physical Layer Cable unplugged, bad NIC, interface down, link negotiation failure, hardware error ethtool, ip ethtool eth0 Hardware / Interface
Layer 2-3 – Data Link / Network Basic Connectivity IP conflict, wrong gateway, no link, DHCP failure, interface down, no default route ip, ping, netstat ip addr show Local Connectivity / IP Configuration
Layer 2-4 – Data Link / Network / Transport Packet Analysis Packet loss, TCP retransmissions, protocol issues, connection resets, latency spikes tcpdump, netstat tcpdump -i eth0 port 80 Packet Inspection / Protocol Analysis
Layer 3-4 – Network / Transport Container Networking Pod communication failure, service discovery issues, CNI problems, overlay network kubectl, docker, nc, ping kubectl get pods -o wide Container Orchestration
Layer 3-4 – Network / Transport Routing & Port Reachability Firewall block, server unreachable, port closed, routing loop, no route to host traceroute, nc, nmap, ping, ip traceroute example.com Routing / Firewall / Connectivity
Layer 4-7 – Transport / Application Load Balancer / Proxy Health check failures, upstream timeouts, SSL termination issues, 502/504 errors curl, wget, openssl curl -I https://example.com Load Balancing / Reverse Proxy
Layer 7 – Application Identify Problem Misconfiguration, app changes, deployment errors, unclear symptoms Review logs and symptoms Check recent deployments and changes Problem Definition / Operations
Layer 7 – Application Application Layer Wrong endpoint, auth failure, invalid SSL cert, API timeout, HTTP errors curl, wget, openssl curl -v https://api.example.com Application / API Communication
Layer 7 – Application DNS Resolution DNS lookup failure, misconfigured nameserver, stale cache, NXDOMAIN dig, nslookup, host dig example.com DNS / Name Resolution
Layer 7 – Application Server / Service Health Service stopped, port not listening, connections refused, service binding issues netstat, nc, nmap netstat -tuln Server / Infrastructure

Why I start from Layer 7? I begin at Layer 7 because the higher you go in the OSI stack, the more semantic and human-readable the information becomes. It’s much easier to understand an error like “API request returned 403 Forbidden” than something like “dropped packets on eth0.” Starting from the application layer gives you immediate clues about intent, configuration, and context—before diving into the lower-level network details.