Introduction: When All Else Fails, Layer by Layer
This week, I found myself knee-deep in troubleshooting Grafana Labs telemetry containers that refused to start properly in an OpenShift cluster (loki, mimir, prometheus, tempo, otel collector, spring boot services, etc.). The workloads looked good, the containers were running, but the telemetry data simply wouldn’t flow. Sound familiar?
After hours of fruitless searching and even getting blocked by AI assistance (yes, even Claude couldn’t crack it), I realized I was in need of applying long learned principles. That’s when I decided to document the mental framework I developed over years of debugging network issues.
This article presents my Network Troubleshooting Matrix—a systematic method based on the OSI model that takes you from symptoms to root cause, layer by layer. Feel free to use it as a reference, or create your own.
My Network Troubleshooting Matrix
Here’s the systematic approach I use for diagnosing network connectivity issues, organized by OSI layer from top to bottom:
| OSI Layer | Step | Common Issues | Tools | Example Command | Focus Area |
|---|---|---|---|---|---|
| Layer 1 – Physical | Physical Layer | Cable unplugged, bad NIC, interface down, link negotiation failure, hardware error | ethtool, ip | ethtool eth0 |
Hardware / Interface |
| Layer 2-3 – Data Link / Network | Basic Connectivity | IP conflict, wrong gateway, no link, DHCP failure, interface down, no default route | ip, ping, netstat | ip addr show |
Local Connectivity / IP Configuration |
| Layer 2-4 – Data Link / Network / Transport | Packet Analysis | Packet loss, TCP retransmissions, protocol issues, connection resets, latency spikes | tcpdump, netstat | tcpdump -i eth0 port 80 |
Packet Inspection / Protocol Analysis |
| Layer 3-4 – Network / Transport | Container Networking | Pod communication failure, service discovery issues, CNI problems, overlay network | kubectl, docker, nc, ping | kubectl get pods -o wide |
Container Orchestration |
| Layer 3-4 – Network / Transport | Routing & Port Reachability | Firewall block, server unreachable, port closed, routing loop, no route to host | traceroute, nc, nmap, ping, ip | traceroute example.com |
Routing / Firewall / Connectivity |
| Layer 4-7 – Transport / Application | Load Balancer / Proxy | Health check failures, upstream timeouts, SSL termination issues, 502/504 errors | curl, wget, openssl | curl -I https://example.com |
Load Balancing / Reverse Proxy |
| Layer 7 – Application | Identify Problem | Misconfiguration, app changes, deployment errors, unclear symptoms | Review logs and symptoms | Check recent deployments and changes | Problem Definition / Operations |
| Layer 7 – Application | Application Layer | Wrong endpoint, auth failure, invalid SSL cert, API timeout, HTTP errors | curl, wget, openssl | curl -v https://api.example.com |
Application / API Communication |
| Layer 7 – Application | DNS Resolution | DNS lookup failure, misconfigured nameserver, stale cache, NXDOMAIN | dig, nslookup, host | dig example.com |
DNS / Name Resolution |
| Layer 7 – Application | Server / Service Health | Service stopped, port not listening, connections refused, service binding issues | netstat, nc, nmap | netstat -tuln |
Server / Infrastructure |
Why I start from Layer 7? I begin at Layer 7 because the higher you go in the OSI stack, the more semantic and human-readable the information becomes. It’s much easier to understand an error like “API request returned 403 Forbidden” than something like “dropped packets on eth0.” Starting from the application layer gives you immediate clues about intent, configuration, and context—before diving into the lower-level network details.