Troubleshooting Methodologies
A systematic, evidence-led method for diagnosing and resolving Linux problems.
Effective troubleshooting is a discipline, not intuition. This chapter presents a repeatable, evidence-led method and applies it to the failure classes you’ll meet most: services, disk, network, and permissions.
By the end of this chapter you will be able to
- Apply a structured troubleshooting workflow.
- Use logs and status tools as primary evidence.
- Isolate faults by working through layers.
- Resolve service, disk, network, and permission failures.
- Document root cause and remediation for reuse.
16.1 A Repeatable Method
- Define the problem precisely: what’s broken, since when, what changed?
- Reproduce it if you can — a reliable repro is half the fix.
- Gather evidence: status, logs, metrics. Don’t guess.
- Form one hypothesis and test it; change one thing at a time.
- Fix, then verify the fix actually resolved it.
- Document the root cause and the remedy.
16.2 Evidence Sources
systemctl status <svc> # is it running? why did it fail?
journalctl -u <svc> -e # the service's own error story
journalctl -p err -b # all errors since boot
dmesg -T | tail # kernel/hardware messages
df -h ; free -h ; uptime # capacity and load at a glance
16.3 Layered Isolation
For anything network- or service-related, move outward one layer at a time and confirm each before blaming the next:
- Is the process running? (systemctl status, ss -tulpn)
- Is it listening on the expected port?
- Does the host firewall allow it? Then any cloud security group?
- Does it answer locally (curl localhost) but not remotely? That narrows it instantly.
- Is DNS resolving? Test by IP vs by name.
16.4 Common Failure Patterns
| Symptom | First moves |
|---|---|
| Service won’t start | systemctl status + journalctl -u; validate its config; check the port isn’t taken. |
| Disk full | df -h to find the filesystem; du to find the cause (usually /var/log); rotate/compress. |
| Can’t connect | Test by IP then name; check listening port, firewall, then app. Refused vs timeout tells you a lot. |
| Permission denied | ls -l / ls -Z; fix ownership/permissions or SELinux context — never chmod 777. |
16.5 Guided Lab: Structured Diagnosis
Estimated time: 25 minutes. Practise the method on a safe, self-made scenario so the workflow is muscle memory before a real incident.
- Pick a service (e.g. nginx on a test VM). Confirm it’s healthy:
systemctl status nginxandcurl -I http://localhost. - Introduce a fault on purpose: stop it (
sudo systemctl stop nginx). - Now diagnose as if you didn’t know: define the symptom, then gather evidence with status + journalctl.
- State a hypothesis (‘the service is stopped’), test it, and fix it (
sudo systemctl start nginx). - Verify with curl that it’s back, then write a two-line note: root cause + remedy.
- Repeat with a different fault (e.g. a config typo, then
nginx -tto catch it).
Troubleshooting
| Symptom | Likely cause and fix |
|---|---|
| Changed several things and now it works — but you don’t know why | Avoid this: change one variable at a time. If you must, revert all but one and re-test to find the real cause. |
| Logs are overwhelming | Filter: journalctl -u <svc> -e for one service, -p err for errors, –since for a time window. Start narrow. |
| Can’t reproduce the problem | Capture exact steps, environment, and timing. Intermittent issues need monitoring/logging over time — see observability. |
| Fixed it but it came back | You treated a symptom, not the root cause. Re-run the method, focusing on ‘what changed’ and underlying conditions. |
Practice & Prove It
Write-the-command drills
- Show a service’s status plus its most recent log lines (one command each).
- Show all error-priority messages since the last boot.
- Show recent kernel messages with human-readable timestamps.
- Get an at-a-glance view of disk, memory, and load with three commands.
- List what’s listening on the network with owning processes.
Quick quiz
- What’s the most powerful first question in troubleshooting?
- Why change only one thing at a time?
- What’s the benefit of working in layers?
- For ‘can’t connect’, what does refused vs timed out suggest?
- What’s the final step of the method, and why?
Key Takeaways
- Use a repeatable method: define, reproduce, gather evidence, hypothesise, fix, verify, document.
- ‘What changed?’ points at most incidents — deploys, edits, updates, full disks.
- Lead with evidence: systemctl status, journalctl, dmesg, df/free/uptime.
- Isolate in layers (process → port → firewall → app → DNS) to converge fast.
- Fix the root cause, verify, and document — never chmod 777 your way out.
This concludes the core chapters. The certification booklets and reference appendix consolidate and assess the full curriculum.