Troubleshooting Methodologies

A systematic, evidence-led method for diagnosing and resolving Linux problems.

Effective troubleshooting is a discipline, not intuition. This chapter presents a repeatable, evidence-led method and applies it to the failure classes you’ll meet most: services, disk, network, and permissions.

By the end of this chapter you will be able to

Apply a structured troubleshooting workflow.
Use logs and status tools as primary evidence.
Isolate faults by working through layers.
Resolve service, disk, network, and permission failures.
Document root cause and remediation for reuse.

16.1 A Repeatable Method

Define the problem precisely: what’s broken, since when, what changed?
Reproduce it if you can — a reliable repro is half the fix.
Gather evidence: status, logs, metrics. Don’t guess.
Form one hypothesis and test it; change one thing at a time.
Fix, then verify the fix actually resolved it.
Document the root cause and the remedy.

16.2 Evidence Sources

Your standard evidence-gathering sweep

systemctl status <svc>      # is it running? why did it fail?
journalctl -u <svc> -e       # the service's own error story
journalctl -p err -b         # all errors since boot
dmesg -T | tail              # kernel/hardware messages
df -h ; free -h ; uptime     # capacity and load at a glance

16.3 Layered Isolation

For anything network- or service-related, move outward one layer at a time and confirm each before blaming the next:

Is the process running? (systemctl status, ss -tulpn)
Is it listening on the expected port?
Does the host firewall allow it? Then any cloud security group?
Does it answer locally (curl localhost) but not remotely? That narrows it instantly.
Is DNS resolving? Test by IP vs by name.

16.4 Common Failure Patterns

Symptom	First moves
Service won’t start	systemctl status + journalctl -u; validate its config; check the port isn’t taken.
Disk full	df -h to find the filesystem; du to find the cause (usually /var/log); rotate/compress.
Can’t connect	Test by IP then name; check listening port, firewall, then app. Refused vs timeout tells you a lot.
Permission denied	ls -l / ls -Z; fix ownership/permissions or SELinux context — never chmod 777.

16.5 Guided Lab: Structured Diagnosis

Estimated time: 25 minutes. Practise the method on a safe, self-made scenario so the workflow is muscle memory before a real incident.

Pick a service (e.g. nginx on a test VM). Confirm it’s healthy: systemctl status nginx and curl -I http://localhost.
Introduce a fault on purpose: stop it (sudo systemctl stop nginx).
Now diagnose as if you didn’t know: define the symptom, then gather evidence with status + journalctl.
State a hypothesis (‘the service is stopped’), test it, and fix it (sudo systemctl start nginx).
Verify with curl that it’s back, then write a two-line note: root cause + remedy.
Repeat with a different fault (e.g. a config typo, then nginx -t to catch it).

Troubleshooting

Symptom	Likely cause and fix
Changed several things and now it works — but you don’t know why	Avoid this: change one variable at a time. If you must, revert all but one and re-test to find the real cause.
Logs are overwhelming	Filter: journalctl -u <svc> -e for one service, -p err for errors, –since for a time window. Start narrow.
Can’t reproduce the problem	Capture exact steps, environment, and timing. Intermittent issues need monitoring/logging over time — see observability.
Fixed it but it came back	You treated a symptom, not the root cause. Re-run the method, focusing on ‘what changed’ and underlying conditions.

Practice & Prove It

Write-the-command drills

Show a service’s status plus its most recent log lines (one command each).
Show all error-priority messages since the last boot.
Show recent kernel messages with human-readable timestamps.
Get an at-a-glance view of disk, memory, and load with three commands.
List what’s listening on the network with owning processes.

Quick quiz

What’s the most powerful first question in troubleshooting?
Why change only one thing at a time?
What’s the benefit of working in layers?
For ‘can’t connect’, what does refused vs timed out suggest?
What’s the final step of the method, and why?

Key Takeaways

Use a repeatable method: define, reproduce, gather evidence, hypothesise, fix, verify, document.
‘What changed?’ points at most incidents — deploys, edits, updates, full disks.
Lead with evidence: systemctl status, journalctl, dmesg, df/free/uptime.
Isolate in layers (process → port → firewall → app → DNS) to converge fast.
Fix the root cause, verify, and document — never chmod 777 your way out.

This concludes the core chapters. The certification booklets and reference appendix consolidate and assess the full curriculum.