Practical Professional Linux — Professional

Chapter 16 · Skill Level: Professional

Troubleshooting Methodologies

A systematic, evidence-led method for diagnosing and resolving Linux problems.

Effective troubleshooting is a discipline, not intuition. This chapter presents a repeatable, evidence-led method and applies it to the failure classes you’ll meet most: services, disk, network, and permissions.

By the end of this chapter you will be able to

  • Apply a structured troubleshooting workflow.
  • Use logs and status tools as primary evidence.
  • Isolate faults by working through layers.
  • Resolve service, disk, network, and permission failures.
  • Document root cause and remediation for reuse.

16.1 A Repeatable Method

  • Define the problem precisely: what’s broken, since when, what changed?
  • Reproduce it if you can — a reliable repro is half the fix.
  • Gather evidence: status, logs, metrics. Don’t guess.
  • Form one hypothesis and test it; change one thing at a time.
  • Fix, then verify the fix actually resolved it.
  • Document the root cause and the remedy.

16.2 Evidence Sources

Your standard evidence-gathering sweep
systemctl status <svc>      # is it running? why did it fail?
journalctl -u <svc> -e       # the service's own error story
journalctl -p err -b         # all errors since boot
dmesg -T | tail              # kernel/hardware messages
df -h ; free -h ; uptime     # capacity and load at a glance

16.3 Layered Isolation

For anything network- or service-related, move outward one layer at a time and confirm each before blaming the next:

  • Is the process running? (systemctl status, ss -tulpn)
  • Is it listening on the expected port?
  • Does the host firewall allow it? Then any cloud security group?
  • Does it answer locally (curl localhost) but not remotely? That narrows it instantly.
  • Is DNS resolving? Test by IP vs by name.

16.4 Common Failure Patterns

Symptom First moves
Service won’t start systemctl status + journalctl -u; validate its config; check the port isn’t taken.
Disk full df -h to find the filesystem; du to find the cause (usually /var/log); rotate/compress.
Can’t connect Test by IP then name; check listening port, firewall, then app. Refused vs timeout tells you a lot.
Permission denied ls -l / ls -Z; fix ownership/permissions or SELinux context — never chmod 777.

16.5 Guided Lab: Structured Diagnosis

Estimated time: 25 minutes. Practise the method on a safe, self-made scenario so the workflow is muscle memory before a real incident.

  • Pick a service (e.g. nginx on a test VM). Confirm it’s healthy: systemctl status nginx and curl -I http://localhost.
  • Introduce a fault on purpose: stop it (sudo systemctl stop nginx).
  • Now diagnose as if you didn’t know: define the symptom, then gather evidence with status + journalctl.
  • State a hypothesis (‘the service is stopped’), test it, and fix it (sudo systemctl start nginx).
  • Verify with curl that it’s back, then write a two-line note: root cause + remedy.
  • Repeat with a different fault (e.g. a config typo, then nginx -t to catch it).

Troubleshooting

Symptom Likely cause and fix
Changed several things and now it works — but you don’t know why Avoid this: change one variable at a time. If you must, revert all but one and re-test to find the real cause.
Logs are overwhelming Filter: journalctl -u <svc> -e for one service, -p err for errors, –since for a time window. Start narrow.
Can’t reproduce the problem Capture exact steps, environment, and timing. Intermittent issues need monitoring/logging over time — see observability.
Fixed it but it came back You treated a symptom, not the root cause. Re-run the method, focusing on ‘what changed’ and underlying conditions.

Practice & Prove It

Write-the-command drills

  • Show a service’s status plus its most recent log lines (one command each).
  • Show all error-priority messages since the last boot.
  • Show recent kernel messages with human-readable timestamps.
  • Get an at-a-glance view of disk, memory, and load with three commands.
  • List what’s listening on the network with owning processes.

Quick quiz

  • What’s the most powerful first question in troubleshooting?
  • Why change only one thing at a time?
  • What’s the benefit of working in layers?
  • For ‘can’t connect’, what does refused vs timed out suggest?
  • What’s the final step of the method, and why?

Key Takeaways

  • Use a repeatable method: define, reproduce, gather evidence, hypothesise, fix, verify, document.
  • ‘What changed?’ points at most incidents — deploys, edits, updates, full disks.
  • Lead with evidence: systemctl status, journalctl, dmesg, df/free/uptime.
  • Isolate in layers (process → port → firewall → app → DNS) to converge fast.
  • Fix the root cause, verify, and document — never chmod 777 your way out.

This concludes the core chapters. The certification booklets and reference appendix consolidate and assess the full curriculum.