Performance Tuning & Monitoring

Measure resource usage, locate bottlenecks, and tune systematically.

Performance work is measurement-driven. This chapter covers the key metrics (CPU, memory, disk I/O), the tools that expose them, and a disciplined method for locating the real bottleneck before tuning.

By the end of this chapter you will be able to

Interpret load average relative to core count.
Use top/htop, vmstat, iostat, and free to profile resources.
Distinguish CPU-, memory-, and I/O-bound situations.
Apply the USE method to localise bottlenecks.
Tune kernel parameters and limits responsibly.

13.1 Load Average

Load average shows demand over the last 1, 5, and 15 minutes. Compare it to your CPU core count: load roughly equal to cores = fully busy; load far above cores = overloaded.

13.2 Core Tools

Resource	Tool
Overview / per-process	top or htop
CPU + memory + swap over time	vmstat 1 5
Memory (look at ‘available’)	free -h
Disk I/O per device	iostat -xz 1
Logs / errors	journalctl -p err -b

13.3 Finding the Bottleneck

The whole game is identifying *which* resource is the limit:

CPU-bound: high %us in top, load above core count, low iowait.
Memory-bound: low ‘available’, heavy swapping (si/so in vmstat) — the system thrashes.
I/O-bound: high iowait, near-100% %util on a disk in iostat — the CPU is waiting on storage.

13.4 Tuning

Only tune what you’ve proven is the bottleneck. Kernel parameters live in sysctl; per-service limits live in systemd unit directives or limits.conf.

13.5 Guided Lab: Performance Triage

Estimated time: 25 minutes. Profile a system across all four resources and decide where a bottleneck would be — read-only and safe to run anywhere.

Read load vs cores: uptime then nproc. Is load near, below, or above core count?
Watch processes live: top — note the top CPU and memory consumers, then quit with q.
Sample CPU/memory/swap: vmstat 1 5. Any swap activity (si/so)?
Check memory honestly: free -h — compare ‘available’ to ‘total’.
Check disk I/O: iostat -xz 1 3 (install sysstat if needed). Any device near 100% util?
Scan for recent errors: journalctl -p err -b | tail.

Troubleshooting

Symptom	Likely cause and fix
High load but low CPU usage	Likely I/O wait. Check iostat for a saturated disk and high await; the CPU is waiting on storage, not computing.
Memory looks ‘full’ but system is fine	That’s file cache. Look at ‘available’ in free -h; cache is reclaimable. Real pressure shows as swapping.
Sudden slowdowns with heavy swapping	Insufficient RAM for the workload. Reduce memory use, add RAM, or move load; tune swappiness only after measuring.
One process pegs a core	Profile that PID; it may be a bug, a tight loop, or legitimate work. Consider nice/renice or fixing the app.

Practice & Prove It

Write-the-command drills

Show the current load average and uptime.
Show memory usage including the available figure, human-readable.
Sample CPU, memory, and swap every second, five times.
Show extended per-device disk I/O three times at one-second intervals.
Show all error-priority journal messages from the current boot.

Quick quiz

What do the three load-average numbers represent?
Which memory column should you actually watch?
What does sustained high iowait indicate?
What does the USE method stand for?
What’s the golden rule before tuning?

Key Takeaways

Compare load average to core count to judge overload.
Use the right tool per resource: top, vmstat, free, iostat, journalctl.
Identify whether the system is CPU-, memory-, or I/O-bound before acting.
Watch ‘available’ memory; cache is reclaimable, swapping signals real pressure.
Tune one proven bottleneck at a time and measure the effect.

Next — Chapter 14: virtualisation and containers.