pythonstuff/errorcheck.dat at Work · edlafferty/pythonstuff

11 lines (11 loc) · 6.34 KB

Free memory^free_mem^free -m^The right two columns show:&buffers: For the buffer cache, used for block device I/O.&cached: For the page cache, used by file systems.&We just want to check that these aren’t near-zero in size, which can lead to higher disk I/O (confirm using iostat), and worse performance. &The “-/+ buffers/cache” provides less confusing values for used and free memory. Linux uses free memory for the caches, but can reclaim it quickly if applications need it. So in a way the cached memory should be included in the free memory column, which this line does.&^

dMessage^dmesg^dmesg | tail^This views the last 10 system messages, if there are any. Look for errors that can cause performance issues. The example above includes the oom-killer, and TCP dropping a request.&

IOStat^iostat^iostat -txz 5 5^Look for:&r/s, w/s, rkB/s, wkB/s: These are the delivered reads, writes, read Kbytes, and write Kbytes per second to the device. Use these for workload characterization. A performance problem may simply be due to an excessive load applied.&await: The average time for the I/O in milliseconds. This is the time that the application suffers, as it includes both time queued and time being serviced. Larger than expected average times can be an indicator of device saturation, or device problems.&avgqu-sz: The average number of requests issued to the device. Values greater than 1 can be evidence of saturation.&%util: Device utilization. This is really a busy percent, showing the time each second that the device was doing work. Values greater than 60% typically lead to poor performance (which should be seen in await), although it depends on the device. Values close to 100% usually indicate saturation.&If the storage device is a logical disk device fronting many back-end disks, then 100% utilization may just mean that some I/O is being processed 100% of the time, however, the back-end disks may be far from saturated, and may be able to handle much more work.&Bear in mind that poor performing disk I/O isn’t necessarily an application issue. Many techniques are typically used to perform I/O asynchronously, so that the application doesn’t block and suffer the latency directly (e.g., read-ahead for reads, and buffering for writes).^

VMStat^vmstat^vmstat -t 2 5^Short for virtual memory stat, it prints a summary of key server statistics on each line.&vmstat was run with an argument of 5, to print five second summaries. The first line of output (in this version of vmstat) has some columns that show the average since boot, instead of the previous second. For now, skip the first line, unless you want to learn and remember which column is which.&Columns to check:&r: Number of processes running on CPU and waiting for a turn. This provides a better signal than load averages for determining CPU saturation, as it does not include I/O. To interpret: an “r” value greater than the CPU count is saturation.&free: Free memory in kilobytes. If there are too many digits to count, you have enough free memory. The “free -m” command, included as command 7, better explains the state of free memory.&si, so: Swap-ins and swap-outs. If these are non-zero, you’re out of memory.&us, sy, id, wa, st: These are breakdowns of CPU time, on average across all CPUs. They are user time, system time (kernel), idle, wait I/O, and stolen time (by other guests, or with Xen, the guest’s own isolated driver domain).&The CPU time breakdowns will confirm if the CPUs are busy, by adding user + system time. A constant degree of wait I/O points to a disk bottleneck; this is where the CPUs are idle, because tasks are blocked waiting for pending disk I/O. You can treat wait I/O as another form of CPU idle, one that gives a clue as to why they are idle.&System time is necessary for I/O processing. A high system time average, over 20%, can be interesting to explore further: perhaps the kernel is processing the I/O inefficiently.&In the above example, CPU time is almost entirely in user-level, pointing to application level usage instead. The CPUs are also well over 90% utilized on average. This isn’t necessarily a problem; check for the degree of saturation using the “r” column.^

MPStat^mpstat^mpstat -P ALL 2 5^This command prints CPU time breakdowns per CPU, which can be used to check for an imbalance. A single hot CPU can be evidence of a single-threaded application.&^

Top (sorted by CPU)^top_CPU^top -n 1 -o %CPU -b^top output, sorted by %CPU.&^

Top (sorted by MEM)^top_MEM^top -n 1 -o %MEM -b^top output, sorted by %MEM.&^

SAR (devices)^sar_dev^sar -n DEV 1 5^Use this tool to check network interface throughput: rxkB/s and txkB/s, as a measure of workload, and also to check if any limit has been reached.&^

SAR (TCP)^sar_tcp^sar -n TCP,ETCP 1 5^This is a summarized view of some key TCP metrics. These include:&active/s: Number of locally-initiated TCP connections per second (e.g., via connect()).&passive/s: Number of remotely-initiated TCP connections per second (e.g., via accept()).&retrans/s: Number of TCP retransmits per second.&The active and passive counts are often useful as a rough measure of server load: number of new accepted connections (passive), and number of&downstream connections (active). It might help to think of active as outbound, and passive as inbound, but this isn’t strictly true&(e.g., consider a localhost to localhost connection). Retransmits are a sign of a network or server issue; it may be an unreliable network&(e.g., the public Internet), or it may be due a server being overloaded and dropping packets.&^

PIDStat^pidstat^pidstat 1 5^Pidstat is a little like top’s per-process summary, but prints a rolling summary instead of clearing the screen. This can be useful for watching&patterns over time, and also recording what you saw (copy-n-paste) into a record of your investigation.&^

Uptime^uptime^uptime^This is a quick way to view the load averages, which indicate the number of tasks (processes) wanting to run. On Linux systems, these numbers&include processes wanting to run on CPU, as well as processes blocked in uninterruptible I/O (usually disk I/O). This gives a high level idea&of resource load (or demand), but can’t be properly understood without other tools. Worth a quick look only.&The three numbers are exponentially damped moving sum averages with a 1 minute, 5 minute, and 15 minute constant. The three numbers give&us some idea of how load is changing over time.&^

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FilesExpand file tree

errorcheck.dat

Latest commit

History

errorcheck.dat

File metadata and controls