Over the course of your day as a developer, you use many tools. Appliscale engineers have a few favourites they would like to share with you over the coming weeks. First up is a dive into /proc. Over the next few posts, we will also look at Jmap, Gprof and Netstat.
/proc
You may wonder, what is the data source of most Linux monitoring commands? The answer is /proc filesystem. The tool is sophisticated but offers tremendous possibilities for accessing low-level system structures. What’s more, accessing data in such way is more accurate, as some day-to-day tools make assumptions that may not be obvious.
Let’s see two examples of what we can stretch out from this devil thing.
- Process resources monitoring
In systems, where there are a few orchestrated applications running, it may be desirable to measure system resources per process. Such measurements can lead us to a faster root-cause analysis of issues related to performance glitches, like process starvation or memory problems. Sometimes, a single application can lead to an entire system crash (starving on others, which leads to timeouts and recovery actions like triggering OOM killer) so it’s good to have such data in place.
When asked, which tool to use to monitor process CPU usage, you immediately have the answer: ps. ps is a helpful tool for monitoring, but it has one caveat: it measures process CPU usage from system boot.
%CPU shows the CPU time/realtime percentage. It will not add up to 100% unless you are lucky. It is time used divided by the time the process has been running.
This is not the most intuitive idea. ps calculates CPU time that the process has used since it started and this is not very useful when monitoring over time is needed. To workaround, we can use top command. But again, the top man page is misleading. On my Arch system, it always seems to work as expected in the case of single process monitoring and precisely as the documentation says when monitoring overall CPU usage (which again is counter-intuitive).
The task’s share of the elapsed CPU time since the last screen update expressed as a percentage of total CPU time. In a true SMP environment, if ‘Irix mode’ is Off, top will operate in ‘Solaris mode’ where a task’s CPU usage is divided by the total number of CPUs. You toggle ‘Irix/Solaris’ modes with the ‘I’ interactive command.
The following command could return the wrong numbers, as it takes the delta up from system boot time:
top -b -p <PID> -n 1
So we have inaccurate ps measurement, top inconsistency, can we do better? For real aesthetes, we can calculate CPU using fields from /proc/<pid>/stat and /proc/stat:
/proc/<pid>/stat (14) utime %lu Amount of time that this process has scheduled in user mode, measured in clock ticks (divide by sysconf(_SC_CLK_TCK)). This process includes guest time, guest_time (time spent running a virtual CPU, see below), so that applications that are not aware of the guest time field do not lose that time from their calculations. (15) stime %lu Amount of time that this process has scheduled in kernel mode, measured in clock ticks (divide by sysconf(_SC_CLK_TCK)). (16) cutime %ld Amount of time that this process's waited-for children have been scheduled in user mode, measured in clock ticks (divide by sysconf(_SC_CLK_TCK)). (See also times(2).) This includes guest time, cguest_time (time spent running a virtual CPU, see below). (17) cstime %ld Amount of time that this process's waited-for children have been scheduled in kernel mode, measured in clock ticks (divide by sysconf(_SC_CLK_TCK)). /proc/stat kernel/system statistics. Varies with architecture. Common entries include: cpu 10132153 290696 3084719 46828483 16683 0 25195 0 175628 0 cpu0 1393280 32966 572056 13343292 6130 0 17875 0 23933 0 The amount of time, measured in units of USER_HZ (1/100ths of a second on most architectures, use sysconf(_SC_CLK_TCK) to obtain the right value), that the system ("cpu" line) or the specific CPU ("cpuN" line) spent in various states:
The idea is basic:
- measure CPU usage at some time by summing the fields of CPU of /proc/stat
- measure process usage at some time, for user and kernel space, taking into account process child if necessary
- make calculations as in the following formula:
u = 100 * ((u + cu)time_after - (u + cu)time_before) / (cpu_sum_after - cpu_sum_before) s = 100 * ((s + cs)time_after - (s + cs)time_before) / (cpu_sum_after - cpu_sum_before)
Check out the chart representing CPU usage of Firefox process, taken when randomly surfing the internet. Can you guess from above, which colour represent which measurement method? (interval is one second, and the data can be a little misaligned, but it doesn’t matter here)
Yellow is the most obvious one and remembering all the story presented, and you correctly guessed that the culprit is ps:
ps -p <pid> -o %cpu | tail -1 | sed 's/^ //'
ps uses the largest window; from system start, up to now. This is why the line is flat; small deviations can’t change the big picture.
Red and blue ones are similar, with red taking quantized values often, and blue being more “analogue” and smooth. Red was collected using:
top -b -n1 -p 1578 | tail -1 | awk '{print $7}'
And blue, with the method described above (using Python):
import argparse import time parser = argparse.ArgumentParser(description='Measure CPU time of process.') parser.add_argument('-p', metavar='PID', required=True, type=int, help='PID of process to monitor') parser.add_argument('-i', metavar='INTERVAL', required=True, type=int, help='Monitoring interval') def main(): args = parser.parse_args() with open('./cpu.log', 'a') as out: while True: with open('/proc/stat') as stat: s = stat.readline() cpu_sum_before = 0 for m in s.split()[1:]: cpu_sum_before += int(m) user_mode_before = 0 kernel_mode_before = 0 with open('/proc/{}/stat'.format(args.p)) as stat: l = stat.readline().split() user_mode_before = int(l[13]) + int(l[15]) kernel_mode_before = int(l[14]) + int(l[16]) time.sleep(args.i / 1000) with open('/proc/stat') as stat: s = stat.readline() cpu_sum_after = 0 for m in s.split()[1:]: cpu_sum_after += int(m) user_mode_after = 0 kernel_mode_after = 0 with open('/proc/{}/stat'.format(args.p)) as stat: l = stat.readline().split() user_mode_after = int(l[13]) + int(l[15]) kernel_mode_after = int(l[14]) + int(l[16]) u = 100 * (user_mode_after - user_mode_before) / (cpu_sum_after - cpu_sum_before) k = 100 * (kernel_mode_after - kernel_mode_before) / (cpu_sum_after - cpu_sum_before) out.write('{}\n'.format(u + k)) if __name__ == '__main__': main()
This programmatic method can especially useful when monitoring a set of processes. Instead of parsing top output, which could be cumbersome, sending a lot of raw output to a remote node and interfere with CPU, the clean, easy solution can be made, using any programming language. Plus, the place for additional actions, like logging is made. Plus, the distinction of user and kernel space is preserved. Plus, top can only monitor up 20 pids at once using -p flag…
But, be sure to use the method that better fits your environment. Sometimes you don’t need a Maserati to drive your kids to school.
SNIPPET 1
#include <sys/mman.h> #include <unistd.h> #define ALLOC_SIZE 10485760 int main() { char *shared = static_cast<char *>(mmap(NULL, ALLOC_SIZE, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_SHARED, -1, 0)); int pid = fork(); if (pid == 0) { for (int i = 0; i < ALLOC_SIZE; ++i) shared[i] = 1; sleep(180); } else { char a; char *p = static_cast<char *>(shared); for (int i = 0; i < ALLOC_SIZE; ++i) a = p[i]; sleep(180); } return 0; } SNIPPET 2 #include <unistd.h> #include <stdlib.h> #define ALLOC_SIZE 10485760 int main() { char *priv = static_cast<char *>(malloc(ALLOC_SIZE)); for (int i = 0; i < ALLOC_SIZE; ++i) priv[i] = 1; sleep(180); free(priv); return 0; } #include <unistd.h> #include <stdlib.h> #define ALLOC_SIZE 10485760 SNIPPET 3 int main() { char *priv = static_cast<char *>(malloc(ALLOC_SIZE)); sleep(180); free(priv); return 0; }
Another essential resource to monitor is a process memory. ps and top could give us at most two statistics: VSS and RSS. Still, we are missing another two important: PSS and USS. What’s the difference between those?
- VSS (Virtual Set Size) is virtual memory assigned to process, that is not necessary sitting in physical memory
- RSS (Resident Set Size) is part of process memory loaded in physical memory, but it also contains shared code
- PSS (Proportional Set Size) is process memory that sits in physical memory plus proportional part of shared code
- USS (Unique Set Size) is private process memory loaded in physical memory
Fortunately, there is a tool that allows collection of missing statistics. Be sure to look at the code (it is straightforward):
https://github.com/pixelb/ps_mem
Let’s see example outputs and their interpretation.
Private + Shared = RAM used Program 24.0 KiB + 5.1 MiB = 5.1 MiB shared --------------------------------- 5.1 MiB ================================= Private + Shared = RAM used Program 24.0 KiB + 5.1 MiB = 5.1 MiB shared --------------------------------- 5.1 MiB ================================= PID USER PR NI VIRT RES %CPU %MEM TIME+ S COMMAND 6171 morfeus+ 20 0 23,1m 12,1m 0,0 0,1 0:00.03 S shared PID USER PR NI VIRT RES %CPU %MEM TIME+ S COMMAND 6172 morfeus+ 20 0 23,1m 10,6m 0,0 0,1 0:00.03 S shared
The output above is a result of measurement of code from SNIPPET 1. A program is allocating a chunk of shared memory, then writing to and reading from shared memory in two separate processes. ps_mem correctly calculates PSS, taking process USS and shared pages size, divided by the number of processes that use them. VSS for both processes is 23,1MB, 12,1MB and 10,6MB is respectively sitting in physical memory (shared memory is counted twice!).
Private + Shared = RAM used Program 10.2 MiB + 32.5 KiB = 10.2 MiB private_paging --------------------------------- 10.2 MiB ================================= PID USER PR NI VIRT RES %CPU %MEM TIME+ S COMMAND 6511 morfeus+ 20 0 23,1m 12,2m 0,0 0,1 0:00.02 S private+
With SNIPPET 2, memory is allocated and used. All pages must be loaded into physical memory, and both, ps_mem and top confirms the observation. VSS stays as previous.
Private + Shared = RAM used Program 180.0 KiB + 31.5 KiB = 211.5 KiB private --------------------------------- 211.5 KiB ================================= PID USER PR NI VIRT RES %CPU %MEM TIME+ S COMMAND 6397 morfeus+ 20 0 23,1m 1,8m 0,0 0,0 0:00.00 S private
On the contrary, with SNIPPET 3 allocated memory isn’t used. Thus no pages are loaded into physical memory. Observations stay in harmony with tools output.
If you want to more read about memory nuances, I refer you to excellent post series by Florent Bruneau:
https://techtalk.intersec.com/2013/07/memory-part-1-memory-types/
- Network stack manipulation
You can query /proc file system to enable nasty futures like IP forwarding, ICMP echo filtering, or reverse path filtering. This is exceptionally profitable in complicated network environments when there are routing loops, but reorganising network infrastructure is not an option.
Be sure to check this link:
http://www.tldp.org/HOWTO/Adv-Routing-HOWTO/lartc.kernel.rpf.html
- Bonus: accidentally deleted file trick
Let’s imagine, that you’re messing with some vital configuration file for demo project for an important client. The directory structure is organised in a way that forces you to use a lot of file manipulations commands. You are almost done, just need to remove temporary files issuing rm -f config.*… Whoops! You deleted the configuration you worked on all day.
There is a nifty trick that allows you to recover the file content, assuming that application is still running (and still holds the descriptor):
[morfeush22@morfeush22-pc tmp]$ echo very_impportant_line > very_important_file [morfeush22@morfeush22-pc tmp]$ tail -f ./very_important_file & [morfeush22@morfeush22-pc tmp]$ lsof ./very_important_file COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME tail 13036 morfeush22 3r REG 8,2 21 6428935 ./very_important_file [morfeush22@morfeush22-pc tmp]$ cat /proc/13036/fd/3 very_impportant_line
Navigating to process file descriptors directory, standard descriptors corresponding to stdin, stdout and stderr (0, 1 and 2) and our file missing one, number three can be seen. Now, all we have to do is to save the content of descriptor, using a simple tool like cat and output redirection.