Dienstag, 3. Dezember 2013

how to strace a process

strace is a Linux command that can help you with error analysis. By default it will output every syscall a process does. This means

  • it will output a lot of lines that you will have to examine
  • in the worst case your problem will not even show up. Consider an endless loop like while (true); - this one does not do any syscalls at all.
There is no tool that "drills down" deeper on this analysis than strace. By this it is comparable to
  • disassembling a program as I give an example here. A program, as a contrast to a process, is something stored on disk rather than running in RAM. Disassembling will reveal problems like the above endless loop, but it does not work on a running process. 
  • network sniffing like I give an example here. There is no drill-down to looking at the bytes that are transmitted via a network cable
  • sniffing the SCSI bus with blktrace like I give an example here
  • sniffing the USB bus with wireshark like I give an example here
  • testing IMAP via telnet - going down to the very elements defined in the IMAP protocol
It takes some time till you can use strace for an actual purpose. I started with Linux in 2000 and the first time strace helped me to accomplish a task was in 2013. Here is what happened:

I was setting up guacaMole, the tool to control a Linux desktop from a browser. It basically consists of a daemon, guacd that (if I understood it right) accepts data from the browser and uses it to control a VNC session. It gave me the error message "Server error" which does not really mean anything. So I strace'd it:
# ps -A|grep guac
 7361 ?        00:00:00 guacd
# strace -p 7361
for every login I did on the web interface I saw this in strace's output:
accept(4, {sa_family=AF_INET, sin_port=htons(42845), ...
clone(child_stack=0, ...
close(5)                                = 0
listen(4, 5)                            = 0
now the commands "man 2 accept", "man 2 clone", "man 2 close" and "man 2 listen" told me more. "Clone" means that a separate process will be spawned off. We want strace to follow this process as well so we call it again with the ff parameter:
# strace -ffp 7361
This time there was more output. There were several lines like
[pid 20344] open("/usr/lib/x86_64-linux-gnu/libguac-client-vnc.so", O_RDONLY) = -1 ENOENT (No such file or directory)
reading the complete output, this libguac-client-vnc.so was never found. This told me I had to re-compile the guacamole server with VNC development libraries so its VNC module would be buildable. When I did this, I also had to move the library to the right place (make install did not do), but these are the little hurdles described in my guacaMole tutorial.

You can do much more with strace. If I had known this was all about a missing file I would have started it with -e open. This tells strace to only print occurences of the syscall open. In effect it allows you to monitor which files are accessed by a process. I use it to find out where configuration changes are stored (ok, maybe this was not the first time strace was useful to me).

You do not need to know a process ID to call strace. You can just call 
strace program
and strace will load program as a process and output its syscalls.

strace shortens information for you and replaces strings by "..." before they get too long. To change this use the -s parameter to determine how long strings are supposed to be. For more information see the man page.

strace can also be used a bit more abstract for performance analysis.
strace -c ls -R
will show you in which syscalls the ls program passed most of its time. This is helpful if you want to optimize routines for speed. The output could look like this:
Entries  Repository  Root
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 67.09    0.412153          14     29664           getdents64
 27.70    0.170168          11     14849        14 open
  4.24    0.026043           0    123740           write
  0.72    0.004443           0     14837           close
  0.20    0.001204           0     14836           fstat
  0.05    0.000285         285         1           execve
  0.00    0.000000           0        12           read
  0.00    0.000000           0         4         3 stat
  0.00    0.000000           0        33           mmap
  0.00    0.000000           0        18           mprotect
  0.00    0.000000           0         4           munmap
  0.00    0.000000           0        12           brk
  0.00    0.000000           0         2           rt_sigaction
  0.00    0.000000           0         1           rt_sigprocmask
  0.00    0.000000           0         2           ioctl
  0.00    0.000000           0         1         1 access
  0.00    0.000000           0         3           mremap
  0.00    0.000000           0         1           fcntl
  0.00    0.000000           0         1           getrlimit
  0.00    0.000000           0         1           statfs
  0.00    0.000000           0         1           arch_prctl
  0.00    0.000000           0         3         1 futex
  0.00    0.000000           0         1           set_tid_address
  0.00    0.000000           0         8           fadvise64
  0.00    0.000000           0         1           set_robust_list
------ ----------- ----------- --------- --------- ----------------
100.00    0.614296                198036        19 total
which means that more than two thirds of ls' time was passed in the getdents64 syscall. I'll leave you alone for now if you want to man 2 getdents64 and start optimizing the ls program to list directories. Have fun!

Keine Kommentare:

Kommentar veröffentlichen

Raspberry Pi, test my internet connection

Since start of Corona times, I use my internet at home also for work. So its reliability and performance has become crucial. Performance see...