qt-creator/perfparser.git - Parser for perf data files, creates output suitable for the QtCreator plugin

	Commit message (Collapse)	Author	Age	Files	Lines
*	Fix compiler warnings about signedness6.0	Christian Kandeler	2021-08-18	1	-2/+2
\| \| \| \| \|	Change-Id: I7578e582ac7fd3dd2012f783f273080ec2c2b18b Reviewed-by: Ulf Hermann <ulf.hermann@qt.io>
*	Add Disassembler functionality	Darya Knysh	2020-12-15	1	-1/+1
\| \| \| \| \| \| \| \| \|	Added start address and size into Symbol and their stream output. A disassembler can then use the relative address and size of a symbol to find the instructions that belong to a function. Change-Id: I96709ca380d0e58cd5cf5a8cc87116147b2754d6 Reviewed-by: Ulf Hermann <ulf.hermann@qt.io>
*	Allow zero sized symbols in the address cache	dknysh	2020-10-07	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \|	Apparently there are situations where the compiler generates entries in the symbol table with zero size which are still referenced by some callstacks. After commit d8d56b7e we ended up losing the symbol names in such cases. This patch fixes it again and restores the symbol names for such zero-size symbols. Change-Id: If98f68626ab4251ccfed89d791ebd333f6a6a60a Reviewed-by: Ulf Hermann <ulf.hermann@qt.io>
*	Build a sorted symbol map once and use it for lookups	Milian Wolff	2020-09-11	1	-27/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Apparently the symtab isn't necessarily sorted. That means every call to dwfl_module_addrinfo we do may potentially end up iterating over many symbol entries to find a match. For libxul.so, which contains roughly one million symbols, this can be exceedingly slow. Now, we instead just iterate over all symbols once and store them in a sorted array for quick lookups later on. Then when we lookup a symbol, we just need to demangle it on-demand. I believe the following numbers speak for themselves. Both are for a 1.1GB perf.data file profiling firefox with debug symbols in libxul. A hefty 10x speedup! Before: ``` 592.765,37 msec task-clock:u # 0,999 CPUs utilized 0 context-switches:u # 0,000 K/sec 0 cpu-migrations:u # 0,000 K/sec 587.913 page-faults:u # 0,992 K/sec 2.610.836.174.604 cycles:u # 4,405 GHz (83,33%) 9.249.001.490 stalled-cycles-frontend:u # 0,35% frontend cycles idle (83,33%) 188.323.380.515 stalled-cycles-backend:u # 7,21% backend cycles idle (83,33%) 6.294.821.871.279 instructions:u # 2,41 insn per cycle # 0,03 stalled cycles per insn (83,33%) 1.593.493.508.805 branches:u # 2688,236 M/sec (83,33%) 1.613.875.121 branch-misses:u # 0,10% of all branches (83,34%) 593,078170383 seconds time elapsed 589,590379000 seconds user 1,591781000 seconds sys ``` After: ``` 57.292,74 msec task-clock:u # 0,999 CPUs utilized 0 context-switches:u # 0,000 K/sec 0 cpu-migrations:u # 0,000 K/sec 598.808 page-faults:u # 0,010 M/sec 246.209.111.444 cycles:u # 4,297 GHz (83,34%) 8.990.996.482 stalled-cycles-frontend:u # 3,65% frontend cycles idle (83,33%) 52.443.604.272 stalled-cycles-backend:u # 21,30% backend cycles idle (83,34%) 583.136.689.772 instructions:u # 2,37 insn per cycle # 0,09 stalled cycles per insn (83,33%) 150.053.278.261 branches:u # 2619,063 M/sec (83,33%) 833.143.959 branch-misses:u # 0,56% of all branches (83,34%) 57,370841100 seconds time elapsed 55,799767000 seconds user 1,291568000 seconds sys ``` Note that this patch also uncovers some broken offset computations. Checking the offsets manually with addr2line indicates that the new offsets we report now are better than the old ones. At least for the cases I compared, e.g.: ``` $ addr2line -C -i -f -e .../fork -a 255800 22c800 0x0000000000255800 __vfwprintf_internal fork.c:? 0x000000000022c800 __cos_fma ??:? $ addr2line -C -i -f -e .../fork -a 252585 229585 0x0000000000252585 printf_positional fork.c:? 0x0000000000229585 main ??:? $ addr2line -C -i -f -e .../vector_static_gcc_v9.1.0 -a 45d3e0 0x000000000045d3e0 __munmap crtstuff.c:? ``` Then, we now resolve symbols like binutils, i.e. we pick the first symbol we find and don't skip weak symbols like eu-addr2line seems to be doing. I.e. for this: ``` 0000000000417a40 w F .text 0000000000000074 hypot 0000000000417a40 w F .text 0000000000000074 hypotf64 0000000000417a40 w F .text 0000000000000074 hypotf32x 0000000000417a40 g F .text 0000000000000074 __hypot ``` We used to get `__hypot`, but now we get `hypot`. I think this is just as good, and as I said - it's also what you'd get from binutils with `addr2line`: ``` $ addr2line -C -i -f -e vector_static_gcc_v9.1.0 -a 417a40 0x0000000000418480 hypot ??:? $ eu-addr2line -C -i -f -e vector_static_gcc_v9.1.0 -a 417a40 0x0000000000418480 __hypot ??:0 ``` Initially, I thought about just skipping all weak symbols, but that's not a feasible approach. There are some symbols that are weak and not overridden by a non-weak symbol, like this one: ``` $ objdump -C -t .../vector_static_gcc_v9.1.0 \| grep 401c70 0000000000401c70 w F .text 0000000000000162 void std::vector<double, std::allocator<double> >::_M_realloc_insert<double>(__gnu_cxx::__normal_iterator<double*, std::vector<double, std::allocator<double> > >, double&&) ``` And that one even contains a bunch of inlined frames, so we definitely want to keep that in. We could potentially pass that information along and then implement a custom logic to prefer non-weak symbols. Quite frankly, I don't think that effort is worth it. Change-Id: Ic91764aaab36e77be1c4df4a32d4ac2b4c28e7e0 Reviewed-by: Ulf Hermann <ulf.hermann@qt.io>
*	Allow address cache lookup for ElfInfo with pgoff / baseAddr	Milian Wolff	2020-06-15	1	-0/+12
\| \| \| \| \| \| \| \| \| \| \| \| \|	I have encountered cases where the ElfInfo we pass to the address cache is the pgoff-mmapped one. In that case, we need to take the baseAddr to compute the relative address, not the absolute address where the offsetted elf was mapped. In my case, the address was actually outside of the pgoff-mmapped one and within a different section apparently, which got handled properly by elfutils but our caching didn't handle this yet - we just asserted and failed. Change-Id: I2cd9d2cebbd60f00353ecbf413e020783374769e Reviewed-by: Ulf Hermann <ulf.hermann@qt.io>
*	Unbreak test in debug build	Milian Wolff	2020-06-15	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	Asking for an address that's not within the range specified by the ElfInfo leads to an assert. Don't do that, by giving both Elf maps the same addresses. The cache operates on relative addresses internally anyways, so the absolute addresses are irrelevant. Change-Id: I33afdc762ca74d1ec4243420e4bc886aa4820581 Reviewed-by: Ulf Hermann <ulf.hermann@qt.io>
*	Fix warnings about type mismatch	Christian Kandeler	2020-04-23	1	-2/+2
\| \| \| \| \|	Change-Id: I5c98b0f427d846d9e5cc987b27fb963176425c5d Reviewed-by: Christian Stenger <christian.stenger@qt.io>
*	Introduce per-DSO cache for symbol lookup via dwfl_module_addrinfo	Milian Wolff	2020-01-09	1	-0/+27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The symbol table isn't necessarily sorted, and thus repeated lookups in there can be expensive when a DSO has many entries in its symtab. For example, the librustc_driver from rustc 1.40.0 has about 202594 symbols. A single call to dwfl_module_addrinfo can take milliseconds on my laptop. Every time we get a sample at a so far unknown address, we have to find the corresponding symbol. So we called this function a lot, which can add up to a significant amount of time. Now, we cache the symbol name and its offset and size information in a sorted list and try to lookup the symbol there quickly. The impact of this patch on the overall time required to analyze a ~1GB perf.data file for a `cargo build` process (and it's child processes) is huge: before: ``` 447.681,66 msec task-clock:u # 0,989 CPUs utilized 0 context-switches:u # 0,000 K/sec 0 cpu-migrations:u # 0,000 K/sec 45.214 page-faults:u # 0,101 K/sec 1.272.289.956.854 cycles:u # 2,842 GHz 3.497.255.264.964 instructions:u # 2,75 insn per cycle 863.671.557.196 branches:u # 1929,209 M/sec 2.666.320.642 branch-misses:u # 0,31% of all branches 452,806895428 seconds time elapsed 441,996666000 seconds user 2,557237000 seconds sys ``` after: ``` 63.770,08 msec task-clock:u # 0,995 CPUs utilized 0 context-switches:u # 0,000 K/sec 0 cpu-migrations:u # 0,000 K/sec 35.102 page-faults:u # 0,550 K/sec 191.267.750.628 cycles:u # 2,999 GHz 501.316.536.714 instructions:u # 2,62 insn per cycle 122.234.405.333 branches:u # 1916,799 M/sec 443.671.470 branch-misses:u # 0,36% of all branches 64,063443896 seconds time elapsed 62,188041000 seconds user 1,136533000 seconds sys ``` That means we are now roughly 7x faster than before. Fixes: https://github.com/KDAB/hotspot/issues/225 Change-Id: Ib7dbc800c9372044a847de68a8459dd7f7b0d3da Reviewed-by: Ulf Hermann <ulf.hermann@qt.io>
*	Share per-DSO address cache across processes	Milian Wolff	2020-01-09	1	-6/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When we profile a multi-process ensemble, it will often happen that we encounter samples at the relative address of a DSO. In such cases, we can leverage a central cache to store the information, instead of recomputing the same data for every process. As an example, I wrote a shell script that runs the same process four times in parallel. When I parse the resulting perf.data file, the perf stat results are as follows: before: ``` Performance counter stats for '/home/milian/projects/compiled/other/lib/libexec/hotspot-perfparser --input ./perf.data --output /dev/null': 4.240,50 msec task-clock:u # 0,956 CPUs utilized 0 context-switches:u # 0,000 K/sec 0 cpu-migrations:u # 0,000 K/sec 17.389 page-faults:u # 0,004 M/sec 11.195.771.907 cycles:u # 2,640 GHz 26.585.168.652 instructions:u # 2,37 insn per cycle 6.234.491.027 branches:u # 1470,227 M/sec 35.149.387 branch-misses:u # 0,56% of all branches 4,435152034 seconds time elapsed 3,732758000 seconds user 0,490148000 seconds sys ``` after: ``` Performance counter stats for '/home/milian/projects/compiled/other/lib/libexec/hotspot-perfparser --input ./perf.data --output /dev/null': 4.160,90 msec task-clock:u # 0,979 CPUs utilized 0 context-switches:u # 0,000 K/sec 0 cpu-migrations:u # 0,000 K/sec 15.476 page-faults:u # 0,004 M/sec 10.635.798.451 cycles:u # 2,556 GHz 16.616.035.720 instructions:u # 1,56 insn per cycle 3.838.148.777 branches:u # 922,433 M/sec 24.902.558 branch-misses:u # 0,65% of all branches 4,249408917 seconds time elapsed 3,612442000 seconds user 0,533933000 seconds sys ``` Note that the overall elapsed time doesn't change that much here, but the amount of instructions required is massively reduced. I bet there are other situations where this patch will bring a more tangible improvement to the overall time requirement. Change-Id: I4531ec648af40dd44b9e4290fab7bbd2a89609da Reviewed-by: Ulf Hermann <ulf.hermann@qt.io>
*	Cache address location information per elf file	Milian Wolff	2019-05-02	1	-0/+63
	Instead of using one large shared address cache that needs to be cleared every now and then, cache the address location information per elf file using relative addresses. This means we don't need to clear this cache at all, as the lookup for a relative address into a given elf file will always return the same data. This greatly improves the performance under some situations where the cache is cleared often. For one of my files, it drops the time from 55s down to 18s. Additionally, this patch paves the way to share (parts of) this cache in PerfUnwind. Most of its contents are not PID specific anymore. Change-Id: I79616fbb5c45a2543845df2d05d9936e49401627 Reviewed-by: Milian Wolff <milian.wolff@kdab.com>