| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
| |
Change-Id: I7578e582ac7fd3dd2012f783f273080ec2c2b18b
Reviewed-by: Ulf Hermann <ulf.hermann@qt.io>
|
|
|
|
|
|
|
|
|
| |
Added start address and size into Symbol and their stream output.
A disassembler can then use the relative address and size of a symbol
to find the instructions that belong to a function.
Change-Id: I96709ca380d0e58cd5cf5a8cc87116147b2754d6
Reviewed-by: Ulf Hermann <ulf.hermann@qt.io>
|
|
|
|
|
|
|
|
|
|
|
| |
Apparently there are situations where the compiler generates entries
in the symbol table with zero size which are still referenced by some
callstacks. After commit d8d56b7e we ended up losing the symbol names
in such cases. This patch fixes it again and restores the symbol names
for such zero-size symbols.
Change-Id: If98f68626ab4251ccfed89d791ebd333f6a6a60a
Reviewed-by: Ulf Hermann <ulf.hermann@qt.io>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Apparently the symtab isn't necessarily sorted. That means
every call to dwfl_module_addrinfo we do may potentially end
up iterating over many symbol entries to find a match. For
libxul.so, which contains roughly one million symbols, this
can be exceedingly slow.
Now, we instead just iterate over all symbols once and store
them in a sorted array for quick lookups later on. Then when
we lookup a symbol, we just need to demangle it on-demand.
I believe the following numbers speak for themselves. Both
are for a 1.1GB perf.data file profiling firefox with debug
symbols in libxul. A hefty 10x speedup!
Before:
```
592.765,37 msec task-clock:u # 0,999 CPUs utilized
0 context-switches:u # 0,000 K/sec
0 cpu-migrations:u # 0,000 K/sec
587.913 page-faults:u # 0,992 K/sec
2.610.836.174.604 cycles:u # 4,405 GHz (83,33%)
9.249.001.490 stalled-cycles-frontend:u # 0,35% frontend cycles idle (83,33%)
188.323.380.515 stalled-cycles-backend:u # 7,21% backend cycles idle (83,33%)
6.294.821.871.279 instructions:u # 2,41 insn per cycle
# 0,03 stalled cycles per insn (83,33%)
1.593.493.508.805 branches:u # 2688,236 M/sec (83,33%)
1.613.875.121 branch-misses:u # 0,10% of all branches (83,34%)
593,078170383 seconds time elapsed
589,590379000 seconds user
1,591781000 seconds sys
```
After:
```
57.292,74 msec task-clock:u # 0,999 CPUs utilized
0 context-switches:u # 0,000 K/sec
0 cpu-migrations:u # 0,000 K/sec
598.808 page-faults:u # 0,010 M/sec
246.209.111.444 cycles:u # 4,297 GHz (83,34%)
8.990.996.482 stalled-cycles-frontend:u # 3,65% frontend cycles idle (83,33%)
52.443.604.272 stalled-cycles-backend:u # 21,30% backend cycles idle (83,34%)
583.136.689.772 instructions:u # 2,37 insn per cycle
# 0,09 stalled cycles per insn (83,33%)
150.053.278.261 branches:u # 2619,063 M/sec (83,33%)
833.143.959 branch-misses:u # 0,56% of all branches (83,34%)
57,370841100 seconds time elapsed
55,799767000 seconds user
1,291568000 seconds sys
```
Note that this patch also uncovers some broken offset computations.
Checking the offsets manually with addr2line indicates that the new
offsets we report now are better than the old ones. At least for
the cases I compared, e.g.:
```
$ addr2line -C -i -f -e .../fork -a 255800 22c800
0x0000000000255800
__vfwprintf_internal
fork.c:?
0x000000000022c800
__cos_fma
??:?
$ addr2line -C -i -f -e .../fork -a 252585 229585
0x0000000000252585
printf_positional
fork.c:?
0x0000000000229585
main
??:?
$ addr2line -C -i -f -e .../vector_static_gcc_v9.1.0 -a 45d3e0
0x000000000045d3e0
__munmap
crtstuff.c:?
```
Then, we now resolve symbols like binutils, i.e. we pick the first
symbol we find and don't skip weak symbols like eu-addr2line seems
to be doing. I.e. for this:
```
0000000000417a40 w F .text 0000000000000074 hypot
0000000000417a40 w F .text 0000000000000074 hypotf64
0000000000417a40 w F .text 0000000000000074 hypotf32x
0000000000417a40 g F .text 0000000000000074 __hypot
```
We used to get `__hypot`, but now we get `hypot`. I think this is
just as good, and as I said - it's also what you'd get from binutils
with `addr2line`:
```
$ addr2line -C -i -f -e vector_static_gcc_v9.1.0 -a 417a40
0x0000000000418480
hypot
??:?
$ eu-addr2line -C -i -f -e vector_static_gcc_v9.1.0 -a 417a40
0x0000000000418480
__hypot
??:0
```
Initially, I thought about just skipping all weak symbols, but that's
not a feasible approach. There are some symbols that are weak and not
overridden by a non-weak symbol, like this one:
```
$ objdump -C -t .../vector_static_gcc_v9.1.0 | grep 401c70
0000000000401c70 w F .text 0000000000000162 void std::vector<double, std::allocator<double> >::_M_realloc_insert<double>(__gnu_cxx::__normal_iterator<double*, std::vector<double, std::allocator<double> > >, double&&)
```
And that one even contains a bunch of inlined frames, so we definitely
want to keep that in. We could potentially pass that information along
and then implement a custom logic to prefer non-weak symbols. Quite
frankly, I don't think that effort is worth it.
Change-Id: Ic91764aaab36e77be1c4df4a32d4ac2b4c28e7e0
Reviewed-by: Ulf Hermann <ulf.hermann@qt.io>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
I have encountered cases where the ElfInfo we pass to the address
cache is the pgoff-mmapped one. In that case, we need to take the
baseAddr to compute the relative address, not the absolute address
where the offsetted elf was mapped. In my case, the address was
actually outside of the pgoff-mmapped one and within a different
section apparently, which got handled properly by elfutils but our
caching didn't handle this yet - we just asserted and failed.
Change-Id: I2cd9d2cebbd60f00353ecbf413e020783374769e
Reviewed-by: Ulf Hermann <ulf.hermann@qt.io>
|
|
|
|
|
|
|
|
|
|
| |
Asking for an address that's not within the range specified
by the ElfInfo leads to an assert. Don't do that, by giving both
Elf maps the same addresses. The cache operates on relative addresses
internally anyways, so the absolute addresses are irrelevant.
Change-Id: I33afdc762ca74d1ec4243420e4bc886aa4820581
Reviewed-by: Ulf Hermann <ulf.hermann@qt.io>
|
|
|
|
|
| |
Change-Id: I5c98b0f427d846d9e5cc987b27fb963176425c5d
Reviewed-by: Christian Stenger <christian.stenger@qt.io>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The symbol table isn't necessarily sorted, and thus repeated lookups
in there can be expensive when a DSO has many entries in its symtab.
For example, the librustc_driver from rustc 1.40.0 has about 202594
symbols. A single call to dwfl_module_addrinfo can take milliseconds
on my laptop. Every time we get a sample at a so far unknown address,
we have to find the corresponding symbol. So we called this function
a lot, which can add up to a significant amount of time. Now, we
cache the symbol name and its offset and size information in a sorted
list and try to lookup the symbol there quickly. The impact of this
patch on the overall time required to analyze a ~1GB perf.data file
for a `cargo build` process (and it's child processes) is huge:
before:
```
447.681,66 msec task-clock:u # 0,989 CPUs utilized
0 context-switches:u # 0,000 K/sec
0 cpu-migrations:u # 0,000 K/sec
45.214 page-faults:u # 0,101 K/sec
1.272.289.956.854 cycles:u # 2,842 GHz
3.497.255.264.964 instructions:u # 2,75 insn per cycle
863.671.557.196 branches:u # 1929,209 M/sec
2.666.320.642 branch-misses:u # 0,31% of all branches
452,806895428 seconds time elapsed
441,996666000 seconds user
2,557237000 seconds sys
```
after:
```
63.770,08 msec task-clock:u # 0,995 CPUs utilized
0 context-switches:u # 0,000 K/sec
0 cpu-migrations:u # 0,000 K/sec
35.102 page-faults:u # 0,550 K/sec
191.267.750.628 cycles:u # 2,999 GHz
501.316.536.714 instructions:u # 2,62 insn per cycle
122.234.405.333 branches:u # 1916,799 M/sec
443.671.470 branch-misses:u # 0,36% of all branches
64,063443896 seconds time elapsed
62,188041000 seconds user
1,136533000 seconds sys
```
That means we are now roughly 7x faster than before.
Fixes: https://github.com/KDAB/hotspot/issues/225
Change-Id: Ib7dbc800c9372044a847de68a8459dd7f7b0d3da
Reviewed-by: Ulf Hermann <ulf.hermann@qt.io>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When we profile a multi-process ensemble, it will often happen that
we encounter samples at the relative address of a DSO. In such cases,
we can leverage a central cache to store the information, instead of
recomputing the same data for every process.
As an example, I wrote a shell script that runs the same process four
times in parallel. When I parse the resulting perf.data file, the perf
stat results are as follows:
before:
```
Performance counter stats for '/home/milian/projects/compiled/other/lib/libexec/hotspot-perfparser --input ./perf.data --output /dev/null':
4.240,50 msec task-clock:u # 0,956 CPUs utilized
0 context-switches:u # 0,000 K/sec
0 cpu-migrations:u # 0,000 K/sec
17.389 page-faults:u # 0,004 M/sec
11.195.771.907 cycles:u # 2,640 GHz
26.585.168.652 instructions:u # 2,37 insn per cycle
6.234.491.027 branches:u # 1470,227 M/sec
35.149.387 branch-misses:u # 0,56% of all branches
4,435152034 seconds time elapsed
3,732758000 seconds user
0,490148000 seconds sys
```
after:
```
Performance counter stats for '/home/milian/projects/compiled/other/lib/libexec/hotspot-perfparser --input ./perf.data --output /dev/null':
4.160,90 msec task-clock:u # 0,979 CPUs utilized
0 context-switches:u # 0,000 K/sec
0 cpu-migrations:u # 0,000 K/sec
15.476 page-faults:u # 0,004 M/sec
10.635.798.451 cycles:u # 2,556 GHz
16.616.035.720 instructions:u # 1,56 insn per cycle
3.838.148.777 branches:u # 922,433 M/sec
24.902.558 branch-misses:u # 0,65% of all branches
4,249408917 seconds time elapsed
3,612442000 seconds user
0,533933000 seconds sys
```
Note that the overall elapsed time doesn't change that much here,
but the amount of instructions required is massively reduced. I bet
there are other situations where this patch will bring a more tangible
improvement to the overall time requirement.
Change-Id: I4531ec648af40dd44b9e4290fab7bbd2a89609da
Reviewed-by: Ulf Hermann <ulf.hermann@qt.io>
|
|
Instead of using one large shared address cache that needs to be
cleared every now and then, cache the address location information
per elf file using relative addresses. This means we don't need to
clear this cache at all, as the lookup for a relative address into
a given elf file will always return the same data.
This greatly improves the performance under some situations where
the cache is cleared often. For one of my files, it drops the time
from 55s down to 18s.
Additionally, this patch paves the way to share (parts of) this cache
in PerfUnwind. Most of its contents are not PID specific anymore.
Change-Id: I79616fbb5c45a2543845df2d05d9936e49401627
Reviewed-by: Milian Wolff <milian.wolff@kdab.com>
|