Flame Graphs in Rust on Linux
Sep 5, 2016 · 3 minute readI’ve been working on a few toy Rust programs and libraries of late. One of these is dolby
, an implementation of Adaptive Count-Min Sketch as detailed in this paper from Microsoft Research. CMS data structures are designed to deal with the problem of counting things at scale - when your incoming data stream is (essentially) infinite, how can you reason about the data using only finite resources? CMS algorithms and structures get around that by using probability - they can give you approximate answers.
Running an algorithm like this at scale is likely to amplify any inefficiencies in the implementation, and knocking off a few seconds, milliseconds, or even microseconds off the running time of a loop or other section of the program can yield massive rewards. And a great way to identify potential improvements or discovering problematic issues is with flame graphs.
(The following was all done on a Ubuntu 16.04 machine, but don’t worry! I’m not using any fancy BPF tricks or anything - as long as you can run a recent-ish perf
on your machine (mine is 4.4.15), you should be able to replicate this fairly easily.)
In order to get useful information out of our profiling tools, we need to tell the Rust compiler (rustc
) to include DWARF debugging symbols in the binary it is going to create. To do this, we add a small section to our cargo.toml
file:
[profile.release]
debug = true
Once we have built the binary with cargo build
, we can do a system-wide profile using perf
. Here I’m running a test against 10 million entries being added to a dolby
data structure:
root# perf record -ag ~/dolby/target/debug/dolby
(the -a flag ensures profiling capture on all CPUs, the -g flag switches on stack-chain recording, which is needed to build the flame graph structure)
After that has finished, we can use Brendan Gregg’s flame graph tools to convert the output into graphical form.
perf script | ~/FlameGraph/stackcollapse-perf.pl | ~/FlameGraph/flamegraph.pl > rust.svg
Opening rust.svg
in a browser gives us a fancy flame graph!
Looking at the graph, it seems that the majority of the time dolby
is on-CPU, it is running the insert
method. Which makes sense, seeing as the test harness is firing 10m numbers into the structure! We can also see by walking up the chain that insert
spends most of its time in the hash_index
method, and most of the samples inside hash_index
are actually while Rust is doing hashing calculations within the murmurhash3
crate. Changing the hash function to something less expensive than murmurhash3
may therefore result in improved insert
performance.
Anyhow - flame graphs in Rust! They’re easy! Use them!