As a performance engineer, I spend a ton of time trying to visualize latency and other system data in ways that make it easy to summarize the characteristics of complex systems. In looking for ways to plot many discrete histograms side-by-side (3 dimensions, x=value, y=count, z=group), I came across Brendan Gregg’s outstanding work with latency heatmaps and waterfall plots. Coalescing the distributions into a heatmap did not fit well with my specific use case, as each distribution was discrete and independent of the other distributions, but the waterfall visualizations would perfectly capture what I was trying to show.
Brendan provides the source code to generate this style of plot, though it requires jumping from R to ImageMagick to lay out the distributions. After searching more for a fully encapsulated solution, I could not find a way to plot data in this style fully inside of R, without depending on any post-processing in another program (e.g GNUPlot, ImageMagick).
The strategy
I decided to take a crack at it using ggplot. My idea was to take each group in the dataset and shift it up the y-axis proportional to the group’s ordinal index among all groups. I’d then use white or black coloring under the curve to “cover up” the groups that are below the current group in terms of z-index (in web design terms). We’ll then remove all axis, labels, and legends to make the visualization clean. The y-axis certainly doesn’t make sense to display since we are artificially setting y values, but the x-axis could be kept should the need arise.
What doesn’t work
My first attempt using standard ggplot syntax looked something like this:
Although ggplot2 will create two layers for every value in groupVar, the ordering of the layers causes this plot to fall short. Ggplot2 will first create N layers of ribbons, followed by N layers of lines on top of them. Since the line and the ribbon for any given group do not have the same z-index, we aren’t able to cover up lines with ribbons.
Switching the order of the geom_ribbon and geom_line also doesn’t help, as the ribbons will end up hiding lines below it on the y-axis.
What works
Changing our ggplot construction to individually add groups two layers at a time will give us the z-index grouping we require to make this visualization. While it’s not the prettiest ggplot syntax, it gives us what we’re looking for in a quick manner.
Plotting Histograms directly
Plotting histograms instead of density makes things a little easier and allows us to use standard ggplot syntax. The white outlines (the “color” aesthetic in geom_rect) can be removed in favor of just the “fill” aesthetic if they become distracting.
Using R to separate high latency caused by queuing from that caused randomly by the operating system, garbage collections, power savings, etc in massive data...
Leave a Comment