fighting oranges with systemtap probes, latency fighting; step 2

Recap from step 1: Sometimes unit test failures on the mozilla tinderboxen are (allegedly, per me) due to insufficiently constrained asynchronous processes.  Sometimes the tests time out because of the asynchronous ordering thing, sometimes it’s just because they’re slow.  Systemtap is awesome.  Using systemtap we can get exciting debug output in JSON form which we can use to fight the aforementioned things.

Advances in bullet point technology have given us the following new features:

  • Integration of latencytap by William Cohen.  Latencytap is a sytemtap script that figures out why your thread/process started blocking.  It uses kernel probes to notice when the task gets activated/deactivated which tells us how long it was asleep.  It performs a kernel backtrace and uses a reasonably extensive built-in knowledge base to figure out the best explanation for why it decided to block.  This gets us not only fsync() but memory allocation internals and other neat stuff too.
    • We ignore everything less than a microsecond because that’s what latencytap already did by virtue of dealing in microseconds and it seems like a good idea.  (We use nanoseconds, though, so we will filter out slightly more because it’s not just quantization-derived.)
    • We get JS backtraces where available for anything longer than 0.1 ms.
  • The visualization is now based off of wall time by default.
  • Visualization of the latency and GC activities on the visualization in the UI.
  • Automated summarization of latency including aggregation of JS call stacks.
  • The new UI bits drove and benefit from various wmsy improvements and cleanup.  Many thanks to my co-worker James Burke for helping me with a number of design decisions there.
  • The systemtap probe compilation non-determinism bug mentioned last time is not gone yet, but thanks to the friendly and responsive systemtap developers, it will be gone soon!

Using these new and improved bullet points we were able to look at one of the tests that seemed to be intermittently timing out (the bug) for legitimate reasons of slowness.  And recently, not just that one test, but many of its friends using the same test infrastructure (asyncTestUtils).

So if you look at the top visualization, you will see lots of reds and pinks; it’s like a trip to Arizona but without all of the tour buses.  Those are all our fsyncs.  How many fsyncs?  This many fsyncs:

Why so many fsyncs?

Oh dear, someone must have snuck into messageInjection.js when I was not looking!  (Note: comment made for comedic purposes; all the blame is mine, although I have several high quality excuses up my sleeve if required.)

What would happen if we change the underlying C++ class to support batching semantics and the injection logic to use it?

Nice.

NB: No, I don’t know exactly what the lock contention is.  The label might be misleading since it is based on sys_futex/do_futex being on the stack rather than the syscall.  Since they only show up on one thread but the latencytap kernel probes need to self-filter because they fire for everything and are using globals to filter, I would not be surprised if it turned out that the systemtap probes used futexes and that’s what’s going on.  It’s not trivial to find out because the latencytap probes can’t really get a good native userspace backtrace (the bug) and when I dabbled in that area I managed to hard lock my system and I really dislike rebooting.  So a mystery they will stay.  Unless someone tells me or I read more of the systemtap source or implement hover-brushing in the visualization or otherwise figure things out.

There is probably more to come, including me running the probes against a bunch of mozilla-central/comm-central tests and putting the results up in interactive web-app form (with the distilled JSON available).  It sounds like I might get access to a MoCo VM to facilitate that, which would be nice.

fighting non-deterministic xpcshell unit tests through causality tracking with systemtap; step 1

It’s a story as old as time itself.  You write a unit test.  It works for you.  But the evil spirits in the tinderboxes cause the test to fail.  The trick is knowing that evil spirits are a lot like genies.  When you ask for a wish, genies will try and screw you over by choosing ridiculous values wherever you forgot to constrain things.  For example, I got my dream job, but now I live in Canada.  The tinderbox evil spirits do the same thing, but usually by causing pathological thread scheduling.

This happened to me the other day and a number of our other tests are running slower than they should, so I decided it was time for another round of incremental amortized tool-building.  Building on my previous systemtap applied to mozilla adventures by porting things to the new JS representations in mozilla-2.0 and adding some new trace events I can now generate traces that:

  • Know when the event loop is processing an event.  Because we reconstruct a nested tree from our trace information and we have a number of other probes, we can also attribute the event to higher-level concepts like timer callbacks.
  • Know when a new runnable is scheduled for execution on a thread or a new timer firing is scheduled, etc.  To help understand why this happened we emit a JS backtrace at that point.  (We could also get a native backtrace cheaply, or even a unified backtrace with some legwork.)
  • Know when high-level events occur during the execution of the unit test.  We hook the dump() implementations in the relevant contexts (xpcshell, JS components/modules, sandboxes) and then we can listen in on all the juicy secrets the test framework shouts into the wind.  What is notable about this choice of probe point is that it:
    • is low frequency, at least if you are reasonably sane about your output.
    • provides a useful correlation between what it is going on under the hood with something that makes sense to the developer.
    • does not cause the JS engine to need to avoid tracing or start logging everything that ever happens.

Because we know when the runnables are created (including what runnable they live inside of) and when they run, we are able to build what I call a causality graph because it sounds cool.  Right now this takes the form of a hierarchical graph.  Branches form when a runnable (or the top-level) schedules more than one runnable during the execution of a single runnable.  The dream is to semi-automatically (heuristics / human annotations may be required) transform the hierarchical graph into one that merges these branches back into a single branch when appropriate.  Collapsing linear runs into a single graph node is also desired, but easy.  Such fanciness may not actually be required to fix test non-determinism, of course.

The protovis-based visualization above has the following exciting bullet points describing it:

  • Time flows vertically downwards.  Time in this case is defined by a globally sequential counter incremented for each trace event.  Time used to be the number of nanoseconds since the start of the run, but there seemed to somehow be clock skew between my various processor cores that live in a single chip.
  • Horizontally, threads are spaced out and within those threads the type of events are spaced out.
    • The weird nested alpha grey levels convey nesting of JS_ExecuteScript calls which indicates both xpcshell file loads and JS component/module top-level executions as a result of initial import.  If there is enough vertical space, labels are shown, otherwise they are collapsed.
  • Those sweet horizontal bands convey the various phases of operation and have helpful labels.
  • The nodes are either events caused by top-level runnable being executed by the event loop or important events that merit the creation of synthetic nodes in the causal graph.  For example, we promote the execution of a JS file to its own link so we can more clearly see when a file caused something to happen.  Likewise, we generate new links when analysis of dump() output tells us a test started or stopped.
  • The blue edges are expressing the primary causal chain as determined by the dump() analysis logic.  If you are telling us a test started/ended, it only follows that you are on the primary causal chain.
  • If you were viewing it in a web browser, you could click on the nodes and it would console.log them and then you could see what is actually happening in there.  If you hovered over nodes they would also highlight their ancestors and descendents in various loud shades of red.

  • The specific execution being visualized had a lot of asynchronous mozStorage stuff going on.  (The right busy thread is the asynchronous thread for the database.)  The second image has the main init chain hovered, resulting in all the async fallout from the initialization process being highlighted in red.  At first glance it’s rather concerning that the initialization process is still going on inside the first real test.  Thanks to the strict ordering of event queues, we can see that everything that happens on the primary causal chain after it hops over to the async thread and back is safe because of that ordering.  The question is whether bad things could happen prior to that joining.  The answer?  In another blog post, I fear.
    • Or “probably”, given that the test has a known potential intermittent failure.  (The test is in dangerous waters because it is re-triggering synchronous database code that normally is only invoked during the startup phase before references are available to the code and before any asynchronous dispatches occur.  All the other tests in the directory are able to play by the rules and so all of their logic should be well-ordered, although we might still expect the initialization logic to likewise complete in what is technically a test phase.  Since the goal is just to make tests deterministic (as long as we do not sacrifice realism), the simplest solution may just be to force the init phase to complete before allowing the tests to start.  The gloda support logic already has an unused hook capable of accomplishing this that I may have forgotten to hook up…

Repo is my usual systemtapping repo.  Things you might type if you wanted to use the tools follow:

  • make SOLO_FILE=test_name.js EXTRA_TEST_ARGS=”–debugger /path/to/tb-test-help/systemtap/chewchewwoowoo.py –debugger-args ‘/path/to/tb-test-help/systemtap/mozperfish/mozperfish.stp –‘” check-one
  • python /path/to/tb-test-help/chewchewwoowoo.py –re-run=/tmp/chewtap-##### mozperfish/mozperfish.stp /path/to/mozilla/dist/bin/xpcshell

The two major gotchas to be aware of are that you need to: a) make your xpcshell build with jemalloc since I left some jemalloc specific memory probes in there, and b) you may need to run the first command several times because systemtap has some seriously non-deterministic dwarf logic going on right now where it claims that it doesn’t believe that certain types are actually unions/structs/whatnot.

work-in-progress tooling: feeding GWT SpeedTracer systemtapped mozilla performance data

Now that Thunderbird 3.1 is string/feature-frozen it’s time to focus on performance.  This post is just to make sure that people who are also working on similar things know what I’m up to.  For example, there’s some very exciting work going on involving adding a startup timeline to mozilla-central that everyone keeping up-to-date with mozilla performance work should be aware of.  In summary, no results yet, but soon!

The big bullet points for now are:

  • GWT SpeedTracer is a very impressive tool.
  • I have seriously cleaned up / automated my previous SystemTap script work for mozilla.  Invoking python chewchewwoowoo.py speedtracer/mozspeedtrace.stp `pgrep thunderbird-bin` does the following:
    • generates a version of the source script with any line statements fixed up based on one or more line seeks in a file.
    • builds the command line and invokes the system tap script.
    • invokes a post-processing script specified in the tap file, feeding it address-translation helpers and systemtap bulk logging per-cpu fusion.
  • mozspeedtrace.stp and its post-processor produce an HTML file with embedded JSON that the SpeedTracer Chrome extension recognizes as one of its own.  (SpeedTracer runs in Firefox through the GWT development mode, but being very new to GWT I am not yet clear on if/how to be able to wrap it up to be used as a webapp from static data.)
  • My mapping of events as recorded by my SystemTap probes to SpeedTracer events is somewhat sketchy, but part of that is due to the limited set of events and their preconfigured display output.  (I believe it turns out I can generate optional data like fake stack traces to be able to expand the set of mapped events without having to modify SpeedTracer.)  This is way basically everything in the screenshot is a yellow ‘JavaScript Callback’.
  • I have not implemented probes for all of the event types defined by SpeedTracer, I have not ported all of my previous probes yet, and there are still some new probes yet to add.  Happily, most of the hard work is already in the can.
  • The motivation behind this is very different from the startup timeline.  Thunderbird uses mozilla-1.9.2, not mozilla-central, and my primary investigatory focus is memory usage in steady-state after start-up, not the time it takes to start up.  (We do, of course, want to improve performance where possible.)  It just so happens that in order to usefully explain who is allocating memory we also have to know who is active at the time / why they are active, and that provides us with the performance timeline data.
  • This work is not likely to be useful for per-tab/webpage performance data gathering… just whole-program performance investigation like Thunderbird needs.
  • This is a work-in-progress, but the focus is on improving Thunderbird’s performance issues, not building tooling.  As such, I may have to bail on SpeedTracer in favor of command-line heuristic analysis if I can’t readily modify SpeedTracer to suit my needs.  I picked SpeedTracer partially to (briefly) investigate GWT, but from my exploration of the code thus far, the activation energy required may be too great.

a systemtap script to attribute JS string memory costs to the creating JS functions

----- js_GC data pages: 96731 total vm pages: 234849
   ###      KiB
 11376      839 BF_format                                        log4moz.j:487
 15906      725 mime_emitter_writeBody                           jsmimeemitter.j:381
   148      576 MimeMessage_coerceBodyToPlaintext                mimemsg.j:341
  4460      492 gloda_ds_queryFromQuery                          datastore.j:2988
  9655      387 <MYSTERIOUS>                                     <BEYOND THE VEIL>:0
   911      237 gloda_content_quoted                             connotent.j:267
  1723      144 gloda_indexMessage                               index_msg.j:2597
   187      141 gloda_content_content                            connotent.j:290
   170      108 gloda_content_getContent                         connotent.j:166
  1917       73 gloda_identity_toString                          datamodel.j:785
  1775       69 Log4Moz_enumerateProps                           log4moz.j:220
value |-------------------------------------------------- count
    0 |@                                                    366
    1 |@@@@                                                1371
    2 |@@@@@                                               1584
    4 |@@@@@@@@@@@@@@@@@@@@@@@@                            6863
    8 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@          11784
   16 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@      12794
   32 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  14016
   64 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                      8662
  128 |@@@@@@@@@@@@@@                                      4093
  256 |@@@@@                                               1567
  512 |@                                                    309
 1024 |                                                     105
 2048 |                                                      58
 4096 |                                                      28
 8192 |                                                      17
16384 |                                                      12
32768 |                                                       0
65536 |                                                       0

The script.  It tells you what JS code has been creating strings since the last GC and the overall distribution of the marginal costs of those strings, leaving out puny code that allocates less than 64k’s worth.  Marginal costs because spidermonkey has support for dependent strings.  “a += b” will actually result in the storage of a being realloc’d with a pointing into the storage of a’.  I forgot to check for the js_ConcatStrings case at first and, let me tell you, the stats were not pretty.  Poor mime_emitter_writeBody was accused of allocating 11 gigabytes of strings in one intra-GC cycle!

What is interesting about this script is that it is actually pulling all of the function name/script name/line number information directly out of a JSContext object.  What is exciting is that it was pretty easy to do thanks to systemtap’s understanding of DWARF debug symbols.  This is in severe comparison to my efforts using VProbes where there was no such support and I had to write a gdbpython script to begin to approach the same thing.  Also, because systemtap’s guru mode allows you to bail into real C code when you need, I was able to write a function that truncates the jschar (16-bit unicode) strings that make up the function names into ASCII that is both straightforward and efficient.  It also allowed for a quick/easy/fast basename() implementation too.

It is worth noting that spidermonkey does have built-in (conditionally compiled) static probe points originally created for DTrace and that they do provide more generic object lifetime hooks.  Although they were created for DTrace, systemtap is also able to use them.  The limitations with those probes is that once any probe is active using a hook, a non-trivial cost is incurred because filtering can only be performed inside the probe and you may receive orders of magnitude more calls than you actually care about.  The cost is worst for probes that provide function names since that requires a call to js_GetStringBytes every time (although there is some caching.)  Keep in mind that I’m not saying the systemtap userspace probes are free, but since they can be placed at arbitrary points, you can have probes that fire only in the cases you actually care about, only incur data-processing costs when desired, and are disturbingly easy to implement since you can just reach into the data structures you care about.

Since everyone loves an implementation fight, DTrace, great and magical though it is, is basically incapable of pulling off the JS stack-walking trick that systemtap makes possible.  Unless DTrace has grown these features since I last checked, it lacks the DWARF magic to do such things easily and the language intentionally avoids the control-flow constructs that would allow it to do anything quite so complex.  And you can entirely forget about arbitrary C code for the unicode string truncation…

This post is brought to you by efforts to decrease the memory footprint of Thunderbird when gloda’s indexer is active.  Because Thunderbird loves you.  But it loves your memory more.  For now.  That’s what we’re trying to fix.

performance annotated SQLite EXPLAINation visualizations using systemtap

For the Thunderbird 3.1 release cycle we are not just fixing UX problems but also resolving various performance issues.  Building on my previous work on a visualization of SQLite opcode control flow graphs using graphviz I give you… the same thing!  But updated to more recent versions of SQLite and integrating performance information retrieved through the use of systemtap with utrace.

In this case we are using systemtap to extract the number of times each opcode is executed and the number of btree pages that are requested during the course of executing the opcode.  (We do not differentiate between cache hits and misses because we are looking at big-O right now.)  Because virtual tables (like those used by FTS3) result in nested SQLite queries and we do not care about analyzing the queries used by FTS3, we ignore nested calls and attribute all btree page accesses to the top-level opcode under execution.

Because the utility of a tool is just as dependent on ease of use as its feature set, I’ve cleaned things up and made it much easier to get information out of Thunderbird/gloda with this patch which should land soon and provides the following:

  • The gloda datastore will write a JSON file with the EXPLAINed results of all its SQL queries to the path found in the preference mailnews.database.global.datastore.explainToPath.  This preference is observed so that setting it during runtime will cause it to create the file and begin explaining all subequently created queries.  Clearing/changing the preference closes out the current file and potentially opens a new one.
  • Gloda unit tests will automatically set the preference to the value of the environment variable GLODA_DATASTORE_EXPLAIN_TO_PATH if set.
  • A schema dump is no longer required for meta-data because we just assume that you are using a SQLite DEBUG build that tells us everything we want to know about in the ‘comment’ column.
  • grokexplain.py now uses optparse and has more internal documentation and such.

So what do the pretty pictures show?

  • Before: A gloda fulltext query search retrieves all of the results data before applying the LIMIT.  This results in a lot more data transplanted into temporary results tables than we will end up using; wasted bookkeeping.  Additionally, we incur the row lookup costs for both our messages data storage table and our fulltext messagesText table for all hits, even the ones we will not return.  (Noting that there was no escaping hitting both tables since we were using offsets() and it hits the fulltext table as part of its operation.)
  • After: We perform an initial query phase where we minimize wasted bookkeeping by only retrieving and using the bare minimum required to compute the LIMITed list of document id’s.  Additionally, by use of the FTS3 matchinfo() function instead of the offsets() function we are able to avoid performing row lookup on the messagesText table for results that will not be returned to the user.  Use of the matchinfo() function requires a custom ranking function which allows us to be somewhat more clever about boosting results based on fulltext matches too.
  • The poor man’s bar graphs in the pictures are expressing a hand-rolled logarithmic scale for the number of invocations (left bar) and number of btree pages accessed (right bar).  On the right side of each line the actual numbers are also presented in the same order.  The goal is not to convey good/bad so much as to draw the eye to hot spots.

Notes for people who want more to read:

  • SQLite has built-in infrastructure to track the number of times an opcode is executed as well as the wall-clock time used; you do not have to use systemtap.  It’s a conditional compilation kind of thing, just -DVDBE_PROFILE and every statement you execute gets its performance data appended to vdbe_profile.out when you reset the statement.  It doesn’t do the btree pages consulted trick, but it’s obviously within its power with some code changes.
  • Our use of systemtap is primarily a question of discretionary reporting control, the ability to integrate the analysis across other abstraction layers, and various build concerns.  The JSON output is not a driving concern, just something nice we are able to do for ourselves since we are formatting the output.
  • The tool has really crossed over into the super-win-payoff category with this fix investigation.  (Having said that, I probably could have skipped some of the data-flow stuff last time around.  But these projects are both for personal interest and amusement as well as practical benefit, so what are you gonna do?  Also, that effort could pay off a bit more by propagating comments along register uses so that the LIMIT counter register r8 and the compute-once r7 in the after diagram would not require human thought.)

References:

  • The grokexplain repo.  Used like so: python grokexplain.py –vdbe-stats=/tmp/glodaNewSearchPerf.json /tmp/glodaNewSearchExplained.json -o /tmp/glodasearch-newcheck
  • The systemtap script in its repo.  Used like so: sudo stap -DMAXSTRINGLEN=1024 sqlite-perf.stp /path/to/thunderbird-objdir/mozilla/dist/lib/libsqlite3.so > /tmp/glodaNewSearchPerf.json
  • The bug with the gloda explain logic patch and the improved SQL query logic.  I also used the tooling to fix another (now-fixed) bug, but that problem turned out to be less interesting.

more systemtap mozilla event loop awareness

====================                                             ms    #
----- Event Loop:
  nsInputStreamReadyEvent                                      7119   11
----- Input Events:
  nsInputStreamPump                                            7119   11
----- Pump Events:
  thunk:nsNNTPProtocol::QueryInterface(...)                    7119   11

====================                                             ms    #
----- Event Loop:
  nsInputStreamReadyEvent                                      6284   15
  nsTimerEvent                                                  242  123
----- Timers:
  nsUITimerCallback                                             121    2
----- Input Events:
  nsInputStreamPump                                            6284   15
----- Pump Events:
  thunk:nsNNTPProtocol::QueryInterface(...)                    6283   15

As of late, I’ve noticed that Thunderbird has been pegging one of my CPU cores periodically in ways not correlated with gloda activity, the usual scapegoat.  With some enhancements to my systemtap script from last time, I can now see that it apparently takes something like 13 seconds of dominating the event loop to perform the check for new messages in my 26 subscribed news folders on news.mozilla.org.  Yowch.  At least this is on a debug build!

Changes are that we now cover input stream ready events, input pump listeners, and proxied objects (which lets us see what Thunderbird IMAP gets up to).  Also, the output is more concise and only shows things that take at least 100ms of aggregate wall clock time.

Script invocation currently wants to look like this:
sudo stap -v moz-event-loop.stp /path/to/thunderbird/objdir/mozilla/dist/lib/libxpcom_core.so /path/to/thunderbird/objdir/mozilla/dist/bin/components/libnecko.so | ../addrsymfilt.py `pgrep thunderbird-bin`

Using systemtap to figure what your mozilla app’s event loop is up to

====================                                             ms    #
----- Event Loop:
  nsTimerEvent                                                 1233   31
  nsProxyObjectCallInfo                                          19   44
  nsStreamCopierOB                                                1   48
  nsStreamCopierIB                                                0   18
  nsProxyCallCompletedEvent                                       0   44
  nsProxyReleaseEvent                                             0   27
  nsTransportStatusEvent                                          0   19
  nsSocketEvent                                                   0   18
  nsHttpConnectionMgr::nsConnEvent                                0    1
----- Timers:
  OnBiffTimer(...)                                             1129    3
  nsGlobalWindow::TimerCallback(...)                             70   10
  nsAutoSyncManager::TimerCallback(...)                          29    6
  nsExpirationTracker::TimerCallback(...)                         1    1
  nsIdleService::IdleTimerCallback(...)                           0    5
  nsExpirationTracker::TimerCallback(...)                         0    1
  nsHttpHandler                                                   0    1
  nsUITimerCallback                                               0    2
  imgContainer::sDiscardTimerCallback(...)                        0    1
  nsExpirationTracker::TimerCallback(...)                         0    1

That’s one of the periodic outputs (10 seconds currently) of this systemtap script filtered through this python script to translate addresses to useful symbol names in realtime.  It’s like top for mozilla.

Actual invocation looks like so:
sudo stap -v moz-event-loop.stp /path/to/thunderbird/objdir/mozilla/dist/lib/libxpcom_core.so | ../addrsymfilt.py `pgrep thunderbird-bin`

The giant caveat (and giant hooray for utrace and Fedora having kernels with the utrace patches built-in) is that the magic in here is done dynamically using utrace source line probes.  As such, the probes aren’t resilient in the face of changes to the underlying source files; the current line numbers are targeted at 1.9.2.  There are various in-tree and out-of-tree solutions possible.