work-in-progress tooling: feeding GWT SpeedTracer systemtapped mozilla performance data

Now that Thunderbird 3.1 is string/feature-frozen it’s time to focus on performance.  This post is just to make sure that people who are also working on similar things know what I’m up to.  For example, there’s some very exciting work going on involving adding a startup timeline to mozilla-central that everyone keeping up-to-date with mozilla performance work should be aware of.  In summary, no results yet, but soon!

The big bullet points for now are:

  • GWT SpeedTracer is a very impressive tool.
  • I have seriously cleaned up / automated my previous SystemTap script work for mozilla.  Invoking python chewchewwoowoo.py speedtracer/mozspeedtrace.stp `pgrep thunderbird-bin` does the following:
    • generates a version of the source script with any line statements fixed up based on one or more line seeks in a file.
    • builds the command line and invokes the system tap script.
    • invokes a post-processing script specified in the tap file, feeding it address-translation helpers and systemtap bulk logging per-cpu fusion.
  • mozspeedtrace.stp and its post-processor produce an HTML file with embedded JSON that the SpeedTracer Chrome extension recognizes as one of its own.  (SpeedTracer runs in Firefox through the GWT development mode, but being very new to GWT I am not yet clear on if/how to be able to wrap it up to be used as a webapp from static data.)
  • My mapping of events as recorded by my SystemTap probes to SpeedTracer events is somewhat sketchy, but part of that is due to the limited set of events and their preconfigured display output.  (I believe it turns out I can generate optional data like fake stack traces to be able to expand the set of mapped events without having to modify SpeedTracer.)  This is way basically everything in the screenshot is a yellow ‘JavaScript Callback’.
  • I have not implemented probes for all of the event types defined by SpeedTracer, I have not ported all of my previous probes yet, and there are still some new probes yet to add.  Happily, most of the hard work is already in the can.
  • The motivation behind this is very different from the startup timeline.  Thunderbird uses mozilla-1.9.2, not mozilla-central, and my primary investigatory focus is memory usage in steady-state after start-up, not the time it takes to start up.  (We do, of course, want to improve performance where possible.)  It just so happens that in order to usefully explain who is allocating memory we also have to know who is active at the time / why they are active, and that provides us with the performance timeline data.
  • This work is not likely to be useful for per-tab/webpage performance data gathering… just whole-program performance investigation like Thunderbird needs.
  • This is a work-in-progress, but the focus is on improving Thunderbird’s performance issues, not building tooling.  As such, I may have to bail on SpeedTracer in favor of command-line heuristic analysis if I can’t readily modify SpeedTracer to suit my needs.  I picked SpeedTracer partially to (briefly) investigate GWT, but from my exploration of the code thus far, the activation energy required may be too great.

a systemtap script to attribute JS string memory costs to the creating JS functions

----- js_GC data pages: 96731 total vm pages: 234849
   ###      KiB
 11376      839 BF_format                                        log4moz.j:487
 15906      725 mime_emitter_writeBody                           jsmimeemitter.j:381
   148      576 MimeMessage_coerceBodyToPlaintext                mimemsg.j:341
  4460      492 gloda_ds_queryFromQuery                          datastore.j:2988
  9655      387 <MYSTERIOUS>                                     <BEYOND THE VEIL>:0
   911      237 gloda_content_quoted                             connotent.j:267
  1723      144 gloda_indexMessage                               index_msg.j:2597
   187      141 gloda_content_content                            connotent.j:290
   170      108 gloda_content_getContent                         connotent.j:166
  1917       73 gloda_identity_toString                          datamodel.j:785
  1775       69 Log4Moz_enumerateProps                           log4moz.j:220
value |-------------------------------------------------- count
    0 |@                                                    366
    1 |@@@@                                                1371
    2 |@@@@@                                               1584
    4 |@@@@@@@@@@@@@@@@@@@@@@@@                            6863
    8 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@          11784
   16 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@      12794
   32 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  14016
   64 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                      8662
  128 |@@@@@@@@@@@@@@                                      4093
  256 |@@@@@                                               1567
  512 |@                                                    309
 1024 |                                                     105
 2048 |                                                      58
 4096 |                                                      28
 8192 |                                                      17
16384 |                                                      12
32768 |                                                       0
65536 |                                                       0

The script.  It tells you what JS code has been creating strings since the last GC and the overall distribution of the marginal costs of those strings, leaving out puny code that allocates less than 64k’s worth.  Marginal costs because spidermonkey has support for dependent strings.  “a += b” will actually result in the storage of a being realloc’d with a pointing into the storage of a’.  I forgot to check for the js_ConcatStrings case at first and, let me tell you, the stats were not pretty.  Poor mime_emitter_writeBody was accused of allocating 11 gigabytes of strings in one intra-GC cycle!

What is interesting about this script is that it is actually pulling all of the function name/script name/line number information directly out of a JSContext object.  What is exciting is that it was pretty easy to do thanks to systemtap’s understanding of DWARF debug symbols.  This is in severe comparison to my efforts using VProbes where there was no such support and I had to write a gdbpython script to begin to approach the same thing.  Also, because systemtap’s guru mode allows you to bail into real C code when you need, I was able to write a function that truncates the jschar (16-bit unicode) strings that make up the function names into ASCII that is both straightforward and efficient.  It also allowed for a quick/easy/fast basename() implementation too.

It is worth noting that spidermonkey does have built-in (conditionally compiled) static probe points originally created for DTrace and that they do provide more generic object lifetime hooks.  Although they were created for DTrace, systemtap is also able to use them.  The limitations with those probes is that once any probe is active using a hook, a non-trivial cost is incurred because filtering can only be performed inside the probe and you may receive orders of magnitude more calls than you actually care about.  The cost is worst for probes that provide function names since that requires a call to js_GetStringBytes every time (although there is some caching.)  Keep in mind that I’m not saying the systemtap userspace probes are free, but since they can be placed at arbitrary points, you can have probes that fire only in the cases you actually care about, only incur data-processing costs when desired, and are disturbingly easy to implement since you can just reach into the data structures you care about.

Since everyone loves an implementation fight, DTrace, great and magical though it is, is basically incapable of pulling off the JS stack-walking trick that systemtap makes possible.  Unless DTrace has grown these features since I last checked, it lacks the DWARF magic to do such things easily and the language intentionally avoids the control-flow constructs that would allow it to do anything quite so complex.  And you can entirely forget about arbitrary C code for the unicode string truncation…

This post is brought to you by efforts to decrease the memory footprint of Thunderbird when gloda’s indexer is active.  Because Thunderbird loves you.  But it loves your memory more.  For now.  That’s what we’re trying to fix.

performance annotated SQLite EXPLAINation visualizations using systemtap

For the Thunderbird 3.1 release cycle we are not just fixing UX problems but also resolving various performance issues.  Building on my previous work on a visualization of SQLite opcode control flow graphs using graphviz I give you… the same thing!  But updated to more recent versions of SQLite and integrating performance information retrieved through the use of systemtap with utrace.

In this case we are using systemtap to extract the number of times each opcode is executed and the number of btree pages that are requested during the course of executing the opcode.  (We do not differentiate between cache hits and misses because we are looking at big-O right now.)  Because virtual tables (like those used by FTS3) result in nested SQLite queries and we do not care about analyzing the queries used by FTS3, we ignore nested calls and attribute all btree page accesses to the top-level opcode under execution.

Because the utility of a tool is just as dependent on ease of use as its feature set, I’ve cleaned things up and made it much easier to get information out of Thunderbird/gloda with this patch which should land soon and provides the following:

  • The gloda datastore will write a JSON file with the EXPLAINed results of all its SQL queries to the path found in the preference mailnews.database.global.datastore.explainToPath.  This preference is observed so that setting it during runtime will cause it to create the file and begin explaining all subequently created queries.  Clearing/changing the preference closes out the current file and potentially opens a new one.
  • Gloda unit tests will automatically set the preference to the value of the environment variable GLODA_DATASTORE_EXPLAIN_TO_PATH if set.
  • A schema dump is no longer required for meta-data because we just assume that you are using a SQLite DEBUG build that tells us everything we want to know about in the ‘comment’ column.
  • grokexplain.py now uses optparse and has more internal documentation and such.

So what do the pretty pictures show?

  • Before: A gloda fulltext query search retrieves all of the results data before applying the LIMIT.  This results in a lot more data transplanted into temporary results tables than we will end up using; wasted bookkeeping.  Additionally, we incur the row lookup costs for both our messages data storage table and our fulltext messagesText table for all hits, even the ones we will not return.  (Noting that there was no escaping hitting both tables since we were using offsets() and it hits the fulltext table as part of its operation.)
  • After: We perform an initial query phase where we minimize wasted bookkeeping by only retrieving and using the bare minimum required to compute the LIMITed list of document id’s.  Additionally, by use of the FTS3 matchinfo() function instead of the offsets() function we are able to avoid performing row lookup on the messagesText table for results that will not be returned to the user.  Use of the matchinfo() function requires a custom ranking function which allows us to be somewhat more clever about boosting results based on fulltext matches too.
  • The poor man’s bar graphs in the pictures are expressing a hand-rolled logarithmic scale for the number of invocations (left bar) and number of btree pages accessed (right bar).  On the right side of each line the actual numbers are also presented in the same order.  The goal is not to convey good/bad so much as to draw the eye to hot spots.

Notes for people who want more to read:

  • SQLite has built-in infrastructure to track the number of times an opcode is executed as well as the wall-clock time used; you do not have to use systemtap.  It’s a conditional compilation kind of thing, just -DVDBE_PROFILE and every statement you execute gets its performance data appended to vdbe_profile.out when you reset the statement.  It doesn’t do the btree pages consulted trick, but it’s obviously within its power with some code changes.
  • Our use of systemtap is primarily a question of discretionary reporting control, the ability to integrate the analysis across other abstraction layers, and various build concerns.  The JSON output is not a driving concern, just something nice we are able to do for ourselves since we are formatting the output.
  • The tool has really crossed over into the super-win-payoff category with this fix investigation.  (Having said that, I probably could have skipped some of the data-flow stuff last time around.  But these projects are both for personal interest and amusement as well as practical benefit, so what are you gonna do?  Also, that effort could pay off a bit more by propagating comments along register uses so that the LIMIT counter register r8 and the compute-once r7 in the after diagram would not require human thought.)

References:

  • The grokexplain repo.  Used like so: python grokexplain.py –vdbe-stats=/tmp/glodaNewSearchPerf.json /tmp/glodaNewSearchExplained.json -o /tmp/glodasearch-newcheck
  • The systemtap script in its repo.  Used like so: sudo stap -DMAXSTRINGLEN=1024 sqlite-perf.stp /path/to/thunderbird-objdir/mozilla/dist/lib/libsqlite3.so > /tmp/glodaNewSearchPerf.json
  • The bug with the gloda explain logic patch and the improved SQL query logic.  I also used the tooling to fix another (now-fixed) bug, but that problem turned out to be less interesting.

Using VMWare Record/Replay and VProbes for low time-distortion performance profiling

profile-performance-graph-enumerateProps

The greatest problem with performance profiling is getting as much information as possible while affecting the results as little as possible.  For my work on pecobro I used mozilla’s JavaScript DTrace probes.  Because the probes are limited to notifications of all function invocations/returns with no discretion and there is no support for JS backtraces, the impact on performance was heavy.  Although I have never seriously entertained using chronicle-recorder (via chroniquery) for performance investigations, it is a phenomenal tool and it would be fantastic if it were usable for this purpose.

VMware introduced with Workstation 6/6.5 the ability to efficiently record VM execution by recording the non-deterministic parts of VM execution.  When you hit the record button it takes a snapshot and then does its thing.  For a 2 minute execution trace where Thunderbird is started up and gloda starts indexing and adaptively targets for 80% cpu usage, I have a 1G memory snapshot (the amount of memory allocated to the VM), a 57M vmlog file, and a 28M vmsn file.  There is also and a 40M disk delta file (against the disk snapshot), but I presume that’s a side effect of the execution rather than a component of it.

The record/replay functionality is the key to being able to analyze performance while minimizing the distortion of the data-gathering mechanisms.  There are apparently a lot of other solutions in the pipeline, many of them open source.  VMware peeps apparently also created a record/replay-ish mechanism for valgrind, valgrind-rr, which roc has thought about leveraging for chronicle-recorder.  I have also heard of Xen solutions to the problem, but am not currently aware of any usable solutions today.  And of course, there are many precursors to VMware’s work, but this blog post is not a literature survey.

There are 3 ways to get data out of a VM under replay, only 2 of which are usable for my purposes.

  1. Use gdb/the gdb remote target protocol.  The VMware server opens up a port that you can attach to.  The server has some built-in support to understand linux processes if you spoon feed it some critical offsets.  Once you do that, “info threads” lists every process in the image as a thread which you can attach to.  If you do the dance right, gdb provides perfect back-traces and you can set breakpoints and generally do your thing.  You can even rewind execution if you want, but since that means restoring state at the last checkpoint and running execution forward until it reaches the right spot, it’s not cheap.  In contrast, chronicle-recorder can run (process) time backwards, albeit at a steep initial cost.
  2. Use VProbes.  Using a common analogy, dtrace is like a domesticated assassin black bear that comes from the factory understanding English and knowing how to get you a beer from the fridge as well as off your enemies.  VProbes, in contrast, is a grizzly bear that speaks no English.  Assuming you can convince it to go after your enemies, it will completely demolish them.  And you can probably teach it to get you a beer too, it just takes a lot more effort.
  3. Use VAssert.  Just like asserts only happen in debug builds, VAsserts only happen during replay (but not during recording).  Except for the requirement that you think ahead to VAssert-enable your code, it’s awesome because, like static dtrace probes, you can use your code that already understands your code rather than trying to wail on things from outside using gdb or the like.  This one was not an option because it is Windows only as of WS 6.5.  (And Windows was not an option because building mozilla in a VM is ever so slow, and, let’s face it, I’m a linux kind of guy.  At least until someone buys me a solid gold house and a rocket car.)

profile-performance-graph-callbackDriver-doubleClicked

My first step in this direction has been using a combination of #1 and #2 to get javascript backtraces using a timer-interval probe.  The probe roughly does the following:

  • Get a pointer to the current linux kernel task_struct:
    • Assume we are uniprocessor and retrieve the value of x86_hw_tss.sp0 from the TSS struct for the first processor.
    • Now that we know the per-task kernel stack pointer, we can find a pointer to the task_struct at the base of the page.
  • Check if the name of our task is “thunderbird-bin” and bail if it is not.
  • Pull the current timestamp from the linux kernel maintained xtime.  Ideally we could use VProbe’s getsystemtime function, but it doesn’t seem to work and/or is not well defined.  Our goal is to have a reliable indicator of what the real time is at this stage in the execution, because with a rapidly polling probe our execution will obviously be slower than realtime.  xtime is pretty good for this, but ticks at 10ms out of box (Ubuntu 9.04 i386 VM-targeted build), which is a rather limited granularity.  Presumably we can increase its tick rate, but not without some additional (though probably acceptable) time distortion.
  • Perform a JS stack dump:
    • Get XPConnect’s context for the thread.
      • Using information from gdb on where XPCPerThreadData::gTLSIndex is, load the tls slot.  (We could also just directly retrieve the tls slot from gdb.)
      • Get the NSPR thread private data for that TLS slot.
        • Using information from gdb on where pt_book is located, get the pthread_key for NSPR’s per-thread data.
        • Using the current task_struct from earlier, get the value of the GS segment register by looking into tls0_base and un-scrambling it from its hardware-specific configuration.
        • Use the pthread_key and GS to traverse the pthread structure and then the NSPR structure…
      • Find the last XPCJSContextInfo in the nsTArray in the XPCJSContextStack.
    • Pull the JSContext out, then get its JSStackFrame.
    • Recursively walk the frames (no iteration), manually/recursively (ugh) “converting” the 16-bit characters into 8-bit strings through violent truncation and dubious use of sprintf.

The obvious-ish limitation is that by relying on XPConnect’s understanding of the JS stack, we miss out on the most specific pure interpreter stack frames at any given time.  This is mitigated by the fact that XPConnect is like air to the Thunderbird code-base and that we still have the functions higher up the call stack.  This can also presumably be addressed by detecting when we are in the interpreter code and poking around.  It’s been a while since I’ve been in that part of SpiderMonkey’s guts… there may be complications with fast natives that could require clever stack work.

This blog post is getting rather long, so let’s just tie this off and say that I have extended doccelerator to be able to parse the trace files, spitting the output into its own CouchDB database.  Then doccelerator is able to expose that data via Kyle Scholz‘s JSViz in an interactive force-directed graph that is related back to the documentation data.  The second screenshot demonstrates that double-clicking on the (blue) node that is the source of the tooltip brings up our documentation on GlodaIndexer.callbackDriver.  doccelerator hg repovprobe emmett script in hg repo.

See a live demo here.  It will eat your cpu although it will eventually back off once it feels that layout has converged.  You should be able to drag nodes around.  You should also be able to double-click on nodes and have the documentation for that function be shown *if it is available*.  We have no mapping for native frames or XBL stuff at this time.  Depending on what other browsers do when they see JS 1.8 code, it may not work in non-Firefox browsers.  (If they ignore the 1.8 file, all should be well.)  I will ideally fix that soon by adding an explicit extension mechanism.

displaying execution traces on a SIMILE timeline

chronimoz-timeline-1

chroniquery can now output JSON for display using the SIMILE timeline widget.  And I am finally done trying to figure out what was going wrong.  Click the pictures to experience the timeline sensation for yourself.

Related interesting things:

  • Supports a fancy if specific constraint mechanism.  For example, PrepareAndDispatch is only displayed in the context of nsTimerImpl::Fire using the following line in the config file:
[func@PrepareAndDispatch]
only_show_if=backtrace,4,nsTimerImpl::Fire
  • Supports displaying arguments in the title using the config file.  For example, displaying the observer service’s topic (plus putting it in its own band and giving it a nice prefix):
[class@nsObserverService]
prefixAlias=Obs:
band=context
list_args=aTopic
  • We actually get the dump() output by logging where _IO_fputs shows up.  We constrain its output by adding an extensible (if currently somewhat specialized) argument-based constraint.
  • Things in red are red because their return codes were exceptional, per my last blog post.  Things in other colors are that way because of the config file.
  • The bubbles that show up when you click on things tell you the return value and arguments for the call in question.  We could include a backtrace, but they are expensive to compute (currently), so I don’t.
  • All layout is using the timeline widget’s layout.  There is no correspondence between stack depth and anything else.

chronimoz-timeline-2

An important limitation is that all functions displayed here were explicitly chosen for display.  Some form of automated mechanism could be used, but unless chronicle-query starts caching more things, it would be really slow.  Here’s the command-line used for the above:

chron-xpc mozrun-9022.db jsontrace -t 1 -f nsObserverService::NotifyObservers -f nsThread::ProcessNextEvent -f nsTimerImpl::Fire -f nsMsgSearchSession::NotifyListenersDone -f nsMsgDBFolder::Shutdown  -f nsMsgLocalMailFolder::AddMessage -f nsMsgDatabase::Open -f nsMsgDatabase::ForceClosed -f nsMsgLocalMailFolder::GetDatabaseWithReparse -f nsMsgDBFolder::OnAnnouncerGoingAway -f nsMsgDBFolder::AddSubfolder -f nsMsgSearchOfflineMail::OpenSummaryFile -f nsMsgSearchSession::TimerCallback -f nsMsgSearchSession::TimeSlice -f nsMsgSearchSession::TimeSliceSerial -f nsMsgSearchScopeTerm::TimeSlice -f nsMsgXFVirtualFolderDBView::Open -f nsMsgXFVirtualFolderDBView::OnNewSearch -f nsMsgXFVirtualFolderDBView::OnSearchDone  -f js_GC -f PrepareAndDispatch -f nsMsgLocalMailFolder::UpdateFolder -f _IO_fputs

understanding where unit tests go wrong with object diffs

chronimoz-diff-1

In our last chroniquery adventure, we used roc‘s chronicle-recorder to trace the execution of an xpcshell unit test and chroniquery (repo info) to use that trace to help us understand what was happening in some particularly confusing C-ish code.  Actually, we’re doing that this time too, but the bag of tricks now contains additional tricks.

The exciting things happening in the screenshot above, in order of increasing awesomeness:

  • The argument list is exposing out-parameters (via a heuristic though, it doesn’t know the XPCOM details.)  ex: “*folderInfo: 0x0” used to be “folderInfo: some pointer to hold an out-parameter that leaves us none the wiser.”
  • Chronifer now sees through ‘this’-adjusting trampolines/wrappers.  Previously, function calls might be hidden because the stack pointer change that chronicle uses to determine function calls jumped to the trampoline whose bytecode was not associated with the actual function being called, and so would potentially be assumed to be boring.  The trampoline would modify rdi then perform a trivial jump (it could have just fallen through) into the function proper, which the existing trampoline detector would not notice.  (That detector only would notice an instant jump of more than 128 bytes; intended for dynamic linking indirection purposes.)
  • Pretty printers allow us to translate nsresult error codes to strings, display the strings referenced by nsString/nsCString instead of the raw structure (mData/mLength/mFlags), pierce nsCOMPtr’s, etc.
  • Pretty printers allow us to detect exceptional values and show them in red!  For example, that NS_ERROR_INVALID_POINTER is red because the nsresult pretty printer’s is_exceptional method returned True because the error bit was high.
  • The configuration method that lets us say what directories/files/classes/functions are interesting/boring has been augmented to let us specify arguments and the sub-fields of arguments to dump.  For example, in the above screenshot, we have added “dump_this=True” to the section labeled “[class@nsMsgLocalMailFolder]”.  As a result, any time a method is called on nsMsgLocalMailFolder, we dump the contents of its ‘this’.  If we only wanted a few fields from “this”, we could have used “dump_this=mPath,mIsServer,mName,mURI” instead.  You can see the end of a full object dump at the top of the screenshot.
  • The object dump output is diffed against the result of previous calls.  This is why we have that text in yellow.  The first entire GetDBFolderInfoAndDB call you see on the screen is a succesful call (no red!) diffed against a previous successful call.  Mainly we see the name-derived things changing and things we expect to be distinct (file path, database).  Then we have an error case, with a few things we probably would not have noticed if we were doing this by hand.  I was somewhat surprised that mHaveParsedURI, the reference count, and m_numOfflineMsgLines showed up.  The others are not entirely shocking, but it’s useful to have them all there.  Without the diff, we would be looking at 65 lines of attribute values, and that is with us already benefitting from the pretty printers.

Exciting possibilities opened by enhancements here:

  • The pretty printers and internal cleanups/refactoring make it feasible to express complicated data-based constraints simply.  For example, with a tractable amount of work I could opt to only see calls on the folder named “gabba3” using the constraint “mURI == ‘mailbox://nobody@Local%20Folders/gabba0′”.  This could be made particularly efficient if we make an assumption that the attribute is invariant once observed and then allow us to key off the pointer/memory region.
  • Code-sharing of pretty printers between my chroniquery code and my archer gdb plugins.
  • Now that archer-gdb has python inferior control, potential use of archer-gdb as an alternate back-end to chroniquery.
  • It would be neat to be able to expose chroniquery’s functionality via a web interface so that things could be expanded/drilled down without having to re-run things.  This would also help offset the non-trivial startup costs to chronicle-recorder’s own chronicle-query.  (Although arguably the startup time is nothing compared to the inefficiency of all the work chronicle-query does for a call and throws away when the next call pretty much wants that same data.)

understanding libmime using chroniquery and unit tests

chroniquery-trace-failed-return-value

Mailnews’ libmime is one of the harder modules to wrap one’s head around.  8-letter filenames where the first four letters tend to be “mime”, home-grown glib-style OO rather than actual C++, and intermingling of display logic with parsing logic do not make for a fun development or even comprehension experience.

Running an xpcshell unit test run under roc‘s chronicle-recorder and then processing it with my chroniquery stuff (repo info), we get a colorful HTML trace of the execution somewhat broken out (note: pretend the stream_write/stream_completes are interleaved; they are paired).  Specific examples of libmime processing for the inquisitive (there are more if you check the former link though):

The thing that kickstarted this exciting foray is the error pictured in the screenshot above from an earlier run.  The return values are in yellow, and you can see where the error propagates from (the -1 cast to unsigned).  If you look at the HTML file, you will also note that the file stops after the error because the functions bail out as soon as they hit an error.

However, our actual problem and driving reason for the unit tests is the JS emitter choking on multipart/related in writeBody (which it receives by way of ‘output_fn’).  Why?  Why JS Emitter?!  (You can open the links and control-F along at home!)

  • We look at the stream_write trace for our multipart/related.  That’s odd, no ‘output_fn’ in there.

chroniquery-stream_complete-output_fn

  • We look at the stream_complete trace for the multipart/related.  There’s our ‘output_fn’!  And it’s got some weird HTML processing friends happening.  That must be to transform links to the related content into something our docshell can render.  This also explains why this processing is happening in stream_complete rather than stream_write… it must have buffered up the content so it could make sure it had seen all the ‘related’ documents and their Content-IDs so it could know how to transform the links.
  • Uh oh… that deferred processing might be doing something bad, since our consumer receives a stream of events.  We had to do something special for SMIME for just such a reason…
  • We check stream_complete for ‘mimeEmitterAddHeaderField’ calls, which the JS emitter keys off of to know what part is currently being processed and what type of body (if any) it should expect to receive.  Uh-oh, none in here.

chroniquery-stream_write-addheaderfield

  • We check stream_write for ‘mimeEmitterAddHeaderField’ calls, specifically with a “Content-Type” field.  And we find them.  The bad news is that they are apparently emitted as the initial streaming happens.  So we see the content-type for our “text/html”, then our “image/png”.  So when stream_complete happens, the last thing our JS emitter will have seen is “image/png” and it will not be expecting a body write.  (It will think that the text/html had no content whatsoever.)
  • Khaaaaaaaaaaaaaaaaaaaaaaaaan!

In summary, unit tests and execution tracing working together with pretty colors have helped us track down an annoying problem without going insane.  (libmime is a lot to hold in your head if you don’t hack on it every day.  also, straight debugger breakpoint fun generally also requires you to try and formulate and hold a complex mental model… and that’s assuming you don’t go insane from manually stepping aboot and/or are lucky with your choices of where you put your breakpoints.)  The more important thing is that next time I want to refresh my understanding of what libmime is up to, I have traces already available.  (And the mechanics to generate new ones easily.  But not particularly quickly.  chronicle and my trace-generating mechanism be mad slow, yo.  It may be much faster in the future to use the hopefully-arriving-soon archer-gdb python-driven inferior support, even if it can’t be as clever about function-call detection without cramming int 0x3’s all over the place.)

Useful files for anyone trying to duplicate my efforts: my ~/.chroniquery.cfg for this run, the unit test as it existed, and the command-line args were: trace -f mime_display_stream_write -f mime_display_stream_complete -c -H /tmp/trace3/trace.html –file-per-func-invoc

visualization of control-flow/data-flow analysis of sqlite EXPLAIN-ations

control-flow before

control-flow before

While doing some work on the gloda-based search targeted for beta 2, I came upon some query slowness and decided to look into it.  Because gloda queries are (conceptually, at least) dynamic, we also generate dynamic SQL.  Our schema is fairly normalized, and it’s not always obvious to me what I can expect to have going on.  EXPLAIN QUERY PLAN is a good litmus test for massive insanity, but it’s a bit concise.  For example, the troublesome SQL was as follows:

SELECT * FROM messages INNER JOIN messagesText ON messages.id = messagesText.rowid WHERE id IN (SELECT docid FROM messagesText WHERE subject MATCH "sutherland") AND deleted = 0 AND folderID IS NOT NULL AND messageKey IS NOT NULL ORDER BY date DESC LIMIT 100;

This nets us:

0|0|TABLE messages WITH INDEX deleted
1|1|TABLE messagesText VIRTUAL TABLE INDEX 1:
0|0|TABLE messagesText VIRTUAL TABLE INDEX 2:

Everything’s an index!  We win!  Sorta.  The index it chose is somewhat worrying if you think about it.  But really, it’s all quite nebulous.  I have basically no idea what is happening in there.  The good news is that EXPLAIN will tell us the actual opcodes in use.  The bad news is that it is quite hard to understand (104 opcodes), at least for this query.

Python, graphviz, pygraphviz, and compulsive tool-building to the rescue!  My original plan was that data-flow analysis could make everything incredibly obvious as to what is going on.  I’m not sure I reached that point.  But in the process, the efforts required to make the system clever enough to do data analysis allow the control flow diagram to be quite pretty and have lots of useful information.  The main sign of data-flow analysis is that all cursor write operations have the list of cursors that data flowed from in parens.  Each cursor gets its own tango-ish color, and opcodes involving the cursor are colored by that cursor.

The major flaw in the data-flow analysis that springs to mind right now is that it ignores control flow.  An action that will only be taken because of a control-flow decision based on data retrieved from a cursor should ideally establish a relationship.  This is important because gloda throws around a lot of set intersections (although not in this case) in its queries, and it would be nice to be able to concisely express that.  The control-flow diagram is arguably orders of magnitude better than a raw EXPLAIN, but my hope is/was that the data analysis can make it trivial to figure out how things are being hooked up.  Given the amount of effort already expended and the present results, I figure I’m probably at the “control flow is good enough for the foreseeable future stage of things”.

In any event, the control-flow graph makes it (more) obvious that the outer loop is using the ‘deleted’ index to walk over *every deleted message in the database*.  A one-off idiom computes the full-text search and stashes it in an intermediary table.  As we then walk over every deleted message, we see if that message can be found in the full-text search result table.  If it is, our previously queued lazy-seek against the main messages table happens and we search against the full-text table to do our join.  And out pop results!

My hope for this query was that the deleted condition would be executed as a constraint as we were walking our result set, just like the folderID and messageKey constraints.  Making sure your queries are using the right index and helping the optimizer make the right call is a fairly textbook problem and the SQLite docs are actually pretty good on this.  For the sake of the example, I have dropped the index entirely.  (The ‘deleted’ index was added so we can quickly mark messages as needing deletion processing without slowing down deletes for the user.  It may turn out to be more beneficial to leave the field as-is, but use a separate table as our work queue.)

dataflow before

dataflow before

control-flow after

control-flow after

After the deletion, we get the revised diagram.  The full-text search still happens completely before we produce any other results, but this is keeping in keeping with our query.  (The query generation code is designed to handle the case where we have multiple constraints and must intersect the results of each.  It can clearly be optimized for the case where no primary-key intersection is required.)  The traversal of the full-text search result set is now the outer loop.  Deleted is now just filtering like folderID and messageKey, as expected.

The major lingering red flag is the block that calls Last() and Delete().  The conditional that makes the decision to do that is using “r1”, which is apparently implementing our limit logic inside the code.  This was confusing until I remembered that we’ve got an ORDER BY going on.  The Delete is just culling the results that will never potentially be seen.  The bad news is that we are doing our full-text search join prior to the ORDER BY culling.  So, my original fear that we are doing the JOIN no matter what still holds.  And that’s enough excitement for one day.

Given that any full-text search with ordered results will require retrieval the entire result set and correlating it against at least one table, I’m tempted to create separate tables for different windows of time given the importance of recency.  Thoughts/input appreciated, as anyone who has read this far is probably very detail oriented…

UPDATE (Feb 25, 2009): The hg repo is http://hg.mozilla.org/users/bugmail_asutherland.org/grokexplain/.  It now does some actual value propagation (and has had various bugs fixed).  By default it ignores ‘Yield’ because the way in which they are used (basically generator/continuation) wreaks havoc on the CFG with just straightforward static analysis.  If you want the yields, pass “yields” on the command line.  Also, passing “debug” gets you the states of the registers at the entrance/exit of the “basic blocks”.  (Not technically basic blocks, but close enough for my purposes.)

LogSploder, logsploding its way to your logs soon! also, logsplosion!

logsploder screenshot with gloda

In our last logging adventure, we hooked Log4Moz up to Chainsaw.  As great as Chainsaw is, it did not meet all of my needs, least of all easy redistribution.  So I present another project in a long line of fantastically named projects… LogSploder!

The general setup is this:

  • log4moz with a concept of “contexts”, a change in logging function argument expectations (think FireBug’s console.log), a JSON formatter that knows to send the contexts over the wire as JSON rather than stringifying them, plus our SocketAppender from the ChainSaw fun.  The JSONed messages representations get sent to…
  • LogSploder (a XULRunner app) listening on localhost.  It currently is context-centric, binning all log messages based on their context.  The contexts (and their state transitions) are tracked and visualized (using the still-quite-hacky visophyte-js).  Clicking on a context displays the list of log messages associated with that context and their timestamps.  We really should also display any other metadata hiding in the context, but we don’t.  (Although the visualization does grab stuff out of there for the dubious coloring choices.)

So, why, and what are we looking at?

When developing/using Thunderbird’s exciting new prototype message/contact/etc views, it became obvious that performance was not all that it could be.  As we all know, the proper way to optimize performance is to figure out what’s taking up the most time.  And the proper way to figure that out is to write a new tool from near-scratch.  We are interested in both comprehension of what is actually happening as well as a mechanism for performance tracking.

The screenshot above shows the result of issuing a gloda query with a constraint of one of my Inbox folders with a fulltext search for “gloda” *before any optimization*.  (We already have multiple optimizations in place!) The pinkish fill with greenish borders are our XBL result bindings, the blue-ish fill with more obviously blue borders are message streaming requests, and everything else (with a grey border and varying colors) is a gloda database query.  The white bar in the middle of the display is a XBL context I hovered over and clicked on.

The brighter colored vertical bars inside the rectangles are markers for state changes in the context.  The bright red markers are the most significant, they are state changes we logged before large blocks of code in the XBL that we presumed might be expensive.  And boy howdy, do they look expensive!  The first/top XBL bar (which ends up creating a whole bunch of other XBL bindings which result in lots of gloda queries) ties up the event thread for several seconds (from the red-bar to the end of the box).  The one I hovered over likewise ties things up from its red bar until the green bar several seconds later.

Now, I should point out that the heavy lifting of the database queries actually happens on a background thread, and without instrumentation of that mechanism, it’s hard for us to know when they are active or actually complete.  (We can only see the results when the main thread’s event queue is draining, and only remotely accurately when it’s not backlogged.)  But just from the visualization we can see that at the very least the first XBL dude is not being efficient with its queries.  The second expensive one (the hovered one) appears to chewing up processor cycles without much help from background processes.  (There is one recent gloda query, but we know it to be cheap.  The message stream requests may have some impact since mailnews’ IMAP code is multi-threaded, although they really only should be happening on the main thread (might not be, though!).  Since the query was against one folder, we know that there is no mailbox reparse happening.)

Er, so, I doubt anyone actually cares about what was inefficient, so I’ll stop now.  My point is mainly that even with the incredibly ugly visualization and what not, this tool is already quite useful.  It’s hard to tell that just from a screenshot, since a lot of the power is being able to click on a bar and see the log messages from that context.  There’s obviously a lot to do.  Probably one of the lower-hanging pieces of fruit is to display context causality and/or ownership.  Unfortunately this requires explicit state passing or use of a shared execution mechanism; the trick of using thread-locals that log4j gets to use for its nested diagnostic contexts is simply not an option for us.

chroniquery does useful things!

Before pecobro, there was chroniquery (shiny, also shiny, texty), a python binding for roc‘s chronicle-recorder omega-debugger.  Now, there is still chroniquery.  It just does more stuff, and does the stuff it previously did more correctly.

In any event, what the awesome ‘screenshot’ (it’s from a console…) shows:

  • A trace from running chronicle-recorder against xpcshell on amd64 running a gloda unit test.
  • Javascript being ‘traced’ with extraction of the function name, script, and line number… everything you’d expect from your dtrace probes or your JSD hooks/venkman/firebug…
  • Hm… it seems to know something about an XPConnect call going down…
  • Oh noes!  It has escaped into the C++ codes!