Using VMWare Record/Replay and VProbes for low time-distortion performance profiling

profile-performance-graph-enumerateProps

The greatest problem with performance profiling is getting as much information as possible while affecting the results as little as possible.  For my work on pecobro I used mozilla’s JavaScript DTrace probes.  Because the probes are limited to notifications of all function invocations/returns with no discretion and there is no support for JS backtraces, the impact on performance was heavy.  Although I have never seriously entertained using chronicle-recorder (via chroniquery) for performance investigations, it is a phenomenal tool and it would be fantastic if it were usable for this purpose.

VMware introduced with Workstation 6/6.5 the ability to efficiently record VM execution by recording the non-deterministic parts of VM execution.  When you hit the record button it takes a snapshot and then does its thing.  For a 2 minute execution trace where Thunderbird is started up and gloda starts indexing and adaptively targets for 80% cpu usage, I have a 1G memory snapshot (the amount of memory allocated to the VM), a 57M vmlog file, and a 28M vmsn file.  There is also and a 40M disk delta file (against the disk snapshot), but I presume that’s a side effect of the execution rather than a component of it.

The record/replay functionality is the key to being able to analyze performance while minimizing the distortion of the data-gathering mechanisms.  There are apparently a lot of other solutions in the pipeline, many of them open source.  VMware peeps apparently also created a record/replay-ish mechanism for valgrind, valgrind-rr, which roc has thought about leveraging for chronicle-recorder.  I have also heard of Xen solutions to the problem, but am not currently aware of any usable solutions today.  And of course, there are many precursors to VMware’s work, but this blog post is not a literature survey.

There are 3 ways to get data out of a VM under replay, only 2 of which are usable for my purposes.

  1. Use gdb/the gdb remote target protocol.  The VMware server opens up a port that you can attach to.  The server has some built-in support to understand linux processes if you spoon feed it some critical offsets.  Once you do that, “info threads” lists every process in the image as a thread which you can attach to.  If you do the dance right, gdb provides perfect back-traces and you can set breakpoints and generally do your thing.  You can even rewind execution if you want, but since that means restoring state at the last checkpoint and running execution forward until it reaches the right spot, it’s not cheap.  In contrast, chronicle-recorder can run (process) time backwards, albeit at a steep initial cost.
  2. Use VProbes.  Using a common analogy, dtrace is like a domesticated assassin black bear that comes from the factory understanding English and knowing how to get you a beer from the fridge as well as off your enemies.  VProbes, in contrast, is a grizzly bear that speaks no English.  Assuming you can convince it to go after your enemies, it will completely demolish them.  And you can probably teach it to get you a beer too, it just takes a lot more effort.
  3. Use VAssert.  Just like asserts only happen in debug builds, VAsserts only happen during replay (but not during recording).  Except for the requirement that you think ahead to VAssert-enable your code, it’s awesome because, like static dtrace probes, you can use your code that already understands your code rather than trying to wail on things from outside using gdb or the like.  This one was not an option because it is Windows only as of WS 6.5.  (And Windows was not an option because building mozilla in a VM is ever so slow, and, let’s face it, I’m a linux kind of guy.  At least until someone buys me a solid gold house and a rocket car.)

profile-performance-graph-callbackDriver-doubleClicked

My first step in this direction has been using a combination of #1 and #2 to get javascript backtraces using a timer-interval probe.  The probe roughly does the following:

  • Get a pointer to the current linux kernel task_struct:
    • Assume we are uniprocessor and retrieve the value of x86_hw_tss.sp0 from the TSS struct for the first processor.
    • Now that we know the per-task kernel stack pointer, we can find a pointer to the task_struct at the base of the page.
  • Check if the name of our task is “thunderbird-bin” and bail if it is not.
  • Pull the current timestamp from the linux kernel maintained xtime.  Ideally we could use VProbe’s getsystemtime function, but it doesn’t seem to work and/or is not well defined.  Our goal is to have a reliable indicator of what the real time is at this stage in the execution, because with a rapidly polling probe our execution will obviously be slower than realtime.  xtime is pretty good for this, but ticks at 10ms out of box (Ubuntu 9.04 i386 VM-targeted build), which is a rather limited granularity.  Presumably we can increase its tick rate, but not without some additional (though probably acceptable) time distortion.
  • Perform a JS stack dump:
    • Get XPConnect’s context for the thread.
      • Using information from gdb on where XPCPerThreadData::gTLSIndex is, load the tls slot.  (We could also just directly retrieve the tls slot from gdb.)
      • Get the NSPR thread private data for that TLS slot.
        • Using information from gdb on where pt_book is located, get the pthread_key for NSPR’s per-thread data.
        • Using the current task_struct from earlier, get the value of the GS segment register by looking into tls0_base and un-scrambling it from its hardware-specific configuration.
        • Use the pthread_key and GS to traverse the pthread structure and then the NSPR structure…
      • Find the last XPCJSContextInfo in the nsTArray in the XPCJSContextStack.
    • Pull the JSContext out, then get its JSStackFrame.
    • Recursively walk the frames (no iteration), manually/recursively (ugh) “converting” the 16-bit characters into 8-bit strings through violent truncation and dubious use of sprintf.

The obvious-ish limitation is that by relying on XPConnect’s understanding of the JS stack, we miss out on the most specific pure interpreter stack frames at any given time.  This is mitigated by the fact that XPConnect is like air to the Thunderbird code-base and that we still have the functions higher up the call stack.  This can also presumably be addressed by detecting when we are in the interpreter code and poking around.  It’s been a while since I’ve been in that part of SpiderMonkey’s guts… there may be complications with fast natives that could require clever stack work.

This blog post is getting rather long, so let’s just tie this off and say that I have extended doccelerator to be able to parse the trace files, spitting the output into its own CouchDB database.  Then doccelerator is able to expose that data via Kyle Scholz‘s JSViz in an interactive force-directed graph that is related back to the documentation data.  The second screenshot demonstrates that double-clicking on the (blue) node that is the source of the tooltip brings up our documentation on GlodaIndexer.callbackDriver.  doccelerator hg repovprobe emmett script in hg repo.

See a live demo here.  It will eat your cpu although it will eventually back off once it feels that layout has converged.  You should be able to drag nodes around.  You should also be able to double-click on nodes and have the documentation for that function be shown *if it is available*.  We have no mapping for native frames or XBL stuff at this time.  Depending on what other browsers do when they see JS 1.8 code, it may not work in non-Firefox browsers.  (If they ignore the 1.8 file, all should be well.)  I will ideally fix that soon by adding an explicit extension mechanism.

dxrpy with references against mailnews, plus bespin wishes

dxrpy-nsimsgfolder-delete

bsmedberg‘s dxrpy likes to generate documentation using dehydra (bsmedberg’s blog posts on this: 1 2).  I have added support for references/referencedBy extraction and otherwise wailed on it a little bit, and run this against JUST comm-central’s mailnews/ directory.  The collection process is still strictly serial, and there’s a lot of stuff in mozilla-central/comm-central, which is why the subset.  You can find the outputs here.  My repository with my changes is here.  There are all kinds of issues and limitations, of course.

The picture above is of nsIMsgFolder::Delete(), the major advantage over mxr being that we’re not going to have ambiguity from completely unrelated Delete methods.  Note that in the mxr link I am cheating by not filtering to mailnews/, but the point is that with dxrpy you don’t need to do that.  The picture below shows a slice of the ‘references’ information for nsMsgDBView::Sort.  Another thing you could see is who is using nsVoidArray (in mailnews/!) by virtue of referencing its destructor.

dxrpy-nsmsgdbview-sort-references

My interest in this is a continuation of my efforts on pecobro to provide an interface to expose performance information from various means of instrumentation.  That such a thing requires a useful interface substrate is just a nice surprise.  The dehydra C++ stuff complements the references/referenced by work I did for javascript in pecobro.  Eventually I would hope to use the jshydra infrastructure to accomplish the js stuff since the parser pecobro uses is an unholy/brittle thing, and a usable jshydra would have all kinds of other exciting uses.

Things I would be very happy for other people to magically do so I don’t need to do them:

  • Magic up bespin so that it can pull files out of hg.mozilla.org.  One of the significant reasons I did not do more work on pecobro is that the approach of syntax highlighting in DOM explodes quite badly when confronted with source files you might find in mozilla-central/comm-central.
  • Make dxrpy generate json files representing the information it knows about each source file (positions => semantic info) as well as each function (positions in a function of references=> semantic info on those references).
  • Make bespin retrieve those json files when the file is loaded so that its naive syntax highlighting can be augmented to correlate with that semantic info.  One could attempt to be clever and use simple regexes or something to allow this information to be maintained while you edit the file, but I only need bespin to be read-only.
  • Make it easy for bespin to use that info to follow references and return without sucking.
  • Make something that extracts stack-traces from socorro and clusters this information by class, sticking it in static json files too, why not.
  • Make bespin use its semantic and just general line number info to grab the stack-trace crash info so that it can highlight code in bespin that is suspected to be crashy.  (You would probably want to get clever about determining what is a potentially innocent frame in the trace and what should be considered actually crasy.  Knowing when you’re calling out of your module is probably a simple and useful heuristic.)
  • Make something that chews the crash stacks with the dehydra/dxrpy data to make a visualization of where the crasy stuff is happening, module-wise.)
  • Make bespin retrieve json files of file/class-oriented json files that my performance tooling generates, letting me display cool sparkline/sparkbar things in the source.
  • Cleanup/infrastructurization of dxrpy’s generation/display mechanisms.  My greatest need of dxrpy is actually a coherent framework that can easily integrate with bespin.  The tricky part for my performance work is the javascript side of the equation, and the pecobro tooling actually does pretty well there.  It just wants to have a migration path and an interface that doesn’t cause firefox to lock up every time I change syntax-highlighted documents…

And if anyone already has a tool that is basically a statistical profiler that can do stack-traces that fully understand the interleaved nature of the C++ and spidermonkey JS stacks, that would be great.  (Bonus points for working with the JIT enabled!)  My dream would be a timer-driven VProbe based solution; then I could run things against a replay and avoid distortion of the timescale while also allowing for a realistic user experience in generating the trace.  I would like to avoid having to do trace reconstruction as I have done most recently with dtrace and chronicle-recorder.  I would not be opposed to a PythonGDB solution, especially if VMware 6.5 still provides a gdb remote thinger (did the eclipse UI make that disappear?), although that would probably still be more crossings than I would like and perhaps slower (if less distorting) than an in-process or ptrace based solution.

pecobro, the tell-you-who-reads/writes-what performance code browser

No cool pictures, but I’ve enhanced and exposed the global reader/writer understanding of javascript code.  In other words, pecobro now does a passable job at telling you what global variables a given javascript file reads from/writes to, and who else writes to/reads from those variables.  False negatives are expected (don’t rely on things to be exhaustive), and false positives are quite conceivable.

As an example starting point, take a look at messageWindow.js’s global reads or global writes.  From either of these you can click on other files in the sidebar to go to them.  There is no execution trace, so the overview diagram won’t be any help.  Be sure you have some free memory available and won’t try and hurt me if Firefox does something crazy.  Note: for long documents, some delay is expected as it attempts to apply various fix-ups.  Also note: clicking on things in the global reads/global writes tab won’t get you anywhere for now.

A quick example of a global read from that file (and who writes to that global):

gDBView

  • Top Level: commandglue.js
  • Function: ClearThreadPane (in commandglue.js)
  • Function: CreateBareDBView (in commandglue.js)
  • Function: RerootFolder (in commandglue.js)
  • Function: SwitchView (in commandglue.js)
  • Function: openFolderTab (in mailWindowOverlay.js)
  • Function: setMailTabState (in mailWindowOverlay.js)
  • Function: restorePreSearchView (in searchBar.js)
  • Function: MsgGroupBySort (in threadPane.js)

A quick example of a global write from that file (and who reads from that global):

SelectFolder

  • Function: OpenMessageByHeader (in mailContextMenus.js)
  • Function: OpenInboxForServer (in mailWindow.js)
  • Function: selectFolder (in mailWindow.js)
  • Function: openFolderTab (in mailWindowOverlay.js)
  • Function: LoadNavigatedToMessage (in messageWindow.js)
  • Function: LoadNavigatedToMessage (in msgMail3PaneWindow.js)
  • Function: OnLocationTreeSelect (in msgMail3PaneWindow.js)
  • Function: SelectServer (in msgMail3PaneWindow.js)
  • Function: loadStartFolder (in msgMail3PaneWindow.js)
  • Function: RenameFolder (in widgetglue.js)
  • Function: loadInboxForNewAccount (in accountUtils.js)
  • Function: DropOnFolderTree (in messengerdnd.js)
  • Function: CrossFolderNavigation (in msgViewNavigation.js)

The code, as always, is available at http://hg.mozilla.org/users/bugmail_asutherland.org/pecobro/

master control pecobro, the overkill performance code browser

The title isn’t true yet, but it’s disturbingly almost-true.  Pecobro’s hackish underpinnings have been “accidentally” replaced with surprisingly forward-looking code capable of supporting a much fancier feature set.  (I didn’t mean to, but I got tricked because the incremental cost to doing the ‘right thing’ was always so low.)

The trace that you can see here, by clicking on any of this text what has coloring (and using firefox 3 in a session that you don’t mind if it crashes), is of running Mark Banner (Standard8)‘s Thunderbird bloatTest on OS X with DTrace probes enabled, but without actually doing the bloat testing.  So Thunderbird starts up, opens the address book, closes it, opens the compose window, closes it, and then quits.

Here is a preliminary processed trace from a run triggering bug 296453.  Be forewarned that there is some missing information from the underlying trace and so it’s not all that it could be.  I think the XPConnect probes need to be fleshed out slightly more (and processed).

The code (both pecobro and mozilla codebase patches) is at: http://hg.mozilla.org/users/bugmail_asutherland.org/

What crazy overkill things can pecobro do now?

  • Parse all of the javascript code used in Thunderbird, including nearly all (maybe all?) of the 1.x enhancements.  The parser is still a bit hackish, especially when it comes to support of regexes, but it actually parses things.  Thanks to Chris Lambrou for posting his initial JavaScript.g antlr3 grammar (and BSD licensing it) that I used as a basis for my hacky version.  This was certainly a learning experience; I now know that javascript can never replace python in my heart…
  • Find all of the places functions can be defines in javascript code and do a pretty good job at figuring out reasonable names for them.  This even includes “(new)? Function()”.  This allows us to use the function-info DTrace probes to use line number information to identify functions.  We do this so that we can identify anonymous functions.
  • Handle XBL as well as we handle pure-javascript files.  (Parsing/AST fix-ups/syntax highlighting with AST traversal/etc.)
  • Parse/interpret disturbingly large portions of makefiles.  Sure, it ignores the rules, and I’ve only implemented the functions I require, but it does all the conditional and include stuff with variable expansion, including some key functions like foreach/filter/etc.
  • Parse jar manifests sufficiently well to figure out what file ends up at what chrome path.
  • Cache its javascript parse results at both the AST and (sorta-semantic) processing level, among other things.  Hooray for cerealizer which dares persist/de-persist that which pickle dare not.  (Nested classes.)  (Note: We don’t cache XBL stuff yet, and the ASTs are large enough that de-persisting them can be quite perceptible.)

What is not crazy, but still nice:

  • The traces are now much more reliable thanks to modifications to the Mozilla USDT probes to provide sufficient information for us to distinguish between JavaScript execution contexts (namely, JSContext).  Modifications were also made to avoid the ‘guessing’/ghost value problems that happen when native functions were called.
  • A foolish bug in the sparkbar calculation has been fixed.  aka “Now With Accuracy”!
  • If a function in a trace called another function or was called by another function, this will be displayed inline in the source display.
  • Infer ‘native’ functions (with the above additional modifications to the Mozilla USDT probes), assigning them to an object.  This ends up being Object, ChromeWindow, or other native Class type.  Some of this may be mooted by getting the XPConnect probes reliably in the mix.

What doesn’t it do?

  • Add the XPConnect js_Invoke probes; I punted on that because that was turning out to be difficult to get right.
  • It ignores .xul files for now.  Although xul files primarily appears in a caller context (as told by function-info probes), they can also be a callee when someone pulls a fast one and inlines some simple code in a script tag.  We brutally mis-attribute the call to the previous function when this happens.  This will eventually get resolved because we will need to understand .xul files for namespace reasons.  Also, people sound like they’re interested in who overlays what and the like, and that’s sorta right up our alley thanks to all the overkill stuff we do.
  • Exhaustively determine reads from/contributions to the ‘global’ (window or what not) namespace/scope.  The groundwork is there in terms of understanding the contribution of top-level statements to the namespace or its reads from it, but we don’t traverse into functions.
  • Associate functions with an object/type (ignoring the native function case).  This requires more semantic understanding.
  • Clicking on functions still doesn’t do anything.  I disabled that before my first post on pecobro due to firefox-crashing issues, and still haven’t gotten around to resolving it.
  • A usable overview visualization.  The overview diagram has become cluttered by the presence of so many ‘relevant’ (touched) files.

pecobro: the performance code browser (early stage)

At the beginning of last week, I had gotten dtrace working on a Mac Mini using the Mozilla javascript-provider probes.  Very cool stuff, but it left me with several questions about what would really be best to do next:

  • How do I best understand what I’m seeing?  (Most of the codebase is brand new to me…)
  • How do I share the data with others in a way that is both comprehensible and allows them to draw their own conclusions from the data?
  • What can I do to reduce the effort required to work on performance problems?

I was tempted to just try and just dig into the system so I could have something to show immediately, but I knew it would still take a while to see the big picture just using an editor/ctags/lxr/opengrok, even informed by dtrace.  And even then, that big picture doesn’t scale well; whatever picture I managed to formulate would be stuck inside my head…

So my solution was to try and build a tool that could help me accomplish my short-term goals soon, and have the potential to grow into a usable solution to all of the above… eventually.  The goal, in a nutshell, is to provide a code browser for javascript that is able to integrate performance information (retrieved from traces) alongside the code.  Seeing that lxr/mxr and opengrok didn’t understand javascript or XBL all that well, it also seemed feasible to try and improve on their browsing capabilities for javascript.  A far-down-the-road goal is also to be able to pull in information from the underlying C++ code as well, potentially leveraging dehydra, etc.  (This would primarily be for understanding what happens when we leave the javascript layer, not trying to be the same solution for C++ space.)

So what can it do so far?  You can go try it for yourself if you like as long as you keep your expectations very low and realize the current state does not reflect all of the bullet points below.  Also, you probably want firefox 3.0.  Or you can read my bullet points:

  • Parse custom DTrace script output!  The Mozilla DTrace probe points could probably use a little love to improve what we are able to get out.  Also, I think it’s betraying us somewhere.
  • Parse JavaScript! Sorta!  (I hacked in support for the regular expression syntax, but I haven’t corrected the ambiguity with division, so things with division break.  Also, there’s at least one or two other glitches that cause early termination.) [Yay antlr!]
  • Parse XBL!  Even with entity inlining!  Even when people put #ifdefs in the XML document! Sorta!  We don’t actually do anything intelligent with the XBL right now or with its JavaScript, but it won’t take much to get that much improved. [Yay elementtree!]
  • Visualize some stuff!  Inter-file relationship graph in the overview.  In the code and ‘Funcs’ sidebar tab you get a sparkbar where each bar represents a time interval.  The height of the par is the percentage of possible time we could have spent in that time interval.  Red means we belive that time was spent in the function itself, green means we think we spent that time in calls to other functions. [Yay visophyte!]
  • Navigate with history!  Click on the overview graph and you go to things.  Click on the file names in the ‘Files’ list and you go to the files.  I tried to make it so you could click on function names in the side bars to go to them, but jquery.scrollTo and/or firefox 3.0b5 had serious crashing issues.  [Yay jquery, jquery.history!]
  • See syntax-highlighted code with random headings intertwined (shows the parser worked) and potentially a visualization.  [Yay pygments!]

My hope in the near-term is to fix the outright bugs (parsing issues), get XBL going, and then augment the function information with more trace-derived data including more traditional call-stacks, etc.  Then the tool should be sufficiently usable that my immediate focus can change to creating automated tests to actually gather performance/execution traces so we can use the tool for what I started it for.  This may also get shelved for a while if it turns out that we need action (patches) immediately.