Using VMWare Record/Replay and VProbes for low time-distortion performance profiling

profile-performance-graph-enumerateProps

The greatest problem with performance profiling is getting as much information as possible while affecting the results as little as possible.  For my work on pecobro I used mozilla’s JavaScript DTrace probes.  Because the probes are limited to notifications of all function invocations/returns with no discretion and there is no support for JS backtraces, the impact on performance was heavy.  Although I have never seriously entertained using chronicle-recorder (via chroniquery) for performance investigations, it is a phenomenal tool and it would be fantastic if it were usable for this purpose.

VMware introduced with Workstation 6/6.5 the ability to efficiently record VM execution by recording the non-deterministic parts of VM execution.  When you hit the record button it takes a snapshot and then does its thing.  For a 2 minute execution trace where Thunderbird is started up and gloda starts indexing and adaptively targets for 80% cpu usage, I have a 1G memory snapshot (the amount of memory allocated to the VM), a 57M vmlog file, and a 28M vmsn file.  There is also and a 40M disk delta file (against the disk snapshot), but I presume that’s a side effect of the execution rather than a component of it.

The record/replay functionality is the key to being able to analyze performance while minimizing the distortion of the data-gathering mechanisms.  There are apparently a lot of other solutions in the pipeline, many of them open source.  VMware peeps apparently also created a record/replay-ish mechanism for valgrind, valgrind-rr, which roc has thought about leveraging for chronicle-recorder.  I have also heard of Xen solutions to the problem, but am not currently aware of any usable solutions today.  And of course, there are many precursors to VMware’s work, but this blog post is not a literature survey.

There are 3 ways to get data out of a VM under replay, only 2 of which are usable for my purposes.

  1. Use gdb/the gdb remote target protocol.  The VMware server opens up a port that you can attach to.  The server has some built-in support to understand linux processes if you spoon feed it some critical offsets.  Once you do that, “info threads” lists every process in the image as a thread which you can attach to.  If you do the dance right, gdb provides perfect back-traces and you can set breakpoints and generally do your thing.  You can even rewind execution if you want, but since that means restoring state at the last checkpoint and running execution forward until it reaches the right spot, it’s not cheap.  In contrast, chronicle-recorder can run (process) time backwards, albeit at a steep initial cost.
  2. Use VProbes.  Using a common analogy, dtrace is like a domesticated assassin black bear that comes from the factory understanding English and knowing how to get you a beer from the fridge as well as off your enemies.  VProbes, in contrast, is a grizzly bear that speaks no English.  Assuming you can convince it to go after your enemies, it will completely demolish them.  And you can probably teach it to get you a beer too, it just takes a lot more effort.
  3. Use VAssert.  Just like asserts only happen in debug builds, VAsserts only happen during replay (but not during recording).  Except for the requirement that you think ahead to VAssert-enable your code, it’s awesome because, like static dtrace probes, you can use your code that already understands your code rather than trying to wail on things from outside using gdb or the like.  This one was not an option because it is Windows only as of WS 6.5.  (And Windows was not an option because building mozilla in a VM is ever so slow, and, let’s face it, I’m a linux kind of guy.  At least until someone buys me a solid gold house and a rocket car.)

profile-performance-graph-callbackDriver-doubleClicked

My first step in this direction has been using a combination of #1 and #2 to get javascript backtraces using a timer-interval probe.  The probe roughly does the following:

  • Get a pointer to the current linux kernel task_struct:
    • Assume we are uniprocessor and retrieve the value of x86_hw_tss.sp0 from the TSS struct for the first processor.
    • Now that we know the per-task kernel stack pointer, we can find a pointer to the task_struct at the base of the page.
  • Check if the name of our task is “thunderbird-bin” and bail if it is not.
  • Pull the current timestamp from the linux kernel maintained xtime.  Ideally we could use VProbe’s getsystemtime function, but it doesn’t seem to work and/or is not well defined.  Our goal is to have a reliable indicator of what the real time is at this stage in the execution, because with a rapidly polling probe our execution will obviously be slower than realtime.  xtime is pretty good for this, but ticks at 10ms out of box (Ubuntu 9.04 i386 VM-targeted build), which is a rather limited granularity.  Presumably we can increase its tick rate, but not without some additional (though probably acceptable) time distortion.
  • Perform a JS stack dump:
    • Get XPConnect’s context for the thread.
      • Using information from gdb on where XPCPerThreadData::gTLSIndex is, load the tls slot.  (We could also just directly retrieve the tls slot from gdb.)
      • Get the NSPR thread private data for that TLS slot.
        • Using information from gdb on where pt_book is located, get the pthread_key for NSPR’s per-thread data.
        • Using the current task_struct from earlier, get the value of the GS segment register by looking into tls0_base and un-scrambling it from its hardware-specific configuration.
        • Use the pthread_key and GS to traverse the pthread structure and then the NSPR structure…
      • Find the last XPCJSContextInfo in the nsTArray in the XPCJSContextStack.
    • Pull the JSContext out, then get its JSStackFrame.
    • Recursively walk the frames (no iteration), manually/recursively (ugh) “converting” the 16-bit characters into 8-bit strings through violent truncation and dubious use of sprintf.

The obvious-ish limitation is that by relying on XPConnect’s understanding of the JS stack, we miss out on the most specific pure interpreter stack frames at any given time.  This is mitigated by the fact that XPConnect is like air to the Thunderbird code-base and that we still have the functions higher up the call stack.  This can also presumably be addressed by detecting when we are in the interpreter code and poking around.  It’s been a while since I’ve been in that part of SpiderMonkey’s guts… there may be complications with fast natives that could require clever stack work.

This blog post is getting rather long, so let’s just tie this off and say that I have extended doccelerator to be able to parse the trace files, spitting the output into its own CouchDB database.  Then doccelerator is able to expose that data via Kyle Scholz‘s JSViz in an interactive force-directed graph that is related back to the documentation data.  The second screenshot demonstrates that double-clicking on the (blue) node that is the source of the tooltip brings up our documentation on GlodaIndexer.callbackDriver.  doccelerator hg repovprobe emmett script in hg repo.

See a live demo here.  It will eat your cpu although it will eventually back off once it feels that layout has converged.  You should be able to drag nodes around.  You should also be able to double-click on nodes and have the documentation for that function be shown *if it is available*.  We have no mapping for native frames or XBL stuff at this time.  Depending on what other browsers do when they see JS 1.8 code, it may not work in non-Firefox browsers.  (If they ignore the 1.8 file, all should be well.)  I will ideally fix that soon by adding an explicit extension mechanism.

displaying execution traces on a SIMILE timeline

chronimoz-timeline-1

chroniquery can now output JSON for display using the SIMILE timeline widget.  And I am finally done trying to figure out what was going wrong.  Click the pictures to experience the timeline sensation for yourself.

Related interesting things:

  • Supports a fancy if specific constraint mechanism.  For example, PrepareAndDispatch is only displayed in the context of nsTimerImpl::Fire using the following line in the config file:
[func@PrepareAndDispatch]
only_show_if=backtrace,4,nsTimerImpl::Fire
  • Supports displaying arguments in the title using the config file.  For example, displaying the observer service’s topic (plus putting it in its own band and giving it a nice prefix):
[class@nsObserverService]
prefixAlias=Obs:
band=context
list_args=aTopic
  • We actually get the dump() output by logging where _IO_fputs shows up.  We constrain its output by adding an extensible (if currently somewhat specialized) argument-based constraint.
  • Things in red are red because their return codes were exceptional, per my last blog post.  Things in other colors are that way because of the config file.
  • The bubbles that show up when you click on things tell you the return value and arguments for the call in question.  We could include a backtrace, but they are expensive to compute (currently), so I don’t.
  • All layout is using the timeline widget’s layout.  There is no correspondence between stack depth and anything else.

chronimoz-timeline-2

An important limitation is that all functions displayed here were explicitly chosen for display.  Some form of automated mechanism could be used, but unless chronicle-query starts caching more things, it would be really slow.  Here’s the command-line used for the above:

chron-xpc mozrun-9022.db jsontrace -t 1 -f nsObserverService::NotifyObservers -f nsThread::ProcessNextEvent -f nsTimerImpl::Fire -f nsMsgSearchSession::NotifyListenersDone -f nsMsgDBFolder::Shutdown  -f nsMsgLocalMailFolder::AddMessage -f nsMsgDatabase::Open -f nsMsgDatabase::ForceClosed -f nsMsgLocalMailFolder::GetDatabaseWithReparse -f nsMsgDBFolder::OnAnnouncerGoingAway -f nsMsgDBFolder::AddSubfolder -f nsMsgSearchOfflineMail::OpenSummaryFile -f nsMsgSearchSession::TimerCallback -f nsMsgSearchSession::TimeSlice -f nsMsgSearchSession::TimeSliceSerial -f nsMsgSearchScopeTerm::TimeSlice -f nsMsgXFVirtualFolderDBView::Open -f nsMsgXFVirtualFolderDBView::OnNewSearch -f nsMsgXFVirtualFolderDBView::OnSearchDone  -f js_GC -f PrepareAndDispatch -f nsMsgLocalMailFolder::UpdateFolder -f _IO_fputs

understanding where unit tests go wrong with object diffs

chronimoz-diff-1

In our last chroniquery adventure, we used roc‘s chronicle-recorder to trace the execution of an xpcshell unit test and chroniquery (repo info) to use that trace to help us understand what was happening in some particularly confusing C-ish code.  Actually, we’re doing that this time too, but the bag of tricks now contains additional tricks.

The exciting things happening in the screenshot above, in order of increasing awesomeness:

  • The argument list is exposing out-parameters (via a heuristic though, it doesn’t know the XPCOM details.)  ex: “*folderInfo: 0x0” used to be “folderInfo: some pointer to hold an out-parameter that leaves us none the wiser.”
  • Chronifer now sees through ‘this’-adjusting trampolines/wrappers.  Previously, function calls might be hidden because the stack pointer change that chronicle uses to determine function calls jumped to the trampoline whose bytecode was not associated with the actual function being called, and so would potentially be assumed to be boring.  The trampoline would modify rdi then perform a trivial jump (it could have just fallen through) into the function proper, which the existing trampoline detector would not notice.  (That detector only would notice an instant jump of more than 128 bytes; intended for dynamic linking indirection purposes.)
  • Pretty printers allow us to translate nsresult error codes to strings, display the strings referenced by nsString/nsCString instead of the raw structure (mData/mLength/mFlags), pierce nsCOMPtr’s, etc.
  • Pretty printers allow us to detect exceptional values and show them in red!  For example, that NS_ERROR_INVALID_POINTER is red because the nsresult pretty printer’s is_exceptional method returned True because the error bit was high.
  • The configuration method that lets us say what directories/files/classes/functions are interesting/boring has been augmented to let us specify arguments and the sub-fields of arguments to dump.  For example, in the above screenshot, we have added “dump_this=True” to the section labeled “[class@nsMsgLocalMailFolder]”.  As a result, any time a method is called on nsMsgLocalMailFolder, we dump the contents of its ‘this’.  If we only wanted a few fields from “this”, we could have used “dump_this=mPath,mIsServer,mName,mURI” instead.  You can see the end of a full object dump at the top of the screenshot.
  • The object dump output is diffed against the result of previous calls.  This is why we have that text in yellow.  The first entire GetDBFolderInfoAndDB call you see on the screen is a succesful call (no red!) diffed against a previous successful call.  Mainly we see the name-derived things changing and things we expect to be distinct (file path, database).  Then we have an error case, with a few things we probably would not have noticed if we were doing this by hand.  I was somewhat surprised that mHaveParsedURI, the reference count, and m_numOfflineMsgLines showed up.  The others are not entirely shocking, but it’s useful to have them all there.  Without the diff, we would be looking at 65 lines of attribute values, and that is with us already benefitting from the pretty printers.

Exciting possibilities opened by enhancements here:

  • The pretty printers and internal cleanups/refactoring make it feasible to express complicated data-based constraints simply.  For example, with a tractable amount of work I could opt to only see calls on the folder named “gabba3” using the constraint “mURI == ‘mailbox://nobody@Local%20Folders/gabba0′”.  This could be made particularly efficient if we make an assumption that the attribute is invariant once observed and then allow us to key off the pointer/memory region.
  • Code-sharing of pretty printers between my chroniquery code and my archer gdb plugins.
  • Now that archer-gdb has python inferior control, potential use of archer-gdb as an alternate back-end to chroniquery.
  • It would be neat to be able to expose chroniquery’s functionality via a web interface so that things could be expanded/drilled down without having to re-run things.  This would also help offset the non-trivial startup costs to chronicle-recorder’s own chronicle-query.  (Although arguably the startup time is nothing compared to the inefficiency of all the work chronicle-query does for a call and throws away when the next call pretty much wants that same data.)

understanding libmime using chroniquery and unit tests

chroniquery-trace-failed-return-value

Mailnews’ libmime is one of the harder modules to wrap one’s head around.  8-letter filenames where the first four letters tend to be “mime”, home-grown glib-style OO rather than actual C++, and intermingling of display logic with parsing logic do not make for a fun development or even comprehension experience.

Running an xpcshell unit test run under roc‘s chronicle-recorder and then processing it with my chroniquery stuff (repo info), we get a colorful HTML trace of the execution somewhat broken out (note: pretend the stream_write/stream_completes are interleaved; they are paired).  Specific examples of libmime processing for the inquisitive (there are more if you check the former link though):

The thing that kickstarted this exciting foray is the error pictured in the screenshot above from an earlier run.  The return values are in yellow, and you can see where the error propagates from (the -1 cast to unsigned).  If you look at the HTML file, you will also note that the file stops after the error because the functions bail out as soon as they hit an error.

However, our actual problem and driving reason for the unit tests is the JS emitter choking on multipart/related in writeBody (which it receives by way of ‘output_fn’).  Why?  Why JS Emitter?!  (You can open the links and control-F along at home!)

  • We look at the stream_write trace for our multipart/related.  That’s odd, no ‘output_fn’ in there.

chroniquery-stream_complete-output_fn

  • We look at the stream_complete trace for the multipart/related.  There’s our ‘output_fn’!  And it’s got some weird HTML processing friends happening.  That must be to transform links to the related content into something our docshell can render.  This also explains why this processing is happening in stream_complete rather than stream_write… it must have buffered up the content so it could make sure it had seen all the ‘related’ documents and their Content-IDs so it could know how to transform the links.
  • Uh oh… that deferred processing might be doing something bad, since our consumer receives a stream of events.  We had to do something special for SMIME for just such a reason…
  • We check stream_complete for ‘mimeEmitterAddHeaderField’ calls, which the JS emitter keys off of to know what part is currently being processed and what type of body (if any) it should expect to receive.  Uh-oh, none in here.

chroniquery-stream_write-addheaderfield

  • We check stream_write for ‘mimeEmitterAddHeaderField’ calls, specifically with a “Content-Type” field.  And we find them.  The bad news is that they are apparently emitted as the initial streaming happens.  So we see the content-type for our “text/html”, then our “image/png”.  So when stream_complete happens, the last thing our JS emitter will have seen is “image/png” and it will not be expecting a body write.  (It will think that the text/html had no content whatsoever.)
  • Khaaaaaaaaaaaaaaaaaaaaaaaaan!

In summary, unit tests and execution tracing working together with pretty colors have helped us track down an annoying problem without going insane.  (libmime is a lot to hold in your head if you don’t hack on it every day.  also, straight debugger breakpoint fun generally also requires you to try and formulate and hold a complex mental model… and that’s assuming you don’t go insane from manually stepping aboot and/or are lucky with your choices of where you put your breakpoints.)  The more important thing is that next time I want to refresh my understanding of what libmime is up to, I have traces already available.  (And the mechanics to generate new ones easily.  But not particularly quickly.  chronicle and my trace-generating mechanism be mad slow, yo.  It may be much faster in the future to use the hopefully-arriving-soon archer-gdb python-driven inferior support, even if it can’t be as clever about function-call detection without cramming int 0x3’s all over the place.)

Useful files for anyone trying to duplicate my efforts: my ~/.chroniquery.cfg for this run, the unit test as it existed, and the command-line args were: trace -f mime_display_stream_write -f mime_display_stream_complete -c -H /tmp/trace3/trace.html –file-per-func-invoc

chronicle-recorder and amd64, hooray!

overview: visichron.py trace python -f main -d 3

My personal laptop rolls amd64-style (rather than i686), and chronicle-recorder’s valgrind component was not working on it (“illegal instruction”). I have done some vendor-branch dancing to get chronicle-recorder’s valgrind sub-directory to use valgrind 3.3.0. A bzr branch of just the valgrind subdirectory (drop-in or build separately and make sure you invoke valgrind from that location) is available here: http://www.visophyte.org/rev_control/bzr/valgrind-chronicle/valgrind-chronicle/

A probably faster location to bzr branch from is here: http://clicky.visophyte.org/rev_control/bzr/valgrind-chronicle/valgrind-chronicle/

and a tarball of the tip is here: http://www.visophyte.org/rev_control/tarballs/valgrind-chronicle-r6.tar.bz2

I have also updated chroniquery (my python binding for chronicle-query’s JSON interface, and its own set of tools that build on it) to work with amd64. Its bzr branch is here: http://www.visophyte.org/rev_control/bzr/chroniquery/trunk/

The goal of all of this was to be able to run chronicle against Thunderbird, which I was able to do. Unfortunately, visichron (the visualizing part of chroniquery using visophyte) is not quite ready for such an undertaking at this time.  (Not to mention a few C++ symbols issues…)

snippet: visichron.py trace python -f main -d 3

However, it was powerful enough to handle visualizing the trace resulting from invoking python on a python file with just “print ‘hello world'” in it. So that’s what you see here, albeit limited to only 3 call levels deep. Click on the upper picture to see the whole thing at 2000×2000, or just look on the lower picture to get an idea of what’s going on. Just like my first post using visichron, the rings are colored based on a (naive) control-flow-taken basis. The ring colors are per-function, however. Also, the node colors are ‘hot’ colored based on how many ‘ticks’ were spent inside the functions, multiple counting for recursion.

Other interesting changes include some primitive watch functionality for chronisole’s ‘trace’ mode. Also, the previously unmentioned readchron.py now understands and prints BunchedEffect and RegEffect info. (readchron aspires to be along the lines of readelf for chronicle databases, but with more colors.)

Chronicle-Recorder Graph/Ring Visualization! Hooray!

Thanks to new and improved time management skills (and a holiday weekend doesn’t hurt), I’ve got a chronicle-recorder visualization going on via chroniquery:

visichron.py trace vfancy -f smain

Above, we have the visualization run against the ‘fancy’ program seen in the previous chroniquery posts (with one caveat, addressed later). What does it mean?

  • The circular nodes are functions in the executed program. In this case, we start from ‘smain’ and pull in all the subroutines that we detect.
  • The edges between nodes indicate that a function call occurred between the two functions sometime during the execution; it could be once, it could be many times. The color of the edge is a slightly more saturated version of the color of the node that performed the call. If they call each other, only one color wins.
  • The rings around the outside of each node indicate when it was called, specifically:
    • The ring starts and stops based on the chronicle-query timestamp. Like on a clock, time starts at due north and flows around clock-wise, with the last smidge of time just before we reach due north again.  This has its ups and downs. The reason we are using ‘smain’ instead of ‘main’ is that when we used the untouched main, the first “new” ended up taking up most of our timestamp space. So I turned main into smain and had a new main that takes the memory allocator start-up cost hit and then calls smain.
    • The thickness of the ring indicates the depth of the call-stack at the time. The thickest ring corresponds to the outermost function, the thinnest ring to the innermost function. This results in a nice nesting effect for recursive functions, even if it’s more of an ‘indirect’ recursion.
    • The color of each ring slice is based on the control flow taken during the function call. (I think this is awesome, hence the bolding.) Now, I’m making it sound fancier than it is; as a hackish first pass, we simply determine the coverage of all the instructions executed during that function call. A more clever implementation might do something when iteration is detected. A better implementation would probably move the analysis into the chronicle-query core where information on the basic blocks under execution should be available. Specific examples you can look at above:
      • print_list: The outermost calls are aqua-green because their boolean arguments are true. The ‘middle’ calls are light green because their boolean arguments are false. The fina, innermost calls are orange-ish because they are the terminal case of the linked-list traversal where we realize we have a NULL pointer and should bail without calling printf. They are also really tiny because no printf means basically no timestamps for the function.
      • nuke: Four calls are made to nuke; the first and third times (light blue) we are asking to remove something that is not there, the second and fourth times (purple) we are asking to remove something that is. I have no idea why the third removal is so tiny; either I have a bug somewhere or using the timestamps is far more foolish than I thought.

Besides the obvious shout-out to chronicle-recorder, pygraphviz and graphviz power the graph layout. Now, a good question would be whether this actually works on something more complex? Could it be? Well, you can probably see the next picture as you’re reading this, so we’ll cut this rhetorical parade short.

visichron.py trace chronicle-query -f load_all_mmapped_objects

Besides the obvious font issues, this actually looks pretty nice. But what does it tell us? Honestly, the full traversal here is excessive. All we care about is the center node, load_all_mmapped_objects, and load_dwarf2_for (due north and a teeny bit east of the center node). If we look at the calls to load_dwarf2_for, we can see that two of them have different control-flow coverages. Those happen to be the times debug symbols could not be found (which is my problem I want to debug). The first one is for ld-2.6.1.so (I’m looking at the textual output of chronisole for the same command), and the second one is for /usr/bin/python2.5. The second one should definitely not fail, because it should find the symbols in /usr/lib/debug, but it does.

Unfortunately, the trail sorta stops cold there with a bug in either chronicle-query or in chroniquery (possibly cribbed from chronomancer). load_dwarf2_for should be calling dwarf2_load, but we don’t see it. I don’t know why, but I haven’t actually looked into it either. Rest assured that a much more awesome graph awaits me in the future!

Because there’s too much text and not enough pictures, I’ll also throw in the output of the ring visualization test which uses a stepped tick-count (1 per function call) to show how things could look if we went with an artificial time base…

ring visualization example

Bzr repositories for both can be found at http://www.visophyte.org/rev_control/bzr/.

Hooray for dwarf location lists!

chronisole-trace-fancy-main-working.png

I resolved the time-stamp issues from last time. Arguments are now happy because we move to the time-stamp corresponding to when the function’s prologue had completed. The booleans were wrong because they were copied to locals during the prologue, which is where the argument list referenced them. Things like this are much easier to diagnose thanks to support of disassembly prints-outs via the diStorm64 disassembler, although I may move to using libdisassemble in the future since it is pure python and presumably provides (or can be more easily coerced to provide) a richer set of info about the disassembly.

Chroniquery slightly more chronoriffic

So, I was playing with the fantastic Chronomancer (roc‘s announcement, project link), an Eclipse GUI for chronicle-recorder, using a trace from my PyXPCOM troubles. I discovered a lack of some symbols that seemed like they should be there and decided to then run chronicle against chronicle-query and view the problem in Chronomancer so that I might smite it. Unfortunately, some stuff didn’t show up in that trace, so I went investigating by way of enhancing chroniquery so that I might better understand the problem while also making chroniquery more usable. Sadly, there’s still more work to be done, but I have a picture to show and I understand a good deal of the chronomancer internals now and how to drive chronicle-query better.

chroniquery against fancy, tracing main, minor issues

The above is pretty representative of what chroniquery’s chronisole can now do. It’s an invocation of chronisole.py trace fancy -f main. The ‘fancy’ program builds a linked list using ‘cons’, uses the recursive ‘print_list’ to print the list, and uses the iterative ‘nuke’ to delete entries from the list. ‘cons’ returns the newly allocated list entry/cons cell, and nuke returns 0 if the entry existed and was nuked or 1 if it didn’t exist. We traced main because of the -f; we could start with any bunch of functions as the root, and can use -x to filter out noisy functions from the recursive function discovery. Return values and parameters are shown.

Of course, even in this example, the first/newline booleans are wrong, presumably indicative of some wackiness/off-by-one-error in my mechanism for determining the sub-called functions’ function entry timestamps. I’m confident I’m doing some foolish things, but it will have to wait until the weekend and a more well-rested brain. I’ve pushed the changes to the bzr tree if anyone wants to do my work for me.