why so slow, pushlog?

I am doing something where I need to talk to the Mozilla hg pushlog.  I noticed things were running disturbingly slow, so I figured I’d look into it.  I’m using node.js so htracr (a node.js, libpcap-based via node_pcap http transaction visualizer) seemed like the easiest and most fun way to figure out what is going on.  (In a web browser I would just use the built-in timeline.)

slooooooooooow...

The short request is a comm-central pushlog request by date.  The long request is a mozilla-central pushlog request by date.  mozilla-central has more pushes, suggesting that the query is either doing something sketchy like a table scan or there is lock contention.  Quick investigation showed no successful pushes in the given timeframe, eliminating lock contention from the likely list of causes.  (The implementation uses SQLite, which, when not running with the Write-Ahead-Log enabled, will only experience contention when the reads are occurring from separate connections.)

This suggests the query is doing a table scan.  This tends to be a reasonably straightforward problem.  But since I have a SQLite opcode visualizer in my toolkit (now on github!) that needs plugs periodically to keep it on the tip of everyone’s tongue, I used that. Plug! Plug! Plug!

If you care about these kind of things, click on the small image on the left and look for the red text.  The giant warning flag is the “Rewind” on the changesets table and the control flow that performs a “Next”  that appears to be part of a loop case right off the bat.  (Note that this is using SQLite 3.6.23.1 since I assume that’s what hg.mozilla.org is using.)  Summary: we perform one table scan of the changesets table for every push in the date range.

The “bad” schema is:

CREATE TABLE changesets (pushid INTEGER, rev INTEGER, node text);
CREATE TABLE pushlog (id INTEGER PRIMARY KEY AUTOINCREMENT, user TEXT, date INTEGER);
CREATE UNIQUE INDEX changeset_node ON changesets (node);
CREATE UNIQUE INDEX changeset_rev ON changesets (rev);
CREATE INDEX pushlog_date ON pushlog (date);
CREATE INDEX pushlog_user ON pushlog (user)

The query is: SELECT id, user, date, node from pushlog LEFT JOIN changesets ON id = pushid WHERE date > 1298306930 AND date < 1298306955 ORDER BY id DESC, rev DESC;

The query plan via EXPLAIN QUERY PLAN from SQLite 3.6.23.1 on the “bad” schema is:

0|0|TABLE pushlog WITH INDEX pushlog_date
1|1|TABLE changesets

This is obvious if you know what you’re looking for; unfortunately the indication of the problem is the lack of “WITH INDEX” or the like rather than text that calls out the problem. The much nicer SQLite 3.7.4 EXPLAIN QUERY PLAN (which has great documentation!) would make the problem much more obvious by saying “SCAN TABLE” if not for the fact that it ends up creating an automatic index:

0|0|0|SEARCH TABLE pushlog USING INDEX pushlog_date (date>? AND date<?) (~110000 rows)
0|1|1|SEARCH TABLE changesets USING AUTOMATIC COVERING INDEX (pushid=?) (~7 rows)
0|0|0|USE TEMP B-TREE FOR ORDER BY

Although one table scan accompanied by a b-tree building is probably a better idea than N table scans, if you read this and think “SQLite 3.7.x just saved someone’s bacon”, you would be wrong because it still throws away the index at the end.  The server is still experiencing an initial vicious kick in the pants every time the statement is run; there are just no lighter follow-up kicks (lighter because of the OS disk cache, you see…).

In any event, the problem is that there is no index on the pushid column in the changesets table.  (And pushid can’t be the primary key because it is not unique.  There’s no benefit to using a composite key since SQLite will still create a simple rowid key once we use a composite, so an index is the way to go.)

Once we fix this, our graph looks like the second one at left (in SQLite 3.6.23.1).  Again, looking for red text and now also orange text, the key things are that we no longer have a “Rewind” or “Next”, and instead have a SeekGe on the index and a Seek on the table using the row id the index told us about.  (We are not using a covering index because we expect extremely high locality in the table based on rowid because the insertions happen consecutively in a single transaction.)

The 3.6.23.1 query plan now looks like:

0|0|TABLE pushlog WITH INDEX pushlog_date
1|1|TABLE changesets WITH INDEX changeset_pushid

Much nicer! What does the lovely 3.7.4 say?:

0|0|0|SEARCH TABLE pushlog USING INDEX pushlog_date (date>? AND date<?) (~110000 rows)
0|1|1|SEARCH TABLE changesets USING INDEX changeset_pushid (pushid=?) (~10 rows)
0|0|0|USE TEMP B-TREE FOR ORDER BY

Awww yeah. Anywho, here’s the schema one wants to use:

CREATE TABLE changesets (pushid INTEGER, rev INTEGER, node text);
CREATE TABLE pushlog (id INTEGER PRIMARY KEY AUTOINCREMENT, user TEXT, date INTEGER);
CREATE INDEX changeset_pushid ON changesets (pushid);
CREATE UNIQUE INDEX changeset_node ON changesets (node);
CREATE UNIQUE INDEX changeset_rev ON changesets (rev);
CREATE INDEX pushlog_date ON pushlog (date);
CREATE INDEX pushlog_user ON pushlog (user);

I am off to file this as a bug… filed as bug 635765.

UPDATE: I initially had a brain glitch where I proposed using a primary key rather than an additional index.  Unfortunately, this is what got pushed out to feed readers during the 30 seconds before I realized my massive mistake.  If you have come to this page to note the bad-idea-ness of that, please enjoy the corrected blog-post which explains why the primary key is not the best idea before going on to point out any other serious brain glitches 🙂

UPDATE 2 (Feb 22, 2011):  Good news: the change has landed and all the existing databases have been updated.  Woo!  Bad news: although things are probably faster now, things are still way too slow.  (Given that this specific deficiency was determined by inspection and not actual profiling, this is not so surprising.)  It would appear that the retrieval of the information about the changesets is the cause of the slowdown.  More inspection on my part suggests that populating the list of tags may be involved, but that’s just a quick guess.  Please follow the bug or follow-on bugs if you want to witness the entire exciting saga.

performance annotated SQLite EXPLAINation visualizations using systemtap

For the Thunderbird 3.1 release cycle we are not just fixing UX problems but also resolving various performance issues.  Building on my previous work on a visualization of SQLite opcode control flow graphs using graphviz I give you… the same thing!  But updated to more recent versions of SQLite and integrating performance information retrieved through the use of systemtap with utrace.

In this case we are using systemtap to extract the number of times each opcode is executed and the number of btree pages that are requested during the course of executing the opcode.  (We do not differentiate between cache hits and misses because we are looking at big-O right now.)  Because virtual tables (like those used by FTS3) result in nested SQLite queries and we do not care about analyzing the queries used by FTS3, we ignore nested calls and attribute all btree page accesses to the top-level opcode under execution.

Because the utility of a tool is just as dependent on ease of use as its feature set, I’ve cleaned things up and made it much easier to get information out of Thunderbird/gloda with this patch which should land soon and provides the following:

  • The gloda datastore will write a JSON file with the EXPLAINed results of all its SQL queries to the path found in the preference mailnews.database.global.datastore.explainToPath.  This preference is observed so that setting it during runtime will cause it to create the file and begin explaining all subequently created queries.  Clearing/changing the preference closes out the current file and potentially opens a new one.
  • Gloda unit tests will automatically set the preference to the value of the environment variable GLODA_DATASTORE_EXPLAIN_TO_PATH if set.
  • A schema dump is no longer required for meta-data because we just assume that you are using a SQLite DEBUG build that tells us everything we want to know about in the ‘comment’ column.
  • grokexplain.py now uses optparse and has more internal documentation and such.

So what do the pretty pictures show?

  • Before: A gloda fulltext query search retrieves all of the results data before applying the LIMIT.  This results in a lot more data transplanted into temporary results tables than we will end up using; wasted bookkeeping.  Additionally, we incur the row lookup costs for both our messages data storage table and our fulltext messagesText table for all hits, even the ones we will not return.  (Noting that there was no escaping hitting both tables since we were using offsets() and it hits the fulltext table as part of its operation.)
  • After: We perform an initial query phase where we minimize wasted bookkeeping by only retrieving and using the bare minimum required to compute the LIMITed list of document id’s.  Additionally, by use of the FTS3 matchinfo() function instead of the offsets() function we are able to avoid performing row lookup on the messagesText table for results that will not be returned to the user.  Use of the matchinfo() function requires a custom ranking function which allows us to be somewhat more clever about boosting results based on fulltext matches too.
  • The poor man’s bar graphs in the pictures are expressing a hand-rolled logarithmic scale for the number of invocations (left bar) and number of btree pages accessed (right bar).  On the right side of each line the actual numbers are also presented in the same order.  The goal is not to convey good/bad so much as to draw the eye to hot spots.

Notes for people who want more to read:

  • SQLite has built-in infrastructure to track the number of times an opcode is executed as well as the wall-clock time used; you do not have to use systemtap.  It’s a conditional compilation kind of thing, just -DVDBE_PROFILE and every statement you execute gets its performance data appended to vdbe_profile.out when you reset the statement.  It doesn’t do the btree pages consulted trick, but it’s obviously within its power with some code changes.
  • Our use of systemtap is primarily a question of discretionary reporting control, the ability to integrate the analysis across other abstraction layers, and various build concerns.  The JSON output is not a driving concern, just something nice we are able to do for ourselves since we are formatting the output.
  • The tool has really crossed over into the super-win-payoff category with this fix investigation.  (Having said that, I probably could have skipped some of the data-flow stuff last time around.  But these projects are both for personal interest and amusement as well as practical benefit, so what are you gonna do?  Also, that effort could pay off a bit more by propagating comments along register uses so that the LIMIT counter register r8 and the compute-once r7 in the after diagram would not require human thought.)

References:

  • The grokexplain repo.  Used like so: python grokexplain.py –vdbe-stats=/tmp/glodaNewSearchPerf.json /tmp/glodaNewSearchExplained.json -o /tmp/glodasearch-newcheck
  • The systemtap script in its repo.  Used like so: sudo stap -DMAXSTRINGLEN=1024 sqlite-perf.stp /path/to/thunderbird-objdir/mozilla/dist/lib/libsqlite3.so > /tmp/glodaNewSearchPerf.json
  • The bug with the gloda explain logic patch and the improved SQL query logic.  I also used the tooling to fix another (now-fixed) bug, but that problem turned out to be less interesting.

Thunderbird ĝLȬdÅ full text search tokenizer now supports accent folding, non-ASCII case-folding, and more!

Thanks to the efforts of Makoto Kato (jp-blog en-blog twitter) whom you may remember as the bringer of CJK bi-gram tokenization, I have just pushed tokenizer support for case-folding, accent-folding, and voiced sound mark magic to comm-central.  The net result is that searches for “ĝLȬdÅ” will now find “gloda” and vice versa.  Less crazy searches should also be improved.

Starting tomorrow, Thunderbird (in the guise of Lanikai) 3.1 nightlies will have this capability and it will also show up in the next beta (beta 2).  No action on your part is required; when you start running a build with this capability your existing gloda database will get blown away and indexing will start from scratch.  If you go back to an older version for some reason after having used the updated build, the old build will also decide to blow away your database, so try to avoid making a habit of switching between them.

We also have some significant performance enhancements related to gloda search and indexing slated for beta 2, not to mention* usability enhancements to avoid the dreaded gloda search/quick search toggle dance.

* This expression is weird given that I go on to mention the allegedly unmentioned, but it still makes more sense than “all but X”.

visualization of control-flow/data-flow analysis of sqlite EXPLAIN-ations

control-flow before

control-flow before

While doing some work on the gloda-based search targeted for beta 2, I came upon some query slowness and decided to look into it.  Because gloda queries are (conceptually, at least) dynamic, we also generate dynamic SQL.  Our schema is fairly normalized, and it’s not always obvious to me what I can expect to have going on.  EXPLAIN QUERY PLAN is a good litmus test for massive insanity, but it’s a bit concise.  For example, the troublesome SQL was as follows:

SELECT * FROM messages INNER JOIN messagesText ON messages.id = messagesText.rowid WHERE id IN (SELECT docid FROM messagesText WHERE subject MATCH "sutherland") AND deleted = 0 AND folderID IS NOT NULL AND messageKey IS NOT NULL ORDER BY date DESC LIMIT 100;

This nets us:

0|0|TABLE messages WITH INDEX deleted
1|1|TABLE messagesText VIRTUAL TABLE INDEX 1:
0|0|TABLE messagesText VIRTUAL TABLE INDEX 2:

Everything’s an index!  We win!  Sorta.  The index it chose is somewhat worrying if you think about it.  But really, it’s all quite nebulous.  I have basically no idea what is happening in there.  The good news is that EXPLAIN will tell us the actual opcodes in use.  The bad news is that it is quite hard to understand (104 opcodes), at least for this query.

Python, graphviz, pygraphviz, and compulsive tool-building to the rescue!  My original plan was that data-flow analysis could make everything incredibly obvious as to what is going on.  I’m not sure I reached that point.  But in the process, the efforts required to make the system clever enough to do data analysis allow the control flow diagram to be quite pretty and have lots of useful information.  The main sign of data-flow analysis is that all cursor write operations have the list of cursors that data flowed from in parens.  Each cursor gets its own tango-ish color, and opcodes involving the cursor are colored by that cursor.

The major flaw in the data-flow analysis that springs to mind right now is that it ignores control flow.  An action that will only be taken because of a control-flow decision based on data retrieved from a cursor should ideally establish a relationship.  This is important because gloda throws around a lot of set intersections (although not in this case) in its queries, and it would be nice to be able to concisely express that.  The control-flow diagram is arguably orders of magnitude better than a raw EXPLAIN, but my hope is/was that the data analysis can make it trivial to figure out how things are being hooked up.  Given the amount of effort already expended and the present results, I figure I’m probably at the “control flow is good enough for the foreseeable future stage of things”.

In any event, the control-flow graph makes it (more) obvious that the outer loop is using the ‘deleted’ index to walk over *every deleted message in the database*.  A one-off idiom computes the full-text search and stashes it in an intermediary table.  As we then walk over every deleted message, we see if that message can be found in the full-text search result table.  If it is, our previously queued lazy-seek against the main messages table happens and we search against the full-text table to do our join.  And out pop results!

My hope for this query was that the deleted condition would be executed as a constraint as we were walking our result set, just like the folderID and messageKey constraints.  Making sure your queries are using the right index and helping the optimizer make the right call is a fairly textbook problem and the SQLite docs are actually pretty good on this.  For the sake of the example, I have dropped the index entirely.  (The ‘deleted’ index was added so we can quickly mark messages as needing deletion processing without slowing down deletes for the user.  It may turn out to be more beneficial to leave the field as-is, but use a separate table as our work queue.)

dataflow before

dataflow before

control-flow after

control-flow after

After the deletion, we get the revised diagram.  The full-text search still happens completely before we produce any other results, but this is keeping in keeping with our query.  (The query generation code is designed to handle the case where we have multiple constraints and must intersect the results of each.  It can clearly be optimized for the case where no primary-key intersection is required.)  The traversal of the full-text search result set is now the outer loop.  Deleted is now just filtering like folderID and messageKey, as expected.

The major lingering red flag is the block that calls Last() and Delete().  The conditional that makes the decision to do that is using “r1”, which is apparently implementing our limit logic inside the code.  This was confusing until I remembered that we’ve got an ORDER BY going on.  The Delete is just culling the results that will never potentially be seen.  The bad news is that we are doing our full-text search join prior to the ORDER BY culling.  So, my original fear that we are doing the JOIN no matter what still holds.  And that’s enough excitement for one day.

Given that any full-text search with ordered results will require retrieval the entire result set and correlating it against at least one table, I’m tempted to create separate tables for different windows of time given the importance of recency.  Thoughts/input appreciated, as anyone who has read this far is probably very detail oriented…

UPDATE (Feb 25, 2009): The hg repo is http://hg.mozilla.org/users/bugmail_asutherland.org/grokexplain/.  It now does some actual value propagation (and has had various bugs fixed).  By default it ignores ‘Yield’ because the way in which they are used (basically generator/continuation) wreaks havoc on the CFG with just straightforward static analysis.  If you want the yields, pass “yields” on the command line.  Also, passing “debug” gets you the states of the registers at the entrance/exit of the “basic blocks”.  (Not technically basic blocks, but close enough for my purposes.)