Thunderbird Jetpack Teasers: Words per Minute in Compose

jetpack.future.import("thunderbird.compose");
jetpack.thunderbird.compose.appendComposePanel({
  onReady: function (panel, composeContext) {
    let doc = panel.contentDocument;
    let msgNode = $("<span />", doc.body).appendTo(doc.body);
 
    let started = Date.now();
    setInterval(function() {
      let words = composeContext.getPlaintextContents().split(/\s+/);
      let secs = Math.ceil((Date.now() - started) / 1000);
      let wordsPerMinute = Math.floor((words.length * 60) / secs);
      msgNode.text(wordsPerMinute + " words per minute.");
    }, 1000);
 
    panel.show();
  },
  html: <><body style="overflow: hidden"></body></>
});

thunderbird-jetpack-words-per-minute-example

So’s your facet: Faceted global search for Mozilla Thunderbird

faceting-gloda-hover-davida-1

Following in the footsteps of the MIT SIMILE project’s Exhibit tool (originally authored by David Huynh) and Thunderbird Seek extension (again by David Huynh), we are hoping to land faceted global search for Thunderbird 3.0 (a la gloda) in beta 4.

I think it’s important to point out how ridiculously awesome the Seek extension is.  It is the only example of faceted browsing or search in an e-mail client that I am aware of.  (Note: I have to assume there are some research e-mail clients out there with faceting, but I haven’t seen them.)  Given the data model available to extensions in Thunderbird 2.0 and the idiosyncratic architecture of the UI code in 2.0, it’s not only a feature marvel but also a technical marvel.

Unfortunately, there was only so much Seek could do before it hit a wall given the limitations it had to work with.  Thunderbird 2.0’s per-folder indices are just that, per-folder.  They also require (fast) O(n) search on any attribute other than their unique key.  Although Seek populated an in-memory index for each folder, it was faced with having to implement its own global indexer and persistent database.

Gloda is now at a point where a global database should no longer be the limiting factor for extensions, or the core Thunderbird experience…

faceting-gloda-action-tag-hover-bienvenu-1

The screenshots are of a fulltext search for “gloda” in my message store.  The first screenshot is without any facets applied and me hovering over one of David Ascher’s e-mail address.  The second is after having selected the “!action” tag and hovering over one of David Bienvenu’s e-mail address.  Gloda has a concept of contact aggregation of identities but owing to a want of UI for this in the address-book right now, it doesn’t happen.  We do not yet coalesce (approximately) duplicate messages, which explains any apparent duplicates you see.

The current state of things is a result of development effort by myself and David Ascher with design input from Bryan Clark and Andreas Nilsson (with hopefully much more to come soon :).  Although we aren’t using much code from our previous exptoolbar efforts, a lot of the thinking is based on the work David, Bryan, and myself did on that.  Much thanks to Kent James, Siddharth Agarwal, and David Bienvenu for their recent and ongoing improvements to the gloda (and mailnews) back-end which help make this hopefully compelling UI feature actually usable through efficient and comprehensive indexing that does not make you want to throw your computer through a window.

If you use linux or OS X, I just linked you to try server builds.  The windows try server was sadly on fire and so couldn’t attend the build party.  The bug tracking the enhancement is bug 474711 and has repository info if you want to spin your own build.  New try server builds will also be noted there.  Please keep in mind that this is an in-progress development effort; it is not finished, there are bugs.  Accordingly, please direct any feedback/discussion to the dev-apps-thunderbird list / newsgroup rather than the bug.  Please beware that increases in awesomeness require that your gloda database be automatically blown away if you try the new version.  And first you have to turn gloda on if you have not already.

Using VMWare Record/Replay and VProbes for low time-distortion performance profiling

profile-performance-graph-enumerateProps

The greatest problem with performance profiling is getting as much information as possible while affecting the results as little as possible.  For my work on pecobro I used mozilla’s JavaScript DTrace probes.  Because the probes are limited to notifications of all function invocations/returns with no discretion and there is no support for JS backtraces, the impact on performance was heavy.  Although I have never seriously entertained using chronicle-recorder (via chroniquery) for performance investigations, it is a phenomenal tool and it would be fantastic if it were usable for this purpose.

VMware introduced with Workstation 6/6.5 the ability to efficiently record VM execution by recording the non-deterministic parts of VM execution.  When you hit the record button it takes a snapshot and then does its thing.  For a 2 minute execution trace where Thunderbird is started up and gloda starts indexing and adaptively targets for 80% cpu usage, I have a 1G memory snapshot (the amount of memory allocated to the VM), a 57M vmlog file, and a 28M vmsn file.  There is also and a 40M disk delta file (against the disk snapshot), but I presume that’s a side effect of the execution rather than a component of it.

The record/replay functionality is the key to being able to analyze performance while minimizing the distortion of the data-gathering mechanisms.  There are apparently a lot of other solutions in the pipeline, many of them open source.  VMware peeps apparently also created a record/replay-ish mechanism for valgrind, valgrind-rr, which roc has thought about leveraging for chronicle-recorder.  I have also heard of Xen solutions to the problem, but am not currently aware of any usable solutions today.  And of course, there are many precursors to VMware’s work, but this blog post is not a literature survey.

There are 3 ways to get data out of a VM under replay, only 2 of which are usable for my purposes.

  1. Use gdb/the gdb remote target protocol.  The VMware server opens up a port that you can attach to.  The server has some built-in support to understand linux processes if you spoon feed it some critical offsets.  Once you do that, “info threads” lists every process in the image as a thread which you can attach to.  If you do the dance right, gdb provides perfect back-traces and you can set breakpoints and generally do your thing.  You can even rewind execution if you want, but since that means restoring state at the last checkpoint and running execution forward until it reaches the right spot, it’s not cheap.  In contrast, chronicle-recorder can run (process) time backwards, albeit at a steep initial cost.
  2. Use VProbes.  Using a common analogy, dtrace is like a domesticated assassin black bear that comes from the factory understanding English and knowing how to get you a beer from the fridge as well as off your enemies.  VProbes, in contrast, is a grizzly bear that speaks no English.  Assuming you can convince it to go after your enemies, it will completely demolish them.  And you can probably teach it to get you a beer too, it just takes a lot more effort.
  3. Use VAssert.  Just like asserts only happen in debug builds, VAsserts only happen during replay (but not during recording).  Except for the requirement that you think ahead to VAssert-enable your code, it’s awesome because, like static dtrace probes, you can use your code that already understands your code rather than trying to wail on things from outside using gdb or the like.  This one was not an option because it is Windows only as of WS 6.5.  (And Windows was not an option because building mozilla in a VM is ever so slow, and, let’s face it, I’m a linux kind of guy.  At least until someone buys me a solid gold house and a rocket car.)

profile-performance-graph-callbackDriver-doubleClicked

My first step in this direction has been using a combination of #1 and #2 to get javascript backtraces using a timer-interval probe.  The probe roughly does the following:

  • Get a pointer to the current linux kernel task_struct:
    • Assume we are uniprocessor and retrieve the value of x86_hw_tss.sp0 from the TSS struct for the first processor.
    • Now that we know the per-task kernel stack pointer, we can find a pointer to the task_struct at the base of the page.
  • Check if the name of our task is “thunderbird-bin” and bail if it is not.
  • Pull the current timestamp from the linux kernel maintained xtime.  Ideally we could use VProbe’s getsystemtime function, but it doesn’t seem to work and/or is not well defined.  Our goal is to have a reliable indicator of what the real time is at this stage in the execution, because with a rapidly polling probe our execution will obviously be slower than realtime.  xtime is pretty good for this, but ticks at 10ms out of box (Ubuntu 9.04 i386 VM-targeted build), which is a rather limited granularity.  Presumably we can increase its tick rate, but not without some additional (though probably acceptable) time distortion.
  • Perform a JS stack dump:
    • Get XPConnect’s context for the thread.
      • Using information from gdb on where XPCPerThreadData::gTLSIndex is, load the tls slot.  (We could also just directly retrieve the tls slot from gdb.)
      • Get the NSPR thread private data for that TLS slot.
        • Using information from gdb on where pt_book is located, get the pthread_key for NSPR’s per-thread data.
        • Using the current task_struct from earlier, get the value of the GS segment register by looking into tls0_base and un-scrambling it from its hardware-specific configuration.
        • Use the pthread_key and GS to traverse the pthread structure and then the NSPR structure…
      • Find the last XPCJSContextInfo in the nsTArray in the XPCJSContextStack.
    • Pull the JSContext out, then get its JSStackFrame.
    • Recursively walk the frames (no iteration), manually/recursively (ugh) “converting” the 16-bit characters into 8-bit strings through violent truncation and dubious use of sprintf.

The obvious-ish limitation is that by relying on XPConnect’s understanding of the JS stack, we miss out on the most specific pure interpreter stack frames at any given time.  This is mitigated by the fact that XPConnect is like air to the Thunderbird code-base and that we still have the functions higher up the call stack.  This can also presumably be addressed by detecting when we are in the interpreter code and poking around.  It’s been a while since I’ve been in that part of SpiderMonkey’s guts… there may be complications with fast natives that could require clever stack work.

This blog post is getting rather long, so let’s just tie this off and say that I have extended doccelerator to be able to parse the trace files, spitting the output into its own CouchDB database.  Then doccelerator is able to expose that data via Kyle Scholz‘s JSViz in an interactive force-directed graph that is related back to the documentation data.  The second screenshot demonstrates that double-clicking on the (blue) node that is the source of the tooltip brings up our documentation on GlodaIndexer.callbackDriver.  doccelerator hg repovprobe emmett script in hg repo.

See a live demo here.  It will eat your cpu although it will eventually back off once it feels that layout has converged.  You should be able to drag nodes around.  You should also be able to double-click on nodes and have the documentation for that function be shown *if it is available*.  We have no mapping for native frames or XBL stuff at this time.  Depending on what other browsers do when they see JS 1.8 code, it may not work in non-Firefox browsers.  (If they ignore the 1.8 file, all should be well.)  I will ideally fix that soon by adding an explicit extension mechanism.

Thunderbird Jetpack messageDisplay.overrideMessageDisplay fun.

jetpack-twitter-follow-notification

As part of our goal to make it easy to write extensions for Thunderbird 3, we’ve been working on getting Jetpack running under Thunderbird and exposing Thunderbird-specific points. This is all experimental, but it’s having good results.

The first example replaces the message you get from twitter when someone follows you and instead shows you that person’s twitter page so you can see what they’ve written. Unfortunately, if you try and click on links on the page you will become sad because they all try and trigger your web browser. But Standard8 is hard at work resolving the content display issues. Besides demonstrating registration via a regex over the sender’s e-mail address, it also shows us extracting message headers from the message. Also, we introduce a small HTML snippet that precedes the nested web browser so it’s not just an embedded web browser.

jetpack.future.import("thunderbird.messageDisplay");
jetpack.thunderbird.messageDisplay.overrideMessageDisplay({
  match: {
    fromAddress: /twitter-follow-[^@]+@postmaster.twitter.com/
  },
  onDisplay: function(aGlodaMsg, aMimeMsg) {
    let desc = aMimeMsg.get("X-Twittersendername", "some anonymous jerk") +
      " has followed you on Twitter.  Check out their twitter page below.";
    return {
      beforeHtml:
        <>
          <div style="background-color: black; color: white; padding: 3px; margin: 3px; -moz-border-radius: 3px;">
            {desc}
          </div>
        </>
      url: "http://twitter.com/" + aMimeMsg.get("X-Twittersenderscreenname")
    };
  }
});

jetpack-amazon-big-total

Our second example of the extension point replaces e-mails from Amazon about an order (order confirmation and shipment confirmation) with the amount of money you spent on the order in BIG LETTERS (or rather BIG NUMBERS). It uses a regular expression run against the message body to find the total order cost. Then it generates a simple web page to present the information to you.

jetpack.future.import("thunderbird.messageDisplay");
jetpack.thunderbird.messageDisplay.overrideMessageDisplay({
  match: {
    fromAddress: /(?:auto-confirm|ship-confirm)@amazon.(?:com|ca)/
  },
  _totalRe: /Total(?: for this Order)?:[^$]+\$\s*(\d+\.\d{2})/,
  onDisplay: function(aGlodaMsg, aMimeMsg, aMsgHdr) {
    let bodyText = aMimeMsg.coerceBodyToPlaintext(aMsgHdr.folder);
    let match = this._totalRe.exec(bodyText);
    let total = match ? match[1] : "hard to say";
    return {
      html:
      <>
        <style><![CDATA[
          body { background-color: #ffffff; }
          .amount { font-size: 800%; }
        ]]></style>
        <body>
          you spent... <span class="amount">${total}</span>
        </body>
      </>
    };
  }
});

The modified version of Jetpack can be found here on the “thunderbird” branch. “about:jetpack” can be triggered from the “Tools” menu. Besides the development jetpack, you can also add jetpacks from the about:jetpack “Installed Features” tab by providing a URL directly to the javascript file. Unfortunately, I just tried installed more than one Feature at the same time and that fell down. I’m unclear if that’s a Thunderbird content issue, a problem with my changes, or a problem in Jetpack/Ubiquity that may go away when I update the branch.

thunderbird, gloda, exptoolbar, protovis, paninaro, oh oh oh

exptoolbar-protovis-gloda-256

Thunderbird.  With the global database, gloda.  Using the exptoolbar extension.  Using the protovis javascript visualization library.  For reals!  Not a prank!  Just grab the most recent XPI or grab the repo.  And be using a nightly (beta 2 might work?)

What you are looking at:

  • The exptoolbar search results page, augmented with a visualization.
  • Each conversation with search results gets its own wedge.
    • Wedges can be distinguished because of the alternating background colors.
    • Conversations that you sent a message to will have a red shading to them.  The examples may be somewhat misleading because the account where a lot of my sent mail ends up is not part of the profile used to create the screenshots.
  • Each message is placed in its conversation wedge…
    • The radius is based on the ‘age’ of the message using a log-ish scale.  Interpolation is actually linear at each level (one day, one week, one month, three months, one year, 5 years, ‘forever’.)
    • The angular placement within the wedge is based on the author of the message.  Across all wedges the placement is the same.  This helps ‘bursty’ parts of conversations (which are extremely likely) be made more obvious, while also helping to provide some understanding of conversation dynamics.
  • Message shapes are determined by whether the message is starred (diamond), sent by a ‘popular’ contact (circle), or an unpopular one (cross).  The use of popularity is a temporary measure because current gloda in trunk does not cache address-book lookups, and they are expensive.  Once the new gloda search code lands with those changes, we can rely on the existence of an address book entry.  (Starring a contact using the new message reader adds them to your address book.)
  • Message opacity is determined by whether the message is a ‘hit’ or not.  All messages in a conversation are eventually retrieved, though initially we only have the hits.
  • Message color is determined by applied tags (using the closest tango color for the first tag), or whether the message is starred (closest tango color to yellow, where I think I had removed the yellow tango colors for some unknown reason, so we get green I guess).  It’s grey if the message has no tag or star.
  • The subject of the conversation is displayed in the wedge.

exptoolbar-protovis-seek-thunderbird-256

Things that are happy:

Things that are sad (aka caveats):

  • It would probably be better if the visualization was not radar-inspired.  Besides the perceptual reasons, the subjects are harder to read than they would be in an equivalent linear-styled visualization.
  • The visualization is not interactive.  protovis officially has no interaction support yet, but if you look in the (only available minified?) source, it’s almost there.  It might be entirely there, but it didn’t work for me immediately after a quick reading of the (indented) source.
  • There is some low probability failure that occurs during the visualization updating as gloda backfills the message collections.  If it happens on the last update, you can end up with a half-built visualization.  Re-running the search will generally resolve the issue.
  • The visualization does a pretty solid job of taking up all the screen real estate and has no way to be disabled, so you have to scroll past it every time.

Future work:

  • Interactivity.
  • Perhaps showing the gravatars for the people involved in a conversation at the outer rim of the wedge, positioning them based on the author positioning we determined.
  • Perhaps lose the radar motif.
  • Your thoughts / patches!

understanding libmime using chroniquery and unit tests

chroniquery-trace-failed-return-value

Mailnews’ libmime is one of the harder modules to wrap one’s head around.  8-letter filenames where the first four letters tend to be “mime”, home-grown glib-style OO rather than actual C++, and intermingling of display logic with parsing logic do not make for a fun development or even comprehension experience.

Running an xpcshell unit test run under roc‘s chronicle-recorder and then processing it with my chroniquery stuff (repo info), we get a colorful HTML trace of the execution somewhat broken out (note: pretend the stream_write/stream_completes are interleaved; they are paired).  Specific examples of libmime processing for the inquisitive (there are more if you check the former link though):

The thing that kickstarted this exciting foray is the error pictured in the screenshot above from an earlier run.  The return values are in yellow, and you can see where the error propagates from (the -1 cast to unsigned).  If you look at the HTML file, you will also note that the file stops after the error because the functions bail out as soon as they hit an error.

However, our actual problem and driving reason for the unit tests is the JS emitter choking on multipart/related in writeBody (which it receives by way of ‘output_fn’).  Why?  Why JS Emitter?!  (You can open the links and control-F along at home!)

  • We look at the stream_write trace for our multipart/related.  That’s odd, no ‘output_fn’ in there.

chroniquery-stream_complete-output_fn

  • We look at the stream_complete trace for the multipart/related.  There’s our ‘output_fn’!  And it’s got some weird HTML processing friends happening.  That must be to transform links to the related content into something our docshell can render.  This also explains why this processing is happening in stream_complete rather than stream_write… it must have buffered up the content so it could make sure it had seen all the ‘related’ documents and their Content-IDs so it could know how to transform the links.
  • Uh oh… that deferred processing might be doing something bad, since our consumer receives a stream of events.  We had to do something special for SMIME for just such a reason…
  • We check stream_complete for ‘mimeEmitterAddHeaderField’ calls, which the JS emitter keys off of to know what part is currently being processed and what type of body (if any) it should expect to receive.  Uh-oh, none in here.

chroniquery-stream_write-addheaderfield

  • We check stream_write for ‘mimeEmitterAddHeaderField’ calls, specifically with a “Content-Type” field.  And we find them.  The bad news is that they are apparently emitted as the initial streaming happens.  So we see the content-type for our “text/html”, then our “image/png”.  So when stream_complete happens, the last thing our JS emitter will have seen is “image/png” and it will not be expecting a body write.  (It will think that the text/html had no content whatsoever.)
  • Khaaaaaaaaaaaaaaaaaaaaaaaaan!

In summary, unit tests and execution tracing working together with pretty colors have helped us track down an annoying problem without going insane.  (libmime is a lot to hold in your head if you don’t hack on it every day.  also, straight debugger breakpoint fun generally also requires you to try and formulate and hold a complex mental model… and that’s assuming you don’t go insane from manually stepping aboot and/or are lucky with your choices of where you put your breakpoints.)  The more important thing is that next time I want to refresh my understanding of what libmime is up to, I have traces already available.  (And the mechanics to generate new ones easily.  But not particularly quickly.  chronicle and my trace-generating mechanism be mad slow, yo.  It may be much faster in the future to use the hopefully-arriving-soon archer-gdb python-driven inferior support, even if it can’t be as clever about function-call detection without cramming int 0x3’s all over the place.)

Useful files for anyone trying to duplicate my efforts: my ~/.chroniquery.cfg for this run, the unit test as it existed, and the command-line args were: trace -f mime_display_stream_write -f mime_display_stream_complete -c -H /tmp/trace3/trace.html –file-per-func-invoc

LogSploder, logsploding its way to your logs soon! also, logsplosion!

logsploder screenshot with gloda

In our last logging adventure, we hooked Log4Moz up to Chainsaw.  As great as Chainsaw is, it did not meet all of my needs, least of all easy redistribution.  So I present another project in a long line of fantastically named projects… LogSploder!

The general setup is this:

  • log4moz with a concept of “contexts”, a change in logging function argument expectations (think FireBug’s console.log), a JSON formatter that knows to send the contexts over the wire as JSON rather than stringifying them, plus our SocketAppender from the ChainSaw fun.  The JSONed messages representations get sent to…
  • LogSploder (a XULRunner app) listening on localhost.  It currently is context-centric, binning all log messages based on their context.  The contexts (and their state transitions) are tracked and visualized (using the still-quite-hacky visophyte-js).  Clicking on a context displays the list of log messages associated with that context and their timestamps.  We really should also display any other metadata hiding in the context, but we don’t.  (Although the visualization does grab stuff out of there for the dubious coloring choices.)

So, why, and what are we looking at?

When developing/using Thunderbird’s exciting new prototype message/contact/etc views, it became obvious that performance was not all that it could be.  As we all know, the proper way to optimize performance is to figure out what’s taking up the most time.  And the proper way to figure that out is to write a new tool from near-scratch.  We are interested in both comprehension of what is actually happening as well as a mechanism for performance tracking.

The screenshot above shows the result of issuing a gloda query with a constraint of one of my Inbox folders with a fulltext search for “gloda” *before any optimization*.  (We already have multiple optimizations in place!) The pinkish fill with greenish borders are our XBL result bindings, the blue-ish fill with more obviously blue borders are message streaming requests, and everything else (with a grey border and varying colors) is a gloda database query.  The white bar in the middle of the display is a XBL context I hovered over and clicked on.

The brighter colored vertical bars inside the rectangles are markers for state changes in the context.  The bright red markers are the most significant, they are state changes we logged before large blocks of code in the XBL that we presumed might be expensive.  And boy howdy, do they look expensive!  The first/top XBL bar (which ends up creating a whole bunch of other XBL bindings which result in lots of gloda queries) ties up the event thread for several seconds (from the red-bar to the end of the box).  The one I hovered over likewise ties things up from its red bar until the green bar several seconds later.

Now, I should point out that the heavy lifting of the database queries actually happens on a background thread, and without instrumentation of that mechanism, it’s hard for us to know when they are active or actually complete.  (We can only see the results when the main thread’s event queue is draining, and only remotely accurately when it’s not backlogged.)  But just from the visualization we can see that at the very least the first XBL dude is not being efficient with its queries.  The second expensive one (the hovered one) appears to chewing up processor cycles without much help from background processes.  (There is one recent gloda query, but we know it to be cheap.  The message stream requests may have some impact since mailnews’ IMAP code is multi-threaded, although they really only should be happening on the main thread (might not be, though!).  Since the query was against one folder, we know that there is no mailbox reparse happening.)

Er, so, I doubt anyone actually cares about what was inefficient, so I’ll stop now.  My point is mainly that even with the incredibly ugly visualization and what not, this tool is already quite useful.  It’s hard to tell that just from a screenshot, since a lot of the power is being able to click on a bar and see the log messages from that context.  There’s obviously a lot to do.  Probably one of the lower-hanging pieces of fruit is to display context causality and/or ownership.  Unfortunately this requires explicit state passing or use of a shared execution mechanism; the trick of using thread-locals that log4j gets to use for its nested diagnostic contexts is simply not an option for us.

Thunderbird and gloda go to meme-town

Sure, a word cloud of your blog posts is cool… but what if you could take any search of your e-mail, and turn that into a word cloud?  And then, if you click on one of those words, your search constraints would be revised to use the word you clicked on (and you’d get a useful search result, not another word cloud)?  And what if that layout algorithm were not as good as wordle?  The future is now, people!  (At least if you install like 5 extra extensions out of mercurial.)

The screenshot above is from Thunderbird trunk with a hacked exptoolbar extension (generalized, committed changes happening soon), visophyte-js, and the new glodacloud extension.  It is a proof-of-easy-gloda-extensions as suggested by David Ascher.

The layout algorithm is what we in the business of making up terminology call a recursive sub-optimal tic-tac-toe subdivision thinger.  We under-use a neat (and somewhat slow) hack to find the bounds of the words through use of canvas.mozPathText and canvas.isPointInPath to sample a grid to know where the text is and isn’t.  It’s under-used because all we use it for right now is to find the actual height above the baseline that the text stretches to (because metrics only gives us the width).  We are lazy and don’t check below the baseline at all, and totally squander our chance to be cool and put small words in the gaps in larger words.  But given the amount of time spent, I’m very happy.

Oh, and of course it uses JS and Canvas.

I’ll be wanting that latte machine now…

in context

credits where credits due:

  • thread arcs a la the nice people at the IBM CUE group
  • the search view prototype is implemented by David Ascher.  the positioning of the visualization is on me as a quick hack, though.
  • the search view prototype is designed by Bryan Clark, and he has even better stuff on the way

The actual implementation is a first step of adapting knowledge from my python “visophyte” library to a JS implementation using canvas.  I am trying a more batch-oriented style of processing this time that uses explicit attributes for value-passing between logic blocks.  This is in comparison to the python implementation which is more functional in nature.  We’ll see how it turns out.

Thunderbird full-text search prototype a la SQLite FTS3

Full-text search using FTS3.

Full-text search with a contact constraint.

The global database sqlite file resulting from indexing all of mozilla.dev.apps.thunderbird is about 13M for something like 4500 messages.  We’re providing FTS3 with the bodies (but not attachments!) of all the newsgroup messages and the subjects of the messages which initiate new threads.  For real usage, we will need to also index the subjects of each message.

Note that the message bodies have not been processed at all by the Thunderbird/gloda code before handing them off to FTS3.  So quoted messages get indexed even though it’s a lot of excess data.  We’re relying on FTS3 to do all stop-words, etc.  FTS3’s Porter stemming/tokenization is in use.