Thunderbird ĝLȬdÅ full text search tokenizer now supports accent folding, non-ASCII case-folding, and more!

Thanks to the efforts of Makoto Kato (jp-blog en-blog twitter) whom you may remember as the bringer of CJK bi-gram tokenization, I have just pushed tokenizer support for case-folding, accent-folding, and voiced sound mark magic to comm-central.  The net result is that searches for “ĝLȬdÅ” will now find “gloda” and vice versa.  Less crazy searches should also be improved.

Starting tomorrow, Thunderbird (in the guise of Lanikai) 3.1 nightlies will have this capability and it will also show up in the next beta (beta 2).  No action on your part is required; when you start running a build with this capability your existing gloda database will get blown away and indexing will start from scratch.  If you go back to an older version for some reason after having used the updated build, the old build will also decide to blow away your database, so try to avoid making a habit of switching between them.

We also have some significant performance enhancements related to gloda search and indexing slated for beta 2, not to mention* usability enhancements to avoid the dreaded gloda search/quick search toggle dance.

* This expression is weird given that I go on to mention the allegedly unmentioned, but it still makes more sense than “all but X”.

So’s your facet: Faceted global search for Mozilla Thunderbird

faceting-gloda-hover-davida-1

Following in the footsteps of the MIT SIMILE project’s Exhibit tool (originally authored by David Huynh) and Thunderbird Seek extension (again by David Huynh), we are hoping to land faceted global search for Thunderbird 3.0 (a la gloda) in beta 4.

I think it’s important to point out how ridiculously awesome the Seek extension is.  It is the only example of faceted browsing or search in an e-mail client that I am aware of.  (Note: I have to assume there are some research e-mail clients out there with faceting, but I haven’t seen them.)  Given the data model available to extensions in Thunderbird 2.0 and the idiosyncratic architecture of the UI code in 2.0, it’s not only a feature marvel but also a technical marvel.

Unfortunately, there was only so much Seek could do before it hit a wall given the limitations it had to work with.  Thunderbird 2.0’s per-folder indices are just that, per-folder.  They also require (fast) O(n) search on any attribute other than their unique key.  Although Seek populated an in-memory index for each folder, it was faced with having to implement its own global indexer and persistent database.

Gloda is now at a point where a global database should no longer be the limiting factor for extensions, or the core Thunderbird experience…

faceting-gloda-action-tag-hover-bienvenu-1

The screenshots are of a fulltext search for “gloda” in my message store.  The first screenshot is without any facets applied and me hovering over one of David Ascher’s e-mail address.  The second is after having selected the “!action” tag and hovering over one of David Bienvenu’s e-mail address.  Gloda has a concept of contact aggregation of identities but owing to a want of UI for this in the address-book right now, it doesn’t happen.  We do not yet coalesce (approximately) duplicate messages, which explains any apparent duplicates you see.

The current state of things is a result of development effort by myself and David Ascher with design input from Bryan Clark and Andreas Nilsson (with hopefully much more to come soon :).  Although we aren’t using much code from our previous exptoolbar efforts, a lot of the thinking is based on the work David, Bryan, and myself did on that.  Much thanks to Kent James, Siddharth Agarwal, and David Bienvenu for their recent and ongoing improvements to the gloda (and mailnews) back-end which help make this hopefully compelling UI feature actually usable through efficient and comprehensive indexing that does not make you want to throw your computer through a window.

If you use linux or OS X, I just linked you to try server builds.  The windows try server was sadly on fire and so couldn’t attend the build party.  The bug tracking the enhancement is bug 474711 and has repository info if you want to spin your own build.  New try server builds will also be noted there.  Please keep in mind that this is an in-progress development effort; it is not finished, there are bugs.  Accordingly, please direct any feedback/discussion to the dev-apps-thunderbird list / newsgroup rather than the bug.  Please beware that increases in awesomeness require that your gloda database be automatically blown away if you try the new version.  And first you have to turn gloda on if you have not already.

thunderbird, gloda, exptoolbar, protovis, paninaro, oh oh oh

exptoolbar-protovis-gloda-256

Thunderbird.  With the global database, gloda.  Using the exptoolbar extension.  Using the protovis javascript visualization library.  For reals!  Not a prank!  Just grab the most recent XPI or grab the repo.  And be using a nightly (beta 2 might work?)

What you are looking at:

  • The exptoolbar search results page, augmented with a visualization.
  • Each conversation with search results gets its own wedge.
    • Wedges can be distinguished because of the alternating background colors.
    • Conversations that you sent a message to will have a red shading to them.  The examples may be somewhat misleading because the account where a lot of my sent mail ends up is not part of the profile used to create the screenshots.
  • Each message is placed in its conversation wedge…
    • The radius is based on the ‘age’ of the message using a log-ish scale.  Interpolation is actually linear at each level (one day, one week, one month, three months, one year, 5 years, ‘forever’.)
    • The angular placement within the wedge is based on the author of the message.  Across all wedges the placement is the same.  This helps ‘bursty’ parts of conversations (which are extremely likely) be made more obvious, while also helping to provide some understanding of conversation dynamics.
  • Message shapes are determined by whether the message is starred (diamond), sent by a ‘popular’ contact (circle), or an unpopular one (cross).  The use of popularity is a temporary measure because current gloda in trunk does not cache address-book lookups, and they are expensive.  Once the new gloda search code lands with those changes, we can rely on the existence of an address book entry.  (Starring a contact using the new message reader adds them to your address book.)
  • Message opacity is determined by whether the message is a ‘hit’ or not.  All messages in a conversation are eventually retrieved, though initially we only have the hits.
  • Message color is determined by applied tags (using the closest tango color for the first tag), or whether the message is starred (closest tango color to yellow, where I think I had removed the yellow tango colors for some unknown reason, so we get green I guess).  It’s grey if the message has no tag or star.
  • The subject of the conversation is displayed in the wedge.

exptoolbar-protovis-seek-thunderbird-256

Things that are happy:

Things that are sad (aka caveats):

  • It would probably be better if the visualization was not radar-inspired.  Besides the perceptual reasons, the subjects are harder to read than they would be in an equivalent linear-styled visualization.
  • The visualization is not interactive.  protovis officially has no interaction support yet, but if you look in the (only available minified?) source, it’s almost there.  It might be entirely there, but it didn’t work for me immediately after a quick reading of the (indented) source.
  • There is some low probability failure that occurs during the visualization updating as gloda backfills the message collections.  If it happens on the last update, you can end up with a half-built visualization.  Re-running the search will generally resolve the issue.
  • The visualization does a pretty solid job of taking up all the screen real estate and has no way to be disabled, so you have to scroll past it every time.

Future work:

  • Interactivity.
  • Perhaps showing the gravatars for the people involved in a conversation at the outer rim of the wedge, positioning them based on the author positioning we determined.
  • Perhaps lose the radar motif.
  • Your thoughts / patches!

support your neighborhood Thunderbird global database

Do you know JavaScript?  Would you like to help improve Thunderbird and its exciting global database, gloda?  Now is your chance!  Check out these exciting bugs that are reasonably sized and independent tasks:

Exciting? Exciting!

I don’t know much about psychology, but I have heard that people on the internet see a call-to-arms like this and say “I’m sure someone else better qualified will step up, maybe even hundreds of them… I’ll just let them take care of it.”  I have news for you, people on the internet are lazy!  Oh, so lazy!  (I am reasonably confident that won’t happen.  If it does happen, I will find enough work for everyone to do while I retire to a life of luxury funded by my ability to inexplicably motivate large swathes of the internet to do my bidding.)

Important steps!

  1. Get yourself a copy of the comm-central codebase.
  2. Build thunderbird! (Actually, that above link covers it, but you might also want to check out the general building info page.)
  3. Dance a victory jig!
  4. Leave a note on one of those bugs saying that you are interested.  Or just e-mail me at asuth@mozillamessaging.com!

Thunderbird and gloda go to meme-town

Sure, a word cloud of your blog posts is cool… but what if you could take any search of your e-mail, and turn that into a word cloud?  And then, if you click on one of those words, your search constraints would be revised to use the word you clicked on (and you’d get a useful search result, not another word cloud)?  And what if that layout algorithm were not as good as wordle?  The future is now, people!  (At least if you install like 5 extra extensions out of mercurial.)

The screenshot above is from Thunderbird trunk with a hacked exptoolbar extension (generalized, committed changes happening soon), visophyte-js, and the new glodacloud extension.  It is a proof-of-easy-gloda-extensions as suggested by David Ascher.

The layout algorithm is what we in the business of making up terminology call a recursive sub-optimal tic-tac-toe subdivision thinger.  We under-use a neat (and somewhat slow) hack to find the bounds of the words through use of canvas.mozPathText and canvas.isPointInPath to sample a grid to know where the text is and isn’t.  It’s under-used because all we use it for right now is to find the actual height above the baseline that the text stretches to (because metrics only gives us the width).  We are lazy and don’t check below the baseline at all, and totally squander our chance to be cool and put small words in the gaps in larger words.  But given the amount of time spent, I’m very happy.

Oh, and of course it uses JS and Canvas.

I’ll be wanting that latte machine now…

in context

credits where credits due:

  • thread arcs a la the nice people at the IBM CUE group
  • the search view prototype is implemented by David Ascher.  the positioning of the visualization is on me as a quick hack, though.
  • the search view prototype is designed by Bryan Clark, and he has even better stuff on the way

The actual implementation is a first step of adapting knowledge from my python “visophyte” library to a JS implementation using canvas.  I am trying a more batch-oriented style of processing this time that uses explicit attributes for value-passing between logic blocks.  This is in comparison to the python implementation which is more functional in nature.  We’ll see how it turns out.

Thunderbird full-text search prototype a la SQLite FTS3

Full-text search using FTS3.

Full-text search with a contact constraint.

The global database sqlite file resulting from indexing all of mozilla.dev.apps.thunderbird is about 13M for something like 4500 messages.  We’re providing FTS3 with the bodies (but not attachments!) of all the newsgroup messages and the subjects of the messages which initiate new threads.  For real usage, we will need to also index the subjects of each message.

Note that the message bodies have not been processed at all by the Thunderbird/gloda code before handing them off to FTS3.  So quoted messages get indexed even though it’s a lot of excess data.  We’re relying on FTS3 to do all stop-words, etc.  FTS3’s Porter stemming/tokenization is in use.

Thunderbird contact auto-completion… with bubbles!

Autocompletion screenshot

Type type type type.  Autocomplete contact…

Completed contact becomes a bubble!  Bubble becomes a constraint, showing us only the messages involving the given contact.  (The idea is that you could then click on/select/whatever the bubble and change the constraint to be only to/from/cc/whatever if you are so inclined.)

Type type type, autocomplete, new constraint!  Now we’re looking at all the messages involving the two given contacts.  (Some of the messages with just one constraint were mailing list postings, but not explicitly involving the second contact.  This listing shows only messages where both contacts were directly involved.  We will have the ability to filter-out messages involving lists as desired, which may be desired by default in a case like this.)

What is exciting about this?

  • The contacts are matched using a suffix-tree implementation on a reduced set of contacts (as a first-pass).  In this case, those with sufficient ‘popularity’.  ‘Frecency’ a la ‘places’ is also planned.  And of course, we can hit the database as needed.  The suffix-tree is nice because it allows extremely rapid lookups while also allowing for substring matching.
  • The contact popularity is computed automatically by the gloda indexing process, taking into account both messages you receive and send.  (I think the current address-book code just increments popularity on send?)
  • I think the bubbles are cool.  (Hyperlink-styling would also work, but would not be cool.)
  • Having the text converted into an explicit object representation (bubbles) is better than just doing string filtering (as quicksearch does) because it allows explicit actions on the object given knowledge of the object type.
  • We can convert more than just contacts/identities to explicit objects.  As demonstrated at the summit, we have a plugin that detects bugzilla bug references in messages as well as (American/NANP-style) phone-numbers in messages.  We could detect these and promote them as well, etc.
  • The filtered messages are being delivered by gloda, the global database (backed by SQLite), which means that we aren’t searching just one folder.
  • There are a lot of places that you, the reader, will shortly be able to hack on and contribute to make this even more exciting.  A vicious cycle of exciting-ness will ensue until everyone is dancing in the streets.

gloda’s first (primitive) visualization

Author activity over time, current thread in blue, selected message in darkest blue.

A primitive visualization augments the gloda “other messages by author” listing by showing the messages sent by the author over time.  Messages are stacked by day.  The currently selected message is in darkest blue and also very wide.  Other messages from the same thread/conversation are in lighter blue and less wide.  Messages not in the conversation are light grey and rather narrow.

It’s not clickable, it lacks any form of scale or any feedback at all, and there are scaling issues.  (If anyone wants to save me the effort of figuring out how to get the canvas to maintain a 1:1 pixel mapping to the actual display and still ‘flex’ by adding/losing pixels, please do drop me a message or leave a comment.)  These will all change, but not yet.

I’ve pushed the changes to the mercurial repos and updated the stable tag, but I’m not publishing updated xpi’s, so you’ll need to roll your own if you care.  (The DB schema has not changed and so does not need to be blown away.)

gloda milestone 1

gloda m1 getting its indexing on

I am declaring milestone 1 of gloda (the global database extension for Thunderbird 3.x) / expmess (the experimental message view extension for Thunderbird 3.x) reached.  Milestone 1 basically consists of:

  • It statically indexes all of your folders.  It does not track changes made to your mailboxes.  It will become confused and angry as time goes on and your message stores change but it stays the same.  Thankfully, it is also passive aggressive and will merely stop doing useful things rather than trying to eat your data.  It also refuses to change its ways; if you try and trick it into indexing a message it has already processed but you moved, it will not update the message’s index.  You can, however, trick it into indexing new messages.
  • The indexing sorta happens in the background and has pretty, if dubious from an UX perspective, progress bars (see screenshot).  This was stolen from M3, making M1 wildly more usable than originally planned.  At least on my computer, I didn’t notice much performance impact from the indexing, but my system is arguably fairly beefy.  This can all be tweaked though, especially once we hook the nsIIdleService in.
  • It adds a “data mine” pane to the right side of the message window.  It has a splitter so you can hide it if you want, but you can never be rid of it.  The data mine shows you the other messages in the current thread and other messages sent by the author… globally!
  • If you double-click on a message in the “data mine” added by expmess, it will take you there!  This is stolen from M2.
  • It will print out a lot of debug on standard out.  It used to print more.

Having said all that, you can get the XPI’s here if you are using Thunderbird/Shredder 3.0a2pre or later, and your build is from July 5th 2008 or later.  You need to install both of them if you want anything interesting to happen.  The easiest way to do this is go to “Tools”…”Add-ons” in Shredder, and drag the links into the add-ons pane, at which point it will prompt you and such.  These extensions will not auto-update.

And the code (in mercurial) is here:

Because of the static indexing, you will probably want to install this extension, mention something about needing to wear sunglasses because of the brightness of the future, and then uninstall it.

Un-installation consists of:

  1. Disable / remove the gloda and expmess extensions.
  2. Delete the global-messages-db.sqlite file from your profile directory.  Or don’t.  It’s up to you.

I’ll be following this post up with a newsgroup post on mozilla.dev.apps.thunderbird on Monday with more details about planning out the rest of the milestones, as well as the arbitrary changes I had made to my (always tenative) milestone 1 plan.  Discussion about the global database is probably best directed to the newsgroup, but feel free to post comments here if you want too.