Thunderbird full-text search prototype a la SQLite FTS3

Full-text search using FTS3.

Full-text search with a contact constraint.

The global database sqlite file resulting from indexing all of mozilla.dev.apps.thunderbird is about 13M for something like 4500 messages.  We’re providing FTS3 with the bodies (but not attachments!) of all the newsgroup messages and the subjects of the messages which initiate new threads.  For real usage, we will need to also index the subjects of each message.

Note that the message bodies have not been processed at all by the Thunderbird/gloda code before handing them off to FTS3.  So quoted messages get indexed even though it’s a lot of excess data.  We’re relying on FTS3 to do all stop-words, etc.  FTS3’s Porter stemming/tokenization is in use.

Thunderbird contact auto-completion… with bubbles!

Autocompletion screenshot

Type type type type.  Autocomplete contact…

Completed contact becomes a bubble!  Bubble becomes a constraint, showing us only the messages involving the given contact.  (The idea is that you could then click on/select/whatever the bubble and change the constraint to be only to/from/cc/whatever if you are so inclined.)

Type type type, autocomplete, new constraint!  Now we’re looking at all the messages involving the two given contacts.  (Some of the messages with just one constraint were mailing list postings, but not explicitly involving the second contact.  This listing shows only messages where both contacts were directly involved.  We will have the ability to filter-out messages involving lists as desired, which may be desired by default in a case like this.)

What is exciting about this?

  • The contacts are matched using a suffix-tree implementation on a reduced set of contacts (as a first-pass).  In this case, those with sufficient ‘popularity’.  ‘Frecency’ a la ‘places’ is also planned.  And of course, we can hit the database as needed.  The suffix-tree is nice because it allows extremely rapid lookups while also allowing for substring matching.
  • The contact popularity is computed automatically by the gloda indexing process, taking into account both messages you receive and send.  (I think the current address-book code just increments popularity on send?)
  • I think the bubbles are cool.  (Hyperlink-styling would also work, but would not be cool.)
  • Having the text converted into an explicit object representation (bubbles) is better than just doing string filtering (as quicksearch does) because it allows explicit actions on the object given knowledge of the object type.
  • We can convert more than just contacts/identities to explicit objects.  As demonstrated at the summit, we have a plugin that detects bugzilla bug references in messages as well as (American/NANP-style) phone-numbers in messages.  We could detect these and promote them as well, etc.
  • The filtered messages are being delivered by gloda, the global database (backed by SQLite), which means that we aren’t searching just one folder.
  • There are a lot of places that you, the reader, will shortly be able to hack on and contribute to make this even more exciting.  A vicious cycle of exciting-ness will ensue until everyone is dancing in the streets.