So’s your facet: Faceted global search for Mozilla Thunderbird

faceting-gloda-hover-davida-1

Following in the footsteps of the MIT SIMILE project’s Exhibit tool (originally authored by David Huynh) and Thunderbird Seek extension (again by David Huynh), we are hoping to land faceted global search for Thunderbird 3.0 (a la gloda) in beta 4.

I think it’s important to point out how ridiculously awesome the Seek extension is.  It is the only example of faceted browsing or search in an e-mail client that I am aware of.  (Note: I have to assume there are some research e-mail clients out there with faceting, but I haven’t seen them.)  Given the data model available to extensions in Thunderbird 2.0 and the idiosyncratic architecture of the UI code in 2.0, it’s not only a feature marvel but also a technical marvel.

Unfortunately, there was only so much Seek could do before it hit a wall given the limitations it had to work with.  Thunderbird 2.0’s per-folder indices are just that, per-folder.  They also require (fast) O(n) search on any attribute other than their unique key.  Although Seek populated an in-memory index for each folder, it was faced with having to implement its own global indexer and persistent database.

Gloda is now at a point where a global database should no longer be the limiting factor for extensions, or the core Thunderbird experience…

faceting-gloda-action-tag-hover-bienvenu-1

The screenshots are of a fulltext search for “gloda” in my message store.  The first screenshot is without any facets applied and me hovering over one of David Ascher’s e-mail address.  The second is after having selected the “!action” tag and hovering over one of David Bienvenu’s e-mail address.  Gloda has a concept of contact aggregation of identities but owing to a want of UI for this in the address-book right now, it doesn’t happen.  We do not yet coalesce (approximately) duplicate messages, which explains any apparent duplicates you see.

The current state of things is a result of development effort by myself and David Ascher with design input from Bryan Clark and Andreas Nilsson (with hopefully much more to come soon :).  Although we aren’t using much code from our previous exptoolbar efforts, a lot of the thinking is based on the work David, Bryan, and myself did on that.  Much thanks to Kent James, Siddharth Agarwal, and David Bienvenu for their recent and ongoing improvements to the gloda (and mailnews) back-end which help make this hopefully compelling UI feature actually usable through efficient and comprehensive indexing that does not make you want to throw your computer through a window.

If you use linux or OS X, I just linked you to try server builds.  The windows try server was sadly on fire and so couldn’t attend the build party.  The bug tracking the enhancement is bug 474711 and has repository info if you want to spin your own build.  New try server builds will also be noted there.  Please keep in mind that this is an in-progress development effort; it is not finished, there are bugs.  Accordingly, please direct any feedback/discussion to the dev-apps-thunderbird list / newsgroup rather than the bug.  Please beware that increases in awesomeness require that your gloda database be automatically blown away if you try the new version.  And first you have to turn gloda on if you have not already.

Using BugXhibit to find that bug you know you saw recently but can’t find

bugxhibit-cc-search

BugXhibit, the Bugzilla search results viewer made with the SIMILE Exhibit widget, is now more fancy, and now addresses another one of my use cases.  I frequently find myself wanting to point someone at a bug, or go back to a bug that I know passed through my bugmail recently, and have trouble finding it.  So now BugXhibit can do easy searches based on reporter/assignee/cc/commenter with time ranges.

Examples by way of live links this time (noting that the default time interval is 7 days).  Uh, and if it gives you an error for reasons I don’t fully understand if you open it in a new tab (in the background) from here, just hitting enter in the address bar should fix it.  I’m going to lazyweb that problem for now.

Other changes:

  • It now is also self documenting, just click on “Show Docs” on the page.
  • You can now use arguments to specify the sort and whether grouping is active on the page.
  • The date parsing is better.  Bugzilla doesn’t provide the raw dates but attempts to change things based on how recent the date is.  BugXhibit does a good job of fixing up the date if you are in the same timezone as the bugzilla server, and a less good but acceptable job if you aren’t.
  • Upgraded to exhibit 2.1.0 and now the numeric sliders with histograms work for me.  Woo!

Other notes:

The hg repo is here, as always.

DevMoXhibit: Exhibit on DevMo (Deki Wiki) results

devmo-search-customize-toolbar

The above screenshot is of a normal search query on DevMo for “customize toolbar”.  I see 2.5 results, and I honestly have no interest in the first item at all.  (It’s a page that only advanced DevMo authors would care about, for those who refuse to squint or click on images to see bigger versions of images.)

devmoxhibit-search-customize-toolbar-corrected

The above screenshot is of the same query using DevMoXhibit.  You will note you can see more things, and the first result from the other page is completely elided because we filter by default so that only “Real” result pages are shown.  (In general, I am not looking for talk pages or user pages or meta-pages.)

But enough about my interpretations of pictures, why don’t you:

Neat things we do that may not be immediately obvious:

  • We flatten the score into deciles, and then within each decile range we sort based on the view count for the page.  The theory is that, given equally likely results, the one that more people have looked at is probably more interesting to you, roughly speaking.
  • We use a simple heuristic to figure out the page type, as mentioned above (“Real”, “Talk”, “User”, etc.)
  • We try and hide all things related to the language, as we explicitly query on a language which means it’s just noise.  Right now, that language is always english, but the code uses a variable if you want to write the code to hook that up and expose it in the UI.
  • We produce a “smart” snippet.  The snippets provided by the search results naively will include “chrome” that is part of the document, which makes for a nearly useless snippet.  For example, take a gander at XUL/toolbar:
    • Plain old snippet:
      • « XUL Reference home    [ Examples | Attributes | Properties | Methods | Related ] A container which typically contains a row of buttons. It is a type of box that defaults to horizontal orientation. …
    • Smart snippet:
      • A container which typically contains a row of buttons. It is a type of box that defaults to horizontal orientation. …
  • We produce a sometimes over-zealous smart snippet.  If you were to keep reading both of those snippets, you would notice that the smart snippet eats a bit that the non-smart-snippet does not.  That is because the smart snippet is based on looking at a version of the snippet which has HTML tags in it, and then it tries to nuke those HTML tags out of existence using simple regexps.

Implementation notes:

  • This probably should work on other deki wikis if so adapted, but I don’t use any others, so YMMV.
  • We actually issue two search queries because there are two result formats that can be produced.  “xml” is an inexplicable mixture of too much data and too little data.  Namely, it does not tell you the tags on a document, which is basically the most useful piece of info, but it does tell you every link to and from that page (which we expose, although I doubt it will be useful enough to justify it).  It does give you a link to be able to get the tags, but that’s a costly operation when you have to perform it for each search result.  In contrast, “search” gives you the tags; they are only space-delimited, but that’s fine.  (“Inexplicable” may be a bit harsh; looking at the source, it’s just dumping the page info without further processing/lookups, but arguably it would be very useful if they made the effort to fetch that data.)
  • Because of cross-site XHR issues, this is not quite as hackable as I would like.  My demo server above is using mod_proxy (with a very specific constraint) to proxy the search to DevMo.  When I develop locally, I have to do the same thing.  Presumably if you are using Firefox 3.5 and devmo is set up correctly, then this would not be a problem.  But, 1) for no good reason, I only use Firefox 3.0 and 2) have no clue whether devmo is emitting the headers that would enable that to work.  I strongly encourage someone to look into #2 and fix it if not.
  • As with BugXhibit, the sliders are totally broken for me and it’s sad, but I left them in there in the hopes that they work for someone, somewhere.  Alternately, I would not complain if someone, somewhere, fixed them.

The hg repo is here.

BugXhibit: Exhibit on Bugzilla results

bugxhibit-timeline

I know it has sorta been done before (found via Bugzilla Fixup Wiki Page on a comment by faaborg), and I feel like there has to be another live version somewhere, but here we are.  BugXhibit is an MIT SIMILE Exhibit widget fronting a bugzilla.mozilla.org quicksearch query.

Click here to go to a BugXhibit page where you can enter your own query.  Enter “gloda” if you want to see what the screenshots are based on.  I feel like it would be improper of me to provide a link with a live query though.

Go visit the hg repo.  Or just download the source from the previous link.  Please improve!  (See the SIMILE Exhibit docs for how to do that.  It’s all really easy.)

bugxhibit-tile-view

Notes:

  • This uses bugzilla’s ctype=js for buglist.cgi.  It apparently has been around since 2003 (bug)!  And thanks to Gerv!  Perhaps not too surprisingly, the format of the results is not inert JSON but live JS code that builds a would-be-Array where each bug’s info is stored in an array.  What each element in the array stands for cannot be known from the results.  I find that using ctype=csv is a good way to get the headers.  Rather than doing that every time (cost concerns on the redundant query), I did it once for columnlist=all (which we always use) and stashed it in bugxhibit.js.  This is dangerous because it is brittle; if you try and use bugxhibit against a saved search someone made public, I at least got many fewer columns (despite columnlist=all), and things just don’t match.  Not to mention there is a “cf_blocking_fennec” flag in there that I feel like should not be there.
  • It looks pretty easy to have bugzilla produce more sane JSON output via a template (although the security code that logs you out for a js request still should run, so don’t forget buglist.cgi.)
  • Even with all columns exposed when using buglist.cgi, there are lots of interesting things that are not exposed.  For example, flags are not exposed via buglist.cgi, so faceting on whether things are blockers or wanted can’t be done.  Once you know the bug numbers from the query, you can obviously go fetch additional information, though I think that currently still needs to be XML format, but that’s not that hard.
  • The code is friendly and splits up the whiteboard and keyword things so it does what you would expect and is not stupid.
  • I made sliders for patch count and votes.  They don’t work for me anymore, and I see XUL wrapper anger (on Firefox 3.0.x), so, uh, don’t be surprised if they fall down.
  • The UI obviously sucks.  But it’s a proof of concept, and you are the internet!  You can do anything!