Web Worker-assisted Email Visualizations using Vega

Faceted and overview visualizations

tl;dr: glodastrophe, the experimental entirely-client-side JS desktop-ish email app now supports Vega-based visualizations in addition to new support infrastructure for extension-y things and creating derived views based on the search/filter infrastructure.

Two of the dreams of Mozilla Messaging were:

  1. Shareable email workflows (credit to :davida).  If you could figure out how to set up your email client in a way that worked for you, you should be able to share that with others in a way that doesn’t require them to manually duplicate your efforts and ideally without you having to write code.  (And ideally without anyone having to review code/anything in order to ensure there are no privacy or security problems in the workflow.)
  2. Useful email visualizations.  While in the end, the only visualization ever shipped with Thunderbird was the simple timeline view of the faceted global search, various experiments happened along the way, some abandoned.  For example, the following screenshot shows one of the earlier stages of faceted search development where each facet attempted to visualize the relative proportion of messages sharing that facet.

faceted search UI prototype

At the time, the protovis JS visualization library was the state of the art.  Its successor the amazing, continually evolving d3 has eclipsed it.  d3, being a JS library, requires someone to write JS code.  A visualization written directly in JS runs into the whole code review issue.  What would be ideal is a means of specifying visualizations that is substantially more inert and easy to sandbox.

Enter, Vega, a visualization grammar that can be expressed in JSON that can not only define “simple” static visualizations, but also mind-blowing gapminder-style interactive visualizations.  Also, it has some very clever dataflow stuff under the hood and builds on d3 and its well-proven magic.  I performed a fairly extensive survey of the current visualization, faceting, and data processing options to help bring visualizations and faceted filtered search to glodastrophe and other potential gaia mail consumers like the Firefox OS Gaia Mail App.

Digression: Two relevant significant changes in how the gaia mail backend was designed compared to its predecessor Thunderbird (and its global database) are:

  1. As much as can possibly be done in a DOM/Web Worker(s) is done so.  This greatly assists in UI responsiveness.  Thunderbird has to do most things on the main thread because of hard-to-unwind implementation choices that permeate the codebase.
  2. It’s assumed that the local mail client may only have a subset of the messages known to the server, that the server may be smart, and that it’s possible to convince servers to support new functionality.  In many ways, this is still aspirational (the backend has not yet implemented search on server), but the architecture has always kept this in mind.

In terms of visualizations, what this means is that we pre-chew as much of the data in the worker as we can, drastically reducing both the amount of computation that needs to happen on the main (page) thread and the amount of data we have to send to it.  It also means that we could potentially farm all of this out to the server if its search capabilities are sufficiently advanced.  And/or the backend could cache previous results.

For example, in the faceted visualizations on the sidebar (placed side-by-side here):

faceted-histograms

In the “Prolific Authors” visualization definition, the backend in the worker constructs a Vega dataflow (only!).  The search/filter mechanism is spun up and the visualization’s data gathering needs specify that we will load the messages that belong to each conversation in consideration.  Then for each message we extract the author and age of the message and feed that to the dataflow graph.  The data transforms bin the messages by date, facet the messages by author, and aggregate the message bins within each author.  We then sort the authors by the number of messages they authored, and limit it to the top 5 authors which we then alphabetically sort.  If we were doing this on the front-end, we’d have to send all N messages from the back-end.  Instead, we send over just 5 histograms with a maximum of 60 data-points in each histogram, one per bin.

Same deal with “Prolific domains”, but we extract the author’s mail domain and aggregate based on that.

Authored content size overview heatmap

Similarly, the overview Authored content size over time heatmap visualization sends only the aggregated heatmap bins over the wire, not all the messages.  Elaborating, for each message body part, we (now) compute an estimate of the number of actual “fresh” content bytes in the message.  Anything we can detect as a quote or a mailing list footer or multiple paragraphs of legal disclaimers doesn’t count.  The x-axis bins by time; now is on the right, the oldest considered message is on the left.  The y-axis bins by the log of the authored content size.  Messages with zero new bytes are at the bottom, massive essays are at the top.  The current visualization is useless, but I think the ingredients can and will be used to create something more informative.

Other notable glodastrophe changes since the last blog post:

  • Front-end state management is now done using redux
  • The Material UI React library has been adopted for UI widget purposes, though the conversation and message summaries still need to be overhauled.
  • React was upgraded
  • A war was fought with flexbox and flexbox won.  Hard-coding and calc() are the only reason the visualizations look reasonably sized.
  • Webpack is now used for bundling in order to facilitate all of these upgrades and reduce potential contributor friction.

More to come!

An email conversation summary visualization

We’ve been overhauling the Firefox OS Gaia Email app and its back-end to understand email conversations.  I also created a react.js-based desktop-ish development UI, glodastrophe, that consumes the same back-end.

My first attempt at summaries for glodastrophe was the following:

old summaries; 3 message tidbits

The back-end derives a conversation summary object from all of the messages that make up the conversation whenever any message in the conversation changes.  While there are some things that are always computed (the number of messages in the conversation, whether there are any unread messages, any starred/flagged messages, etc.), the back-end also provides hooks for the front-end to provide application logic to do its own processing to meet its UI needs.

In the case of this conversation summary, the application logic finds the first 3 unread messages in the conversation and stashes their date, author, and extracted snippet (if any) in a list of “tidbits”.  This also is used to determine the height of the conversation summary in the conversation list.  (The virtual list is aware of a quantized coordinate space where each conversation summary object is between 1 and 4 units high in this case.)

While this is interesting because it’s something Thunderbird’s thread pane could not do, it’s not clear that the tidbits are an efficient use of screen real-estate.  At least not when the number of unread messages in the conversation exceeds the 3 we cap the tidbits at.

time-based thread summary visualization

But our app logic can actually do anything it wants.  It could, say, establish the threading relationship of the messages in the conversation to enable us to make a highly dubious visualization of the thread structure in the conversation as well as show the activity in the conversation over time.  Much like the visualization you already saw before you read this sentence.  We can see the rhythm of the conversation.  We can know whether this is a highly active conversation that’s still ongoing, or just that someone has brought back to life.

Here’s the same visualization where we still use the d3 cluster layout but don’t clobber the x-position with our manual-quasi-logarithmic time-based scale:

the visualization without time-based x-positioning

Disclaimer: This visualization is ridiculously impractical in cases where a conversation has only a small number of messages.  But a neat thing is that the application logic could decide to use textual tidbits for small numbers of unread and a cool graph for larger numbers.  The graph’s vertical height could even vary based on the number of messages in the conversation.  Or the visualization could use thread-arcs if you like visualizations but want them based on actual research.

If you’re interested in the moving pieces in the implementation, they’re here:

webpd: a Polymer-based web UI for the beets music library manager

beets webpd filtered artists list

beets is the extensible music database tool every programmer with a music collection has dreamed of writing.  At its simplest it’s a clever tagger that can normalize your music against the MusicBrainz database and then store the results in a searchable SQLite database.  But with plugins it can fetch album art, use the Discogs music database for tagging too, calculate ReplayGain values for all your music, integrate meta-data from The Echo Nest, etc.  It even has a Music Player Daemon server-mode (bpd) and a simple HTML interface (web) that lets you search for tracks and play them in your browse using the HTML5 audio tag.

I’ve tried a lot of music players through the years (alphabetically: amarok, banshee, exaile, quodlibetrhythmbox).  They all are great music players and (at least!) satisfy the traditional Artist/Album/Track hierarchy use-case, but when you exceed 20,000 tracks and you have a lot of compilation cd’s, that frequently ends up not being enough. Extending them usually turned out to be too hard / not fun enough, although sometimes it was just a question of time and seeking greener pastures.

But enough context; if you’re reading my blog you probably are on board with the web platform being the greatest platform ever.  The notable bits of the implementation are:

  • Server-wise, it’s a mash-up of beets’ MPD-alike plugin bpd and its web plugin.  Rather than needing to speak the MPD protocol over TCP to get your server to play music, you can just hit it with an HTTP POST and it will enqueue and play the song.  Server-sent events/EventSource are used to let the web UI hypothetically update as things happen on the server.  Right now the client can indeed tell the server to play a song and hear an update via the EventSource channel, but there’s almost certainly a resource leak on the server-side and there’s a lot more web/bpd interlinking required to get it reliable.  (Python’s Flask is neat, but I’m not yet clear on how to properly manage the life-cycle of a long-lived request that only dies when the connection dies since I’m seeing the appcontext get torn down even before the generator starts running.)
  • The client is implemented in Polymer on top of some simple backbone.js collections that build on the existing logic from the beets web plugin.
    • The artist list uses the polymer-virtual-list element which is important if you’re going to be scrolling through a ton of artists.  The implementation is page-based; you tell it how many pages you want and how many items are on each page.  As you scroll it fires events that compel you to generate the appropriate page.  It’s an interesting implementation:
      • Pages are allowed to be variable height and therefore their contents are too, although a fixedHeight mode is also supported.
      • In variable-height mode, scroll offsets are translated to page positions by guessing the page based on the height of the first page and then walking up/down from there based on cached page-sizes until the right page size is found.  If there is missing information because the user managed to trigger a huge jump, extrapolation is performed based on the average item size from the first page.
      • Any changes to the contents of the list regrettably require discarding all existing pages/bindings.  At this time there is no way to indicate a splice at a certain point that should simply result in a displacement of the existing items.
    • Albums are loaded in batches from the server and artists dynamically derived from them.  Although this would allow for the UI to update as things are retrieved, the virtual-list invalidation issue concerned me enough to have the artist-list defer initialization until all albums are loaded.  On my machine a couple thousand albums load pretty quickly, so this isn’t a huge deal.
    • There’s filtering by artist name and number of albums in the database by that artist built on backbone-filtered-collection.  The latter one is important to me because scrolling through hundreds of artists where I might only own one cd or not even one whole cd is annoying.  (Although the latter is somewhat addressed currently by only using the albumartist for the artist generation so various artists compilations don’t clutter things up.)
    • If you click on an artist it plays the first track (numerically) from the first album (alphabetically) associated with the artist.  This does limit the songs you can listen to somewhat…
    • visualizations are done using d3.js; one svg per visualization

beets webpd madonna and morrissey

“What’s with all those tastefully chosen colors?” is what you are probably asking yourself.  The answer?  Two things!

  1. A visualization of albums/releases in the database by time, heat-map style.
    • We bin all of the albums that beets knows about by year.  In this case we assume that 1980 is the first interesting year and so put 1979 and everything before it (including albums without a year) in the very first bin on the left.  The current year is the rightmost bucket.
    • We vertically divide the albums into “albums” (red), “singles” (green), and “compilations” (blue).  This is accomplished by taking the MusicBrainz Release Group / Types and mapping them down to our smaller space.
    • The more albums in a bin, the stronger the color.
  2. A scatter-plot using the echo nest‘s acoustic attributes for the tracks where:
    • the x-axis is “danceability”.  Things to the left are less danceable.  Things to the right are more danceable.
    • the y-axis is “valence” which they define as “the musical positiveness conveyed by a track”.  Things near the top are sadder, things near the bottom are happier.
    • the colors are based on the type of album the track is from.  The idea was that singles tend to have remixes on them, so it’s interesting if we always see a big cluster of green remixes to the right.
    • tracks without the relevant data all end up in the upper-left corner.  There are a lot of these.  The echo nest is extremely generous in allowing non-commercial use of their API, but they limit you to 20 requests per minute and at this point the beets echonest plugin needs to upload (transcoded) versions of all my tracks since my music collection is apparently more esoteric than what the servers already have fingerprints for.

Together these visualizations let us infer:

  • Madonna is more dancey than Morrissey!  Shocking, right?
  • I bought the Morrissey singles box sets. And I got ripped off because there’s a distinct lack of green dots over on the right side.

Code is currently in the webpd branch of my beets fork although I should probably try and split it out into a separate repo.  You need to enable the webpd plugin like you would any other plugin for it to work.  There’s still a lot lot lot more work to be done for it to be usable, but I think it’s neat already.  It definitely works in Firefox and Chrome.

about:nosy is about:memory with charts, helps you lay blame more easily

about:memory and the memory reporter infrastructure that powers it are amazing.  They provide an explicit hierarchy that breaks down the memory use in the system to the subsystems and increasingly the causes of allocation.  about:memory looks like this (if you stand a few feet back from your monitor and take off your glasses):

If you are going to look at about:memory, it is probably for one of two reasons:

  1. You are Nicholas Nethercote or one of his merry band of MemShrink hackers kicking ass and taking names (of inefficient uses of memory).  In this case, about:memory is exactly what you need.
  2. You suspect some tab in Firefox has gone crazy and you want to figure out which one it is and take your vengeance upon it.  Vengeance can take the form of thinking mean thoughts, closing the tab, or writing a snarky tweet.  about:memory will let you do this, but you have to look at a lot of text and you may already be too late to find the culprit!  If only there was an easier way…

Enter about:nosy:

It can show us a list of all the open tabs and their memory usage sans JS for now, as per the above screenshot.  If you expand the tab capsules, you get to see the list of all the inner windows/iframes that live in the hierarchy of that page.  In most cases the list is either really short and boring or really long and boring.  In the case of www.cnn.com I end up with 26 inner windows.

It can also show us memory aggregated by origin.  We do show JS for this case because JS is currently only trackable on a per-origin basis.  When Bug 650353 gets fixed or the memory reporters get more specific we should be able to apportion JS usage to pages directly.

It also attempts to aggregate extension JS compartments back to their owning extension.  We ask the add-on manager for a list of the installed extensions to find their filesystem roots, ask the resource protocol to explain resource mappings, and from there are able to translate such paths.  Just keep in mind that traditional overlay-based extensions do not create their own compartments and so are invisible for tracking purposes.

In the screenshot above, you can see that about:nosy keeps the charts exciting by generating a ridiculous amount of garbage all by itself.  Much of this is just the about:memory tree-building code that we are reusing.  If you refreshed about:memory once a second you would probably see similar garbage creation from the main system JS compartment.

You can install a restartless XPI (update: points at 0.3 now which does not screw up style shell apportionment and uses a better add-on SDK that does not create throwaway JS compartments every second) of the state of the now that will not auto-update.  It wants a recent nightly build of Firefox because it makes assumptions about the structure of the memory reporters in order to better serve you.

You can find the source repo on github.  It requires the add-on SDK to build.  It might seem a little overkill for just graphing memory history, but if you’re looking at the repo you will notice my goal is to use Brian Burg‘s jsprobes work aided by Steve Fink and now de-bitrotted by me (but still a bit crashy) to be able to graph CPU usage, including raw JS, layout/reflow, and paint (eventually, after adding probe points).  It’s also possible for those statistics to be gathered via static mechanisms, but the probes are fun and I want to see them work.

LogSploder, logsploding its way to your logs soon! also, logsplosion!

logsploder screenshot with gloda

In our last logging adventure, we hooked Log4Moz up to Chainsaw.  As great as Chainsaw is, it did not meet all of my needs, least of all easy redistribution.  So I present another project in a long line of fantastically named projects… LogSploder!

The general setup is this:

  • log4moz with a concept of “contexts”, a change in logging function argument expectations (think FireBug’s console.log), a JSON formatter that knows to send the contexts over the wire as JSON rather than stringifying them, plus our SocketAppender from the ChainSaw fun.  The JSONed messages representations get sent to…
  • LogSploder (a XULRunner app) listening on localhost.  It currently is context-centric, binning all log messages based on their context.  The contexts (and their state transitions) are tracked and visualized (using the still-quite-hacky visophyte-js).  Clicking on a context displays the list of log messages associated with that context and their timestamps.  We really should also display any other metadata hiding in the context, but we don’t.  (Although the visualization does grab stuff out of there for the dubious coloring choices.)

So, why, and what are we looking at?

When developing/using Thunderbird’s exciting new prototype message/contact/etc views, it became obvious that performance was not all that it could be.  As we all know, the proper way to optimize performance is to figure out what’s taking up the most time.  And the proper way to figure that out is to write a new tool from near-scratch.  We are interested in both comprehension of what is actually happening as well as a mechanism for performance tracking.

The screenshot above shows the result of issuing a gloda query with a constraint of one of my Inbox folders with a fulltext search for “gloda” *before any optimization*.  (We already have multiple optimizations in place!) The pinkish fill with greenish borders are our XBL result bindings, the blue-ish fill with more obviously blue borders are message streaming requests, and everything else (with a grey border and varying colors) is a gloda database query.  The white bar in the middle of the display is a XBL context I hovered over and clicked on.

The brighter colored vertical bars inside the rectangles are markers for state changes in the context.  The bright red markers are the most significant, they are state changes we logged before large blocks of code in the XBL that we presumed might be expensive.  And boy howdy, do they look expensive!  The first/top XBL bar (which ends up creating a whole bunch of other XBL bindings which result in lots of gloda queries) ties up the event thread for several seconds (from the red-bar to the end of the box).  The one I hovered over likewise ties things up from its red bar until the green bar several seconds later.

Now, I should point out that the heavy lifting of the database queries actually happens on a background thread, and without instrumentation of that mechanism, it’s hard for us to know when they are active or actually complete.  (We can only see the results when the main thread’s event queue is draining, and only remotely accurately when it’s not backlogged.)  But just from the visualization we can see that at the very least the first XBL dude is not being efficient with its queries.  The second expensive one (the hovered one) appears to chewing up processor cycles without much help from background processes.  (There is one recent gloda query, but we know it to be cheap.  The message stream requests may have some impact since mailnews’ IMAP code is multi-threaded, although they really only should be happening on the main thread (might not be, though!).  Since the query was against one folder, we know that there is no mailbox reparse happening.)

Er, so, I doubt anyone actually cares about what was inefficient, so I’ll stop now.  My point is mainly that even with the incredibly ugly visualization and what not, this tool is already quite useful.  It’s hard to tell that just from a screenshot, since a lot of the power is being able to click on a bar and see the log messages from that context.  There’s obviously a lot to do.  Probably one of the lower-hanging pieces of fruit is to display context causality and/or ownership.  Unfortunately this requires explicit state passing or use of a shared execution mechanism; the trick of using thread-locals that log4j gets to use for its nested diagnostic contexts is simply not an option for us.

Thunderbird and gloda go to meme-town

Sure, a word cloud of your blog posts is cool… but what if you could take any search of your e-mail, and turn that into a word cloud?  And then, if you click on one of those words, your search constraints would be revised to use the word you clicked on (and you’d get a useful search result, not another word cloud)?  And what if that layout algorithm were not as good as wordle?  The future is now, people!  (At least if you install like 5 extra extensions out of mercurial.)

The screenshot above is from Thunderbird trunk with a hacked exptoolbar extension (generalized, committed changes happening soon), visophyte-js, and the new glodacloud extension.  It is a proof-of-easy-gloda-extensions as suggested by David Ascher.

The layout algorithm is what we in the business of making up terminology call a recursive sub-optimal tic-tac-toe subdivision thinger.  We under-use a neat (and somewhat slow) hack to find the bounds of the words through use of canvas.mozPathText and canvas.isPointInPath to sample a grid to know where the text is and isn’t.  It’s under-used because all we use it for right now is to find the actual height above the baseline that the text stretches to (because metrics only gives us the width).  We are lazy and don’t check below the baseline at all, and totally squander our chance to be cool and put small words in the gaps in larger words.  But given the amount of time spent, I’m very happy.

Oh, and of course it uses JS and Canvas.

I’ll be wanting that latte machine now…

in context

credits where credits due:

  • thread arcs a la the nice people at the IBM CUE group
  • the search view prototype is implemented by David Ascher.  the positioning of the visualization is on me as a quick hack, though.
  • the search view prototype is designed by Bryan Clark, and he has even better stuff on the way

The actual implementation is a first step of adapting knowledge from my python “visophyte” library to a JS implementation using canvas.  I am trying a more batch-oriented style of processing this time that uses explicit attributes for value-passing between logic blocks.  This is in comparison to the python implementation which is more functional in nature.  We’ll see how it turns out.

Pretty Polish

pretty-pretty.png

As part of a continuing effort to make visophyte’s byproducts look attractive, I implemented a bit more shiny today. Using this aqua sphere effect photoshop tutorial at skdstudio.com as a basis, I have made the simple circle renderer support a ‘pretty’ option.

Unfortunately, this took a lot longer than I was hoping. Cairo lacks a Gaussian blur mechanism, PIL only supports 5×5 image kernels (iterative application is too slow), and using SciPy was absurdly slow and didn’t even work right before I gave up on it. Thankfully, in my googling it turned out that box blurs can be used to approximate a Gaussian blur. So visophyte’s cairo renderer now has a home-grown “box blur” filter using a boxcar average to keep the iterations and redundant calculations down.  (And only using the array module, so no new dependencies.)

mailman-pretty-circles.png

The latter vis is just the same vis as in my post about pretty pie charts, but with the pie visualizations replaced with circles. A net loss in information, but perhaps a net gain in prettiness? (Utility probably stays about the same…)

Cairo bakes Pretty Pies

Vacation found me actually relaxing, but some pretty progress has been had. I forgot to push my 64-bit aggdraw patches to the laptop I brought, so I implemented a cairo renderer. This has resulted in some backend cleanup and refactoring, although there is more to do. This also allows for attractive use of gradients:

pretty-pie.png

The changes to the pie-chart rendering are based on an Illustrator tutorial on how to make pretty pie charts. Although the pie chart has long known how to label itself and is now more competent at it, labels still overlap so they have been mercilessly disabled in this picture.

pretty-graphito-pies-python-dev-2007-july-twopi-fancy.png

This is a graph of the python-dev traffic (from the mailman archive) for July 2007 once more. This time:

  • The nodes are authors.
    • The radius of the node is a linear mapping lower bounded at 8 and upper-bounded at 24 based on the number of messages the author wrote during the time period.
    • The pie slices are the threads the author replied to/started during the month.
      • Their weights are the number of messages they wrote involved in that thread.
      • Their colors are distinctly colored. Because the previous distinct color mechanism clearly fell down by providing colors which were too similar, I did a first pass at varying the saturation in addition to varying the hue. Varying value/brightness seemed a little too distracting, but it might be okay with less severe variations.
  • The edges indicate that the author replied to a message by another author.
    • The width is (linearly) based on the number of times the user replied to the other author.
    • The color is always 25% opaque 50%-gray. Since the edges are effectively directed (but not visually distinguished), a case in which two authors replied to each other will result in a darker gray, at least in the region of overlap (since width can vary).
  • The layout is graphviz‘s twopi.