Email

gloda’s first (primitive) visualization

Author activity over time, current thread in blue, selected message in darkest blue.

A primitive visualization augments the gloda “other messages by author” listing by showing the messages sent by the author over time.  Messages are stacked by day.  The currently selected message is in darkest blue and also very wide.  Other messages from the same thread/conversation are in lighter blue and less wide.  Messages not in the conversation are light grey and rather narrow.

It’s not clickable, it lacks any form of scale or any feedback at all, and there are scaling issues.  (If anyone wants to save me the effort of figuring out how to get the canvas to maintain a 1:1 pixel mapping to the actual display and still ‘flex’ by adding/losing pixels, please do drop me a message or leave a comment.)  These will all change, but not yet.

I’ve pushed the changes to the mercurial repos and updated the stable tag, but I’m not publishing updated xpi’s, so you’ll need to roll your own if you care.  (The DB schema has not changed and so does not need to be blown away.)

Email
Mozilla
Thunderbird
Visualizing

Comments (0)

Permalink

Adding stews (hackish destructive accumulation/reduction) to CouchDB

As all misguidedly-lazy programmers are wont to do, I decided that it would be easier to ‘enhance’ CouchDB to meet my needs rather than to rewrite visotank to use SQLAlchemy. Also, I wanted to understand what CouchDB was doing under the hood with views and try my hand at some Erlang.

This Has Nothing To Do With Anything

CouchDB as currently implemented maintains a lot of information for each mapped document. There is a B-tree associated with each View Group whose keys are Document Ids and whose Values are a list of {View Id, Actual-Key-You-Mapped-In-That-View} tuples for every key mapped from that document for every view in the view group. Next, each View has a B-tree associated with it whose keys are {Actual-Key-You-Mapped, Document Id} tuples and whose values are the Actual-Value-You-Mapped.

This is all well and good, but is a poor fit for one of my key use-cases: reducing e-mail message traffic to date-binned summary statistics so I can render graphics. If I want the weekly-messages-sent count for a given ‘author’, map(message.author, blah) will allow me to filter only to messages sent by that author, but no matter what blah is, I will still get one per message.

Long blog post short, I have implemented a hackish first-pass reduce/accumulate solution to my problem. The idea is that ’stews’ allow you to aggregate mapped data that shares the same key. I’m a little fuzzy on exactly what the definition of ‘reduce’ is in the map/reduce papers (it’s been a while, if ever), so we’ll call this ‘accumulate’ (in the SICP/Scheme sense). It is a hack because:

  • It does not unify views and ’stews’. Whereas views are defined under ‘_design’ and accessed via ‘_view’, stews are defined under ‘_pot’ and accessed via ‘_stew’.
  • Values can only be integers right now, and it’s assumed you want to add them. (No custom JavaScript logic!)
  • I have not yet dealt with modified/removed documents. Which is to say that if you modify or remove a stew-mapped document, your accumulated values will climb ever-skyward.
  • It is in no way, shape, or form intended to be anything other than a learning experiment. (It is my hope that Damien Katz magically solves my problems in the next release. Having said that, I’m not opposed to trying to actually implement a more solid feature along these lines; coding in Erlang is wicked awesome. (sounds better with a fake accent))

It just so happens that these constraints are perfectly in line with visotank’s needs. Using stews and otherwise limiting my use of views, CouchDB is less ridiculous in its view-update times and the fully-populated (view/stew-wise) from-scratch ‘messages’ database tops out at 77M rather than 1.2G.

This also has nothing to do with anything

Anyways, if anyone is interested in the code (or the comments I added to the existing couch_view_group.erl logic), my bzr branch for CouchDB is at: http://www.visophyte.org/rev_control/bzr/couchdb/visbrero-couchdb/ . My bzr branch for couchdb-python, adding a simple unit test for stews is at: http://www.visophyte.org/rev_control/bzr/couchdb-python/visbrero/ .

Update!  The bzr repository is powerful messed up, so a better choice might be my changes in patch form:  http://www.visophyte.org/rev_control/patches/couchdb/visbrero-couchdb-stews-1.patch

Update 2! The bzr repository accessible at http://clicky.visophyte.org/rev_control/bzr/couchdb/visbrero-couchdb/ works and there’s a checkout with working copy (that you can browse) at http://clicky.visophyte.org/rev_control/bzr-checkouts/couchdb/visbrero-couchdb/ .   Note that these locations are not guaranteed to be valid for all time, but will be good for at least a month or two.

I fear my (sleepy) explanation may not be sufficient, so the unit test I added to couchdb-python may speak better to this end:

self.db['tom1'] = {’author’: ‘tom’, ’subject’: ‘cheese’}
self.db['tom2'] = {’author’: ‘tom’, ’subject’: ‘cats’}
self.db['tom3'] = {’author’: ‘tom’, ’subject’: ‘mice’}
self.db['bob1'] = {’author’: ‘bob’, ’subject’: ‘hats’}
self.db['jon1'] = {’author’: ‘jon’, ’subject’: ‘hats’}
self.db['kim1'] = {’author’: ‘kim’, ’subject’: ‘cats’}
self.db['kim2'] = {’author’: ‘kim’, ’subject’: ‘cows’}
self.db['_pot/test'] = {’views’: {
‘authors’: ‘function(doc) { map(doc.author, 1) }’,
’subjects’: ‘function(doc) { map(doc.subject, 1) }’
}}
authors = dict([(row.key, row.value) for row in self.db.view('_stew/test/authors')])
self.assertEqual(authors['tom'], 3)
self.assertEqual(authors['bob'], 1)
self.assertEqual(authors['jon'], 1)
self.assertEqual(authors['kim'], 2)
subjects = dict([(row.key, row.value) for row in self.db.view('_stew/test/subjects')])
self.assertEqual(subjects['cheese'], 1)
self.assertEqual(subjects['cats'], 2)
self.assertEqual(subjects['mice'], 1)

Uh, the spiral visualizations have nothing to do with the post. They are new insofar as I have never posted them before, but they are in fact rather quite old. They have a new aspect in that they now work with the cairo renderer, having relied upon ’special’ (horrible) custom renderers in the old agg backend.

CouchDB
Email
Software

Comments (4)

Permalink

SVG in visotank

visotank-conversation-timeline-snippet.png

visotank now has AJAX-loaded SVG graphics. The hooks are there to actually do something when you click on stuff, but it doesn’t do anything. The visualization is ripped from my visterity plugin for posterity; it’s not supposed to be new or exciting. The fact that the SVG is loaded via AJAX is new (visterity didn’t have that) and exciting. Pretty much everything else is simply legwork relating to using application/xhtml+xml instead of txt/html and the ramifications of that, especially with AJAX.

I’ve updated what is running at http://clicky.visophyte.org:8080/, but it looks like my VPS has some issues, so I wouldn’t be surprised if it things hang or are very slow when not yet cached.  (Specifically, I think it has very serious IO issues, but its absurd amounts of memory available avoid that problem from impacting things too much.)  (Normal slowness like its refusal to pipeline requests and there being hundreds of images is not a VM problem.)  Also, the SVG stuff is unlikely to work on anything but Firefox 2; at least 3.0a8 gets angry for me on gutsy.

To see the SVG graphs, the steps are: 1) select at least one contact in the contacts list, 2) click on the ‘conversations’ tab in the bottom half, select a conversation (you can only select one), and 3) click on the ‘conversation’ tab in the bottom half.  I should note that you might want to wait for all of the sparkbars to load before proceeding to the next step…

visotank-screenshot-conversation-timeline.png

Email
Visualizing
clicky

Comments (2)

Permalink

more (clicky!) mailing-list visualization a la visotank, couchdb

visotank-shot-1.png

Visotank now allows you to select some authors of interest from a sortable list of contacts, and then show the conversations they were involved in. You get the previously shown sparkbars for the author’s activity. You also get sparkbars showing the conversation activity, with each author assigned a color and consistent stacking position in that sparkbar. Click on the screenshots for zoomed versions of the screenshots to see what I mean.

You can click on things yourself at http://clicky.visophyte.org:8080/. Please only go there if you’re okay with restarting your Firefox session (especially true if Firebug is on.) All tables/images are the real thing and not fetched on demand… which results in Firefox having to pull down a lot of images. Click on some rows in the contacts table to select them. Then, in the lower tab group, click on the “conversations” tab. This will then fetch all the conversations those selected users were involved in. The system will truncate more than 10 users, so don’t go crazy. The tabs are re-fetched on switch, so if you change your contact selections, in the lower tab group, click away to “HowTo”, then back to “Conversations”. The “Conversation” tab does nothing and is a big lie.  Great UI, I know.

visotank-shot-2.png

I think you will find that sparkbar visualizations of the conversation traffic with a weekly granularity are rather useless. I think a reasonable solution would be a ‘zoomed’ sparkbar with an indication of the actual uniform timeline scale included. Since the images currently show about 2 years of data, a thread that happened 1 year ago would be centered in the middle of the image, but with its actual horizontal scale being inconsistent with that position. Future work, as always.

I have used Pylon’s Beaker caching layer to attempt to make things reasonably responsive. While CouchDB view updates are sadly quite lengthy (many many minutes when dealing with 16k messages; python-dev from Jan 2006 through Nov 2007), that is thankfully a one-off sort of thing. (The data-set is immutable once imported and I don’t change schemas that often.) The main performance hit is that I can only issue one range of keys to query in a request, so if I am trying to snipe a subset of non-consecutive information, I have to issue multiple requests. (I don’t believe POSTed views can operate against views in the database…)

Regrettably, I think my conclusion about CouchDB is that it (or something like it) will be truly fantastic in the future, but it is not going to get there soon enough for anything that hopes to be ‘productized’ anytime soon. The next thing I want to look at is using a triple-store to model some of the email data schema; my efforts from the visterity hacking suggest it could be quite useful. Of course, even if triple stores work out, I suspect a more traditional SQL database will still be required for some things. Combined with a thin custom aggregation and caching layer, that could work out well.

Note: I should emphasize that my CouchDB schema could be more optimized, but part of the experiment is/was to see if the views saved me from having to jump through clever hoops.

CouchDB
Email
Software
Visualizing
clicky

Comments (0)

Permalink

first steps to interactive fun using CouchDB

visotank-first-python-dev-sparkbars1.png

First, let me say that Pylons with its Paste magic is delightful; lots of nice round edges helped me get something simple up and running in no time, and using genshi to boot.

The new tool, visotank, is ingesting the python-dev mailman archives (as previously visualized) and putting them into CouchDB. The near-term goal is to allow for interactive exploration/visualization of the archives. The current result, as pictured, is simply sparkline barcharts of people’s posting history. Left-to-right, present-to-past, weekly, one (vertical) pixel per message, truncating at the image height (12 pixels).

Although the input processing thus far is specific to mailing list archives, the couchdb schema in use is for generic e-mail traffic. The messages are even coerced into rfc2822 format for ‘raw’ storage.

The ability to use ‘map’ multiple times in couchdb views to spread information is delightful. What I really would like is more reduce functionality or, more specifically, just accumulate. The sparkbars get their data from statistics with keys [contact id, timestamp of time period] and value 1, one per message. I would love for couchdb to provide a way to aggregate all those values with identical keys into a single row with the sum as the value. I’ll look into this and the view implementation before writing any more on the subject, but if someone out there already knows a way to do this, please let me know.

visotank-first-python-dev-sparkbars2.png

CouchDB
Email
Software
Visualizing

Comments (0)

Permalink

radial (radar) email vis, with care factors

radial-care-factor-vis.png

It’s a radial e-mail visualization intended to be the basis for a “situational awareness” overview of your e-mail. I’ve added the beginnings of a ‘care factor’* (”do I care about this person/message?”) concept to messages and contacts, which is used to assist in focusing your attention only to messages/people you care about. Right now, the care factor is simply whether you have ever sent the contact/author of a message an e-mail directly (to = 1.0), indirectly (cc = 0.5), or not at all (nada/ninguno=0.0). That can obviously be expanded upon in many directions; involvement of people you care about in message threads (with that person), intensity of your communication with that person, explicit interest-levels via tags, social network propagation (Google’s OpenSocial) without the person previously having existed in your e-mail corpus, etc.

Some more details about the visualization:

  • Things close to the center happened more recently. Things further away happened in the past. This seems like the most reasonable ‘radar’ metaphor for e-mail. If we were dealing with to-do items with due dates, then it would make sense that they are moving inward. However, the reality of e-mail is that if you don’t deal with them soon, they ‘fall off your radar’. My first thought to fuse the two would be to have messages associated with to-do tasks stick out quite obviously, latch once they hit the ‘edge’, and generally grow more ominous and threatening as time goes by. Of course, it’s probably not helpful to make people’s to-do lists seem like something they can’t escape…
    • The central grey circle is a void to ensure that angle is still meaningful even when the time is at a minimum; otherwise things would stack up and be generally extra confusing.
  • The angle is mapped to a single author/contact. This is currently random, but my intent is to allow clustering of contacts and quasi-persistent angular locations. So messages from your family might tend to come from the North, your friends the East, mailing lists the West, and ads from the South. (Let’s assume you get no spam.)  Actual geographic relationships would be a neat trick, but practically foolish.
  • Messages with a low care-factor are made more subtle by having reduced opacities. I forgot to make the edges linking messages to their parent more subtle…
  • Contacts with a high care-factor get their (anonymized) name in a strong color and their slice of the pie highlighted with a color. Contacts with a low care-factor have their names displayed more subtly and just get a grey hue for their outer-ring marker/label. The intent with the slice coloring is mainly to be intensity based with only one or two hues in use; I think using more colors will quickly overwhelm the display.
  • Time markers are in use, but may not be obvious. The blue ring labeled ‘30′ (along the x-axis) indicates that’s October 30th. The inner white ring is November 1st, but I’m not clear on why it wasn’t labeled as such (aka bug). The time marker logic needs to be refactored to provide more usable single “ruler” labeling (the timeline use currently is biased towards 2 rulers, which is where the month and year went). See the test program output from below for a better example of time display, although the month/year are still AWOL in another ruler.

radial-blah-blah-blah.png

And there’s the test program. Note that edges connect a message to its parent, and currently always flow clock-wise for time. So the innermost red message is the parent of the inner-most green message. I’m a bit conflicted about this; the consistency is nice, but the relationship would probably be more obvious if we took the shortest path. Also, since e-mail reply relationships are causal, it’s not like there’s any doubt which message was a reply to the other.

* I say ‘care factor’ because I did this work on a red-eye flight where my tiredness overwhelmed my natural defense against puns, and since Halloween was recent, and there was that tv show called ’scare factor’, etc. etc.

Email
Posterity
Visualizing

Comments (0)

Permalink

some email analysis for some email visualization

An attempt to apply hidden topic markov models to e-mail to perform topic analysis has morphed into simply deriving (aggregate) word-frequency information for TF-IDF purposes. The e-mails I attempted to analyze from my corpus appear to simply have been too short and wanting for quantity to pull a rabbit out of the (algorithmic) hat. (I only threw e-mails amongst my ‘village’-tagged contacts, as previously visualized.)

poor-mans-themail-but-hey-first-pass.png

Luckily, there’s a lot you can do with such information. (And in fact, I ended up using the word frequency info to attempt to normalize out e-mail signatures since I didn’t feel like doing the right thing for signatures at the time.) The bad news is I am not doing anything polished or good with the info yet.

The above is a quick proof-of-it-kinda-works which apes Themail’s monthly words concept. If you’re not familiar with Themail, click the link, read the PDF. It is/was a covet-able research prototype that let ‘you’ explore your history, e-mail-wise. It’s not available for download, hence ‘was’, and was only available to participating subjects, hence ‘you’. The good news is that, as always, you can download my hacked-up version of posterity and my visterity plugin. I wouldn’t try using it if I were you, though.

conv-index-with-terms-as-databased.png

The second screenshot is my Inbox with the ‘best’ scoring keyword (using traditional tf-idf, not the themail revised metrics) displayed for each message where the histogram information is available. Since I only ran the processing code against a set of my contacts, only messages involving those people have a keyword displayed.

I’m going to try and pull in my old pre-gmail email into the system to try and get some more (personal) data to work with. Or, people who are not spammers, e-mail me so I have some more correspondence. sombrero@alum.mit.edu. Conversations about why the Pet Shop Boys are the greatest band ever are preferable. Eventually I’ll try and pull in my gaim/pidgin logs which would be more useful, but that’s arguably a different data case with special needs, and I’m already spread pretty thin focus-wise as is, so that will have to wait.

Email
Posterity
Visualizing

Comments (0)

Permalink

Anyone going to InfoVis 2007? And a dubious contact vis.

I’m going to be at Vis 2007/InfoVis 2007/VAST 2007 next week thanks to my employer, The PTR Group (who is hiring, especially embedded folk in the greater DC area). So, if anyone who reads this will be there, feel free to drop me a line in the comments below or via e-mail at sombrero@alum.mit.edu.

magneto-contact-vis-dubious.png

The above is an attempt to visualize the per-month message flows between contacts, originally drawing inspiration from iron filings in a magnetic field. It doesn’t really work, even filtering things so only messages between the contacts involving “Nerdlinger” (upper left quadrant) and “Celia” (lower right quadrant) are displayed.

The Celia one works out a lot better, as you can see. Please overlook the fact that somehow there are all sorts of messages from Celia during that blue month with neither her nor the other contacts having any meaningful message traffic per the pie-chart. (The filtering constraint is that the contacts of interest have to either be in the to/cc or the from part.) This is a result of a bug in my contact consolidation logic (now fixed), that resulted in the summary statistics not properly being updated in all cases. Recalculating the statistics isn’t going to happen tonight, so this screenshot shall forever be buggy.

What I take away from this visualization is that when attempting to indicate the relative proportion of messages between contacts across time, it would probably be better to use a single straight rainbow line. I define a rainbow line as a line made up of multiple colors; I’m sure there’s a better (pre-existing) term for the concept.

revised-me-relative-sparkbars-redblue.png

And these are some of the newer sparkline bars that only include messages to-me-from-the-contact (above: blue for to, light blue for cc), or from-me-to-the-contact (below: red for to, pink for cc). The color scheme is a red-shift (away from me)/blue-shift (towards me) sort of metaphor which no one (cooler than me) should ever guess. I think I’ll have to resort to text or icons, or just stop trying to cram so much information in there.

Email
Meta
Posterity
Visualizing

Comments (0)

Permalink

visualizing contact relationships

anonymized-contacts-village.png

First, there’s bad news on the shiny front. Although I made the SVG renderer capable of producing the gel-like circles, Mozilla’s SVG support currently lacks the features to do it as implemented. (Curse your full-featured-ness, Eye of GNOME!) Hopefully someday soon, it will support the synthetic BackgroundImage filter input (bug).

Our show-and-tell for the end of this weekend is a screenshot of the anonymized visualization from my Posterity contacts page for my ‘village’ tag (using my visterity plugin):

  • The nodes represent the contacts tagged with ‘village’.
    • The pie-slices represent the activity of the contact overall for the last year:
      • Each pie slice corresponds to a month. The current month is the slice just below due east, and we go back in time as we rotate around clockwise. The oldest month is just above due east.
      • The radius is a clamped linear scaling (over all nodes displayed and all months) of the number of messages sent by this person to anyone (including me).
      • The color is just a consistent hue-variation designed to look pretty. I was actually planning to use a single color whose saturation was at max for the current month and decayed to black for the oldest month. That arguably would reduce confusion as to what pie slice means what.
    • The labels are anonymized. There’s a mapper in there that maps people’s actual names (eventually nicknames) to what you see.
  • The edges represent e-mail messages sent from the displayed contacts to the other displayed contacts. Since I did not add my contact to this label group, this means I have no impact on things.
    • The edge widths are a linear scaling of the number of messages sent from the one contact to the other. Because the edges are actually directed, there may be two edges between two nodes which will overlap. Because the edges are not fully opaque, you should be able to figure out if the communication is at least symmetric or asymmetric.

Interestingly, you can see that the two major clusters of contacts are linked by ‘Mitt’ and ‘Celia’.

mailman-flow-whoknows.png

Er, so, every post gets two pictures, and there’s your second one. It’s the stacked line-chart being all curvy instead of not (curvy). The data is from the mailman feed, although I don’t think it’s actually supposed to look like that… the colors could also use some work.

Email
Posterity
Visualizing

Comments (0)

Permalink

contacts, tallies, sparklines, but no clever title.

sparkbar-contacts-others.png

I have hacked up my local posterity bzr branch to process messages to extract contacts (and mailing lists). These contacts result in synthetic tags (to, from, and cc) applied to each message. My changes also include maintaining per-time interval (day, ISO week, month, year, ever) sparse counts for each tag.

Visophyte (bzr trunk) has been augmented to create bar-graph style sparklines (as coined/created/etc. by Tufte). Visterity, my happy-go-visualizy posterity plugin (also in the visophyte repository), consumes the new posterity contact statistics and produces what you see above.

sparkbar-contacts-me.png

If you’re not sure what you’re seeing, each bar is a week. The grey-colored bars ‘above’ the invisible line are messages from that contact (to anyone). The red-colored bars ‘below’ the line are messages to that contact (from anyone), while the lighter-red-colored-bars below the line are messages cc’ed or bcc’ed to that contact (from anyone).

It is important to reiterate that, at this time, these from/to/cc relations have nothing to do with the person whose email repository it is (mine, in this case). If messages are red, that doesn’t mean I sent them the message; it just means someone sent them a message and it somehow passed through my account. Of course, when viewing a list of contacts, I’m only really going to care about that person’s interactions with me. So that is a must-have and up on the near-term to-do list. The current set of tags are really most useful in attempting to visualize messages sent through a mailing list or other broadcast medium, which is also of great interest to me.

I should probably also note that when writing the list-handling logic here, I forgot to have the code generate an implied ‘to’ when the person replied only to the list but the author of the previous message could be presumed to be the intended recipient of the message. Which explains why so many of the people in the example image up there do not end up receiving many messages. I would have fixed this, but re-processing my full downloaded gmail corpus of something like 68,000 messages takes a while.

Also, the blurred out guy in the example doesn’t need to be blurred out. I ended up eliding a useless column that I had decided to blur for no clear reason, but in my sleep-desiring state I also screwed up and blurred a column that really didn’t need to be blurred. (All the people shown have posted to public mailing lists, so their e-mail addresses are already out their, and their names aren’t exactly going to get them more spam. And OCR would be required anyways, as the sparklines are images rather than inline-SVG for caching reasons.)

UPDATE: Uh, as I quickly re-read the post and looked at the sparkline, I realized I completely flipped (both color-and above/below) the sparkline from what I had originally intended.  Thankfully my original intent didn’t make all that much sense either, at least color-wise, so I think I can sleep easy without correcting that.  Better color suggestions,etc. are appreciated.

Email
Posterity
Visualizing

Comments (0)

Permalink