teaser: code completion in skywriter/ajax.org code editor using jstut and narcissus

I’ve hooked up jstut’s (formerly narscribblus‘) narcissus-based parser and jsctags-based abstract interpreter up to the ajax.org code editor (ace, to be the basis for skywriter, the renamed and somewhat rewritten bespin).  Ace’s built-in syntax highlighters are based on the somewhat traditional regex-based state machine pattern and have no deep understanding of JS.  The tokenizers have a very limited stateful-ness; they are line-centric and the only state is the state of the parser at the conclusion of tokenizing the previous line.  The advantage is that they will tend to be somewhat resilient in the face of syntax errors.

In contrast, narcissus is a recursive descent parser that explodes when it encounters a parse error (and takes all the state on the stack at the point of failure with it).  Accordingly, my jstut/narscribblus parser is exposed to ace through a hybrid tokenizer that uses the proper narcissus parser as its primary source of tokens and falls back to the regex state machine tokenizer for lines that the parser cannot provide tokens for.  I have thus far made some attempt at handling invalidation regions in a respectable fashion but it appears ace is pretty cavalier in terms of invalidating from the edit point to infinity, so it doesn’t really help all that much.

Whenever a successful parse occurs, the abstract interpreter is kicked off which goes and attempts to evaluate the document.  This includes special support for CommonJS require() and CommonJS AMD define() operations.  The require(“wmsy/wmsy”) in the screenshot above actually retrieves the wmsy/wmsy module (using the RequireJS configuration), parses it using narcissus, parses the documentation blocks using jstut, performs abstract interpretation and follow-on munging, and then returns the contents of that namespace (asynchronously using promises) to the abstract interpreter for the body of the text editor.  The hybrid tokenizer does keep around a copy of the last good parse to deal with code completion in the very likely case where the intermediate stages of writing new code result in parse failures.  Analysis of the delta from the last good parse is used in conjunction with the last good parse to (attempt to) provide useful code completion

The net result is that we have semantic information about many of the tokens on the screen and could do fancy syntax highlighting like Eclipse can do.  For example, global variables could be highlighted specially, types defines from third party libraries could get their own color, etc.  For the purposes of code completion, we are able to determine the context surrounding the cursor and the appropriate data types to use as the basis for completion.  For example, in the first screenshot, we are able to determine that we are completing a child of “wy” which we know to be an instance of type WmsyDomain from the wmsy namespace.  We know the children of the prototype of WmsyDomain and are able to subsequently filter on the letter “d” which we know has been (effectively) typed based on the position of the cursor.  (Note: completion items are currently not sorted bur rather shown in definition order.)

In the second example, we are able to determine that the cursor is in an object initializer, that the object initializer is the first argument of a call to defineWidget on “wy” (which we know about as previously described).  We accordingly know the type constraint on the object initializer and thus know the legal/documented key names that can be used.

This is not working enough to point people at a live demo, but it is exciting enough to post teaser screenshots.  Of course, the code is always available for the intrepid: jstut/narscribblus, wmsy.  In a nutshell, you can hit “Alt-/” and the auto-completion code will try and do its thing.  It will display its results in a wmsy popup that is not unified with ace in terms of how focus is handled (wmsy’s bad).  Nothing you do will actually insert text, but if you click outside of the popup or hit escape it will at least go away.  The egregious deficiencies are likely to go away soon, but I am very aware and everyone else should be aware that getting this to a production-quality state you can use on multi-thousand line files with complex control flow would likely be quite difficult (although if people document their types/signatures, maybe not so bad).  And I’m not planning to pursue that (for the time being); the goal is still interactive, editable, tutorial-style examples.  And for these, the complexity is way down low.

My thanks to the ajax.org and skywriter teams; even at this early state of external and source documentation it was pretty easy to figure out how various parts worked so as to integrate my hybrid tokenizer and hook keyboard commands up.  (Caveat: I am doing some hacky things… :))  I am looking forward to the continued evolution and improvement of an already great text editor component!

non-infuriating indentation with emacs and js2-mode with require.def asynchronous module definition CommonJS boilerplate

Classic CommonJS modules assume a synchronous execution environment (for the purposes of “require”) with a specialized loader mechanism that evaluates the module in its proper context and takes care of namespacing it.  If you want to use CommonJS modules in the browser you can either:

  • Leave the source code as it is and use an XHR-based loader that uses eval to perform the namespacing trick.  In order to deal with the synchronous require assumption you can use some combination of deferring the evaluation of the module until you think you have all the dependencies and synchronous XHR.  Commonly, regular expressions are used to figure out the dependencies, but one could also use some form of static analysis.  Examples of browser-based CommonJS loaders supporting this are teleport and yabble.
  • Wrap your source code in boilerplate that takes care of the namespacing.  This can be done via a build system or done permanently in the source.  Pretty much every browser-based CommonJS loader supports this, with RequireJS being the only one I’m going to name-check because there are too many of these suckers as is.

The synchronous idiom for module “foo” might look like this:

var bar = require("bar");
var baz = require("baz");
 
exports.doStuff = function() {
  return "awwww yeah.";
};

The asynchronous module definition for “foo” might look like this, noting that there are actually a couple of possible variations on this:

require.def("foo", ["exports", "bar", "baz"], function(exports, bar, baz) {
 
exports.doStuff = function() {
  return "awwww yeah.";
};
 
);

The thing that may jump out at you is that the asynchronous wrapping means that the body of our module actually lives inside a function definition within the argument list of a function call.  Let’s assume you enjoy the finer things in life and are using emacs and js2-mode for your javascript editing.  js2-mode will helpfully suggest indenting 14 characters because that puts us 2 characters in from the enclosing function call’s opening paren.

That indentation could drive a man crazy and was really my only reason for avoiding the asynchronous idiom.  Thankfully, emacs being what it is, I was able to make it do what I roughly what I want:

;; Check if the suggested indentation is due to require.def().  If it is, force
;;  the indentation down to zero.  We detect this case by checking whether the
;;  parse depth is 2 and the last top-level point was preceded by require.def.
(defun require-def-deindent (list index)
  (when (and (eq (nth 0 parse-status) 2)
             (save-excursion
               (let ((tl-point (syntax-ppss-toplevel-pos parse-status)))
                 (goto-char tl-point)
                 (backward-word 2)
                 (equal "require.def" (buffer-substring (point) tl-point))))
             ;; only intercede if they are suggesting what the sexprs suggest
             (let ((suggested-column (js-proper-indentation parse-status)))
               (eq (nth index list) suggested-column))
             )
    (indent-line-to 0)
    't
    ))
;; Uncomment the following to enable the hook if you want tab to always slam you
;;  to column 0 rather than doing the cycle thing.  (With the newline hook in
;;  place, I haven't seen the need yet.)
;(add-hook 'js2-indent-hook 'require-def-deindent)
 
;; Unfortunately, js2-enter-key turns off the bounce indent logic so we need to
;;  intentionally do something to get our helper invoked.  In this case, we use
;;  advice but we could also mess with the keybinding.
;; This assumes js2-enter-indents-newline is enabled / desired.
(defadvice js2-enter-key (around js2-enter-key-around)
  "Trigger require-def-deindent on enter for the newline."
  ad-do-it
  (let ((parse-status (save-excursion
                        (parse-partial-sexp (point-min) (point-at-bol))))
        positions)
    (push (current-column) positions)
    (require-def-deindent positions 0)))
(ad-activate 'js2-enter-key)

If you paste the above into your .emacs and have sufficient emacs karma, hopefully the above will work for you too.

UPDATE (2011/1/1):

The AMD idiom has settled on using “define” instead of “require.def”, so here is the above code modified to this end:

;; --- CommonJS AMD define() compensation
 
;; Check if the suggested indentation is due to define().  If it is, force
;;  the indentation down to zero.  We detect this case by checking whether the
;;  parse depth is 2 and the last top-level point was preceded by define.
(defun require-def-deindent (list index)
  (when (and (eq (nth 0 parse-status) 2)
             (save-excursion
               (let ((tl-point (syntax-ppss-toplevel-pos parse-status)))
                 (goto-char tl-point)
                 (backward-word 1)
                 (equal "define" (buffer-substring (point) tl-point))))
             ;; only intercede if they are suggesting what the sexprs suggest
             (let ((suggested-column (js-proper-indentation parse-status)))
               (eq (nth index list) suggested-column))
             )
    (indent-line-to 0)
    't
    ))
;; Uncomment the following to enable the hook if you want tab to always slam you
;;  to column 0 rather than doing the cycle thing.  (With the newline hook in
;;  place, I haven't seen the need yet.)
;(add-hook 'js2-indent-hook 'require-def-deindent)
 
;; Unfortunately, js2-enter-key turns off the bounce indent logic so we need to
;;  intentionally do something to get our helper invoked.  In this case, we use
;;  advice but we could also mess with the keybinding.
;; This assumes js2-enter-indents-newline is enabled / desired.
(defadvice js2-enter-key (around js2-enter-key-around)
  "Trigger require-def-deindent on enter for the newline."
  ad-do-it
  (let ((parse-status (save-excursion
                        (parse-partial-sexp (point-min) (point-at-bol))))
        positions)
    (push (current-column) positions)
    (require-def-deindent positions 0)))
(ad-activate 'js2-enter-key)
 
;; (end define compensation)

(clicky.visophyte.org-hosted CouchDB services offline)

This means doccelerator and my Tinderbox scraper that chucked stuff into a CouchDB for exposure by a modified bugzilla jetpack.  Either couch went crazy or someone gave it a request that is ridiculously expensive to answer with a database the accumulated size of the tinderbox database.  I’m not aware of any trivial ways to contend with the latter.  Judging by the logs, it looks like 2 people other than myself used these services, so, my apologies to those cool, insightful, forward-looking individuals.

The specific services are probably not coming back.  doccelerator is being subsumed into something else that better meets documentation needs and will be more fully baked, more on that soon.  I got the impression at the summit that the tinderbox problem is in hand, or very close to someone’s hand; maybe one of those robot grabby-arm things is involved?  My reviewboard with bugzilla hacks instance is sticking around for the time being, but I think wheels are turning elsewhere in Mozilla on that front too, so hopefully my install is mooted before it falls over.

Lest there be any doubt, all clicky services are provided on a self-interested basis… I am happy when they benefit others, but I do these things for the benefit of my own productivity/sanity and they are hosted using my own resources.  (While I had higher hopes for doccelerator, the MoMo-resourced couchdb service provisioning never happened so doccelerator never got pushed public with nightly updates because of said clicky resource constraints.)

ediosk: an emacs buffer switcher for the rest of us

Emacs users and would-be emacs users, are you tired of those emacs developers in their ivory towers not supporting buffer switching via touch-screen on a computer that’s not running emacs and using modern web browser technology instead of disproven parentheses-based technology?  Be tired no more!*

Thanks to Christopher Wellons and Chunye Wang’s work on emacs-httpd it is a simple matter to expose a JSON representation of the current set of frames/windows/buffers in your emacs session and provide non-REST manipulation mechanisms via a webserver implemented in elisp.

Once you have exposed an API, it is a subsequent simple matter to implement some JavaScript that understands these things and presents a nice UI.  In this case, we have used the Jetpack SDK, wmsy (the Widget Manifesting SYstem, an widgeting framework I am developing), and protovis.

The screenshot basically captures the entire feature-set:

  • A protovis-based visualization that shows the location of all of the emacs “windows” (the things that show buffers).  Emacs reports to us the pixel-space coordinates/sizes of the “frames” (GUI windows) and “windows”, so this all comes magically for free.  The downside is your emacs windows need to be in the same coordinate space, so use of multiple X displays without use of DMX will likely lead to weird overlap.
    • The selected “window” in each “frame” gets a diamond.  The active frame’s diamond gets to be black instead of gray.
    • Clicking on a “window” focuses/raises the containing “frame” and selects the “window”.
  • Buffers grouped by the directory they live in (if they have a backing file).
    • Buffers visible in windows have their background composed of the colors for all the windows they are in.
    • Buffers that are modified have their text colored red.
    • Buffers that have not been freshly displayed in a window recently have their text colored grey.
    • Clicking on a buffer displays it in what the UI believes to be the currently selected frame’s currently selected window.

* This entire paragraph is a joke**; no flaming necessary.

** ediosk is not a joke though.  I seriously have a touch-screen monitor hooked up to my windows build machine to the right of my two monitors hooked up to my linux/primary development machine.  While c-x b (icicle mode) will still be my dominant buffer switching mechanism, I expect ediosk to prove useful/amusing for cases where the number of buffers greatly exceeds my mental stack, when I am switching contexts, or when I am working in multiple branches simultaneously.

Doccelerator: JavaScript documentation via JSHydra into CouchDB with an AJAX UI

doccelerator-1

About the name.  David Ascher picked it.  My choice was flamboydoc in recognition of my love of angry fruit salad color themes and because every remotely sane name has already been used by the 10 million other documentation tools out there.  Regrettably not only have we lost the excellent name, but the color scheme is at best mildly irritated at this point.

So why yet another JavaScript documentation tool:

  • JavaScript 1.8 support.  JSHydra (thanks jcranmer!) is built on spidermonkey.  In terms of existing JS documentation tools out there, they can be briefly lumped into “doesn’t even both attempting to parse JavaScript” and “parses it to some degree, but gets really confused by JavaScript 1.8 syntax”.  By having the parser be the parser of our  JS engine, parsing success is guaranteed.  And non-parsing tools tend to require too much hand labeling to be practical.
  • Docceleterator is not intended to be just a documentation tool.  While JSHydra is still in its infancy, it promises the ability to extract information from function bodies.  Its namesake, Dehydra, is a static analysis tool for C++ and has already given us great things (dxr, also in its infancy).

doccelerator-comment

  • Support community API docs contributions without forking the API docs or requiring source patches.  DevMo is a great place for documentation, but it is an iffy place for doxygen-style API docs.  Short of an exceedingly elaborate tool that round-trips doxygen/JSDoc comments to the wiki and user modifications back again, the documentation is bound to diverge.  By supporting comments directly on the semantic objects themselves[1], we eliminate having to try and determine what a given wiki change corresponds to.  (This would be annoying even if you could force the wiki users to obey a strict formatting pattern.)  This enables automatic patch generation, etc.
  • Mashable.  You post the JavaScript source file to a server running the doccelerator parser.  You get back a JSON set of documents.  You post those into a CouchDB couch.  The UI is a CouchApp; you can modify it.  Don’t like the UI, just want a service?  You can query the couch for things and get back JSON documents.  Want custom (CouchDB) views but are not in control of a documentation couch?  Replicate the couch to your own local couch and add some views.
  • Able to leverage data from dehydra/dxr.  Mozilla JS code lives in a world of XPCOM objects and their XPIDL-defined interfaces.  We want the JS documentation to be able to interact with that world.  Obviously, this raises some issues of where the boundary lies between dxr and Doccelerator.  I don’t think it matters at this point; we just need internal and API documentation for Thunderbird 3 now-ish.
  • A more ‘dynamic’ UI.  The UI is inspired by TiddlyWiki‘s interface where all wiki “pages” open in the same document.  I often find myself only caring about a few methods of a class at any given time.  Documentation is generally either organized in monolithic pages or single pages per function.  Either way, I tend to end up with a separate tab for each thing of interest.  This usually ends in both confusion and way too many tabs.

1: Right now I only support commenting at the documentation display granularity which means you cannot comment on arguments individually, just the function/method/class/etc.

Example links which will eventually die because I’m not guaranteeing this couch instance will stay up forever:

The hg repo is here.  I tried to code the JS against the 1.5 standard and generally be cross-browser compatible, but I know at least Konqueror seems to get upset when it comes time to post (modified) comments.  I’m not sure what’s up with that.

Exciting potential taglines:

  • Doccelerator: Documentation from the future, because the documentation was doccelerated past the speed of light, and we all know how that turns out.
  • Doccelerator: It sounds like an extra pedal for your car and it’s just as easy to use… unless we’re talking about the clutch.
  • Doccelerator: Thankfully the name doesn’t demand confusingly named classes in the service of a stretched metaphor.  That’s good, right?

Adding stews (hackish destructive accumulation/reduction) to CouchDB

As all misguidedly-lazy programmers are wont to do, I decided that it would be easier to ‘enhance’ CouchDB to meet my needs rather than to rewrite visotank to use SQLAlchemy. Also, I wanted to understand what CouchDB was doing under the hood with views and try my hand at some Erlang.

This Has Nothing To Do With Anything

CouchDB as currently implemented maintains a lot of information for each mapped document. There is a B-tree associated with each View Group whose keys are Document Ids and whose Values are a list of {View Id, Actual-Key-You-Mapped-In-That-View} tuples for every key mapped from that document for every view in the view group. Next, each View has a B-tree associated with it whose keys are {Actual-Key-You-Mapped, Document Id} tuples and whose values are the Actual-Value-You-Mapped.

This is all well and good, but is a poor fit for one of my key use-cases: reducing e-mail message traffic to date-binned summary statistics so I can render graphics. If I want the weekly-messages-sent count for a given ‘author’, map(message.author, blah) will allow me to filter only to messages sent by that author, but no matter what blah is, I will still get one per message.

Long blog post short, I have implemented a hackish first-pass reduce/accumulate solution to my problem. The idea is that ‘stews’ allow you to aggregate mapped data that shares the same key. I’m a little fuzzy on exactly what the definition of ‘reduce’ is in the map/reduce papers (it’s been a while, if ever), so we’ll call this ‘accumulate’ (in the SICP/Scheme sense). It is a hack because:

  • It does not unify views and ‘stews’. Whereas views are defined under ‘_design’ and accessed via ‘_view’, stews are defined under ‘_pot’ and accessed via ‘_stew’.
  • Values can only be integers right now, and it’s assumed you want to add them. (No custom JavaScript logic!)
  • I have not yet dealt with modified/removed documents. Which is to say that if you modify or remove a stew-mapped document, your accumulated values will climb ever-skyward.
  • It is in no way, shape, or form intended to be anything other than a learning experiment. (It is my hope that Damien Katz magically solves my problems in the next release. Having said that, I’m not opposed to trying to actually implement a more solid feature along these lines; coding in Erlang is wicked awesome. (sounds better with a fake accent))

It just so happens that these constraints are perfectly in line with visotank’s needs. Using stews and otherwise limiting my use of views, CouchDB is less ridiculous in its view-update times and the fully-populated (view/stew-wise) from-scratch ‘messages’ database tops out at 77M rather than 1.2G.

This also has nothing to do with anything

Anyways, if anyone is interested in the code (or the comments I added to the existing couch_view_group.erl logic), my bzr branch for CouchDB is at: http://www.visophyte.org/rev_control/bzr/couchdb/visbrero-couchdb/ . My bzr branch for couchdb-python, adding a simple unit test for stews is at: http://www.visophyte.org/rev_control/bzr/couchdb-python/visbrero/ .

Update!  The bzr repository is powerful messed up, so a better choice might be my changes in patch form:  http://www.visophyte.org/rev_control/patches/couchdb/visbrero-couchdb-stews-1.patch

Update 2! The bzr repository accessible at http://clicky.visophyte.org/rev_control/bzr/couchdb/visbrero-couchdb/ works and there’s a checkout with working copy (that you can browse) at http://clicky.visophyte.org/rev_control/bzr-checkouts/couchdb/visbrero-couchdb/ .   Note that these locations are not guaranteed to be valid for all time, but will be good for at least a month or two.

I fear my (sleepy) explanation may not be sufficient, so the unit test I added to couchdb-python may speak better to this end:

self.db['tom1'] = {'author': 'tom', 'subject': 'cheese'}
self.db['tom2'] = {'author': 'tom', 'subject': 'cats'}
self.db['tom3'] = {'author': 'tom', 'subject': 'mice'}
self.db['bob1'] = {'author': 'bob', 'subject': 'hats'}
self.db['jon1'] = {'author': 'jon', 'subject': 'hats'}
self.db['kim1'] = {'author': 'kim', 'subject': 'cats'}
self.db['kim2'] = {'author': 'kim', 'subject': 'cows'}
self.db['_pot/test'] = {'views': {
'authors': 'function(doc) { map(doc.author, 1) }',
'subjects': 'function(doc) { map(doc.subject, 1) }'
}}
authors = dict([(row.key, row.value) for row in self.db.view('_stew/test/authors')])
self.assertEqual(authors['tom'], 3)
self.assertEqual(authors['bob'], 1)
self.assertEqual(authors['jon'], 1)
self.assertEqual(authors['kim'], 2)
subjects = dict([(row.key, row.value) for row in self.db.view('_stew/test/subjects')])
self.assertEqual(subjects['cheese'], 1)
self.assertEqual(subjects['cats'], 2)
self.assertEqual(subjects['mice'], 1)

Uh, the spiral visualizations have nothing to do with the post. They are new insofar as I have never posted them before, but they are in fact rather quite old. They have a new aspect in that they now work with the cairo renderer, having relied upon ‘special’ (horrible) custom renderers in the old agg backend.

more (clicky!) mailing-list visualization a la visotank, couchdb

visotank-shot-1.png

Visotank now allows you to select some authors of interest from a sortable list of contacts, and then show the conversations they were involved in. You get the previously shown sparkbars for the author’s activity. You also get sparkbars showing the conversation activity, with each author assigned a color and consistent stacking position in that sparkbar. Click on the screenshots for zoomed versions of the screenshots to see what I mean.

You can click on things yourself at http://clicky.visophyte.org:8080/. Please only go there if you’re okay with restarting your Firefox session (especially true if Firebug is on.) All tables/images are the real thing and not fetched on demand… which results in Firefox having to pull down a lot of images. Click on some rows in the contacts table to select them. Then, in the lower tab group, click on the “conversations” tab. This will then fetch all the conversations those selected users were involved in. The system will truncate more than 10 users, so don’t go crazy. The tabs are re-fetched on switch, so if you change your contact selections, in the lower tab group, click away to “HowTo”, then back to “Conversations”. The “Conversation” tab does nothing and is a big lie.  Great UI, I know.

visotank-shot-2.png

I think you will find that sparkbar visualizations of the conversation traffic with a weekly granularity are rather useless. I think a reasonable solution would be a ‘zoomed’ sparkbar with an indication of the actual uniform timeline scale included. Since the images currently show about 2 years of data, a thread that happened 1 year ago would be centered in the middle of the image, but with its actual horizontal scale being inconsistent with that position. Future work, as always.

I have used Pylon’s Beaker caching layer to attempt to make things reasonably responsive. While CouchDB view updates are sadly quite lengthy (many many minutes when dealing with 16k messages; python-dev from Jan 2006 through Nov 2007), that is thankfully a one-off sort of thing. (The data-set is immutable once imported and I don’t change schemas that often.) The main performance hit is that I can only issue one range of keys to query in a request, so if I am trying to snipe a subset of non-consecutive information, I have to issue multiple requests. (I don’t believe POSTed views can operate against views in the database…)

Regrettably, I think my conclusion about CouchDB is that it (or something like it) will be truly fantastic in the future, but it is not going to get there soon enough for anything that hopes to be ‘productized’ anytime soon. The next thing I want to look at is using a triple-store to model some of the email data schema; my efforts from the visterity hacking suggest it could be quite useful. Of course, even if triple stores work out, I suspect a more traditional SQL database will still be required for some things. Combined with a thin custom aggregation and caching layer, that could work out well.

Note: I should emphasize that my CouchDB schema could be more optimized, but part of the experiment is/was to see if the views saved me from having to jump through clever hoops.

first steps to interactive fun using CouchDB

visotank-first-python-dev-sparkbars1.png

First, let me say that Pylons with its Paste magic is delightful; lots of nice round edges helped me get something simple up and running in no time, and using genshi to boot.

The new tool, visotank, is ingesting the python-dev mailman archives (as previously visualized) and putting them into CouchDB. The near-term goal is to allow for interactive exploration/visualization of the archives. The current result, as pictured, is simply sparkline barcharts of people’s posting history. Left-to-right, present-to-past, weekly, one (vertical) pixel per message, truncating at the image height (12 pixels).

Although the input processing thus far is specific to mailing list archives, the couchdb schema in use is for generic e-mail traffic. The messages are even coerced into rfc2822 format for ‘raw’ storage.

The ability to use ‘map’ multiple times in couchdb views to spread information is delightful. What I really would like is more reduce functionality or, more specifically, just accumulate. The sparkbars get their data from statistics with keys [contact id, timestamp of time period] and value 1, one per message. I would love for couchdb to provide a way to aggregate all those values with identical keys into a single row with the sum as the value. I’ll look into this and the view implementation before writing any more on the subject, but if someone out there already knows a way to do this, please let me know.

visotank-first-python-dev-sparkbars2.png

radial (radar) email vis, with care factors

radial-care-factor-vis.png

It’s a radial e-mail visualization intended to be the basis for a “situational awareness” overview of your e-mail. I’ve added the beginnings of a ‘care factor’* (“do I care about this person/message?”) concept to messages and contacts, which is used to assist in focusing your attention only to messages/people you care about. Right now, the care factor is simply whether you have ever sent the contact/author of a message an e-mail directly (to = 1.0), indirectly (cc = 0.5), or not at all (nada/ninguno=0.0). That can obviously be expanded upon in many directions; involvement of people you care about in message threads (with that person), intensity of your communication with that person, explicit interest-levels via tags, social network propagation (Google’s OpenSocial) without the person previously having existed in your e-mail corpus, etc.

Some more details about the visualization:

  • Things close to the center happened more recently. Things further away happened in the past. This seems like the most reasonable ‘radar’ metaphor for e-mail. If we were dealing with to-do items with due dates, then it would make sense that they are moving inward. However, the reality of e-mail is that if you don’t deal with them soon, they ‘fall off your radar’. My first thought to fuse the two would be to have messages associated with to-do tasks stick out quite obviously, latch once they hit the ‘edge’, and generally grow more ominous and threatening as time goes by. Of course, it’s probably not helpful to make people’s to-do lists seem like something they can’t escape…
    • The central grey circle is a void to ensure that angle is still meaningful even when the time is at a minimum; otherwise things would stack up and be generally extra confusing.
  • The angle is mapped to a single author/contact. This is currently random, but my intent is to allow clustering of contacts and quasi-persistent angular locations. So messages from your family might tend to come from the North, your friends the East, mailing lists the West, and ads from the South. (Let’s assume you get no spam.)  Actual geographic relationships would be a neat trick, but practically foolish.
  • Messages with a low care-factor are made more subtle by having reduced opacities. I forgot to make the edges linking messages to their parent more subtle…
  • Contacts with a high care-factor get their (anonymized) name in a strong color and their slice of the pie highlighted with a color. Contacts with a low care-factor have their names displayed more subtly and just get a grey hue for their outer-ring marker/label. The intent with the slice coloring is mainly to be intensity based with only one or two hues in use; I think using more colors will quickly overwhelm the display.
  • Time markers are in use, but may not be obvious. The blue ring labeled ’30’ (along the x-axis) indicates that’s October 30th. The inner white ring is November 1st, but I’m not clear on why it wasn’t labeled as such (aka bug). The time marker logic needs to be refactored to provide more usable single “ruler” labeling (the timeline use currently is biased towards 2 rulers, which is where the month and year went). See the test program output from below for a better example of time display, although the month/year are still AWOL in another ruler.

radial-blah-blah-blah.png

And there’s the test program. Note that edges connect a message to its parent, and currently always flow clock-wise for time. So the innermost red message is the parent of the inner-most green message. I’m a bit conflicted about this; the consistency is nice, but the relationship would probably be more obvious if we took the shortest path. Also, since e-mail reply relationships are causal, it’s not like there’s any doubt which message was a reply to the other.

* I say ‘care factor’ because I did this work on a red-eye flight where my tiredness overwhelmed my natural defense against puns, and since Halloween was recent, and there was that tv show called ‘scare factor’, etc. etc.

some email analysis for some email visualization

An attempt to apply hidden topic markov models to e-mail to perform topic analysis has morphed into simply deriving (aggregate) word-frequency information for TF-IDF purposes. The e-mails I attempted to analyze from my corpus appear to simply have been too short and wanting for quantity to pull a rabbit out of the (algorithmic) hat. (I only threw e-mails amongst my ‘village’-tagged contacts, as previously visualized.)

poor-mans-themail-but-hey-first-pass.png

Luckily, there’s a lot you can do with such information. (And in fact, I ended up using the word frequency info to attempt to normalize out e-mail signatures since I didn’t feel like doing the right thing for signatures at the time.) The bad news is I am not doing anything polished or good with the info yet.

The above is a quick proof-of-it-kinda-works which apes Themail‘s monthly words concept. If you’re not familiar with Themail, click the link, read the PDF. It is/was a covet-able research prototype that let ‘you’ explore your history, e-mail-wise. It’s not available for download, hence ‘was’, and was only available to participating subjects, hence ‘you’. The good news is that, as always, you can download my hacked-up version of posterity and my visterity plugin. I wouldn’t try using it if I were you, though.

conv-index-with-terms-as-databased.png

The second screenshot is my Inbox with the ‘best’ scoring keyword (using traditional tf-idf, not the themail revised metrics) displayed for each message where the histogram information is available. Since I only ran the processing code against a set of my contacts, only messages involving those people have a keyword displayed.

I’m going to try and pull in my old pre-gmail email into the system to try and get some more (personal) data to work with. Or, people who are not spammers, e-mail me so I have some more correspondence. sombrero@alum.mit.edu. Conversations about why the Pet Shop Boys are the greatest band ever are preferable. Eventually I’ll try and pull in my gaim/pidgin logs which would be more useful, but that’s arguably a different data case with special needs, and I’m already spread pretty thin focus-wise as is, so that will have to wait.