some email analysis for some email visualization

An attempt to apply hidden topic markov models to e-mail to perform topic analysis has morphed into simply deriving (aggregate) word-frequency information for TF-IDF purposes. The e-mails I attempted to analyze from my corpus appear to simply have been too short and wanting for quantity to pull a rabbit out of the (algorithmic) hat. (I only threw e-mails amongst my ‘village’-tagged contacts, as previously visualized.)

poor-mans-themail-but-hey-first-pass.png

Luckily, there’s a lot you can do with such information. (And in fact, I ended up using the word frequency info to attempt to normalize out e-mail signatures since I didn’t feel like doing the right thing for signatures at the time.) The bad news is I am not doing anything polished or good with the info yet.

The above is a quick proof-of-it-kinda-works which apes Themail‘s monthly words concept. If you’re not familiar with Themail, click the link, read the PDF. It is/was a covet-able research prototype that let ‘you’ explore your history, e-mail-wise. It’s not available for download, hence ‘was’, and was only available to participating subjects, hence ‘you’. The good news is that, as always, you can download my hacked-up version of posterity and my visterity plugin. I wouldn’t try using it if I were you, though.

conv-index-with-terms-as-databased.png

The second screenshot is my Inbox with the ‘best’ scoring keyword (using traditional tf-idf, not the themail revised metrics) displayed for each message where the histogram information is available. Since I only ran the processing code against a set of my contacts, only messages involving those people have a keyword displayed.

I’m going to try and pull in my old pre-gmail email into the system to try and get some more (personal) data to work with. Or, people who are not spammers, e-mail me so I have some more correspondence. sombrero@alum.mit.edu. Conversations about why the Pet Shop Boys are the greatest band ever are preferable. Eventually I’ll try and pull in my gaim/pidgin logs which would be more useful, but that’s arguably a different data case with special needs, and I’m already spread pretty thin focus-wise as is, so that will have to wait.