(clicky.visophyte.org-hosted CouchDB services offline)

This means doccelerator and my Tinderbox scraper that chucked stuff into a CouchDB for exposure by a modified bugzilla jetpack.  Either couch went crazy or someone gave it a request that is ridiculously expensive to answer with a database the accumulated size of the tinderbox database.  I’m not aware of any trivial ways to contend with the latter.  Judging by the logs, it looks like 2 people other than myself used these services, so, my apologies to those cool, insightful, forward-looking individuals.

The specific services are probably not coming back.  doccelerator is being subsumed into something else that better meets documentation needs and will be more fully baked, more on that soon.  I got the impression at the summit that the tinderbox problem is in hand, or very close to someone’s hand; maybe one of those robot grabby-arm things is involved?  My reviewboard with bugzilla hacks instance is sticking around for the time being, but I think wheels are turning elsewhere in Mozilla on that front too, so hopefully my install is mooted before it falls over.

Lest there be any doubt, all clicky services are provided on a self-interested basis… I am happy when they benefit others, but I do these things for the benefit of my own productivity/sanity and they are hosted using my own resources.  (While I had higher hopes for doccelerator, the MoMo-resourced couchdb service provisioning never happened so doccelerator never got pushed public with nightly updates because of said clicky resource constraints.)

who is starring Firefox tinderbox failures and how long before they do?

Using the accumulated data in my CouchDB instance about the Firefox tinderbox and Protovis, we can see how long it takes before a testfailed/busted build gets starred and who stars it.  The horizontal axis is the number of minutes between the end of the build and when it gets starred.  If someone stars the build before it finishes running, a negative number is possible.  The vertical axis is a double-height histogram, which is to say the scaling is percentage based and don’t try to infer absolute numbers.  Colors tell us who did the starring.

The winner is philor.  I knew that going in, I just wanted to see it.  Thank you for reducing the pain of dealing with the tree!

partial: posting pymake data/results to a couch

What it does:

  • Adds a “–couch-json” option to bsmedberg‘s pymake that takes a URL that crams the state of (global, non-rule associated) variables into a couch document for each makefile context executed.  Variables are grouped by general origin as well as specific Makefile origin where relevant.  The code as it stands only tells you about make variables that were accessed because this cuts down on things you (I) don’t care about.  (For example, if FOO and BAR are defined but only FOO is ever expanded, then only FOO goes in the couch document.)

I think there’s a fair bit of potential in this (pushing “make -p” style data/make output into couch), at least in situations where there’s no escaping make, you don’t already have usable/appropriate tooling, and python/CouchDB/JSON is one of your religions.

I’m hoping to avoid touching the mozilla build system in any meaningful fashion from here on out, so I am unlikely to work on this much more, but it could be useful to others, hence this post.  The repo is here, check out the “understand” (p)branch.  (The changes are not suitable for upstreaming as things stand.)

Tinderbox results in bugzilla, jetpack times 2, CouchDB, review board

mstange‘s Tinderboxpushlog is awesome.  You know it, I know it, the many people whose sanity has been saved by it know it.  It is a fantastic improvement on checking the tinderbox; it lets you know the current state of the tree, the recent history of the tree, and how these things are correlated with recent commits.  What it is not good at (nor intended for) is to be a historical record.  Tinderboxpushlog ‘scrapes’ the tinderbox on-demand using tinderbox time windows and does not have the ability to key off anything but the time.

So if one refactored the scraper into a CommonJS module suitable for use with the newly rebooted jetpack, hooked it up to a cron job, and crammed its outputinto a CouchDB database, what would you get?

Exactly, something you can hook up to johnath and ehsan‘s magic bugzilla jetpack (last mentioned on the blog here).  You can install it from here.  (johnath/ehsan, please feel free to pull the changes into the upstream repo; I’m still wary of randomly pushing things into other people’s user repos…).  I know the presentation is ugly, not the least because the text labels are inconsistent with tinderboxpushlog; feel free to push into my user repo with improvements.  Oh, and for this to work you need to put an honest-to-goodness URL in your bugzilla comment or attachment description; there is no all-powerful regex if you type things out by hand.

You can find the thing that pushes thing into the couch using refactored tinderboxpushlog logic here.  Some cuddlefish runner tweaks (not all of which are likely advisable) can be found in my user jetpack-sdk repo.  (The new jetpack reboot is wicked awesome, by the way.)

Right now the cron job is running against the MozillaTry, Firefox, and Thunderbird3.1 trees on the tinderbox every 5 minutes.  While it should be pretty easy for me to keep the cron job and couch server online, I make no guarantees about the schema used in the couch, just that I will keep the jetpack in sync with it.  And if the service starts exceeding the resources of my (personal) linode, I may have to tweak polling rates or what not (‘what not’ meaning ‘up to and including turning it off’).

There is other work happening in this area and I am excited about that.  For example, I think brasstacks has an encompassing vision that should help provide historical data about which tests failed, etc.  With any luck, my efforts here will be mooted by buildbot and magic awesome dust.

The mention of reviewboard in the title is just that if you are using my review board stuff and put a link to your try server commit in the attachment description, we will use that to pull the official hg changeset as the basis for the diff.  The main benefit of this is that if your patch depends on other patches in your queue that are not yet in the trunk, the diff will still work out.  Specifically, if your queue had A, B, and C applied (where C is the tip) and you link to C, then we will provide a diff of C relative to B.  Please be aware that hg.mozilla.org is rather slow about providing the hg changeset diffs on demand so this will be at least an order of magnitude slower than the fetch of an already-available patch from bugzilla.  Repo with changes is here.

Doccelerator: JavaScript documentation via JSHydra into CouchDB with an AJAX UI

doccelerator-1

About the name.  David Ascher picked it.  My choice was flamboydoc in recognition of my love of angry fruit salad color themes and because every remotely sane name has already been used by the 10 million other documentation tools out there.  Regrettably not only have we lost the excellent name, but the color scheme is at best mildly irritated at this point.

So why yet another JavaScript documentation tool:

  • JavaScript 1.8 support.  JSHydra (thanks jcranmer!) is built on spidermonkey.  In terms of existing JS documentation tools out there, they can be briefly lumped into “doesn’t even both attempting to parse JavaScript” and “parses it to some degree, but gets really confused by JavaScript 1.8 syntax”.  By having the parser be the parser of our  JS engine, parsing success is guaranteed.  And non-parsing tools tend to require too much hand labeling to be practical.
  • Docceleterator is not intended to be just a documentation tool.  While JSHydra is still in its infancy, it promises the ability to extract information from function bodies.  Its namesake, Dehydra, is a static analysis tool for C++ and has already given us great things (dxr, also in its infancy).

doccelerator-comment

  • Support community API docs contributions without forking the API docs or requiring source patches.  DevMo is a great place for documentation, but it is an iffy place for doxygen-style API docs.  Short of an exceedingly elaborate tool that round-trips doxygen/JSDoc comments to the wiki and user modifications back again, the documentation is bound to diverge.  By supporting comments directly on the semantic objects themselves[1], we eliminate having to try and determine what a given wiki change corresponds to.  (This would be annoying even if you could force the wiki users to obey a strict formatting pattern.)  This enables automatic patch generation, etc.
  • Mashable.  You post the JavaScript source file to a server running the doccelerator parser.  You get back a JSON set of documents.  You post those into a CouchDB couch.  The UI is a CouchApp; you can modify it.  Don’t like the UI, just want a service?  You can query the couch for things and get back JSON documents.  Want custom (CouchDB) views but are not in control of a documentation couch?  Replicate the couch to your own local couch and add some views.
  • Able to leverage data from dehydra/dxr.  Mozilla JS code lives in a world of XPCOM objects and their XPIDL-defined interfaces.  We want the JS documentation to be able to interact with that world.  Obviously, this raises some issues of where the boundary lies between dxr and Doccelerator.  I don’t think it matters at this point; we just need internal and API documentation for Thunderbird 3 now-ish.
  • A more ‘dynamic’ UI.  The UI is inspired by TiddlyWiki‘s interface where all wiki “pages” open in the same document.  I often find myself only caring about a few methods of a class at any given time.  Documentation is generally either organized in monolithic pages or single pages per function.  Either way, I tend to end up with a separate tab for each thing of interest.  This usually ends in both confusion and way too many tabs.

1: Right now I only support commenting at the documentation display granularity which means you cannot comment on arguments individually, just the function/method/class/etc.

Example links which will eventually die because I’m not guaranteeing this couch instance will stay up forever:

The hg repo is here.  I tried to code the JS against the 1.5 standard and generally be cross-browser compatible, but I know at least Konqueror seems to get upset when it comes time to post (modified) comments.  I’m not sure what’s up with that.

Exciting potential taglines:

  • Doccelerator: Documentation from the future, because the documentation was doccelerated past the speed of light, and we all know how that turns out.
  • Doccelerator: It sounds like an extra pedal for your car and it’s just as easy to use… unless we’re talking about the clutch.
  • Doccelerator: Thankfully the name doesn’t demand confusingly named classes in the service of a stretched metaphor.  That’s good, right?

Adding stews (hackish destructive accumulation/reduction) to CouchDB

As all misguidedly-lazy programmers are wont to do, I decided that it would be easier to ‘enhance’ CouchDB to meet my needs rather than to rewrite visotank to use SQLAlchemy. Also, I wanted to understand what CouchDB was doing under the hood with views and try my hand at some Erlang.

This Has Nothing To Do With Anything

CouchDB as currently implemented maintains a lot of information for each mapped document. There is a B-tree associated with each View Group whose keys are Document Ids and whose Values are a list of {View Id, Actual-Key-You-Mapped-In-That-View} tuples for every key mapped from that document for every view in the view group. Next, each View has a B-tree associated with it whose keys are {Actual-Key-You-Mapped, Document Id} tuples and whose values are the Actual-Value-You-Mapped.

This is all well and good, but is a poor fit for one of my key use-cases: reducing e-mail message traffic to date-binned summary statistics so I can render graphics. If I want the weekly-messages-sent count for a given ‘author’, map(message.author, blah) will allow me to filter only to messages sent by that author, but no matter what blah is, I will still get one per message.

Long blog post short, I have implemented a hackish first-pass reduce/accumulate solution to my problem. The idea is that ‘stews’ allow you to aggregate mapped data that shares the same key. I’m a little fuzzy on exactly what the definition of ‘reduce’ is in the map/reduce papers (it’s been a while, if ever), so we’ll call this ‘accumulate’ (in the SICP/Scheme sense). It is a hack because:

  • It does not unify views and ‘stews’. Whereas views are defined under ‘_design’ and accessed via ‘_view’, stews are defined under ‘_pot’ and accessed via ‘_stew’.
  • Values can only be integers right now, and it’s assumed you want to add them. (No custom JavaScript logic!)
  • I have not yet dealt with modified/removed documents. Which is to say that if you modify or remove a stew-mapped document, your accumulated values will climb ever-skyward.
  • It is in no way, shape, or form intended to be anything other than a learning experiment. (It is my hope that Damien Katz magically solves my problems in the next release. Having said that, I’m not opposed to trying to actually implement a more solid feature along these lines; coding in Erlang is wicked awesome. (sounds better with a fake accent))

It just so happens that these constraints are perfectly in line with visotank’s needs. Using stews and otherwise limiting my use of views, CouchDB is less ridiculous in its view-update times and the fully-populated (view/stew-wise) from-scratch ‘messages’ database tops out at 77M rather than 1.2G.

This also has nothing to do with anything

Anyways, if anyone is interested in the code (or the comments I added to the existing couch_view_group.erl logic), my bzr branch for CouchDB is at: http://www.visophyte.org/rev_control/bzr/couchdb/visbrero-couchdb/ . My bzr branch for couchdb-python, adding a simple unit test for stews is at: http://www.visophyte.org/rev_control/bzr/couchdb-python/visbrero/ .

Update!  The bzr repository is powerful messed up, so a better choice might be my changes in patch form:  http://www.visophyte.org/rev_control/patches/couchdb/visbrero-couchdb-stews-1.patch

Update 2! The bzr repository accessible at http://clicky.visophyte.org/rev_control/bzr/couchdb/visbrero-couchdb/ works and there’s a checkout with working copy (that you can browse) at http://clicky.visophyte.org/rev_control/bzr-checkouts/couchdb/visbrero-couchdb/ .   Note that these locations are not guaranteed to be valid for all time, but will be good for at least a month or two.

I fear my (sleepy) explanation may not be sufficient, so the unit test I added to couchdb-python may speak better to this end:

self.db['tom1'] = {'author': 'tom', 'subject': 'cheese'}
self.db['tom2'] = {'author': 'tom', 'subject': 'cats'}
self.db['tom3'] = {'author': 'tom', 'subject': 'mice'}
self.db['bob1'] = {'author': 'bob', 'subject': 'hats'}
self.db['jon1'] = {'author': 'jon', 'subject': 'hats'}
self.db['kim1'] = {'author': 'kim', 'subject': 'cats'}
self.db['kim2'] = {'author': 'kim', 'subject': 'cows'}
self.db['_pot/test'] = {'views': {
'authors': 'function(doc) { map(doc.author, 1) }',
'subjects': 'function(doc) { map(doc.subject, 1) }'
}}
authors = dict([(row.key, row.value) for row in self.db.view('_stew/test/authors')])
self.assertEqual(authors['tom'], 3)
self.assertEqual(authors['bob'], 1)
self.assertEqual(authors['jon'], 1)
self.assertEqual(authors['kim'], 2)
subjects = dict([(row.key, row.value) for row in self.db.view('_stew/test/subjects')])
self.assertEqual(subjects['cheese'], 1)
self.assertEqual(subjects['cats'], 2)
self.assertEqual(subjects['mice'], 1)

Uh, the spiral visualizations have nothing to do with the post. They are new insofar as I have never posted them before, but they are in fact rather quite old. They have a new aspect in that they now work with the cairo renderer, having relied upon ‘special’ (horrible) custom renderers in the old agg backend.

SVG in visotank

visotank-conversation-timeline-snippet.png

visotank now has AJAX-loaded SVG graphics. The hooks are there to actually do something when you click on stuff, but it doesn’t do anything. The visualization is ripped from my visterity plugin for posterity; it’s not supposed to be new or exciting. The fact that the SVG is loaded via AJAX is new (visterity didn’t have that) and exciting. Pretty much everything else is simply legwork relating to using application/xhtml+xml instead of txt/html and the ramifications of that, especially with AJAX.

I’ve updated what is running at http://clicky.visophyte.org:8080/, but it looks like my VPS has some issues, so I wouldn’t be surprised if it things hang or are very slow when not yet cached.  (Specifically, I think it has very serious IO issues, but its absurd amounts of memory available avoid that problem from impacting things too much.)  (Normal slowness like its refusal to pipeline requests and there being hundreds of images is not a VM problem.)  Also, the SVG stuff is unlikely to work on anything but Firefox 2; at least 3.0a8 gets angry for me on gutsy.

To see the SVG graphs, the steps are: 1) select at least one contact in the contacts list, 2) click on the ‘conversations’ tab in the bottom half, select a conversation (you can only select one), and 3) click on the ‘conversation’ tab in the bottom half.  I should note that you might want to wait for all of the sparkbars to load before proceeding to the next step…

visotank-screenshot-conversation-timeline.png

more (clicky!) mailing-list visualization a la visotank, couchdb

visotank-shot-1.png

Visotank now allows you to select some authors of interest from a sortable list of contacts, and then show the conversations they were involved in. You get the previously shown sparkbars for the author’s activity. You also get sparkbars showing the conversation activity, with each author assigned a color and consistent stacking position in that sparkbar. Click on the screenshots for zoomed versions of the screenshots to see what I mean.

You can click on things yourself at http://clicky.visophyte.org:8080/. Please only go there if you’re okay with restarting your Firefox session (especially true if Firebug is on.) All tables/images are the real thing and not fetched on demand… which results in Firefox having to pull down a lot of images. Click on some rows in the contacts table to select them. Then, in the lower tab group, click on the “conversations” tab. This will then fetch all the conversations those selected users were involved in. The system will truncate more than 10 users, so don’t go crazy. The tabs are re-fetched on switch, so if you change your contact selections, in the lower tab group, click away to “HowTo”, then back to “Conversations”. The “Conversation” tab does nothing and is a big lie.  Great UI, I know.

visotank-shot-2.png

I think you will find that sparkbar visualizations of the conversation traffic with a weekly granularity are rather useless. I think a reasonable solution would be a ‘zoomed’ sparkbar with an indication of the actual uniform timeline scale included. Since the images currently show about 2 years of data, a thread that happened 1 year ago would be centered in the middle of the image, but with its actual horizontal scale being inconsistent with that position. Future work, as always.

I have used Pylon’s Beaker caching layer to attempt to make things reasonably responsive. While CouchDB view updates are sadly quite lengthy (many many minutes when dealing with 16k messages; python-dev from Jan 2006 through Nov 2007), that is thankfully a one-off sort of thing. (The data-set is immutable once imported and I don’t change schemas that often.) The main performance hit is that I can only issue one range of keys to query in a request, so if I am trying to snipe a subset of non-consecutive information, I have to issue multiple requests. (I don’t believe POSTed views can operate against views in the database…)

Regrettably, I think my conclusion about CouchDB is that it (or something like it) will be truly fantastic in the future, but it is not going to get there soon enough for anything that hopes to be ‘productized’ anytime soon. The next thing I want to look at is using a triple-store to model some of the email data schema; my efforts from the visterity hacking suggest it could be quite useful. Of course, even if triple stores work out, I suspect a more traditional SQL database will still be required for some things. Combined with a thin custom aggregation and caching layer, that could work out well.

Note: I should emphasize that my CouchDB schema could be more optimized, but part of the experiment is/was to see if the views saved me from having to jump through clever hoops.

first steps to interactive fun using CouchDB

visotank-first-python-dev-sparkbars1.png

First, let me say that Pylons with its Paste magic is delightful; lots of nice round edges helped me get something simple up and running in no time, and using genshi to boot.

The new tool, visotank, is ingesting the python-dev mailman archives (as previously visualized) and putting them into CouchDB. The near-term goal is to allow for interactive exploration/visualization of the archives. The current result, as pictured, is simply sparkline barcharts of people’s posting history. Left-to-right, present-to-past, weekly, one (vertical) pixel per message, truncating at the image height (12 pixels).

Although the input processing thus far is specific to mailing list archives, the couchdb schema in use is for generic e-mail traffic. The messages are even coerced into rfc2822 format for ‘raw’ storage.

The ability to use ‘map’ multiple times in couchdb views to spread information is delightful. What I really would like is more reduce functionality or, more specifically, just accumulate. The sparkbars get their data from statistics with keys [contact id, timestamp of time period] and value 1, one per message. I would love for couchdb to provide a way to aggregate all those values with identical keys into a single row with the sum as the value. I’ll look into this and the view implementation before writing any more on the subject, but if someone out there already knows a way to do this, please let me know.

visotank-first-python-dev-sparkbars2.png