{"id":161,"date":"2009-02-04T07:08:11","date_gmt":"2009-02-04T12:08:11","guid":{"rendered":"http:\/\/www.visophyte.org\/blog\/?p=161"},"modified":"2009-02-25T22:37:34","modified_gmt":"2009-02-26T03:37:34","slug":"visualization-of-control-flowdata-flow-analysis-of-sqlite-explain-ations","status":"publish","type":"post","link":"https:\/\/www.visophyte.org\/blog\/2009\/02\/04\/visualization-of-control-flowdata-flow-analysis-of-sqlite-explain-ations\/","title":{"rendered":"visualization of control-flow\/data-flow analysis of sqlite EXPLAIN-ations"},"content":{"rendered":"<div id=\"attachment_162\" style=\"width: 166px\" class=\"wp-caption alignleft\"><a href=\"http:\/\/www.visophyte.org\/blog\/wp-content\/uploads\/2009\/02\/blocks-before.png\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-162\" class=\"size-thumbnail wp-image-162\" title=\"control-flow before\" src=\"http:\/\/www.visophyte.org\/blog\/wp-content\/uploads\/2009\/02\/blocks-before-156x600.png\" alt=\"control-flow before\" width=\"156\" height=\"600\" srcset=\"https:\/\/www.visophyte.org\/blog\/wp-content\/uploads\/2009\/02\/blocks-before-156x600.png 156w, https:\/\/www.visophyte.org\/blog\/wp-content\/uploads\/2009\/02\/blocks-before-78x300.png 78w\" sizes=\"auto, (max-width: 156px) 100vw, 156px\" \/><\/a><p id=\"caption-attachment-162\" class=\"wp-caption-text\">control-flow before<\/p><\/div>\n<p>While doing some work on the gloda-based search targeted for beta 2, I came upon some query slowness and decided to look into it.\u00a0 Because gloda queries are (conceptually, at least) dynamic, we also generate dynamic SQL.\u00a0 Our schema is fairly normalized, and it&#8217;s not always obvious to me what I can expect to have going on.\u00a0 EXPLAIN QUERY PLAN is a good litmus test for massive insanity, but it&#8217;s a bit concise.\u00a0 For example, the troublesome SQL was as follows:<\/p>\n<p><tt>SELECT * FROM messages INNER JOIN messagesText ON messages.id = messagesText.rowid WHERE id IN (SELECT docid FROM messagesText WHERE subject MATCH \"sutherland\") AND deleted = 0 AND folderID IS NOT NULL AND messageKey IS NOT NULL ORDER BY date DESC LIMIT 100;<\/tt><\/p>\n<p>This nets us:<\/p>\n<p><tt>0|0|TABLE messages WITH INDEX deleted<br \/>\n1|1|TABLE messagesText VIRTUAL TABLE INDEX 1:<br \/>\n0|0|TABLE messagesText VIRTUAL TABLE INDEX 2:<\/tt><\/p>\n<p>Everything&#8217;s an index!\u00a0 We win!\u00a0 Sorta.\u00a0 The index it chose is somewhat worrying if you think about it.\u00a0 But really, it&#8217;s all quite nebulous.\u00a0 I have basically no idea what is happening in there.\u00a0 The good news is that EXPLAIN will tell us the actual opcodes in use.\u00a0 The bad news is that it is <a href=\"http:\/\/www.visophyte.org\/blog\/wp-content\/uploads\/2009\/02\/explained-before.txt\">quite hard to understand<\/a> (104 opcodes), at least for this query.<\/p>\n<p>Python, graphviz, pygraphviz, and compulsive tool-building to the rescue!\u00a0 My original plan was that data-flow analysis could make everything incredibly obvious as to what is going on.\u00a0 I&#8217;m not sure I reached that point.\u00a0 But in the process, the efforts required to make the system clever enough to do data analysis allow the control flow diagram to be quite pretty and have lots of useful information.\u00a0 The main sign of data-flow analysis is that all cursor write operations have the list of cursors that data flowed from in parens.\u00a0 Each cursor gets its own tango-ish color, and opcodes involving the cursor are colored by that cursor.<\/p>\n<p>The major flaw in the data-flow analysis that springs to mind right now is that it ignores control flow.\u00a0 An action that will only be taken because of a control-flow decision based on data retrieved from a cursor should ideally establish a relationship.\u00a0 This is important because gloda throws around a lot of set intersections (although not in this case) in its queries, and it would be nice to be able to concisely express that.\u00a0 The control-flow diagram is arguably orders of magnitude better than a raw EXPLAIN, but my hope is\/was that the data analysis can make it trivial to figure out how things are being hooked up.\u00a0 Given the amount of effort already expended and the present results, I figure I&#8217;m probably at the &#8220;control flow is good enough for the foreseeable future stage of things&#8221;.<\/p>\n<p>In any event, the control-flow graph makes it (more) obvious that the outer loop is using the &#8216;deleted&#8217; index to walk over *every deleted message in the database*.\u00a0 A one-off idiom computes the full-text search and stashes it in an intermediary table.\u00a0 As we then walk over every deleted message, we see if that message can be found in the full-text search result table.\u00a0 If it is, our previously queued lazy-seek against the main messages table happens and we search against the full-text table to do our join.\u00a0 And out pop results!<\/p>\n<p>My hope for this query was that the deleted condition would be executed as a constraint as we were walking our result set, just like the folderID and messageKey constraints.\u00a0 Making sure your queries are using the right index and helping the optimizer make the right call is a fairly textbook problem and the SQLite docs are actually pretty good on this.\u00a0 For the sake of the example, I have dropped the index entirely.\u00a0 (The &#8216;deleted&#8217; index was added so we can quickly mark messages as needing deletion processing without slowing down deletes for the user.\u00a0 It may turn out to be more beneficial to leave the field as-is, but use a separate table as our work queue.)<\/p>\n<div id=\"attachment_164\" style=\"width: 610px\" class=\"wp-caption alignnone\"><a href=\"http:\/\/www.visophyte.org\/blog\/wp-content\/uploads\/2009\/02\/dataflow-before.png\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-164\" class=\"size-thumbnail wp-image-164\" title=\"dataflow before\" src=\"http:\/\/www.visophyte.org\/blog\/wp-content\/uploads\/2009\/02\/dataflow-before-600x247.png\" alt=\"dataflow before\" width=\"600\" height=\"247\" srcset=\"https:\/\/www.visophyte.org\/blog\/wp-content\/uploads\/2009\/02\/dataflow-before-600x247.png 600w, https:\/\/www.visophyte.org\/blog\/wp-content\/uploads\/2009\/02\/dataflow-before-300x123.png 300w, https:\/\/www.visophyte.org\/blog\/wp-content\/uploads\/2009\/02\/dataflow-before-1024x421.png 1024w, https:\/\/www.visophyte.org\/blog\/wp-content\/uploads\/2009\/02\/dataflow-before.png 1076w\" sizes=\"auto, (max-width: 600px) 100vw, 600px\" \/><br \/>\n<\/a><p id=\"caption-attachment-164\" class=\"wp-caption-text\">dataflow before<\/p><\/div>\n<div id=\"attachment_167\" style=\"width: 240px\" class=\"wp-caption alignright\"><a href=\"http:\/\/www.visophyte.org\/blog\/wp-content\/uploads\/2009\/02\/blocks-after.png\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-167\" class=\"size-thumbnail wp-image-167\" title=\"control-flow after\" src=\"http:\/\/www.visophyte.org\/blog\/wp-content\/uploads\/2009\/02\/blocks-after-230x600.png\" alt=\"control-flow after\" width=\"230\" height=\"600\" srcset=\"https:\/\/www.visophyte.org\/blog\/wp-content\/uploads\/2009\/02\/blocks-after-230x600.png 230w, https:\/\/www.visophyte.org\/blog\/wp-content\/uploads\/2009\/02\/blocks-after-115x300.png 115w, https:\/\/www.visophyte.org\/blog\/wp-content\/uploads\/2009\/02\/blocks-after-392x1024.png 392w, https:\/\/www.visophyte.org\/blog\/wp-content\/uploads\/2009\/02\/blocks-after.png 1064w\" sizes=\"auto, (max-width: 230px) 100vw, 230px\" \/><\/a><p id=\"caption-attachment-167\" class=\"wp-caption-text\">control-flow after<\/p><\/div>\n<p>After the deletion, we get the revised diagram.\u00a0 The full-text search still happens completely before we produce any other results, but this is keeping in keeping with our query.\u00a0 (The query generation code is designed to handle the case where we have multiple constraints and must intersect the results of each.\u00a0 It can clearly be optimized for the case where no primary-key intersection is required.)\u00a0 The traversal of the full-text search result set is now the outer loop.\u00a0 Deleted is now just filtering like folderID and messageKey, as expected.<\/p>\n<p>The major lingering red flag is the block that calls Last() and Delete().\u00a0 The conditional that makes the decision to do that is using &#8220;r1&#8221;, which is apparently implementing our limit logic inside the code.\u00a0 This was confusing until I remembered that we&#8217;ve got an ORDER BY going on.\u00a0 The Delete is just culling the results that will never potentially be seen.\u00a0 The bad news is that we are doing our full-text search join prior to the ORDER BY culling.\u00a0 So, my original fear that we are doing the JOIN no matter what still holds.\u00a0 And that&#8217;s enough excitement for one day.<\/p>\n<p>Given that any full-text search with ordered results will require retrieval the entire result set and correlating it against at least one table, I&#8217;m tempted to create separate tables for different windows of time given the importance of recency.\u00a0 Thoughts\/input appreciated, as anyone who has read this far is probably very detail oriented&#8230;<\/p>\n<p><strong>UPDATE (Feb 25, 2009):<\/strong> The hg repo is <a href=\"http:\/\/hg.mozilla.org\/users\/bugmail_asutherland.org\/grokexplain\/\">http:\/\/hg.mozilla.org\/users\/bugmail_asutherland.org\/grokexplain\/<\/a>.\u00a0 It now does some actual value propagation (and has had various bugs fixed).\u00a0 By default it ignores &#8216;Yield&#8217; because the way in which they are used (basically generator\/continuation) wreaks havoc on the CFG with just straightforward static analysis.\u00a0 If you want the yields, pass &#8220;yields&#8221; on the command line.\u00a0 Also, passing &#8220;debug&#8221; gets you the states of the registers at the entrance\/exit of the &#8220;basic blocks&#8221;.\u00a0 (Not technically basic blocks, but close enough for my purposes.)<\/p>\n","protected":false},"excerpt":{"rendered":"<p>While doing some work on the gloda-based search targeted for beta 2, I came upon some query slowness and decided to look into it.\u00a0 Because gloda queries are (conceptually, at least) dynamic, we also generate dynamic SQL.\u00a0 Our schema is &hellip; <a href=\"https:\/\/www.visophyte.org\/blog\/2009\/02\/04\/visualization-of-control-flowdata-flow-analysis-of-sqlite-explain-ations\/\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"inline_featured_image":false,"footnotes":""},"categories":[7,8,4],"tags":[25,24],"class_list":["post-161","post","type-post","status-publish","format-standard","hentry","category-debugging","category-program-execution","category-visualizing","tag-graphviz","tag-sqlite"],"_links":{"self":[{"href":"https:\/\/www.visophyte.org\/blog\/wp-json\/wp\/v2\/posts\/161","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.visophyte.org\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.visophyte.org\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.visophyte.org\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.visophyte.org\/blog\/wp-json\/wp\/v2\/comments?post=161"}],"version-history":[{"count":9,"href":"https:\/\/www.visophyte.org\/blog\/wp-json\/wp\/v2\/posts\/161\/revisions"}],"predecessor-version":[{"id":172,"href":"https:\/\/www.visophyte.org\/blog\/wp-json\/wp\/v2\/posts\/161\/revisions\/172"}],"wp:attachment":[{"href":"https:\/\/www.visophyte.org\/blog\/wp-json\/wp\/v2\/media?parent=161"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.visophyte.org\/blog\/wp-json\/wp\/v2\/categories?post=161"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.visophyte.org\/blog\/wp-json\/wp\/v2\/tags?post=161"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}