This document is a personal history of Legion: the story of the project, told through the lens of the papers we wrote about it, and the journey that brought us to each one.
In research, we rarely talk about the struggle involved in doing research. We especially don’t like to talk about rejection. I think this does the community a disservice by glossing over the challenges we face getting to the results that eventually do get published, trivializing the process and minimizing the (mundane but necessary) engineering work that makes the research possible in the first place—making the process seem simple and almost boring. The real story, is (you guessed it) anything but.
Almost every paper we published along the way was rejected at least once. A majority of papers were rejected multiple times. And some papers were rejected so much that they eventually landed back at the same conferences that initially rejected them. Contrary to what the cynics might say, these rejections made the papers stronger, clearer, and in some cases helped us clarify our own contributions in the work that we ultimately published. But of course that doesn’t mean that receiving those rejections was easy.
Legion is both a research project and also a living piece of software. So, while I focus here on papers about Legion, I could have equally written about the software itself. Sometimes, those paths lined up, and sometimes they diverged quite dramatically. For better or worse, not everything that is essential to maintaining a long-term software project leads to a publishable research article—something that I hope to remedy someday by writing a software history of Legion. But this is not that post.
Because this is a personal history, I am going to focus on the papers that I had the most direct involvement with. Therefore, this story doesn’t start quite at the beginning, because my involvement with the project began about 3–6 months after it had started. And there are many, important Legion papers that I won’t touch on due to my relative lack of involvement. There are certainly more stories that could be told, but others will have to tell them.
If you found your way here without knowing anything about Legion, Legion is a distributed, heterogeneous programming system for high-performance computing (HPC). The goal of Legion is to make it easy to program clusters of CPUs and GPUs (or in the future, whatever hardware we end up running in HPC). Legion’s core value proposition is that it is implicitly parallel, that is, users do not have to worry about the tedious, error-prone, low-level details of how their programs run on the hardware. Among systems with comparable goals, Legion is distinguished by having an exceptionally flexible data model, something that is our key advantage and has also been the biggest challenge in making the system work and scale well.
I arrived at Stanford in 2011 for my Ph.D. program. First year students go through rotations, or temporary assignments with different professors, to get a feeling for research with various advisors. My second rotation was with Alex Aiken.
At the time Legion had just gotten off the ground, literally 3–6 months prior to my joining. I was entering almost (but not quite) on the ground floor. While the project itself was cool, what immediately sold me on it was the team. The two students running the project, Mike Bauer and Sean Treichler, were clearly driving the research and knew what they were doing. I was hooked.
My assignment in the beginning was nominally scaling. (Legion is a distributed programming system for HPC, so scaling means running a program on more nodes.) This was in retrospect a laughable goal—utterly ridiculous. The software I had access to didn’t run. Sometimes it didn’t even build. The applications for our experiments hadn’t even been written yet, and when I started to write one, the software immediately fell over.
And so my role evolved into that of a glorified tester. I would write code, build it, and one of two things would happen: either it wouldn’t build, or it would build but then crash. I would turn to Mike Bauer, who (in a stroke of blessed administrative competence) was sitting next to me, and report the bug. Fifteen minutes later he’d push a fix that would unblock me and I’d go back to coding until I hit the next issue. As I recall, in those early days I hit about one bug an hour, so somewhere between half a dozen to a dozen bugs a day.
Overall, I think this approach worked well. My being on the team allowed Mike to avoid thinking about testing at all in this early phase of the project, and just build. Because I was testing the code on a continuous basis, I usually found bugs within an hour of them being introduced. I was also kicking the tires of the API to make sure it was possible to write the necessary application code, which was also not always the case. In any event, splitting the responsibilities allowed each of us to iterate on our respective parts as fast as possible, and mostly without blocking each other.
We set a goal of sending the paper to SC 2012. Technically, this was the second submission for this paper, but both the software and the paper itself evolved so aggressively that there isn’t that much in common with the first submission. Alex Aiken, our advisor, took responsibility for the text of the paper, while the three students (Mike, Sean and myself) continued working full-time on the code, which was not ready until right before the deadline. As would become a trend for us, we finished our experiments at the last minute and squeaked in under the deadline.
As it turns out, we kept most of our papers under version control in a repository which has since been released to the public. Therefore, if you want to read any of our old papers (including submissions that were rejected), you can navigate to the corresponding source directory and type make.
Submission history:
My rotation ended and I went off to work with another professor for three months. While I would come back to work on Legion full-time after this, those three months were apparently a prolific time for the team because they managed to get off two paper submissions without me. Oof! These papers described a type system for Legion (focusing on data partitioning), and the low-level portability runtime Realm. Because I was away, I wasn’t really involved in these papers, so I won’t cover them in detail—but both went through a pretty long rejection cycle before finally finding a home.
Partitioning submission history:
Realm submission history:
To no one’s surprise whatsoever, writing software under intense time pressure does not lead to high quality production software. And so the history of Legion is one of qualitative, often step-function improvements to the software, through repeated rewrites of the code. In our defense, this wasn’t just because of shoddy engineering (though the first couple of versions of Legion were definitely rushed). As it turned out, we dramatically underestimated the complexity of the original problem we were trying to solve, and as we peeled away the layers, we kept finding more and more complexity beneath. Realistically, if Legion had been a commercial project, this surely would have killed it after the first couple of rewrites. But since we were academics, each new twist in the road provided us an excuse to publish—which laid the foundations of several Ph.D. careers.
Structure slicing was the third complete rewrite of the Legion runtime, and the first that changed the conceptual model of the problem we were trying to solve. (Two previous rewrites had focused on paying down technical debt, and changing the runtime’s internal model of execution, but neither were user-facing.) The key observation that led us to structure slicing is that the user’s data frequently consists of objects with multiple fields. In any given task (or function), you might not need all of those fields. This might not make any significant difference in a classical system, but on distributed machines, you can save a huge amount of data movement (and thus greatly improve performance) by only moving the necessary fields. But to do so requires that Legion reason about those fields, which essentially adds an entire new dimension of complexity to the runtime.
My role in this paper was essentially the same as the last one: I prepared application codes for deployment with the updated runtime interfaces, and tested them aggressively to work out the bugs (of which there were many). But I was getting a bit restless and wanted to strike out on my own, which I’ll cover next.
This paper was also rejected once prior to acceptance, in what would become an ongoing trend for our research group. To this day, no core Legion technology paper has ever been accepted for publication to PLDI, the conference that we almost always submitted to first. Instead (and with no small irony), the first Legion paper ever accepted to PLDI was about a library running on top of Legion—a great paper, but one that would never have been possible at all if it hadn’t been for all the (rejected) papers that came before it.
In retrospect, it was for the best that this paper got rejected. In the four months between the first and second submissions, we got one of our codes to run on over 1024 nodes (and then, in the next six months, to the entirety of Titan, then the #1 supercomputer in the world). This was truly a dramatic improvement over our first paper, where our highest achieved node count was only 16.
But there was a catch: we were only able to do this for one application. And the application, as impressive as it was, had to be hand-written to use new features introduced specifically into Legion for the purpose of improving scalability. This code was notoriously difficult to write because it broke holes in the nice, implicitly parallel Legion abstraction that we’d worked so hard to promote. And so while technically, this was the first paper to present truly scalable Legion results, fixing the leaky abstraction would take us another three years.
Submission history:
Legion is sort of an odd project. Technically, the software is written in C++. You can (and we did) use it from C++. In the early years of the project, this was how every Legion application was written. And yet, being the programming language geeks we were, we designed Legion to have a type system, and users were expected to follow those rules. By 2013 it was pretty obvious that writing proper Legion code in C++ was hard and that we needed to develop an actual language around the project.
At the tail-end of my second year as a Ph.D. student, I began to work on the language that would eventually become Regent. According to our Git repository history, the very first version of Regent was written in Haskell. To be honest, that version was so short-lived that I forgot (until I went to look) that it ever existed. For the next 12 months, the version of Regent that existed until just 10 months before the final paper submission, was written in Python. But the version that would eventually be published was written in a new language also developed at Stanford, called Terra.
There were three compelling features that drove me to adopt Terra:
As you might expect, when you develop a compiler for the first time, it’s easy to generate really slow code. But the slowdown of the early versions of the Regent compiler were truly astonishing. My first application, developed during a summer internship at Los Alamos National Lab (LANL) in 2013, was a port of Pennant, itself a miniaturized version of a production hydro-shock simulation code used at the lab.
The initial version of Pennant in Regent was slower than the original code by a whopping factor of 10,000. Note: that wasn’t the distributed version of Pennant. It wasn’t even the multi-core version. It was the sequential version. My compiler generated code 10,000 times slower than some other code running on a single core.
Needless to say, this was so utterly unacceptable that it became my full-time job for the next six months to make it better. If I recall correctly, I improved performance by a factor of 100× in the first two weeks, leaving the code a mere 100× slower than the single-core original. These were mostly stupid bugs, the sort that occur naturally in newborn compilers, but which are pretty simple to fix. I recall I got a factor of 5× by turning on optimizations in Legion, for example—literally the difference between passing -O0 and -O2 to the underlying C++ compiler. But the next 10× was more work, and the next 10× was harder still. By December of 2013, my code was only 50% slower than the sequential reference code, but I had no idea where that last 50% was hiding. I would eventually discover that I had missed a key cache-blocking optimization in the original code—which in my initial reading I had (correctly) determined made no difference to the output of the program, and thus elided, but which was essential to improving the computational intensity of the algorithm.
Meanwhile, there was another, even bigger issue. In a meeting with my advisor in February 2015, after describing everything I had done up to that point, and stating my goal of submitting to SC 2015 (with a deadline a mere two months away), my advisor informed me that nothing I had done up to that point was worthy of a paper submission. There simply wasn’t enough novelty. Building a new language wasn’t enough. Building a new compiler (in three different iterations, no less) wasn’t enough. I had to come up with something else.
Before you go knock on my advisor, note that the type system had already been published in a substantial (if not final) form, and that took quite a bit away from the novelty of what I was doing. While there were nontrivial aspects of the implementation, he was ultimately correct: what I had done up to that point wasn’t enough.
Though true, the diagnosis was also crushing. I was in my fourth year of my Ph.D. program—the time when, as Alex put it, “You wouldn’t want to get to the end of the year without having some sort of paper.” I couldn’t afford to throw everything away again.
In a state of semi-panic, I went back to my office after that meeting and immediately brainstormed every idea I could possibly think of for turning this into research-worthy work. In that process I sketched out notes for a series of optimizations, to be implemented in the compiler, that would automatically improve program performance—basically doing things that every Legion C++ programmer was forced to do, but that we really didn’t want Regent users to have to worry about.
In case you can’t read it, the document starts with the line: “I’ve been looking for things to make a paper more interesting.” Let me translate that for you: “Everything I’ve done up to this point is absolutely worthless and I’m throwing everything at the wall in an attempt to rescue my Ph.D. career.”
(You can insert the most appropriate swear word in your native language here.)
I have kept that set of notes ever since, as a reminder to myself that no matter how bleak the outlook is, you can always find ways to turn things around.
It would be easy to pretend that the next couple of months to the paper submission were easy, since I was mostly just following the plan I’d laid out in February. They were not. They were a slog, a death march to the finish line. We were constantly behind schedule, and the benchmarks we found for our experiments all turned out to be much more work then we’d expected.
Two weeks before the deadline, only one benchmark could plausibly be considered to be in reasonable shape, and we needed three for the paper. (Three is apparently a magic number determined by the gods of academic publishing as the acceptable minimum number of benchmarks for any programming language.) I had only just diagnosed the root cause of the performance problems in Pennant, and the optimizations required were well beyond what I’d be able to pull off with the compiler. The last benchmark was simply missing—we hadn’t started writing it yet.
In those last two weeks there was literally more work than one person could do, and I wouldn’t have finished it at all if I hadn’t had help. I worked furiously to revise Pennant to add cache blocking. Wonchan Lee, a new Ph.D. student in the group, wrote the last benchmark completely from scratch, and also wrote a vectorizer (another optimization) for the compiler. Really, an entire vectorizer. Also from scratch. Wonchan is awesome.
And that still wasn’t enough. One week before the deadline, we had three benchmarks working correctly, and performing well on a single node. But we had no multi-node capability whatsoever. And I had already written the paper—because it couldn’t wait—and had done so with the assumption that we would get at least basic multi-node performance working. It turned out the Legion runtime itself also had bugs. And these were bugs that neither Wonchan or I could fix: Mike Bauer was the maintainer of the Legion runtime and the only person in a position to fix any of this.
In the last week, things went down like this. On Saturday, we had zero benchmarks running on multiple nodes. Each day after that, we got one application working. On Tuesday, we had three out of three benchmarks working on multiple nodes… and zero out of three of those actually running efficiently. On Wednesday, we fixed one; Thursday, another. On Friday, Mike threw in the towel and said he wouldn’t be fixing any more bugs for us.
We ended up submitting what I thought was a weak paper. I’m pretty sure everyone in the group felt that way. When the reviews came in, everyone thought the paper would be rejected, and it felt like a chore to even write a rebuttal.
Somehow the paper got accepted. On its first submission. When every other Legion paper up to that point had been rejected at least once.
I was so overwhelmed when I got the acceptance notification that I spent an hour walking laps around NVIDIA HQ, where I was doing an internship at the time, to calm down. This was, in a very direct sense, the first concrete evidence that I could actually build a career as a researcher. Until you get there, you never really know whether you’ve got it in you or not.
Holy shit, I’d actually made it.
Yet again, it would be easy for the story to end here. Paper accepted. Yay, happy ending, right?
Ha, ha, ha, ha. Oh, boy. I wish.
It turns out that even when papers do get accepted, things can still go wrong.
Exactly one week before the camera ready deadline, I discovered that I had forgotten to validate one of the applications. (Camera ready is when the paper is supposed to be done. Like, done done. The paper is already accepted. It’s been reviewed and the reviewers all agreed that it is publication-worthy. While you can, in principle, make any changes you want prior to this final deadline, losing one entire experiment (out of a set of three) is not an acceptable change.)
I figured that it ought to be ok as long as I validated the application and confirmed the results.
Validating the application failed. The code produced the wrong answer.
At this point, all our performance results for the application got thrown out the window. If it didn’t compute the correct answer, it was meaningless how fast it ran.
In a flurry of debugging to discover the root cause of the issue, we found a fundamental flaw in the Legion runtime itself, that caused the runtime to actually execute the application code incorrectly. Regent was doing the right thing, but Legion was not. This flaw ran deep enough to make it impossible, for all intents and purposes, to fix. We were hosed.
Now, as it just so happened, Mike Bauer had been working for nearly six months at this point on an entirely new version of Legion—yet another complete rewrite. His new version promised to fix the issue we were seeing at its root. Mike had originally planned to have this new code ready in about a month, but he agreed to shoot for having a preliminary version of it ready for us to try by the end of the week.
So we bet all our hopes on getting this rewrite finished, and Mike worked furiously to get it done. But as we counted down the days, we had to face the real possibility that it wouldn’t arrive in time. And then what?
Obviously, we couldn’t publish the old numbers. That left two options. We had a version of the code that gave the correct answer, but was slow. We could use that in a new experiment, and just publish those numbers, such as they were. Or we could use the nuclear option: mail the conference organizers, apologize, and withdraw the paper from the conference.
A couple days before the deadline, we decided not to use the nuclear option, despite still not having anything. The heart of the paper’s contributions did not actually hinge on these results, even though missing them would make the paper less strong. I prepared a version of the paper worded carefully to be compatible with either option. Then, if we got the results at the last minute, I’d be able to add them, and if not I could just stick with the minimal results we had.
With less than 24 hours remaining, Mike gave the new Legion his stamp of approval and had me start testing the code. It didn’t work. For several hours, I sat with him, debugging Legion’s internal runtime logic to figure out what was going wrong.
We started debugging at 9pm. At 3am, we found the root cause. Mike and I looked at it, agreed that we were sure this really was the root cause, but that it wasn’t going to get fixed in the next 2 hours. The paper deadline was 5am.
On a whim, 90 minutes prior to the paper deadline, I went back to the last working version of the application (and the old Legion) and tried a Hail Mary. The application (in the fixed version that passed validation) was too slow because of data movement. That data movement was causing the code not to scale. So I pinned the entire working set of the application in physical memory so that the NIC would have direct access to all of it. This was massive overkill, because the vast majority of the data was private and would never need to be communicated at all. But it sidestepped the particular corner case that was causing the old version of Legion to fail.
It worked; validation succeeded. I ran all the performance experiments, went back to the paper, hurried to update the figures and double check for inconsistent text, and uploaded the final version of the paper at 4:45am, 15 minutes before the deadline.
Submission history:
In the history of Legion, there have been a handful of cases where exploring the deficiencies of Legion has led to the discovery of a true gem. Dependent partitioning is one of those cases; the other is control replication (which I’ll cover below).
In distributed computing, it’s common to talk about partitioning data: basically, the process of splitting the data up so that it can be distributed among different memories. Legion is unusual among systems of its kind because it allows extremely general partitioning of data. Whereas it might be common for systems to allow you to chunk up data in various ways, these ways are typically limited to a fixed set of operators (like block or block cyclic distributions) that are known to be efficient. Legion on the other hand allows you to pick any partitioning whatsoever, including ones where the chosen subsets of elements are noncontinguous, or even overlapping (i.e., a single element appears in multiple subsets).
This flexible partitioning was a key ingredient of our secret sauce, but it also meant that the process of partitioning was cumbersome and error prone: users had to go and manually select the correct subsets of elements, and if they messed this up, their entire program would run incorrectly.
The deficiency in our model was obvious, but the solution space very much was not, at least to me. In the early days of us formulating a research plan to address this problem, I remember Sean offering to let me take a crack at it, and I frankly just didn’t even know where to start. As I watched him work, it became clear that he was in a completely different head space from me, despite my having been on the project several years by this point.
Sean’s key insight—which drove the rest of the design—was to distinguish between independent and dependent partitions. Independent partitions are ones that are (usually) easily computable in a straightforward way: e.g., slicing data into N equal pieces, or coloring (i.e., assigning a number to) each element and collecting subsets with each respective color. Dependent partitions are ones that are computed from other partitions. The brilliance of Sean’s scheme is that in many applications, the structure of the data itself tells you what dependent partitions you need. All the tedious, error-prone partitioning code goes away and you’re left with something simple, obvious, and (perhaps surprisingly) performant.
It would have been enough for a paper to define these new dependent partitioning operators. But Sean didn’t stop there. He created a small sublanguage for these operators and developed a type system that could prove deep, structural properties about the resulting partitions. Best of all, the fragment of logic he used was decidable, giving us a sound and complete solution—something very rare in type systems that powerful.
And of course he actually implemented the operators, showing that the new partitioning technique was not just more powerful, but also more efficient, and could even be parallelized and distributed automatically.
The paper was frankly a tour de force. And yet despite that, it went through the same rejection cycle as every other Legion paper.
Submission history:
Remember how I said exactly one of our applications was scaling? That was still true at this point, and it was beginning to stick out like a sore thumb. The kind of code that you needed to write in Legion to actually scale was really onerous, and it was obvious to everyone involved that we needed a solution.
Newly freed from my SC 2015 submission (and not yet burdened by my discovery of the validation failure), I set about to fix this. If I recall correctly, several of us came up with more-or-less the shape of the solution independently. It might not have been obvious to anyone outside our group, but to those of us who’d had our noses shoved in it for long enough, it was pretty blatantly obvious that what we needed the compiler (or the runtime) to do was automatically transform the code in the same way we had manually done to our one application that actually scaled.
I spent a month or so hand-optimizing another application to make absolutely sure I knew what was involved. At that point, I launched into the design of the optimization itself.
In principle, the optimization isn’t that tricky. What you need to do is take a set of loops like:
for t = 0, T_final do
for i = 0, N do
some_task(...)
end
for i = 0, N do
some_other_task(...)
end
end
And transform it into code like:
for shard = 0, num_shards do
for t = 0, T_final do
for i = N_start[shard], N_final[shard] do
some_task(...)
end
-- communicate if necessary
for i = N_start[shard], N_final[shard] do
some_other_task(...)
end
-- communicate if necessary
end
end
That is, we’re basically slicing the inner loops into “shards” and establishing a new shard loop along the outside. At that point, we split each iteration of the outer loop into its own task, and voila! the code now scales magically.
Swapping the loops around makes this look easy, but it’s worth noting what we’re really doing here. Legion’s entire premise is that it provides implicit parallelism: users don’t have to think about where the data goes or how it is communicated between nodes. But the second code sample above is explicitly parallel: the shards are now long-running tasks that explicitly communicate data between them. So when I said above that the process of manually writing one of these codes is onerous, that was an understatement, because it sacrifices one of Legion’s most fundamental tenets in the name of performance.
Control replication is quite literally the optimization that saved Legion. If we hadn’t made it work, we would have lost the essence of the project.
Getting back to the implementation, the loop transformation is the easy part. The hard part is figuring out what data needs to go where.
This turns out to be one of the fundamentally intractable problems of computer science. The problem is taking a data structure like a mesh, say this one:
And figuring out which of these various solid or hashed regions are required at each point in the computation. Except we need to do this in the compiler, which doesn’t know: (a) how many pieces of the mesh there are, or (b) the Venn-diagram overlaps of the various pieces. All we see are symbolic references to entire subsets of this mesh, like zones_private[i] would refer to the ith piece of mesh shown in the upper left corner.
As it turns out, this problem is tractable to solve in Regent because the language has first-class constructs that represent exactly the pieces of the data that we need. (If you read my notes from earlier closely enough, you’ll note that the same basic insight drove certain language changes that enabled other optimizations in the SC 2015 paper. But I hadn’t fully internalized what it meant at the time.)
Even then, I had to go design an entire new intermediate representation (what the compiler uses to represent code) and compiler analysis for region data movement in programs. I was fortunate to do an internship with NVIDIA Research in the summer of 2015, and I convinced them to let me spend it building the optimization, which meant that I got to sit near Mike again while designing the IR—a key benefit since it turns out that many of the algorithms are similar between the compiler IR construction and runtime dependence analysis. Work continued into the fall after returning to Stanford. To my surprise, once I got all the pieces into place, everything basically worked.
By this point you may have noticed a pattern that nothing ever really goes to plan, even when it’s going “well.” That was certainly the case for this paper.
We planned to submit to SC 2016. Around December of 2015, I remember sitting at a lunch with my advisor and some guests where we talked about the work we were doing, and Alex made a comment to the effect of (paraphrased): “Well, we’d have to really mess up the submission to not get accepted.”
You see, an early version of the optimization was already working at this point, and the preliminary results were amazing. In my entire career, I’d never seen results so good. The optimization had rough edges, corner cases it couldn’t handle, but when it worked, the performance was astonishing: nearly ideal weak scaling out to 512 nodes, where our previous (unoptimized) results were falling over at about 16 nodes.
Of course those pesky corner cases ended up being a royal pain to solve.
Two months before the deadline, I ramped up my effort to 10 hours a day, 5 days a week, to be sure I got my experiments done in time. One month before the deadline, I increased that to 12 hours a day, 5 days a week. Two weeks before the deadline, I started working 16 hour days, 6 days a week. In the last week, I just worked 16 hours every day.
In the end I was so wrung out that despite the fact that the empirical results were literally perfect, so perfect you couldn’t ask for anything better, I still couldn’t write the paper. I tried anyway, but it was one of the most frustrating writing experiences I’ve ever had, and when I finally passed the first draft to my advisor, he described it as “incomprehensible.” Hint: when your own advisor can’t make heads or tails of what you’ve written, you’re in deep trouble. We did what we could but there was so little time before the deadline that there just wasn’t much we could do.
I submitted anyway, but I remember it being one of the most palpable sensations of professional failure I’ve ever experienced, even worse than the “oh shit” moments with the 2015 paper. I was almost worried the paper would be accepted in its current form, because the empirical results were so outstanding, and then I’d have this incomprehensible paper on my record that I wouldn’t be able to properly fix before its camera ready deadline. (If that seems implausible, remember that the SC 2015 paper was in fact accepted on its first submission.)
But in the end I didn’t need to worry about that, and the paper would be rejected three times before finally finding a home.
In retrospect, the problem with the early versions of this paper was that I wrote the paper in exactly the way I had written the compiler. I had created this entirely new IR which I wanted to be the centerpiece of the paper. The problem was that no one understood it, or its implications, or why any of it was even necessary in the first place. In addition to that, I hadn’t fully internalized some of the realizations I mentioned above about why this problem was even possible to solve at all, and so the introduction simply failed to communicate what we’d done.
Lesson learned: if you can’t communicate your results, it doesn’t matter how spectacular they are.
Over the course of the next three submissions, I progressively refined the narrative of the paper until I finally threw in the towel with the last submission and just ditched the novel IR altogether. I discovered that the optimization could be performed with mostly-standard compiler technologies, so that’s how I wrote the paper, even though it bears very little resemblance to how any of it is actually implemented.
Submission history:
The entire point of the control replication paper was to improve performance. But it turned out that control replication only really improved performance along one axis: scale-out performance. Control replication allows you to take an application, and keep scaling it indefinitely, as long as the problem size increases proportionally to fill the machine. In HPC we call this weak scaling. If instead you try to do strong scaling, where the problem size is fixed, then as you scale out you have less work to do on each node. Less work takes less time, and pretty soon, if your system has any overheads at all, those start to be really noticeable.
The problem was that Legion had some pretty high overheads.
At this point, Mike, Sean and I all had our hands full in our respective corners of the project, and none of us had the bandwidth to fix this. Wonchan, who had by this point contributed not just vectorization, but also a GPU code generator to Regent, took on the work to address overheads in Legion itself.
The basic idea is straightforward: many applications end up doing the same thing over and over again. In this case, it’s pretty pointless to keep doing work when we know the outcome will be the same. So we can save a lot of work by capturing a trace of what we did the first time around, and then replaying it.
This idea is as old as dirt in computer science, but again, what makes it tricky is applying it to a distributed computing setting where you have to reason about different pieces of data flowing around the system (remember the mesh diagram I showed earlier).
Following our classic Legion pattern, we submitted to PLDI first, got rejected, then sent the paper to SC.
Submission history:
Automatic parallelism is one of those ideas that’s been beaten to death in HPC. For various reasons regarding certain application codes we were committed to supporting, we decided to build it anyway. This was work again done by Wonchan, but this time on top of Regent instead of Legion.
The problem with this paper was therefore in figuring out if we’d actually done anything novel. The work was done—there was no question we were going to support it. But was there anything interesting about the way we’d done it?
As it turned out, the answer was yes. And the key idea goes back to what I was saying before about realizing the full implications of partitioning. Strictly speaking, partitioning is about dividing up the data. But we realized that you can also talk about partitioning the space of loop iterations. And when you have a piece of code like, say:
for elt in mesh do
mesh[elt].x += mesh[mesh[elt].neighbor].y
end
We can determine, by analyzing this code, that we need a copy of each mesh element’s neighbor to perform this calculation. This is, in essence, a constraint on the valid partitionings of the mesh that would be safe to use in this program. By solving these constraints, we arrive at a set of valid partitionings. And dependent partitioning provides exactly the operators we need to do that—something that is often challenging in competing approaches.
Wonchan really nailed this paper and it sailed through on its first submission. I was especially pleased to see the paper get nominated for best paper, though unfortunately we didn’t win the award.
Submission history:
This paper makes me unreasonably happy.
It is a paper that does everything “wrong.” During my Ph.D. training, I was taught to not submit to workshops. If anything was worth doing, it was worth doing well. And if it was worth doing well, you might as well try to get full credit for doing it, and submit to top-tier conferences. That was an approach I took to every other paper submission, sometimes insisting on continuing to submit to first-tier conferences long after my other collaborators had begun to waffle. But this paper I never even attempted to submit to a conference.
The paper concerns work that I honestly thought would never be published. At its most basic, Pygion is “just” a set of Python bindings over Legion. It was something I built because users needed it, not because I thought it was research-worthy at the time.
And unlike every other Legion paper ever written, the paper makes no pretense to generality whatsoever. Research venues are more interested in publishing results that are widely applicable—the more general, the better. But as a team creating a new programming system from scratch, this created a tension: the more we took full advantage of Legion’s unique features, the higher the risk that reviewers might complain that our techniques wouldn’t apply to any other system. This ended up being one of the more common reasons our papers got rejected. And so as a rule we tried to generalize our results as much as possible—a process that resulted in true gems like the “Dependent Partitioning” paper, but a process that also sometimes ironically diverted energy from actually pushing the results themselves forward.
This is the first paper where I got to say exactly what I wanted, exactly how I wanted to, without having to pander to the sensibilities of top-tier conference reviewers.
The paper is pretty simple. I had written a language called Regent, published in SC 2015. Regent does a lot of great stuff because it has a static type system that directly embodies the Legion programming model—and thus the compiler can figure out all the optimizations required to get excellent performance out of the system. The idea here was to see how far we could get with Python. Would it be possible at all? Or would performance suffer so much that it wouldn’t provide a meaningful replacement?
To my surprise, the answer was that it works and actually gets excellent performance (with some qualifiers). The syntax is a bit hacky, but through the magic of dynamic typing, and just enough runtime analysis of the program, we can recover most of the important optimizations that Regent would normally do. There are obviously limits to this approach—there is no whole program analysis, so truly global optimizations would be impossible. But we got far enough to take the classic suite of Regent benchmarks, rewrite the core logic in Python, and achieve the same weak scaling performance and scalability as the original codes.
Submission history:
This paper is very different from the others on this list. Almost everything else here is a technology paper, describing some fundamental innovation in some part of the system: either the language, or compiler, or runtime, etc. Task Bench was the first time I’d worked on a benchmark code.
The idea was suggested to me by way of our colleague George Bosilca. George had had a student develop a benchmark which he’d used to compare some different systems. The paper had been rejected, and George had mentioned to Alex that he thought it would be worth working on an improved version of this benchmark. Alex came to me and suggested that if I was interested, he could hire some summer students to work on the implementations with me. Basically he reasoned that if we wanted to target N systems with M different application patterns, we’d have the students write N×M codes.
Initially I was skeptical. The benchmark didn’t seem that interesting, and I wasn’t sure what we would learn (or whether it would be publishable).
There were two things that convinced me to go for it.
First, the issue of overheads in Legion was still nagging at me. I was facing internal resistance from some members of the team who said that the overheads were low enough, or at least acceptable given what Legion was doing. But were they really? Part of the problem here was that there was no way to compare, so I couldn’t go back and prove to anyone that the problem was real.
Second, as I thought about the problem, I realized that there was a way to finish this project in N+M effort instead of N×M. Basically, by building a framework to describe what was being executed, we could write in each system a code that would just run the pattern as efficiently as possible. If we did our job right, we’d write N programs, and then code up M patterns, but we wouldn’t need to write the full N×M by hand.
We hired 5 summer students through Stanford, and my collaborator Wei Wu hired another student at Los Alamos. George provided one of his students to help out as well. So between all of us, we had 7 people actively writing codes that summer, and we managed to finish implementations of 15 distributed programming systems.
Finishing the implementations is one thing, making sure they’re efficient is something else. Unless we were sure that we’d done our due diligence in optimizing each implementation to its fullest extent, the results wouldn’t really be representative, and worse, we’d probably offend someone in the process.
So, to be sure that we were really doing things properly, we reached out to the developers of each of the 15 systems, and solicited feedback on our approach. It took some prodding in a couple of cases, but we ultimately got in contact with at least one member of each project, and in several cases the developers were highly engaged in helping us optimize the codes.
That still left the problem of analyzing and presenting the results. I now had a tool capable of running an arbitrary number of application patterns on each of these 15 different systems. But what should I actually measure?
One thing I realized while working on this paper is that “overhead,” though it seems intuitive, is actually a slippery concept. One way to measure overhead is to run a bunch of empty tasks through the system. Empty tasks take no time, so anything left at that point is, by definition, overhead. But what are you actually measuring? Some systems reserve exactly one core for doing scheduling. Some systems schedule on every core. On some systems, this is configurable, and then how do you pick the number of scheduling cores to use?
I realized the only sane way to do this is to measure performance under some load. That is, the system has to do some amount of useful work. Suppose we set a goal of hitting 50% of the peak FLOPS of the system. Then it’s obvious that you can’t use 100% of your cores to schedule, at least not full-time, because then you would do no useful work. Thus, the useful work requirement provides us a way of keeping the systems honest, while also allowing them to use whatever settings happen to perform best for a given system.
This led to some interesting new terminology, the minimum effective task granularity (METG), which captures the smallest task you can run while achieving at least a stated level of performance. You can think of it as being like an “honest overhead” measurement, because you are forcing the system to still do useful work, while getting through as many tasks as possible (and thereby minimizing the overhead). While it wasn’t originally the expected contribution of the paper, it’s probably my favorite part, and something that I hope takes off.
But this paper also had its share of issues. Reviewers didn’t understand what it meant to implement a “task” on systems that didn’t have tasks. The concept of METG was tricky and easy to misunderstand. Although we refined the paper with each submission, it still wasn’t enough, and after three failed submissions some of our collaborators were inclined to give up and take the paper to a second-tier conference. But I wasn’t ready to give up, and on the paper’s fourth submission, we made it through.
Despite the tribulations this paper went through, it is probably the one with the least evolution from the initial to the final submissions. The initial submission had it basically right, we just had to polish it over and over to get the narrative smooth enough that reviewers grasped what we were doing.
Submission history:
While Legion began as a research project, it was never our goal for it to remain purely in the academic realm: we really wanted to produce a piece of software that would last and that people would actually use. That meant, in part, transforming the research culture of the core project into an engineering culture. By 2017, when the work on dynamic control replication began in earnest, that transformation was already underway: we had moved from occasional “throw it over the fence” public software releases, to doing everything in the open. We now had a real continuous integration (CI) infrastructure, instead of relying on me to manually hand-test everything. As far as the software itself, we cared much more about production quality and paying down technical debt from the very beginning, as opposed to building the software quick and dirty and paying the debt over time. And it meant that we would build things that were important, even if there was no immediate paper payoff—though we certainly hoped to publish the pieces that were interesting enough to warrant it.
Dynamic control replication (DCR) was, at some level, just the dynamic version of the optimization that I had previously published in SC 2017. Dynamic in this case meant that it was built directly into the runtime system, instead of into the compiler. There were a couple reasons we decided to do this:
With basically zero uptake of CR in the rest of the programming system world, the odds of getting scooped were essentially nil, and thus we could afford to take our time, do the optimization properly, and publish when it was ready. But that was a cherry on top at this point.
Along the way to building DCR, Mike (who was still the exclusive owner of all Legion runtime code) discovered that there were other, even more foundational changes we would need to make. But those would end up turning into an entirely separate paper.
The interesting thing about this paper is that we wrote, finished, and deployed the optimization into production before even thinking about writing the paper on this one. When Mike began to organize the effort to write the paper, DCR had already been the de facto production version of Legion for over a year, and he solicited everyone on the team to submit the experiments they wanted included in the paper. Compared to our previous papers, it was striking how orderly of a process this was. There was no frantic rush, no endless stream of showstopping bugs, and no last-minute catastrophic discoveries. One of the reasons why this paper has so many experiments is that basically everyone had it already working as their daily driver, so it was easy to gather a lot of data.
After the struggle to communicate the value of SCR, Mike knew exactly how to write the paper on this one and it sailed through on its first submission.
(Note: from this point onward, our paper submissions moved to Overleaf, so the submission sources are not publicly available.)
Submission history:
This is a paper that I thought truly would never be published, and yet ironically got accepted on its first submission.
Index launches had been a part of Legion since very early in the project. The basic idea is that if you’re going to a launch a bunch of tasks like:
for i = 0, N do
my_task(some_data_partition[i])
end
There should be a way to rewrite the entire loop as a single operation. Regent has no syntax for this, so I can’t give you actual code here, but conceptually it’s something like the following (using Python pseudocode):
IndexLaunch(my_task, [0, N], some_data_partition, lambda i: i)
Because we can suck the entire loop into one operation, we can reduce a bunch of analysis costs from 𝒪(N) to 𝒪(1) or at least 𝒪(log N).
It’s a nice, clean, simple idea, and it’s also key to almost everything Legion does. For example, the spectacular success of DCR is only possible because index launches gave us a compact representation of the program (something that we argue in this paper).
For a paper, the idea was perhaps too simple. It was also just old. This was literally one of the first features that Mike and Sean came up with in the very early days of Legion, when I was still a neophyte researcher, and it had been present in every version of Legion publicly released.
And yet, index launches had never been properly described in any published research paper. They only got a passing mention in my SC 2015 paper, where I talk about an optimization Regent does to take advantage of this feature of the Legion runtime.
But, having worked recently on the Task Bench paper that made it into SC 2020, I was in a unique position to look around at the ecosystem to see what other projects were doing. After all, I had just led a team to develop efficient implementations of Task Bench in 15 different systems. And after doing that research, I realized that literally no one in the field was using this idea, except us. We’d been yammering at people about this for nearly 10 years, and no one had listened.
That was the point where I decided I wanted to make this a paper, if at all possible.
I pitched the idea to Rupanshu Soi, an undergraduate student in India who had joined the project initially as an open source contributor, and whose work I’d been advising. This was the perfect paper for Rupanshu to take on, since he was already actively working on improvements to Regent’s index launch optimization. I didn’t see any way to get that work published unless it was merged into a larger paper of some sort. Meanwhile, after sitting idle for nearly ten years, it was obvious that if writing this paper fell to myself or the other original contributors, it simply was never going to happen. The energy to push it over the finish line had to come from someone new.
But given that index launches had already been a feature of Legion for so long, we had to think carefully about what the contribution of the paper would actually be. After some discussion, we settled on a three pronged plan:
This was also my first paper as an advisor (i.e., in the last author slot of the paper instead of the first or Nth). Rupanshu did a great job on this one and it got accepted on the first submission.
Submission history:
At its core, Legion has a deceptively simple premise: you can partition a distributed data structure any way you want, any number of times, with arbitrary hierarchy. This has turned out to be the single most powerful and also the single most challenging feature of the system, with implications that took us most of a decade to fully understand.
As our understanding evolved, so did the Legion implementation. And so, by 2021, we had iterated through three major versions of a certain core Legion algorithm: the algorithm that figures out what version of a piece of data a given task should have access to.
This was basically a paper waiting to happen. We had this very deep insight into how these systems work, on the basis of developing progressively refined implementations. Mike in particular thought he knew how to write a paper about it. But the challenge was that we’d developed the software over a period of multiple years. Was it even possible to run a set of applications on versions of Legion separated by so much time, let alone do experiments that allowed head-to-head comparison?
So after stepping off the index launch paper, with experiments fresh on my mind, I set about to see if it was possible to do this. The core insight that I had was that it might be possible to swap out only one box in the architecture stack diagram: Legion itself. That is, could I use an up-to-date Regent (the language), and an up-to-date Realm (the low-level portability layer), and somehow shove in the (many years) old versions of Legion into the middle?
After a couple of days of furious Git hackery and papering over API differences, I had what I wanted. Brand new Regent, brand new Realm, and three different versions of Legion sitting in the middle. And the best part was, the applications mostly didn’t need to know that anything had happened. To the extent that there were compatibility hacks I needed to put in, they went into either the API shims or the compiler.
With that done, I ran all the experiments, and handed the results off to Mike to write the paper itself. The tricky part was figuring out how to communicate what we’d actually done. Because Legion is such a different system than many of the ones that share the “task-based” moniker with us, most reviewers come in with the wrong expectations about what we’re setting up to do. This caught us in multiple submissions of this paper until we polished the story enough to make sure people got it.
Submission history:
When I back up and look at all the work we’ve done over the last 12 years and change, despite all the ups and downs and moments of sheer existential terror, the biggest thing I have to say is that I’m really proud. We did a lot of great work. Not only did we push forward the frontier of HPC, but we did so while building a great piece of software and somehow managing to distill and communicate many important conceptual contributions of the work we were developing. I’m aware of projects that have done one or the other, but I’m not aware of many projects that managed to do both. In part that’s because the problems ended up being much deeper than anyone expected, giving us a lot to work on both in research and engineering.
I never would have expected, when we started the Legion project in 2011, that we would still be discovering core technological contributions in 2023. And yet, that has continued to be the case. As recently as the latter half of 2023, we’ve made substantial refinements in our algorithms that are critical to running at extreme scales—allowing us to run, as of only a few months ago, on the full scale of Frontier, the current #1 supercomputer.
From a people perspective, the reason all this worked out is because we kept the core contributors actively engaged post-graduation. Mike Bauer graduated in 2014 but continues to be the main (and usually sole) developer of Legion. (We are making a concerted effort to increase that bus number. Really, honest.) Sean Treichler graduated in 2016 but continued to work actively on Realm until fairly recently, and still helps out (albeit in a much reduced role). I graduated in 2017 and am still continuing in my various and sundry roles, including beating on our software when the circumstances require. And there have been many other people involved over the years, a surprising number of whom have continued to work on Legion in some capacity post-graduation.
From an academic perspective, as I have said, there were far more problems to solve in this space than any of us anticipated. I haven’t even covered all the papers here; there are more to be found on Legion’s publications page, but someone else will have to tell those stories.
From an engineering perspective, I think we managed to build the right kind of culture at the right time for the phase of the project we were in. In the beginning, I was our test suite and our CI. As the project grew and we needed something more rigorous, we evolved the approach to grow with us, adding more and more sophisticated tests as we went along. There is more to do, as some of our users on the bleeding edge unfortunately know all too well, but overall the level of completeness and robustness of our tests is dramatically better than it was.
From a product perspective, we now have a killer app, a product that NVIDIA now employs a nontrivial number of people to develop, which is built on Legion, and serves in part to justify ongoing investments in the project. And this is one of several areas where we now have compelling products available.
Of course, all of this has only been possible because we had great leadership who kept us on track all along the way. It’s not easy managing a project that spans the gamut from academic to engineering, but Alex and Pat did it.
If there’s one thing I want to reiterate, it’s that this is one, small slice of what has happened in the project, and couldn’t possibly cover every contribution or every contributor. And even in the events I’ve described, this is only one perspective. Hopefully with time we’ll get to hear from others involved in the project as well.
]]>if statements? You can thank Lisp!)
But the language landscape has changed a lot since then, and realistically no programmer today cares about what made a language stand out 50 years ago. Clearly, the good ideas have been copied into other languages. Paul Graham even suggests this convergence towards Lisp is inevitable. I wouldn’t go so far. But this begs the question: Is there anything left? Are there any features that couldn’t be copied so easily into the various descendants of Algol?
I’d say these features, particularly in combination, continue to be distinctive:
An “everything is available all the time” approach to system design. Lisp allows you to run code at compile time, compile code at runtime, run and compile code while debugging, iteratively compile and profile different sets of code, etc. Everything blurs together so that there are no obvious boundaries between different parts of the system. The way that common Lisp systems produce executable binaries to be used as application deliverables is by literally dumping the contents of memory into a file with a little header to start things back up again.
Pervasive interactivity, i.e. the REPL. This was more of a contrast to e.g. Fortran and other ahead-of-time compiled languages, but is still notable today. For example, the Common Lisp package manager is executed via the Lisp shell. Nearly all comparable alternatives I’m aware of are executed out of process via the system shell (/bin/bash, or what have you). Same with the debugger, profiler, IDE (if that’s your thing), even in some cases the OS.
A canonical representation of programs in terms of the literal syntax of the language’s core data structures, permitting a design where the program text is the literal representation of the program AST.
Let me unpack that.
Python has literal syntax for various core language data structures: [] for lists, {} for dictionaries, () for tuples, etc. Python programs are not represented in terms of those literals; the language itself has different syntax (def, class, etc.). But Lisp programs are.
Lisp programs are expressed using Lisp’s list core data structure. Therefore, the text of a program in Lisp consists simply of the serialization of this data structure into text via the literal syntax for lists (i.e. the infamous ()). Furthermore, this permits an implementation where the language AST (at least through to the first stage of the compiler) also uses the same representation. This is what people mean when they say the language is homeoiconic.
Homeoiconicity has a surprising advantage. Now it is possible to manipulate, using only core language data structures, program ASTs. It is possible, in other words, to write code that produces code. And since the compiler is available at all times, you can not just build ASTs, but actually compile and run them. This is the basis of metaprogramming in Lisp.
It’s worth noting that macros are not on my list above. Macros are simply syntax sugar around the capability you already have as a consequence of (1) and (3). Macros allow you to hide the fact that you’re doing code generation, but they don’t fundamentally give you any new capabilities when the compiler was already available at runtime anyway.
It’s also worth noting that code generation (and therefore metaprogramming itself) are also not fundamentally innovations of Lisp. For example, in C++, it is entirely possible to link your application to libclang and build Clang ASTs inside your application C++ code, and use the Clang compiler to emit and run that code. Before you laugh, know that people can and have done it. But unsurprisingly, it’s hard—hard enough that you wouldn’t bother unless you had a problem you couldn’t solve any other way, and so big that you couldn’t ignore it.
Lisp is easier. In fact, Lisp is so much easier that people do it in the course of their day-to-day programming tasks. This is the real secret sauce of macros. Macros, together with the key innovations above, make it possible to do metaprogramming easily enough that it’s actually a regular occurrence.
In case you missed it, I did a minor sleight of hand two paragraphs back: I used the word “metaprogramming”, but the links I included were all to domain-specific languages (DSLs). This was not a mistake. As metaprograms become more sophisticated, they become increasingly difficult to distinguish from full-on languages. Furthermore, in a language that makes metaprogramming easy, an application that consists of a series of layered libraries begins to look increasingly like a series of layered languages. To put it another way, libraries and languages are both abstraction layers, but languages are the more powerful one. This is, by the way, not a new idea, but it is one that the broader software community has yet to internalize.
As someone who makes a living by creating languages and compilers, it’s easy to downplay the significance that all of this has on writing code. Even in the programming languages community, Lisp probably gets less credit than it deserves. But with the recent resurgence of DSLs, we should really be taking a hard look at what Lisp provides. If the goal is to make it easier to build DSLs, Lisp already provides a lot of the starting blocks.
(I originally wrote this on February 27, 2017, but it remains as true today as it was at the time. I’m finally getting around to publishing it.)
]]>This release recognizes what has already been true for many years: Terra is mature and is actively used in production environments to deliver best-in-class performance at a variety of scales. While the language has been stable for some time, this is the first release to introduce a stability policy that describes how we envision development proceeding in the future. This is a living document and we expect this policy to evolve as we gain experience with how this works for our users. As always, feedback is welcome.
A brief summary of what’s included in the release:
Support for dramatically more LLVM versions. Currently we support almost every LLVM version between 3.8 and 14, although we have plans to deprecate older versions to reduce the ongoing maintenance burden.
Substantially updated platform support, including CUDA (through version 11) and Microsoft MSVC (through version 2022).
Build system rewritten from scratch in CMake to improve portability, finally putting to rest a number of long-standing issues with the Makefile build.
Experimental support for a number of language features for efficient code generation such as atomics and switch statements that directly generate LLVM jump tables.
Very experimental support for AMD GPUs.
For a complete list of changes, see the release notes.
To get started with Terra, you can spin up an instance at Replit or download binaries for your platform. See the getting started guide for an introduction to the language.
Terra is a volunteer-run project, by and for its users. We always welcome new contributions. If you have any questions or want to hang out with other Terra users, come join our Zulip instance.
Finally, let me take this opportunity to thank Zach DeVito for the initial development of Terra. As someone who has been building applications successfully with Terra for many years, I’m grateful to be building on top of such a solid foundation.
Here’s to many more years of Terra!
]]>In case you haven’t been following along, we’ve released a number of betas over the last couple of years. We are now quite close to releasing 1.0.0. If you want to help, please refer to this issue on the latest beta.
Some of the notable changes in recent betas include:
Support for more LLVM versions. We’re currently up to 14 (the most recent version as of this time of writing). I’ve been happy that we’ve been able to keep up to date; it’s not such a long time ago when we were many versions behind on the LLVM treadmill.
Support for AMD GPUs. This is very experimental, but has already been used in Regent to provide seamless GPU code generation for NVIDIA and AMD hardware. Preliminary performance looks good, and we’re pretty much set to run on the Frontier supercomputer when it arrives.
Fixes for performance regressions in CUDA. It used to be that you had to use LLVM 3.8 if you wanted to get the best performance on NVIDIA GPUs. Now, at least in my testing, we’re able to do this with LLVM 13.
Support for (experimental) concurrency primitives. Right now this includes the LLVM atomicrmw, fence, and cmpxchg instructions.
Much better cross-version compatibility for Linux. You can now build on Ubuntu 18.04 and expect it to work on 20.04, 22.04, etc. I have yet to test the cross-distro compatibility, but I’m optimistic that this may work as well now.
Some older changes include CMake support, substantially revamped Windows support, Nix support, a dramatically better CI and release infrastructure, and many, many bug fixes.
I’m personally most plugged into the Regent community. Regent users—by virtue of Regent being built on top of Terra—are (indirectly) Terra users.
Terra (via Regent) has been used on many of the top supercomputers (at least those in the USA, and some in Europe). Regent uses Terra to implement seamless support for generating efficient GPU kernels. We’ve been doing this for NVIDIA GPUs for many years now, and have recently started supporting AMD as well. Intel GPU support is also on the roadmap. The result is that, in a fairly short amount of time, Terra/Regent have grown into a powerful way to write performance-portable GPU code. That’s pretty awesome.
I’d love to hear more about what other users are doing with Terra. Swing by our Zulip instance and let us know what you’re working on!
Let me state the obvious up front: I’m not Zach. Zach moved on from his postdoc position a number of years ago, and I stepped into the gap to help keep Terra going.
Terra is a relatively small community. At this time, we have no corporate sponsors and no full-time (or even significant part-time) staff. I count myself lucky to be able to put about 5% of my time into the project, but it’s still not a lot. I’ve been focusing mainly on keeping Terra up to date with the ecosystem changes (e.g., LLVM versions) and making sure Terra is able to target the hardware that comes along (particularly for USA-based supercomputers). I also try to do what I can to mentor other users who are interested in working on Terra.
As usual, if you want to support Terra, the best way is to get involved in the community. There are plenty of things to work on, and we love community contributions! If you need ideas, feel free to pop over to Zulip and let us know you’re interested.
I believe strongly that Terra goes where the community wants it to go, where “community” opinion is weighted in favor of those those who actually put in the effort to do something about it. Maybe some day we’ll be large enough to have a dedicated team with a more formal process, but right now that’s not where Terra is at. First and foremost, Terra has to serve the community it has.
However, as one of those people who are putting active effort into Terra, I suppose I should say something about my goals. To be clear, these are my personal goals. I do not claim to speak for anyone else on the project (or for any past, present or future employer).
In roughly decreasing order of priority, I want Terra to have:
Best in class code generation support for AMD, Intel and NVIDIA GPUs (and of course CPUs too). You should never get to a point where you have to switch to CUDA, HIP, SYCL, etc. to accomplish what you need to do, and performance should be at least as good as any of those platforms.
Best in class support for HPC platforms and machines (especially those in the USA). I am biased in this regard by my employer, which as a U.S. DOE laboratory, has an interest in running on current and upcoming DOE supercomputers. But I also put effort into running on machines used by my collaborators, for example in Europe.
Best in class support for HPC codes and applications. That’s not to say I don’t or wouldn’t support other classes of applications, but HPC has needs that are often not well-represented in the popular programming languages. My usage of Terra is mostly HPC-focused, so that’s where my effort goes.
No unnecessary breakage. I am first and foremost a user of Terra. For me, this is not an exercise in language design, it’s about getting work done. If Terra breaks, it gets in the way of me doing that.
And lastly, I aim to support community efforts as best I can in whatever time is remaining after fulfilling my other obligations.
I think this is an exciting time to get involved with Terra, with some of the things that are around the corner. If you’re interested, we’d love to chat and discuss how to get started.
This entry has also been cross-posted on the Terra issue tracker.
]]>One of the most remarkable things, looking back on the life of my Kindle, is how long-lived it was. It basically just worked, and continued working, for nearly 12 years. It got firmware updates (including one to enable Amazon’s KF8 format on the device), but those updates never changed how the Kindle looked and felt, or the way it worked. In contrast, I replaced my laptop three times over the same period—of which two were incentivized by major upgrades, and one by the hardware failing out from under me. Apple dragged the laptop world to high-resolution, wide-gamut displays. Internal changes, such as SSDs, swept over the scene, bringing dramatically higher performance. In the land of GPUs, desktops and gaming, we saw a dramatic increase in both compute performance and also the corresponding visual fidelity that could be achieved. In high-performance computing, we saw an increase of 230× in peak performance1. We even saw the possible start of the end of the dominance of the x86 ISA2.
Changes in the software realm were no less dramatic. All three major desktop OSes underwent massive UI overhauls. Many users of Microsoft Windows had their machines force-updated without their permission and woke up to a completely new interface. Mac OS X became OS X, and then macOS, and then stopped using the version number 10.
My Kindle never changed. It performed the same function, with the same interaction model, with the same interface in 2021 as it did in 2010. Until the charging circuit broke, and then it died.
I’m not saying there was nothing to be improved with my Kindle. In fact, I was pleased to see that there were nice improvements that had accrued over the years since I’d purchased my original device. It’s not all sunshine and roses—I’m still on the fence about the touch screen—but it’s unquestionably an upgrade. Still, for all that, my new Kindle has basically the same interaction model as the old one. Aside from moving from buttons to touch, you could nearly pretend it was the same device.
Much of the credit for this stability goes to Amazon, who steered the ship straight for over a decade. That may seem like the obvious (and possibly even easy) thing to do, but it’s really not, especially in a world with as much churn as we’ve seen in the last decade. Consider, for example, the saga of Microsoft, who closed their ebook store and deleted all of the files from customers’ systems. Or the number of perfectly good Google properties that have been shuttered.
But there’s also an extent to which I think this shows what is possible with appliance computing. A device with a well-defined, stable function doesn’t need to change. That doesn’t mean stability is inevitable—as the IOT market has disastrously demonstrated—but it is possible. Or perhaps to be more precise: stability is possible where there is a stable business model that doesn’t depend on selling devices or making them work in a radically different way. Kindle, apparently, is one of those business models.
From the Top 500 list, the #1 computer in 2010 was Jaguar, a CPU-only machine that achieved just over 2 Petaflops. In June 2021, Supercomputer Fugaku achieved 537 Petaflops.↩︎
We’ll see where it goes, but the Apple M1 SOC certainly looks impressive. See also AWS Graviton.↩︎
It refers, of course, to what you’d like your life’s work to be. But it also has, if you take it perhaps too literally, some other linguistic layers of meaning to it1.
Take the word “spend”, which conjures up the image of an economic transaction. Your life is a currency. Almost like going to a store to make a purchase, you get to choose what you spend your time on.
You could choose to spend it on aggrandizing yourself. On making yourself knowledgeable in some field. On making an absurd (or possibly less absurd) amount of money. You could move out into the middle of nowhere and try to isolate yourself from civilization. Or you could try to spend it doing something that you think will have an impact on your fellow human beings, either in the present, or in the future—a legacy.
Of course the analogy falls down at a certain point. Unlike a store, you don’t know, when you’re picking a product to buy, how your life will turn out if you go any given direction. You may not even see all the available options up front—or some may be so distant that you discount that they’re possible at all. Some options may only open up once you’ve gone a certain direction for long enough. Or you may discover once you get there that it wasn’t what you expected.
Still, I think the analogy is useful in more ways than one. For example, the spending analogy helps to emphasize the temporariness of life. Like a currency inflates until it’s eventually worthless, life will pass you by if you don’t do anything. Not spending is not an option—the only option is figuring out what direction you want to go, in this moment.
Let me just say up front that I’m well aware of the spuriousness of taking things too literally. Just like the first two letters of “God” are “go”, and the middle letter of “sin” is “I”, this is a coincidence that has no inherent meaning. I’m not presenting this because I think this is some deep linguistic insight, but because the accidental collision of meanings is fun to think about.↩︎
This much is obvious, but what isn’t is how to actually deal with it. In this regard, Alberts’ advice is surprisingly shallow. We’re told to avoid sentimentality. Ok, how do we do that? Alberts doesn’t really say.
That’s the question I want to address here. But first, I believe the problem needs to be reframed. I will attempt to define sentimentality in a way that I believe leads to greater insight about its nature, and then having done that, suggest a way forward to avoiding it in our writing.
Sentimentality, in my view, is most helpfully defined as unearned emotion. Let’s unpack those words, starting from the end and working backwards:
Emotion is pretty much what you’d expect: happiness, anger, sadness, fear, etc. But note that in the context of sentimentality, we usually mean the extreme forms of emotion. A hint of sadness does not qualify as sentimental. But extreme sadness, such as at the death of a loved one, might.
What does it mean for emotion to be unearned? I think the best way to show this is with an analogy.
Suppose I write a book that ends in a climactic fight scene. In it, the plucky protagonist defeats the evil overlord by using their agility and strength to (barely) outmaneuver the villain. Ok, fair enough. But now suppose that in the first 90% of the book our protagonist is a couch potato. They’ve never actually exercised a day in their life. They never lift weights. They never train. If this is the protagonist who goes on to defeat the villain, then the ending of the book will feel like a massive non sequitur. It will feel fake because it wasn’t earned.
Here’s the thing: whatever climax you choose for your book, you need to earn it. That means showing the protagonist making progress towards their goal. There can be backsliding. There can be setbacks. But for the most part, there should be some sort of regular progress, showing the character improving along the axis that you need for the ending to feel real.
I would suggest that the same basic principle applies to emotion. Emotion is the reward you give the reader for finishing the book. If it’s a tragedy, you show the character finally succumbing to their circumstances. If instead the book has a happy ending, you show their joy and elation. Both in the physical circumstances as well as the emotional ones, you can’t simply jump to the end. You have to show how a character gets there.
So for example, if someone dies at the end of the book, that death should be foreshadowed. Foreshadowing is the promise you make to readers about what’s going to happen later on. It allows readers to feel the continuity of the story, to feel that the story has structure and is not just a random sequence of unrelated events. If something tragic happens at the end of the book, readers generally like to be prepared for that so it doesn’t come on them (at least entirely) by surprise. That could be a smaller setback that occurs earlier (but is somehow representative of what is to come), or a reference to a tragedy that happened in the main character’s backstory, or perhaps an “unrelated” tragedy that occurs to a secondary character, or a “chance” discussion about a dangerous possibility that the main character fears.
It may be tempting to skip foreshadowing to make the ending “unexpected.” After all, you’re telegraphing what’s going to happen in advance. But I think this doesn’t actually work—instead, readers will think your ending comes out of nowhere and therefore feels like a non sequitur, rather than clever. Good authors manage to foreshadow, even when the ending feels unexpected1.
Foreshadowing is good but it’s not enough. The emotional ending, whatever it is, needs to be developed. Think of the way you’d develop the plot for the book. The character sets a goal, makes an attempt, and fails. Rinse and repeat. Each time, they get a little bit better, and build skills that contribute toward that final confrontation with the villain. In the same way, emotions can be developed. In Hamlet, Shakespeare did not simply jump from the initial foreshadowing to the end. He developed the plot, showing Hamlet’s struggle escalating before we reach the tragic ending.
And this brings me to a final point. It is ok to write highly emotional endings. Emotion is fine, even good, for an ending. But it needs to be earned. Hamlet works as well as it does because of the masterful skill Shakespeare uses to earn the emotional ending. Most of us are not, of course, anywhere near as skilled as Shakespeare, but we can still strive to earn the emotions in our books and deliver satisfaction to our readers.
To summarize, sentimentality is unearned emotion. Emotion is one element of the ending of a book that needs to earned. As with other elements of a climax, it is earned by foreshadowing and development.
While I still need to do more research, I have a theory about how authors pull this off. The key is distraction. Like a good magic trick, a plot twist may get paraded out right in front of readers’ faces. But they don’t expect the twist because something else occurs to distract them at the moment the twist is initially revealed. This creates a deeper continuity later on when readers realize the twist, because it’s been there the entire time. The final reveal of the twist is more of a reminder to readers of what was already there, rather than a presentation of something entirely new.↩︎
Laurie Alberts helps to reframe the problem in her book on writing craft, Showing and Telling. Alberts tells us that in fiction (and other narrative prose) there are two primary modes of writing: scene and summary. Ultimately, we need both of these techniques, and few (if any) successful books use exclusively one or the other.
One important thing to note here is that scene (in the sense used by Alberts) is not a unit of organization of a book (a division smaller than a chapter). Similarly, summary is not synposis. Instead scene and summary are modes of writing, a bit like prose and poety (though not so extreme). Any given “scene” (in the book organization sense) could contain both scene and summary interwoven, or one or the other exclusively. A book will typically be structured to alternate scene and summary, though sometime will flow from scene to scene or summary to summary.
Scene is the mode you’re probably most used to when you think of fiction. This mode is most accurately characterized by the perceived passing of real time. Think about a movie. The camera is watching the characters and you see what they’re doing as they do it, as if you were standing there yourself. Similarly, in creative writing, and especially in fiction, much of the narrative is spent with the characters, watching their lives unfold as if we were living alongside (or inside) them.
In contrast, summary is characterized by breaking the flow of real time, or perhaps more helpfully, unified by a common theme, mood or emotion. Time may be condensed, repetitive, or “general” (i.e., the state described is true in general and is not specific to a certain time frame). In any case, summary permits a description of state or events without pedantically describing everything as if it were happening in real time. This gives readers a breather between scenes, but it also serves to unify the narrative around a theme, mood or emotion. Often, mood changes (e.g., between scenes) can be accomplished with a summary that gradually transitions from one mood to another. Thus scene and summary fill complementary roles, focusing on the passage of time and thematic elements of a book, respectively.
Why is it important to use a mix of scene and summary? Because to use only one would be repetitive and tiring.
Using scene exclusively is exhausting. Scene happens in real time, and that means you need to write each and every event as it occurs. Worse, scenes are often build or extend tension. To use scene exclusively is to run along from one event to the next without any breaks. This wears the reader down and ultimately causes them to put the book down, even if every scene is engaging on its owns.
Similarly, using summary exclusively would be challenging. There are exceptions, books that use primarily summary, but they are very rare. In this day and age of movies and TV, readers are more visual than ever. They want to be able to see and feel what happens to the characters. This requires scene to be delivered effectively.
Coming back around to the original problem statement, why are we told to “show, don’t tell”? Because we’re not using scene effectively. But that doesn’t mean that we don’t need summary! Instead, we need to learn to use both scene and summary effectively.
Beyond the framing of the problem itself, the real value that Alberts delivers in this book is specific advice about how to effectively write both scene and summary, how to transition between them, and how to embed small parts of either one in the other. This is the heart of the technique that, if writers truly master it, will vastly improve their craft.
Though a certain amount of this borders on generic craft advice, Alberts’ framing and presentation make this book a particularly helpful resource. The book starts at a high level, describing the purposes, types, and structures of scenes and summary. Then it delves into more specific aspects of craft, such as beginnings, middles and ends of scenes and summary, and specific techniques related to time, setting, character, dialogue, and even specific verb or word choices. Alberts really digs deep into the nuts and bolts here, using frequent examples to show precisely how to effectively accomplish each technique described in the book. Little is left abstract, and Alberts includes both positive and negative examples (things to do, as well as things to avoid).
If the book has a weakness, it is that a small number of potential pitfalls are stated, but given little in the way of concrete suggestions for correction. Writers are told to avoid cliche, for example. This is obvious enough advice, but how precisely are we supposed to know what cliche is and how to avoid it? Alberts never really says. Or in another case, Alberts exhorts writers to avoid sentimentality. Again, there is a lack of insight here as to what sentimentality is exactly and how to avoid it.
Overall, if you’re looking for a fresh perspective on your writing craft, Showing and Telling is an excellent place to start. The framing is insightful, the examples are truly a joy to read, and the advice is direct and helpful.
]]>For me, NieR:Automata felt like it came completely out of left field, going from complete obscurity to a cult hit seemingly overnight. I put the game on my wishlist, promptly forgot about it, and then recently dug it back out. And I have to say, it’s a gem.
NieR:Automata tells the story of androids 2B, 9S and A2 as they fight a proxy war against a machine invasion force. The game is set in a post-apocalyptic future in which the earth has been wrecked by war, and humans have fled to the moon, while androids continue to fight on their behalf on earth.
There are few times when I feel like I get to witness history being made: when I get to live through the day-1 launch of a game that I know is going to be iconic. Dishonored was one of those games (though I didn’t initially realize it): a combination of graphics, gameplay and story that somehow made for a truly compelling combination. Automata, while very different, holds a similar place in my heart. A Japanese RPG that combines a great story, satisfying gameplay mechanics, distinctive (if not hyper realistic) graphics, and a soundtrack that stands on its own, this is a game to be remembered.
What makes this even more strange is that the game is technically a sequel. Story wise, Automata doesn’t really have much of anything to do with the original NieR, despite being set in the same universe. You can jump right in without knowing anything about the previous game (or that there even was another game in the series). To be honest, it’s surprising that the game got made at all, given that the original NieR was not especially successful, and was widely criticized for some of its elements, including graphics. But it would seem that (contrary to what so often happens these days), the producers really learned their lessons with NieR, and were able to successfully address those issues to make a truly stellar sequel.
Automata is an open world RPG with a combination of RPG, fighting game, side scrolling, and bullet hell mechanics. In addition to the different mechanics, the game uses multiple perspectives and fluidly switches between these depending on where you are in the game.
One of the defining characteristics of Automata is variety. The game seems to go out of its way to avoid the monotony that is so common in RPGs. Just when you think you know what to expect from, say, a boss battle, the perspective will flip, completely changing the gameplay experience.
This theme plays out at multiple levels in Automata. In the soundtrack, for example, every “song” really consists of multiple tracks with different levels of vocals, instrumentals, etc., and the game again switches between these fluidly as you play.
This can also be seen in the story. Automata is organized into distinct “routes” that each have their own ending. Unlike a typical game where you stop when you reach the end, the expectation in Automata is that you keep playing (restarting after each ending) until you achieve all of the five main endings. Routes feature different characters, have different story (including different side quests), and a variety of world-shaking events occur that cause the different stages of the game to feel qualitatively different.
One thing that has to be said up front is that this is a console port. If you’re playing on PC, some of the aspects of the game are unfortunately not well done. Which is honestly pretty frustrating, because the console version of the game is very well executed.
The good news is that almost all of this can be fixed via mods. There are two main mods you’ll want, one to fix an issue with resolution (so the game doesn’t look fuzzy), and one that (massively) improves the mouse and keyboard controls.
NAIOM in particular in necessary in order to be able to access healing in a reasonable way. This is basically mandatory, unless you want to be stuck playing the entire game on easy mode, because most/all of the bosses in the game do >50% of your health in damage per hit and you really need the healing items to survive that. (There is an auto-healing upgrade, but it only triggers at <30% health, which makes it much less useful in boss battles.)
The game has four difficulty levels: easy, normal, hard, and very hard.
I recommend normal or hard. Actually, in my opinion, it doesn’t matter much which one you play. In hard mode, you’ll have to level up a bit more to survive, and you’ll be pushed a bit harder to optimize your equipment. But once you do, things will equalize out and you probably won’t notice much of a difference in overall difficulty. The main thing you lose in hard mode is the ability to lock on, but if you’re playing on a PC with mouse and keyboard (and the input mod above) this is honestly not an issue.
One of the nice things about the game is that you can switch difficulties at any time. So if a boss battle is too hard, you can always switch to normal (or easy) to get through it. I ended up having to do this in the opening sequence because I was still learning the game mechanics and wasn’t able to make it to the first save point. After that I switched back to hard and leveled up enough to make the game playable.
The other nice thing about easy mode is that the game has an “auto” mode where it will basically play itself for you. This is the mode to choose if you only care about the story and don’t want to be bothered by combat.
I don’t recommend very hard. This mode makes you die in one hit. That certainly sounds difficult, but it also makes many of the items in the game pointless (e.g., upgrades for defense, HP, invincibility after being hit). As a result, I think this mode actually robs you of many of the mechanics that make combat satisfying, and I think I would only consider playing it after finishing a first playthrough on normal or hard.
The game has three perspectives:
You don’t choose the perspective you’re in. Instead the game fluidly transitions from one perspective to another depending on where you are in the game. Most of the time, this is just a way to break the monotony and make the gameplay more interesting. But in some cases, the game seems to intentionally switch perspectives on you as a way to actively make the game more challenging.
The perspective also leads to some emergent behavior, like when you’re constrained to a plane (by either side scrolling or top down perspectives) while enemies are not. In these cases, you either need to wait for the enemies to enter your plane so you can fight them, or else use attacks that lock on so that you can attack outside of your plane.
In addition to perspectives, the game has three major sets of gameplay mechanics:
The “on foot” mechanics are what you’d expect from any modern RPG. You’re a soldier, and you have a variety of melee and ranged weapons. This is where you’ll spend most of the time in the game, and where you have access to the open world and main/side quests. While the main quest is mostly linear, there are relatively few constraints on where you can go, and a variety of side quests can be done in parallel to the main quest at any given time.
Occasionally, you’ll board a flight unit, which is when the flying mechanics enter in. Generally speaking these are fixed sequences, so once you’re done with all the enemies you’ll continue along in the story (generally returning to the ground).
Lastly, there is a hacking mechanic. In some cases you can get hacked (in which case you have to defend yourself), but generally speaking this is an ability that you use mostly with one specific character (9S).
The result is that, at least hypothetically, there are nine combinations of perspective and mechanics that you could experience during the game. In practice, I don’t think all of these are used—but probably at least half are.
In addition to all of this, you have the typical RPG mechanics like equipment and weapon upgrades, as well as some more unusual mechanics like “chipsets”, or sets of upgrades that you can install to boost your abilities (things like increased attack, defense, HP, auto-healing, auto-pickup-items, increased duration of invincibility after being hit, etc.). You also have “pods”, which are a bit like robotic animal companions, and can also be upgraded and have their own programs.
NieR:Automata is one of the few games where I listened the soundtrack before I played the game. Look it up, it really is that good.
There are a couple things to keep in mind about the soundtrack. One is that, if you actually buy an official copy, it seems to be missing certain songs. Maybe those songs are exclusive to one platform or another, I’m not sure.
The other thing is that every “song” actually consists of multiple tracks. For example, one track might be instrumental only, one might have soft vocals, one medium vocals, etc. In the game itself, these tracks would seamlessly shift from one to another as you move between zones in the game, and potentially based on what you’re doing. The effect is subtle enough that you won’t necessarily notice it while you’re playing, unless you’re specifically listening for it. (Which is, to be honest, exactly as it should be.) But it’s just important to keep in mind since you can’t really replicate this effect when listening to the soundtrack.
I’ve split this section into two parts, in order to isolate the spoilers as much as possible.
You can go read the premise of the game on the Steam store page, so I won’t bother repeating that here.
What I will say is that the story feels substantial. In some games, you put in a lot of hours but get left with the feeling that you didn’t actually do that much. (I’m looking at you, Torment: Numenera.) Automata is the opposite of this. I spent about 40 hours on my initial playthrough, but it feels like I did way more than a typical 40-hour game. I think part of this is not just the twists and turns in the plot, but also the feeling that there are major, world-changing events going on. Most games would lead up to a world-changing event, and then stop. Automata keeps going.
WARNING: This section contains SPOILERS. If you haven’t played the game yet, I recommend doing so before you read this section.
When I think about what makes the game feel substantial, the key ingredient is change. The characters and world at the end of the game are not at all like what they were at the beginning of the game. It’s the exact opposite of Star Trek, where it seems like at the end of every episode the Federation goes back to being the way it was before the show started.
Change means consequences. And boy do we have those in this game:
The game doesn’t allow these deaths to feel easy. This isn’t, “oh yeah, I died again, whatever.” These are emotional, world-ending sorts of deaths. Excluding one set of deaths at the beginning (for 9S/2B), you don’t expect the characters to come back. And in fact, after 2B’s second death, 9S becomes increasingly irrational and goes into a rage where he decides to put an end to all machine life, period. (And this goes on for something like the last 25% of the game.)
The principle of consequences applies not only to the characters, but also to the world. If I were to summarize the set of conceptual states the game goes through, it would look something like:
YoRHa is an android army fighting the machines so that humans can return to Earth.
Huh. The aliens (who originally sent the machine invasion) are dead. The machines killed them??
Oh wait, humans are dead too—they were extinct before the machines ever arrived in the first place. We just made them up to give ourselves something to fight for.
Whatever, let’s just kill the machines anyway. (Insert massive battle.)
Oops, we lost. YoRHa is gone.
YoRHa was intentionally created with a backdoor, set on a timer to cause itself to self-destruct. And the YoRHa androids were intentionally created to be destroyed. What?!?
In comparison, if I were to map out Star Trek, Star Wars, etc. you’d get at most one or two conceptual states. That gives you a sense of how much more “twisty” the Automata plot is with respect to changes at the world level.
The large number of conceptual states—and how different they are—is part of what makes the change feel so deep.
Overall I enjoyed the game and highly recommend it. There are issues, but they are largely fixable with mods, and the mechanics, story and music a breath of fresh air coming from the more typical commercial games these days.
]]>My goal here is more descriptive than it is prescriptive: I’m not interested so much in what one “ought” to do (as if this could be distilled down into a set of rules), as in what actually happens in practice in the writings of successful, well-known authors. I present a qualitative and quantitative analysis based on the works of some authors I like. The best part about this is that you can repeat this analysis yourself! You can determine for yourself what authors you admire are doing in their own writing.
I’ve found in my study (both the qualitative and quantitative portions), that techniques that are often discouraged in writing guides are in fact in widespread use by well-known, successful authors. This indicates to me that perhaps we should re-evaluate how we teach the use of dialogue tags.
Having said that, none of this will directly answer the question when you should use a given technique in your own writing. Ultimately, this is up to you as the author to determine for your own book. But I hope the data presented here can be a starting point for thinking about your own writing style.
To start off, I’m going to first go over some terminology. For each style of dialogue tag, I give examples taken from the works of well-known authors. I’ve based these categories on what I find authors using in practice in their books. In some cases these are well-known techniques that you might find in other writing guides. In a couple cases, I’m finding things that I haven’t seen described elsewhere, at least not in detail; in a small number of these cases I found myself needing to create new terminology to describe what I see.
After this I conclude with a quantitative analysis where I go through four books and report in detail the numbers of occurances of the various forms of dialogue tags discussed in the initial sections.
In my qualitative analysis I use examples from the following authors. You could do this analysis with any books you want; these just happen to be a few of my favorites.
(In case you haven’t read these books before, I try to minimize spoilers by quoting mainly from the early chapters.)
My more detailed quantitative analysis is currently limited to British style dialogue tags due to challenges in analyzing arbitrary noun phrases that would be required for a robust analysis of American style tags. This limits the analysis to The Lord of the Rings and Pride and Prejudice. I also include the first Harry Potter book, as well as The President Is Missing—the latter chosen mostly because it is a recent New York Times #1 Best Seller that happens to use British style tags:
Similar studies could be conducted for other books, and at the bottom I provide links to the technical details required to reproduce my study. The main limitations are as noted above, on the style of dialogue tags that can be handled, and the time required to perform an handful of manual setup steps for the analysis.
The first thing to know about dialogue tags is that there’s a split between the British system and the American system. Of course this is not a hard and fast divide. I’m an American and I prefer the British system. Some authors mix both in the same book, but this seems to be uncommon. Most authors pick one and stick with it.
| British | American | |
|---|---|---|
| Pronoun | "he said", "she said" | "he said", "she said" |
| Noun | "said Alice", "said Bob" | "Alice said", "Bob said" |
Notice that the pronoun case is the same in both cases while the noun or proper noun case is different.
It’s worth mentioning that there is also an older form where the pronoun is flipped: “said he” or “said she”, etc. This form is considered archaic, and is very rare in contemporary writing, if it is used at all. For example, in The Lord of the Rings, the modern form occurs 764 times, while the archaic form occurs only 10.
Examples:
Below as I will do throughout, I have marked the verbs or other areas of interest in bold, to make them easier to spot.
British:
“You’re right, Dad!” said the Gaffer. (The Lord of the Rings, 22)
“Kitty has no discretion in her coughs,” said her father; “she times them ill.” (Pride and Prejudice, 14)
American:
“Remember that you’re a duke’s son,” Jessica said. (Dune, 7)
It’s worth noting that even well-known authors are not always perfectly consistent in their choice of dialogue tags. For example, I have found examples of Tolkien using American dialogue tags, or Herbert using British tags, etc. However generally these are the (very) rare exception.
This distinction is slightly arbitrary, but I find it useful. Many guides censure the use of said-bookisms: dialogue tags with any verb other than “said.” This is well-meaning advice but I believe unhelpful. The intention, as I understand it, is to avoid drawing attention away from the dialogue onto the tag verb. Also there is a sense that using verbs that are too “descriptive” can hide weak dialogue.
However, there are a number of verbs that are clearly useful in dialogue, not because they are descriptive, but because they indicate the function or role of a piece of dialogue. These include: asked, replied, answered, continued, added, repeated, etc. For lack of a better term I call these functional tags. (If anyone knows the proper name for these, please let me know.)
In practice these seem to occur less often than the basic “said”. But they still occur on a regular basis, and it’s easy to see why, because of their utility in informing the reader about the functional role of the speech being presented.
Examples:
“But what about this Frodo that lives with him?” asked Old Noakes of Bywater. (The Lord of the Rings, 22)
“My dear Mr. Bennet,” replied his wife, “how can you be so tiresome! You must know that I am thinking of his marrying one of them.” (Pride and Prejudice, 12)
“‘I am Bene Gesserit: I exist only to serve,’” Jessica quoted. (Dune, 23)
And now for the much maligned descriptive tag verbs. By this I mean verbs that describe the manner, mood or inflection of the speech and not simply its function or form. Examples: shouted, whispered, yelled, murmured, cried, whimpered, observed, retorted, argued, snorted, stammered, exclaimed, demanded, snarled, growled, rumbled, breathed, roared, panted, blurted, etc.
Despite admonishions to the contrary, these verbs are clearly widely used. I have hard numbers on this below, but functional and descriptive tags are both less common than the basic “said” tags, but still quite common. I include a larger selection of samples below so you can see how they get used in practice.
Examples:
“Ah, but he has likely enough been adding to what he brought at first,” argued the miller, voicing common opinion. (The Lord of the Rings, 24)
“And you can say what you like, about what you know no more of than you do of boating, Mr. Sandyman,” retorted the Gaffer, disliking the miller even more than usual. (The Lord of the Rings, 24)
Note also the use of a simultaneous action (see below).
“Well, er, yes, I suppose so,” stammered Bilbo. (The Lord of the Rings, 33)
“I am afraid, Mr. Darcy,” observed Miss Bingley in a half whisper, “that this adventure has rather affected your admiration of her fine eyes.” (Pride and Prejudice, 41)
“I deserve neither such praise nor such censure,” cried Elizabeth; “I am NOT a great reader, and I have pleasure in many things.” (Pride and Prejudice, 42)
“Who are you?” he whispered. “How did you trick my mother into leaving me alone with you? Are you from the Harkonnens?” (Dune, 8)
“They were toys compared to me,” Piter snarled. (Dune, 18)
“Look down, lad,” Gurney panted. (Dune, 35)
This is one of the few cases that I would recommend avoiding outright. Some authors use tag verbs that are not a part of speech at all. This is generally jarring because the assignment of a non-speech verb to dialogue is contradictory.
Notably, out of all of the forms I’ve identified, this is the only one which I cannot find in widespread use in the writings of well-known authors.
Example:
This is an example from Gilbert the Great by Jane Clarke.
“You didn’t make Raymond leave,” Mom smiled. “Everyone fights sometimes.” (13)
This probably would have been better phrased as an action beat (i.e., by changing the comma to a period; see below).
This is another case that is often recommended against in writing guides. Again, clearly, this is common use in practice, though I think not as widely as what I call descriptive tags (see above). I have struggled to find an example of Herbert using an adverb in a dialogue tag, for example.
The rationale against these is similar to descriptive tags: it can potentially distract from the dialogue and may hide weak dialogue. However, the benefits are also clear: adverbs and other phrases can serve to strengthen dialogue.
Examples:
Adverbs:
“There wasn’t any permanent harm done, was there?” asked Frodo anxiously. (The Lord of the Rings, 48)
“Do you not want to know who has taken it?” cried his wife impatiently. (Pride and Prejudice, 11)
“We are not in a way to know WHAT Mr. Bingley likes,” said her mother resentfully, “since we are not to visit.” (Pride and Prejudice, 14)
Similarly, prepositional phrase can follow the dialogue tag:
“My dear Mr. Bennet,” said his lady to him one day, “have you heard that Netherfield Park is let at last?” (Pride and Prejudice, 11)
One thing to watch out for, if you use British style dialogue tags, is that the more complicated phrases can sound odd because of the unusual word order. The prepositional phrase almost makes it look like a normal sentence, but it’s not quite, since the dialogue tag still follows its own order. This is less of an issue in the American style since the dialogue tag follows the usual word order.
In some cases the tag is placed before the dialogue. Note in all such cases I’m aware of the verb follows standard English sentence structure (i.e., no “said Bob”). Also note the use of the colon after the dialogue tag.
Examples:
He paused, and then said slowly in a deep voice: “This is the Master-ring, the One Ring to rule them all. […]” (The Lord of the Rings, 50)
She said: “He’s a cautious one, Jessica.” (Dune, 6)
This a quotation with no accompanying dialogue tag. The benefit of this form is that it makes the text flow more smoothly, since it does not interrupt the dialogue with tags. The potential pitfall here is that it may not be obvious who is speaking, or it may cause the speaker to seem like a disembodied voice.
Examples:
“Then what happened after Bilbo escaped from him? Do you know that?” (The Lord of the Rings, 56)
“YOU want to tell me, and I have no objection to hearing it.” (Pride and Prejudice, 11)
“What’s in the box?”
“Pain.” (Dune, 9)
It’s possible to have a dialogue tag where the actual dialogue is implied rather than directly quoted.
Example:
Mr. Bennet replied that he had not. (Pride and Prejudice, 11)
This form tends to be more common in contemporary writing. A dialogue tag can be replaced by an action beat, which is a short sentence that describes an action taken by a character, which is typically representative of either their mood or the action of the situation. Critically, the action beat is its own sentence. Action beats can occur before, after or between fragments of dialogue.
A potential pitfall is to attempt to combine it into the same sentence as the dialogue in a way that implies the use of a non-speech verb (see above). Instead, either split the action into its own sentence (to create an action beat) or keep the tag and add a simultaneous action (see below).
Examples:
Paul sat up, hugged his knees. “What’s a gom jabbar?” (Dune, 6)
Some action beats are more descriptive. For example, in these passages the action beat describes the speaker’s tone or manner of speaking.
Gandalf laughed. “I hope he will. But nobody will read the book, however it ends.” (The Lord of the Rings, 32)
“I asked you a question, Jessica!” The old woman’s voice was snappish, demanding. (Dune, 22)
If an action is concurrent with the dialogue, it can be stated alongside the dialogue tag.
Examples:
“Who will laugh, I wonder?” said Gandalf, shaking his head. (The Lord of the Rings, 25)
“YOU are dancing with the only handsome girl in the room,” said Mr. Darcy, looking at the eldest Miss Bennet. (Pride and Prejudice, 19)
A pattern that has become more common in contemporary writing is the concatenation of several actions and/or dialogue. For example, in the following passage, we have an action (staring) concurrent with internal dialogue, followed by external dialogue.
Example:
The Reverend Mother stared at him, wondering: Did I hear criticism in his voice? “We carry a heavy burden,” she said. (Dune, 12-13)
Though less common in contemporary writing, it’s worth pointing out that dialogue tags can potentially be involved in other complex sentence structures. Probably best to avoid this form, but it’s worth recognizing that some of the best authors do use it:
Examples:
“Which do you mean?” and turning round he looked for a moment at Elizabeth, till catching her eye, he withdrew his own and coldly said: "She is tolerable, but not handsome enough to tempt ME; […] (Pride and Prejudice, 19)
There is no hard and fast rule for internal dialogue (a.k.a. thought bubbles), but in contemporary writing this is often done with italics. In other respects this follows the form of other dialogue.
Examples:
How could this be a test? he wondered. (Dune, 9)
One odd aspect of internal dialogue is that there is sometimes an interplay between thoughts that are expressed as exposition versus those expressed as dialogue. For example, in the following passage, all sentences represent thoughts at some level, but only one is italicized (and therefore being presented directly as internal dialogue). Note that in the non-italicized text, references to the character’s mother are in the third person.
They spoke truth. His mother had undergone this test. There must be terrible purpose in it … the pain and fear had been terrible. (Dune, 11)
Often, it’s best to keep a mix of different forms. This can help to keep the dialogue fresh and to maintain good flow, while inserting just enough tags so that it’s clear who is speaking in each paragraph.
Example:
This passage has a dialogue tag followed by prepositional phrase, tag with implied dialogue, functional tag, action beat (no dialogue), descriptive tag, no tag, and exposition (no dialogue). Also, keep reading in the book for a long section with no tags at all.
“My dear Mr. Bennet,” said his lady to him one day, “have you heard that Netherfield Park is let at last?”
Mr. Bennet replied that he had not.
“But it is,” returned she; “for Mrs. Long has just been here, and she told me all about it.”
Mr. Bennet made no answer.
“Do you not want to know who has taken it?” cried his wife impatiently.
“YOU want to tell me, and I have no objection to hearing it.”
This was invitation enough.
(Pride and Prejudice, 11)
The qualitative analysis above is hopefully sufficient to demonstrate that all of the forms above are used in one way or another somewhere in the writings of famous authors. However, an equally important question is how often? After all, if a form is very rare, one might be better off avoiding it altogether.
I’ve done a (partially automated) analysis of a couple of the texts to determine exactly how often each of the forms is used. Hopefully this helps shed some additional light on exactly how common these forms are.
As above, keep in mind that these results are still specific to a particular author’s style, and possibly genre, time of writing, etc. But I am making the scripts for my analysis available so that others can reproduce my results and even perform the same analysis on their favorite authors or genres.
First up: verbs. What verbs are used and how often?
In order to determine this I performed an analysis to find all verbs used for dialogue tags in The Lord of the Rings (LOTR), Pride and Prejudice (P&P), Harry Potter and the Philosopher’s Stone (HP1), and The President Is Missing (President). I found a total of 67 in LOTR, 19 in P&P and 30 in President. I manually sorted the verbs into categories for basic (i.e., said), functional (11 verbs in LOTR, 10 in P&P, 7 in HP1, 9 in President), and descriptive (55 verbs in LOTR, 7 in P&P, 56 in HP1, 19 in President). Then I performed an analysis to determine how often each of the verbs is used. Totalling these by category we find:
| Count (Percent) | ||||
|---|---|---|---|---|
| LOTR | P&P | HP1 | President | |
| Basic | 3487 (75%) | 302 (50%) | 715 (77%) | 464 (83%) |
| Functional | 475 (10%) | 195 (32%) | 45 (5%) | 57 (10%) |
| Descriptive | 676 (15%) | 107 (18%) | 165 (18%) | 41 (7%) |
In case you were wondering what those verbs are, here are the top 10 most used verbs for each category (I skip the basic category below as it only contains a single verb).
Functional:
| LOTR | P&P | HP1 | President | ||||
|---|---|---|---|---|---|---|---|
| Verb | Ct. | Verb | Ct. | Verb | Ct. | Verb | Ct. |
| answered | 214 | replied | 90 | asked | 19 | asks | 37 |
| asked | 166 | added | 47 | told | 8 | adds | 6 |
| thought | 41 | continued | 23 | added | 7 | continues | 3 |
| added | 25 | thought | 11 | came | 4 | told | 2 |
| replied | 8 | repeated | 7 | began | 4 | tell | 2 |
| came | 6 | answered | 6 | repeated | 2 | responds | 2 |
| began | 5 | began | 5 | finished | 1 | begins | 2 |
| repeated | 4 | adding | 3 | answers | 2 | ||
| interrupted | 3 | returned | 2 | goes | 1 | ||
| continued | 2 | repeating | 1 | ||||
Descriptive:
| LOTR | P&P | HP1 | President | ||||
|---|---|---|---|---|---|---|---|
| Verb | Ct. | Verb | Ct. | Verb | Ct. | Verb | Ct. |
| cried | 303 | cried | 86 | shouted | 17 | whispers | 13 |
| muttered | 52 | observed | 9 | whispered | 13 | calls | 4 |
| shouted | 44 | exclaimed | 4 | snapped | 12 | comes | 3 |
| whispered | 32 | called | 4 | cried | 9 | whispered | 2 |
| laughed | 27 | exclaiming | 2 | heard | 8 | spits | 2 |
| called | 26 | whispered | 1 | yelled | 7 | mumbles | 2 |
| hissed | 23 | rejoined | 1 | muttered | 6 | hisses | 2 |
| growled | 21 | snarled | 5 | agrees | 2 | ||
| exclaimed | 15 | growled | 5 | yells | 1 | ||
| snarled | 11 | called | 5 | snaps | 1 | ||
Clearly, Tolkien and Rowling like to use descriptive tags, as there is a wide assortment of verbs in use throughout the text. Austen and Patterson make more minimal use. On the other hand, Tolkien and Austen have robust use of functional tags; this seems to be more minimal in Rowling and Patterson.
What about adverbs? In LOTR, I find 76 adverbs used in 186 dialogue tags; in P&P, 14 adverbs (in 19 dialogue tags); and in HP1, 81 adverbs (in 157 dialogue tags). This makes the use of adverbs much less common than functional or descriptive verbs, but still noticeably prevalent. Interestingly, I was not able to find adverbs in dialogue tags in President at all.
Among Tolkien, Austen and Rowling the ten most common adverbs are:
| LOTR | P&P | HP1 | |||
|---|---|---|---|---|---|
| Adverb | Count | Adverb | Count | Adverb | Count |
| softly | 14 | impatiently | 3 | suddenly | 10 |
| quietly | 13 | warmly | 2 | quietly | 7 |
| suddenly | 10 | drily | 2 | quickly | 6 |
| sadly | 10 | stoutly | 1 | loudly | 6 |
| slowly | 9 | resentfully | 1 | finally | 6 |
| sternly | 8 | melancholy | 1 | coldly | 4 |
| grimly | 7 | immediately | 1 | sleepily | 3 |
| sharply | 6 | hastily | 1 | shortly | 3 |
| gloomily | 5 | gravely | 1 | miserably | 3 |
| eagerly | 5 | coolly | 1 | irritably | 3 |
What about different styles of tags? Standard tag (i.e., in suffix position) vs ones with adverbs vs prefix tags vs no tag, etc. This analysis is harder due to the sheer number of variations, but I was able to get some numbers. Note I don’t report any numbers for tags with simultaneous actions below, since it is challenging to track these with a fully automated analysis.
| Count (Percent) | ||||
|---|---|---|---|---|
| LOTR | P&P | HP1 | President | |
| Standard Tag | 4450 (83%) | 458 (37%) | 714 (48%) | 532 (29%) |
| Tag with Adverb | 188 (3%) | 19 (2%) | 157 (10%) | 0 (0%) |
| Prefix Tag | 138 (3%) | 127 (10%) | 54 (4%) | 30 (2%) |
| No Tag | 511 (9%) | 633 (51%) | 491 (33%) | 882 (48%) |
| Action Beat | 93 (2%) | 12 (1%) | 81 (5%) | 385 (21%) |
This table is interesting because there is a striking difference between the different authors. Tolkien by far prefers standard tags, followed by a much smaller uses of the others. Austen prefers no tag, followed by standard tags, with a smaller amount of prefix tags, and almost none of the other forms. Rowling prefers standard, then no tag, with smaller amounts of the others. Patterson prefers no tag, then standard, then action beats.
Unfortunately this analysis isn’t able to reveal why these authors vary so substantially. Differences could be attributed to genre, time period, author’s personal style, or something else I haven’t thought of. Without more data it’s impossible to say. But at the least this analysis demonstrates that there is substantial variation among authors who are widely considered to be successful.
If you’re interesting in replicating the analysis above, or want to perform it on a new book, I have made an open source release with the scripts used in creating this study, along with instructions for replicating my results. Please see the link below and submit an issue if you run into any trouble with the instructions:
]]>