blank

How to Tell If Your AI Truly Generalizes

2026-03-06T05:00:00-08:00

Humanity developed the best AI model on Earth and brought it to Mars. How do we make sure that it still works when there’s no one around to fix it when it breaks?

Questions similar to this are what NASA scientists and engineers solve for lunch every day — and so have applied machine learning researchers, including those who worked on natural language processing, for a long time. The underlying question is the same: if we build a system in Environment A and use it in Environment B, how do we make sure that it functions as flawlessly as possible in Environment B?

In the context of modern AI, this problem takes the now well-popularized name of “generalization”. How do we make sure the AI systems we build actually work well when they are deployed, where they are usually exposed to new data? There is a striking parallel to how NASA technologists and AI practitioners approach this problem effectively, which we will come back to at the end of this post.

To understand what “generalization” means and how to achieve it in AI systems, let us take a quick crash course in statistical machine learning.

Part I: What Do We Mean When We Say “Generalization”?

Models are assumptions plus parameters

Most prevalent modern AI systems, including large language models and other foundation models, are built on statistical learning approaches.

The classic joke about machine learning — illustrated memorably in an XKCD comic — is that you pour the data into a pile of linear algebra and just stir until the answers start looking right. There is more truth to that than we sometimes like to admit.

Source: xkcd.com/1838

At its core, a machine learning model is the combination of two things: assumptions and parameters.

Assumptions encode what we believe about the structure of the problem before seeing any data. For instance: “data points that are similar to each other tend to share similar labels” or “the pattern we’re trying to learn is not infinitely complex.” These beliefs let us narrow down the infinite space of possible explanations for what we observe.

Parameters are our best guesses, given the data we’ve actually seen, about how to fill in the details of those assumptions. They are what the training process actually optimizes. Once training is done, the model essentially says: here is the best explanation of the world I can construct from the examples I was given.

This is all well and good — until you realize that the observed data is always just a tiny, finite sample. There are infinitely many explanations that could fit any finite dataset perfectly well. The model picks one. Whether it picks the right one — meaning, whether its assumptions and parameters will continue to work on data it has never seen — is the entire question of generalization.

A model (right) tries to capture the structure of the data it was trained on (left). The decision boundary it learns reflects both its assumptions and what it inferred from the observed examples.

The limits of finite observations

This leads to three fundamental caveats that underpin everything discussed in this post:

Is the observed data representative of the entire problem? You’ve labeled thousands of cat and dog photos — but are you sure the real problem only ever involves cats and dogs? What if users also photograph owls? The data can only tell you about what it contains; it cannot tell you what you are missing.
Is the observed data representative enough to estimate parameters? Of your photos, 500 show running dogs but only 5 show running cats. You will learn what a running dog looks like — but your estimates for running cats will be unreliable.
Does the observed data faithfully represent the data the model will actually be used on? You trained on domestic cats and pet dogs. In production, users submit photos of lions and wolves. Same broad family — very different distribution.

Given the same finite dataset (left), there are infinitely many models that fit it equally well. The true underlying distribution could be any of them. Assumptions are what narrow the field.

The gap between how your model performs on its training data and how it performs on new, unseen data is the generalization gap. Closing — or at least honestly measuring — that gap is what this post is about.

Part II: How Can We Think About “Generalization” in the Real World?

What generalization means for an AI product

What does it actually mean for an AI product to “generalize”? In practice, it usually means all of the following:

It works on customer data very well with minimal to no adaptation after being deployed.
Any necessary adaptation can be performed quickly enough to retain customer trust.
Sometimes you cannot even see customer data in production, so you have to get it right on the first try.

This sets up the central challenge of AI product development: how can we best estimate the customer’s experience with our AI system when we cannot access their real production traffic ahead of time?

The answer sounds deceptively simple: you need a held-out test set — and a good metric — that reflects the customer experience as closely as possible. But getting both of those things right requires a lot of care.

Held-out test sets are non-negotiable

When we follow the “make assumptions → estimate parameters” training process, we find models that explain the observed training data very well. That is precisely the problem: the model has had a chance to improve specifically on the data it is now being tested on. Evaluating a model on its own training data tells you almost nothing about whether it will generalize.

A model evaluated on its training data is essentially a student taking an exam with the question sheet, the answer key, and the worked solutions all in front of them. The outcome tells you nothing about how that student will perform when they walk into an unfamiliar exam.

This is why held-out test sets are non-negotiable. A held-out test set is a set of examples that the model has never seen — not during training, not during development, not even during informal spot checks. It is the only honest estimate you have of how the model will behave on new data. Once you create a held-out test set, it should be:

Used rarely. Evaluating on it too frequently allows its results to unconsciously influence decisions, which slowly corrupts its value.
Analyzed sparingly. If you dig into every failure on the test set and use those insights to improve the model, you are leaking information from the test set into your development process. Over time, this turns the “test” set into a disguised part of the training process.
Updated carefully. As your product requirements evolve, your test set should evolve with them. But every update warrants scrutiny.

Customer-provided datasets are almost always best reserved as test data. They are pristine, real, and uncontaminated by your development choices — qualities that are hard to preserve once data gets pulled into the training pipeline.

The metric matters as much as the data

Many teams focus obsessively on collecting more data while choosing their evaluation metric almost as an afterthought. This is backwards. Your evaluation metric determines two extremely high-stakes things:

How you perceive the quality of your system — whether you feel confident shipping it or not.
How you make decisions about where to invest — whether to work on more data, better models, or improved post-processing.

Getting the metric wrong can lead you confidently in the wrong direction for months.

The same underlying reality can look very different depending on which metric you use.

A few dimensions worth thinking through carefully:

Measurement granularity. Are you measuring what matters most to the customer? Consider a form-filling AI agent. Measuring accuracy at the individual field level (did we fill in the right value for each field?) feels natural. But if your customer cares whether a whole form was filled out correctly — because an incorrect form triggers an expensive manual review — then field-level accuracy can badly mask what the customer actually experiences. A form with 20 fields and one wrong field is a failure to the customer, even if it looks like 95% accuracy to you.

Level sets. When there are multiple ways to achieve the same aggregate performance number, do they feel the same to the customer? If your classifier is 80% accurate on a dataset where 80% of the examples belong to a single class, a naïve model that always predicts the majority class would score 80% — but it would be completely useless. Make sure your metric distinguishes performance that is genuinely useful from performance that just looks good on paper.

Aggregation. How would customers perceive quality in practice? Is their experience the average outcome across many interactions, or is it the worst-case outcome they cannot tolerate? If a customer runs your AI system in a batch job where all errors are equally bad, a macro average might be appropriate. If they care most about not having any catastrophic failures, you might want to track the tail of the error distribution instead.

There is no universally correct metric. The right metric is the one that best simulates the customer’s own judgment of your product’s quality.

Test set size and statistical power

Once you have a well-designed test set and a good metric, a question that often goes underappreciated is: how big does the test set need to be?

Consider a coin. If a coin is biased 99% toward heads, you can probably detect the bias after 10 flips. But if a coin is biased 51% toward heads, you need hundreds or thousands of flips to reliably detect that bias over the noise of random variation. The same principle applies to evaluating AI systems.

The rough rule of thumb is that the smallest difference in performance you can reliably detect from a dataset scales with 1/√n, where n is the number of examples in your test set. A test set of 100 examples can reliably distinguish improvements of roughly 10 percentage points. A test set of 10,000 examples can reliably distinguish improvements of roughly 1 percentage point. If you are trying to make fine-grained improvements to a model that is already performing well, you need a substantially larger test set to tell signal from noise.

The same applies to stochasticity in model outputs. If your model has randomness in it (as most language models do), then evaluating the same example once and evaluating it five times with majority vote are very different propositions. The more variability there is in model outputs, the more repetitions you need to establish the true average performance.

The practical upshot: prioritize putting your highest-quality data into your test set first, then into a development set, and only then into training. It is tempting to throw everything into training to maximize model performance. But a small, unreliable test set will give you false confidence and lead to bad release decisions.

Dependency quietly destroys test power

There is a subtler issue that often goes completely unnoticed even by experienced practitioners: dependency between test examples undermines the statistical power of your test set.

Think about a product durability study. Suppose you want to know how long a product lasts before breaking. You could:

(a) Recruit 10 people, have them use the product for 10 months, and measure every 3 days. That gives you 1,000 measurements.
(b) Recruit 90 people, have them use the product for 10 months, and measure every month. That gives you 900 measurements.
(c) Recruit 500 people, have them use the product for 10 months, and measure only at the start and end. That gives you 1,000 total measurements.

On paper, (a) and (c) give you the same number of data points. But (a) gives you 1,000 measurements from 10 people; those measurements are highly correlated with each other because they come from the same individuals in the same context. Option (c) gives you 500 independent perspectives. The effective sample size in (a) is much closer to 10 than to 1,000.

The same applies to AI test sets. If your test set contains 10 Q&A pairs generated from the same document, those 10 pairs are not 10 independent data points — they are 10 highly correlated samples from 1 document. If your model happens to struggle with the information in that document, all 10 examples fail together; if it happens to handle that document well, all 10 pass together. In terms of what the test set can actually tell you about generalization, you are much closer to having 1 example than 10.

The rule of thumb is harsh but important: dependent examples do not count toward your total examples. When designing your test set, structure it so that your examples are as independent from each other as possible.

Part III: What Can We Do to Develop “Generalizable” AI Systems?

Designing your dev/test split

Given all of the above, the question of how to split your data into training, development, and test sets turns out to be much more consequential than it first appears. A random split is almost never the right answer. Instead, the split should be designed to simulate the true generalization gap — the gap between the environment in which the model was built and the environment in which it will be deployed.

The better you simulate that gap in your evaluation setup, the more reliably your test set performance predicts actual customer experience. If you artificially close the gap, you inflate your test numbers, ship with false confidence, and get an unpleasant surprise in production. Cheating on your own evaluation doesn’t help anybody.

Let us walk through a few concrete cases that illustrate how to think about this.

Case 1: Wikipedia Q&A. Imagine building an AI agent that answers questions about Wikipedia pages. You’ve collected a dataset of Q&A pairs: for each of 5,000 Wikipedia pages, you have roughly 10 questions and answers about the content of that page.

The naive approach is to randomly split the Q&A pairs into train, dev, and test. But this means the same Wikipedia page can appear in all three splits. A model that has seen 8 of 10 Q&A pairs from a page during training has essentially memorized the content of that page. The remaining 2 Q&A pairs from the same page are not a fair test of generalization — they are a test of memorization. A correct split here is by page: train on some pages, develop on a held-out set of pages, test on yet another held-out set of pages that the model has never encountered.

Case 2: WikiData Q&A. Now imagine the same setup, but for WikiData, which encodes relationships between entities across pages (e.g., “Jill Biden is the spouse of Joe Biden”). A Q&A pair about Jill Biden’s relationship to Joe Biden involves information from both of their pages. Even if you split by page, a model that has seen Joe Biden’s page during training and Jill Biden’s page during testing has still been exposed to relevant training signal. The true generalization challenge requires splitting on entity clusters, so that no entity and its close relations appear on both sides of the split.

Case 3: Automated stock trading. An AI system trained on social media and SEC filings to trade stocks. The data spans 1990–2024. A random split would put examples from 1995 in training and examples from 1995 in the test set. But time is a fundamental dependency here — the market in 2010 is not independent of the market in 2009. More importantly, a model that has seen the future is not generalizing; it is cheating. The correct split is strictly temporal: train on data up to some year, develop on the following year or two, test on the most recent data. This simulates the actual deployment scenario.

Case 4: Form filling across jurisdictions. An AI agent that fills out credentialing forms for medical practitioners across many different states and jurisdictions. If you have 50 different form types with up to 10 examples each, the core question is whether the model can generalize to unseen form layouts. That means splitting by form type: some forms in training, held-out forms in development, held-out forms in testing. Doing a random split by example would let information from Form A leak into both training and test, making the test set much easier than the real deployment scenario.

In each of these cases, the right split strategy requires thinking carefully about: what is the source of variation in the deployment environment that the model has not seen during training? The dev/test split should mirror that source of variation.

Practical tools to close the gap

Building a high-quality, properly structured test set is necessary but not sufficient. You also need enough data to train a good model in the first place. Fortunately, there are several tools for augmenting your training data (not your test data):

Data synthesis: Generate synthetic examples using templates, heuristics, or other AI systems. This can be especially valuable for rare or hard edge cases that don’t appear often in naturally collected data.
Data augmentation: Create modified versions of existing training examples (e.g., paraphrasing, format variations, noise injection) to increase the diversity and volume of training data without collecting new examples from scratch.
Data annotation: Have human annotators label additional examples, particularly for the long-tail cases that matter most to customers but are underrepresented in naturally collected data.
Statistical comparison tools: Use significance testing to rigorously establish whether one model is actually better than another, rather than eyeballing metric differences.

One note of caution: these tools should be applied to training and development data. Synthesizing or augmenting test data defeats the purpose of having a test set in the first place.

How NASA would do it

At the beginning of this post, I noted a striking parallel between how NASA engineers and AI practitioners approach the problem of ensuring a system works in an environment it wasn’t built in.

NASA can’t test a Mars rover on Mars before sending it there. What they do instead is design testing environments on Earth that simulate Martian conditions as faithfully as possible: the right temperature, pressure, terrain, lighting, and gravity. They are deeply aware of the ways in which their simulation falls short of the real thing, and they invest heavily in closing that gap. The entire discipline of systems validation and verification is about building confidence that something will work in an environment you can’t directly access.

Applied AI practitioners face exactly the same challenge. You can’t deploy your AI model to production and then decide whether to ship it. You need to make that decision in advance, based on evidence from a test environment you control. The quality of that test environment — how faithfully it mirrors the conditions the model will encounter in deployment — is what determines whether your confidence is justified.

The teams that get this right are the ones that treat evaluation design with the same rigor as model design. A great model evaluated poorly will fail in ways that surprise you. A good model evaluated rigorously will give you exactly the confidence you need.

Always and never

To close, here is the simplest possible summary of the principles in this post:

Always	Never
Use a held-out test set to estimate generalization performance	Ship AI products without held-out testing
Carefully design your test set and metrics to reflect the customer experience	Use low-quality test sets, arbitrary metrics, or test sets that are too small or inflated by dependent examples
Design your dev/test split to reflect the true generalization gap	Do a random split or underestimate the generalization gap just to make your numbers look better

The last row is worth emphasizing. It is always tempting to design your evaluation in a way that makes your model look good. But the only person you are fooling is yourself — and eventually, your customer will tell you what the honest evaluation would have shown all along. Better to find out now.

Based on an internal presentation at Orby AI originally prepared for the entire team, to advocate good evaluation practices for the AI products we build.

Why You Should Stop Using HotpotQA for AI Agents Evaluation in 2025

2025-07-01T00:00:00-07:00

Photo by Karsten Winegeart on Unsplash.

That year, humanity launched their first functioning autonomous language agents into their collective cyberspace. Centuries later, historians of the Solar System will look back at this point in time, and only then, could they fully recognize the watershed moment that it was. What was formerly known as Year 2025 in the Common Era is now more colloquially known as Year 1 of the Agent Era.

– An attempt at a Three Body-esque sci-fi narrative about AI agents

Agents, agents everywhere. From countless startups to innumerable social media posts, AI agents are creeping up in your IDEs, your browsers, and perhaps places you are not even looking or consciously aware. Clearly the dominating narrative in artificial intelligence in 2025, “agents”, transitive by the share of the public discourse AI is taking up, is firmly one of the most discussed topics in humanity’s collective consciousness this year. The fact that we do not even seem to have a shared definition for what AI agents are does not seem to deter us. We all seem to have developed a warm and buzzy feeling about them, and sincerely believe (or hope) that agents are “the next big thing that changes everything as we know it”.

If you are following frontier AI agents research closely, you might have come across a benchmark called HotpotQA, that is somewhat often discussed in these research works. What is HotpotQA, and what does it have to do with the AI agents today?

HotpotQA

Back in 2018, some awesome collaborators (from CMU and Université de Montréal) and I published a paper called HotpotQA, which features a novel (at the time) idea of question answering through multi-step reasoning in the wild. Since its publication, HotpotQA has helped fuel some of the early ideas in modern AI agents.

Consider the question “What government position was held by the woman who portrayed Corliss Archer in the film Kiss and Tell?” To answer this question, you’d first need to figure out “who portrayed Corliss Archer in the film Kiss and Tell”, before you can answer what government position she held (it’s Shirley Temple, and her position was Chief of Protocol of the United States).

This was a time before BERT and any of the decoder-only breakthroughs in large language models we take for granted today (e.g., GPT‑2, back when version numbers were still monotonically increasing integers). At that time, most of the AI systems that can answer questions from a large collection of text (like Wikipedia) follow a simple, scripted recipe: retrieve a handful of passages from the corpus with the question, and shovel them into a neural network with the question to get the answer.

HotpotQA pioneered a class of AI tasks that requires the AI system to autonomously perform multiple steps of reasoning in an open-domain setting, where scripted approaches no longer worked well. Since, now, each question in HotpotQA requires at least two passages from separate Wikipedia pages to answer. These questions are designed so that often at least one of these won’t surface if you searched with the question itself. What is more,different questions use different natural language reasoning patterns that present challenges to a single script that works for all.

In its early days, researchers often challenged the motivation for HotpotQA by observing that the questions themselves are oftentimes artificial, and that users can still break their complex questions down into subquestions manually so that existing AI systems can handle them. Looking back after a few years, our answer to these very valid considerations seems to have stood the test of time:

We would still like to build AI systems that can solve more challenging problems through more autonomous reasoning and acting, rather than always relying on humans to cope with the limitations of the AI systems that are currently available.

HotpotQA, despite some of its limitations, was a step in the right direction to challenge the envelope of possible AI system capabilities, where we essentially over-sampled a group of small but challenging problems to stress test how well our systems perform.

In my own follow up work a year later, GoldEn Retriever, we explored a prototype of such an AI system. Instead of following a scripted procedure, GoldEn Retriever proposes its own search queries based on the original question and whatever passages have been retrieved from the previous iteration (initially empty). This allows the system to autonomously focus on information related to “Kiss and Tell” in the first iteration of information gathering, then use the retrieved information about the movie, combined with the original question, to redirect its focus to “Shirley Temple” to solve the question by going beyond the information immediately available in the original question. This is a proof of concept for the ideas behind what would later become agentic retrieval and deep research, where the AI system is provided with strong agency in determining what information to gather at each step, gather information for multiple steps, and then formulate a coherent answer given all the context that is provided.

Why you should stop using (just) HotpotQA for agents research

Four score and seven months later, looking back on this journey, it is truly humbling to see the body of research work that this dataset has helped enable. I have constantly discovered new inspirations from the ingenuity of the community that works on AI agents, some of which have found HotpotQA useful as a testbed.

However, when used in isolation, the evaluation that HotpotQA provides is no longer adequate for today’s agentic systems.

First, the task output format and evaluation metric are a little outdated. Like many question answering datasets of its day, HotpotQA is an extractive question answering dataset following the pioneering work in SQuAD, which means that the answer came directly from a substring in the Wikipedia context that supports the answer. While this leads to relatively easy-to-implement and objective evaluation metric, it leads to a format that is no longer natural for today’s generative AI systems (or to real users), and is limited in its ability to evaluate semantically equivalent or other potentially feasible but unannotated answers. For instance, in our example in the previous section, a system predicted answer “The actress in question is Shirley Temple, who held the position of Chief of Protocol of the United States”, while being perfectly sound to a human user, would not have received full credit in the exact match or F1 scores, because it provides extra information that the specific answer format did not allow.

Second, the task is not realistic enough, in the sense that it bakes in the assumption that every single question requires multiple steps of reasoning to arrive at the answer. In reality, this cannot be determined just by looking at the question itself, but also depends on the knowledge sources that are available, somewhat akin to the halting problem of programs (you cannot always determine whether a program will terminate on all possible inputs). In our example, it is plausible (though unlikely) that the Wikipedia page for Kiss and Tell mentions Shirley Temple’s government capacity of Chief of Protocol of the United States, for instance, if this position temporally overlaps with the film’s making, or has had an effect on the artistic decision-making in its production. In that case, only a single step of retrieval and reasoning would have been sufficient, and doing more risks introducing more noise for the model to cut through. In my 2021 follow-up work, we attempted to address this benchmarking problem by coming up with a unified single- and multi-hop question answering dataset with the same underlying knowledge corpus, BeerQA,¹ alongside an autonomous system that performs retrieve-and-read for as many steps as needed for the question at hand, given the information retrieved so far. This more closely emulates the scenario where AI agents are being deployed: before exploring potential approaches to solve the given problem with the knowledge and tools provided, it is impossible to perfectly predict the complexity of the problem ahead of time.

Third, the task is relatively contrived, and reflects only a small slice of what humans would like to delegate to intelligent systems. Question answering is at the core of human needs to gather (and synthesize) knowledge, which undoubtedly improves upon the previous generation of technology that solves this problem, represented by powerful search engines. However, in many cases, knowledge gathering itself is not an end, but a means to an underlying one. One can gather knowledge to assist with more challenging decision-making, to educate oneself or others, to explore the boundary of adjacent knowledge, or even to serve as the basis of knowledge to help accomplish certain tasks, to name a few. As the capabilities of our AI models and systems improve, we should not stop at holding ourselves to solving only parts of these problems, but more aggressively explore their end-to-end solutions.

Finally, as is the case with any benchmark, approaches overfit the benchmark over time. Almost everyone that has worked on HotpotQA is probably familiar with some of the first 10 dev set examples by now (our Shirley Temple example being one of them), and powerful large language models have also likely seen the entire development set with the correct answers in its training. One unfortunate fact with benchmarks, especially in an unchecked SOTA-or-reject publication culture, is also that researchers are motivated to do whatever they need to hillclimb on the benchmarks that they work on. This has often led to the antipattern that a general-sounding approach is heavily tailored for different benchmarks to achieve the best results on those benchmarks. This is especially troubling for works that claim to build general-purpose AI agents that function for a wide range of agentic tasks, because such customization clearly provides unfair advantage based on information that is not derived by the agent, and therefore undermines the validity of any such generality claims.

What to do instead

Having revisited the limitations that HotpotQA has, what can we do instead, as a community, to make true progress towards autonomous AI agents? Some of the answers lie in the four limitations I listed above. Instead of writing a carefully crafted research statement, I will just try to dump some of the ideas in a semi-consistent pile of text.

First, avoid heavy customization of your agent design to each benchmark it is tested on. With the advances of modern foundation models, we might be able to get away without dataset-specific hyperparameter tuning, and really revisit the dream of artificial intelligence – a human-like algorithm that “just works” on any problem we throw at it, without excessive human involvement or problem-specific hyperparameter tuning. Like scientists in nature sciences, perhaps we should set out to look for the “universal equations” where one set of constants govern everything in our observable universe. This also helps avoid the problem of working on unrealistic settings (e.g., all problems are exclusively multi-hop or complex), since now your AI system is held to the standard of solving a wide range of tasks with a single implementation.

Second, work on agentic tasks beyond simple knowledge gathering. There has been a myriad of new agentic benchmarks that focus on evaluating complex reasoning behaviors and autonomous decision making. Along similar lines to HotpotQA’s theme of context understanding and knowledge gathering are benchmarks like GAIA that stress test agent capabilities in leveraging a diverse set of tools to gather correct information form the cyber world to answer questions. In the meantime, it has never been a better time to work on agentic tasks and benchmarking that evaluate agent capabilities to do various tasks and have real-world effects. Benchmarks like Online Mind2Web, OSWorld, WebArena, WorkArena, among others, evaluate AI agent’s ability to perform actions in a real or simulated digital environment to achieve goals that not only derive answers to questions, but also sometimes change the state of the world in desirable ways. A natural extension of these digital agents into the physical world is embodied AI agents, where, despite many impressive breakthroughs coming from computer vision and robotics, much of the exploration is still in very infancy stages.

Third, with capable agents, there’s unprecedented space for discovery of new tasks that we couldn’t imagine working, new interactions we didn’t think mature enough, or those that they don’t do well yet. In the realm of things that we couldn’t imagine working, are various creative works that leverage language-based AI agents’ role-playing capabilities to simulate a well-organized group of humans working towards a goal, or simply a loosely organized group going about their “lives” that exhibit interesting social and/or economic behaviors (AI towns, for example). With maturing AI capabilities, there will also be user-agent and agent-agent interactions we didn’t think possible a few years ago, e.g., immersive multimodal collaboration between humans and AIs to solve problems in the real world, where the AI is not merely actuating human instructions, but more and more serving as an autonomous counter-party and making autonomous decisions independent of their human counterparts. There will also always be tasks or capabilities that the emergent generalization of today’s AI models cannot cover very well yet, due to limited resource availability or fundamental limitations in the workhorse technologies we use today. It will be particularly exciting to piggyback on the AI capabilities that we have today to find the ones that are still lacking,² where reliable and robust evaluation will always be at the forefront of these challenges (building datasets, sound evaluation techniques and defining new categories of tasks are important, but admittedly I am biased). I still stand by the predictions I made a little over two years ago in the tweet below, only reality is a lot more exciting than I had imagined.

Could the fast development and wide adoption of #gpt4 and friends actually lead to a Linguistic Renaissance?
— Peng Qi (@qi2peng2) March 15, 2023

Last but not least, with great adoption, comes great responsibilities. As AI agents improve in their capabilities to solve real-world tasks that people encounter in their everyday work and life, they will inevitably see wider and wider adoption. How can we make sure these agents do not do undesirable things when they are put in the hands of users? How do we make sure they are transparent enough so that the user can understand how they do what they do, and controllable enough that should the user need to intervene, their behavior can be meaningfully influenced? How can we make them run more efficiently and robustly, reducing their power footprint and improving user experience in the same time? How do we make sure the interface that these agents expose to users is actually providing the right amount of information and control, so that well-meaning users can accomplish their goals unhindered, while malevolent users cannot use them to cause outsized harm? A lot of these questions stand at the blurry boundary between cutting edge research and real-world applications, which offers unprecedented opportunities for academic and industry researchers to collaborate and work on groundbreaking problems together (plug: Orby is doing a lot of these and beyond, consider joining us!).

Final thoughts

The progress of science in artificial intelligence goes hand-in-hand with our ability to discover and clearly define new “intelligent” problems, however elusive this process might seem.³ Gone are the days where a few benchmarks occupy the attention of half of the NLP field for a decade, but the importance of identifying crucial, high-quality, and lasting research problems never fades. I try to ask myself this question all the time, and I would encourage every AI researcher to do the same from time to time: what is the problem that we are actually solving here? It is probably not just a new model architecture, just hill-climbing on a bunch of datasets, or just designing and publishing new shiny benchmarks, despite what it looks like on the surface and what the buzz says. One answer that I inevitably stumble upon as I think in depth about this, is that we are fundamentally solving for the underlying human problems, and anything short of it is a detour or temporary milestone.

Footnotes

Because you get different flavors of beer by adding different levels of hops, among other things. And beer is a great companion when having hotpot in the summer. ↩
Pragmatics will always be one of my favorite examples, but I am clearly also biased. ↩
I have also tried to spell this out a bit more in a previous blog post: AI is the new rocket science ↩

AI is the New Rocket Science

2025-03-12T05:00:00-07:00

Photo by SpaceX on Unsplash.

“It’s not like it’s rocket science!” Exclaims a character in your favorite sitcom/drama/book, who then explains to another character how simple something they are asking is. Rocket science has been widely used in the everyday English dialog to mean something incomprehensibly complex more often in the sarcastic negative meaning, much like how “heavenly script” is used in Chinese (and by transitivity, the one of the most incomprehensible languages in the world).

On the surface (pun intended), the appeal of rockets and space technology is so easy to grasp for the general public, and has since been sensationalized and romanticized by numerous artists in any shape and form imaginable, from movies (Oscar-wining ones too), songs (including this oldie-goodie), to endless books, to just name a few categories. Even without extensive educational efforts from pop science and space agencies, it is not hard to see the appeal of space and space technology. After all, every launch marks the lift off of humongous space vehicles weight a massive ball of fire, into the eternal void full of undiscovered wonders of our physical universe. The nerve-racking challenges in weightlessness, the cool but memorable jargons (“extravehicular activity” sounds unfamiliarly cool with an easy acronym, while “natural language processing” sounds like a meat plant that turns books into dictionaries¹), the smart-looking lead engineers and scientists in the huge mission control rooms, and the (oddly never soundless but) expansive black backdrop of the universe all contribute to the mystery and gravitas. Add black holes and special relativity (and the Nolan brothers), and boom, instant Oscars.

On the other hand, the portrayal of AI (really mostly just robots) has largely fallen into two tropes: the bad AI deceiving humans and leading to our extinction, and the good, friendly, quirky, and often under-equipped AIs discovering their humanness and/or fighting these bad AIs along our side. Sometimes a bit of both. (AI) programmers on the silver screen are either staring at walls of JavaScript scrolling by when they smash their laptop keyboard randomly (which, can someone who knows someone in Hollywood talk them into at least replacing with Python?), or are token nerds that gets killed off after offering the one critical smart comment they offer the muscular super-spy protagonist (apparently I’m not the only one wondering).

Despite their vastly different public images, rocket science and AI science actually share a lot in common. Some are more obvious, so let us get those out of the way first.

First, both are extremely capital-intensive and knowledge-intensive to build. While the proliferation of AI today might give a different impression, the AI market largely remains an oligarchy, where a few well-funded top players dominates most of the leading-edge work. Even the David’s in the news are Goliaths compared to most of the rest of the world in terms of access to computational resources and/or top research talent, especially if they are building much of the breakthrough from scratch.

Second, both are applicable in a myriad of dual-use situations, which invites heavy government scrutiny if not also investment, media hype in the name of national security and pride, as well as (sometimes misguided) public interest. AI science is more and more resembling its rocket relative at its high time 70 years ago, where DeepSeek is actively compared to the “Sputnik moment” by the media, no less.²

Third, both are overly romanticized by the general public. This is manifested mainly in two ways – one where fiction works make interesting extrapolations of the technology into the future while grossly missing the mark on some things that we see as obvious decades later (e.g., the phone prediction below from 1956); another where we tend to amplify the voices of a few people that are associated with the technical breakthroughs, who are usually not the most versed in the technologies they represent. One example is early astronauts. While there is no doubt about their bravery to sit in their tin cans (which is what they were by today’s standards), they are highly trained to follow a very dedicated mission protocol designed and maintained by numerous engineers to account for all foreseeable issues during the mission, with a booster, a vehicle, electronics, etc. designed and built by hundreds if not thousands of highly capable and deeply knowledgeable scientists and engineers. Yet we end up mostly remembering the “One small step” quote — pretty much the same thing is also happening today.

Prediction of future phones in 1956. From Reddit. Original source unknown.

In the meantime, there are some less-discussed commonalities between rocket science and today’s AI science, as well, and you might not be expecting some of these if you haven’t paid close enough attention.

First, “science” is really a misnomer for both. To be clear, this is not to say that no science is happening within these disciplines: to the contrary, when there is abundant public interest, funding and potential applications to practical problems, scientific research and discoveries proliferate. But the term “science” clouds so much of the picture and creates an illusion that everything happens where people wear white lab coats, stare intensely at formulae and hand-drawn orbit diagrams on blackboards, or just debate the latest scientific findings in presentations in a crowded classroom.

So much that is happening in both rocket science and AI science is not science, per se, in at least two ways. First, a lot of what we marvel as scientific achievements are really a combination of scientific and technological breakthroughs, and at the time that they become a public sensation to deliver an astonishing demonstration or tremendous economical value, what is known and well-understood from a scientific standpoint is often very primitive. Less often are cases where a piece of scientific theory published decades prior explaining perfectly all the new empirical data we collect in experiments, more likely we engineer things to work first with a ton of new empirical data, then patch up the theory here and there to explain what happened. It is worth noting that even in top academic labs, oftentimes the line between very cutting-edge scientific advances and very sophisticated engineering is blurred at best, since these disciplines can deliver so much practical value before everything is perfectly well-understood about them, as long as the good recipes can yield reproducible results and be built upon.

Second, using just the term “science” underplays the amount of stellar systems engineering that goes into both of these disciplines. A great deal of program management, business administration, architecture and engineering marvels go into these advances, which would have been impossible were it left to just a handful of smart scientists alone. This is thanks to the complex and multifaceted nature of these problems spanning deeply into so many academic disciplines and application domains, where it is impossible for any single human to be sophisticated enough to know everything well enough to make meaningful contributions to all of them – and even if someone did, they would simply not have enough time and energy to do all the work that is required. Organizing a team of individuals, each talented in their own ways, to move in unison towards a grander goal while keeping the quality of an entire delivered system up a certain standard is a lot more difficult than it might sound – to get a glimpse of this challenge, one needs to look no further than group projects in their nearest school / university.

Second, we tend to take too narrow of a view of their potential (positive) implications. Both rocket science and AI science are complex enough to involve the collaboration of people from many different disciplines coming together, and this effect sometimes pays off in ways that are difficult to imagine when the technology is being developed to solve the problem at hand, even for the very experts themselves that invented the technology. What do a LASIK surgeon, a newborn baby drinking formula milk, and an architect designing a new skyscraper have in common? They are likely using some technology that originated from NASA’s space exploration. Numerous such examples can be found also in life sciences, material science, healthcare, just to name a few. While not all of these are directly related to the problems the technology was originally designed to solve, this is a great illustration of the far- and wide-reaching positive technological implications of solving complex engineering problems.

In AI, this effect is likely going to be more pronounced. While most of the technological advances that enabled space flight for humans are helping to improve the human condition that could potential leverage these technologies on Earth, space flight itself is not a “product” that directly drives a positive impact on most people’s everyday life (except for the convenience that satellites provide, of course, which should not be underplayed and deserves a whole other article — here’s an example). This picture is vastly different for the development of AI. More so than ever before, the general public has direct access to some of the cutting-edge AI artifacts, and have knitted them deeply into the fabrics of their everyday work and life. At the same time, not only does AI require technological advances to build and use efficiently (this is what seemingly disjoint technologies like crowd-sourcing, light-based chips, and massive yet efficient storage clusters have in common), but strong AI artifacts can also be leveraged to mimic the human experts out there to potentially uplift the productivity and innovation in many other disciplines (AI for science and AI for AI), catalyzing more emergent technological leaps — meanwhile the V-2 rockets you build would not automatically help you build a Saturn V.

Third, both are sometimes overly mystified by (some of) their practitioners. With great power, comes sometimes not great responsibilities, but great insecurities. Both rocket science and AI science can bring about significant societal changes, economic opportunities, and power shifts with their dual use nature. With vested (geo-)political, economical, financial, and sometimes reputational / socioeconomic status interests in maintaining the scarcity of the technology developed, the nations, organizations, or individual practitioners that work on both rocket science and AI science are sometimes incentivized to unnecessarily complicate these technologies in the public narrative, and put up a mystified and grandiose facade. With extensive gilding and gatekeeping, they try to ensure that they are part of the few that is “in the know”, and shun away anyone anyone that comes near and offers misdirections when they get close.

While from a short-term perspective this might be beneficial for the individuals and groups involved, in the long term, this is usually a net loss for society and humanity as a whole. Science and technology did not win over the world by shipping around mystic elites that speak a language that they and only they can understand, but by developing a rigorous set of methodology that derives theories from a relentless belief in empiricism, where embracing a broader range ideas that matches all of the experimental findings as we know them have lead to numerous rediscoveries of the empirical truths about our world many times over in history.

So, as we recognize these commonalities between AI and rocket sciences, I believe there are also a number of lessons we can inherit from our space-exploring precursors back in the day:

Strong leadership and clear vision are key. Because of the capital-intensive and high-risk nature of these endeavors, it is easy to shy away from the challenge, to put in a half-hearted effort, or divide the exploration into too many disjointed directions instead of focusing on solving the key challenges — all quick recipes to wastage of a lot of resources and eventual failure.

It is a Herculean organizational challenge to united a large team of people around something as complex and as challenging as rockets and modern AI. This is partly because, when at scale, people will have to come from different disciplines with different backgrounds. They will not only have very different views on what is obvious, what is comprehensible, what is important in the entire project, and what is at risk of failure, but also have very different views from whoever is leading the team, because every person is limited in their own domain(s) of expertise. Meanwhile, although most would share the same mental mission of driving great achievements from the success of the overall project, each individual also comes with their personal agenda and goals. This is where vision-driven clear directions are key for the group to operate more autonomously, yet still moving towards the same collective goal in unison. Most individuals with strong technical backgrounds might have at one point wondered why someone not as technically brilliant or versed as them is doing this thing called “management” while they are doing the “real work”, and wonder if they themselves can instantly do a much better job with their deeper expertise in the subject matter — the truth is, these are usually two very different job categories with very different set of skills, and it takes serious time and effort to constantly do well in just one of them, let alone both. Working in a team of 5 is usually not that difficult, but imagine 5x that, then 5x that, and 5x again. What you would end up using is a very different set of skills than those you started out with.

Set moonshot goals. “We choose to go to the Moon in this decade and do the other things, not because they are easy, but because they are hard.” This famous quote from U.S. President John F. Kennedy’s (JFK’s) 1962 actually emphasizes two key ingredients that I believe make moonshot goals work: a challenging goal that goes a few steps beyond the average person’s ambitions, and a somewhat tangible time frame allotted to achieve it.

The goal-setting part is more intuitive and commonly talked about. A challenging yet meaningful goal can help attract the right kind of talent and keep them motivated working on it, which can typically lead to more exciting advances than what is imaginable from the onset. Like many things in science and engineering, solving problems at a larger scale, making things work at faster speeds, and increasing system reliability in more challenging conditions all require deep investigations and sometimes multiple breakthroughs. Once we know how to do these things for the first time, we have not only obtained a good solutions for smaller scales, slower speeds, and lower reliability requirements along the way, but also have built a world-class team for future challenges of this sort.

Often less talked about is the second part about time frame. JFK did not say “We choose to go to the Moon in this year and do the other things” probably because it was mid-September already (with the Holiday season coming up, there was practically no time left in the year), neither did he say “We choose to go to Mars in this decade and do the other things” presumably because it was too Red for the American people’s taste at the time. Jokes aside, these would have been time frames incompatible with the technological leaps needed at the time, which I would sometimes call “physically impossible” as it is almost breaking the laws of physics and causality. The risk with setting these goals is that people either would not have trust in leaders that set such goals, or would quickly lose trust when they realize that the timeline is entirely unrealistic after putting in a good amount of hard work and getting nowhere close.

Respect subject matter expertise with humility. Another effect of working on a complex problem / system, is that no one will constantly have a perfect understanding simultaneously at the bird’s eye level and down into the details that can sometimes make or break the entire system. As the problem we solve becomes more and more complicated, it is through simplifications that we make abstractions, reduce the complexity of the problem, identify patterns, and make decisions that involve larger and larger amounts of humanpower and resources. It is necessary to make assumptions and develop autonomous mechanisms for a large, complex organization to function, but it is also too easy for those that are leading / decision-making positions of these organizations to forget that the people who are actually working first-hand on the subject matter are more likely the experts on the problems.

For a large group of people to make progress on a sophisticated problem, it is important for those who lead to discern important signal from noise from the perspective of the project as a whole, which means not everyone’s voice will be heard at every given point in time. In the meantime, for the organization to stay healthy and the risk in the project to remain reasonably contained, it is also crucial that organizations have a good way of surfacing mission-critical concerns and considerations that often come from front-line contributors, and be decisive enough to take actions accordingly, no matter how trivial they might seem to an outsider. On this point, I am constantly reminded of the O-Ring issue that eventually lead to the disaster of Space Shuttle Challenger. In a project as intricate as the space shuttle program, layers of management would be inevitable, where people assigned to management roles are usually either more senior or more removed from the day-to-day, if not both. Their failing to trust the expertise and professional judgement of the technicians that actually worked on the O-Ring led to this information being buried in the giant organization, which eventually became the weakest link that decidedly caused the huge investment to fail catastrophically, with loss of precious human lives.

Collaborating with openness we all win, competing with divisiveness we all lose. In the same 1962 speech at Rice University, JFK retold the ancient philosophical wisdom eloquently in modern English, “The greater our knowledge increases, the greater our ignorance unfolds.” This highlights the seemingly paradoxical nature of knowledge exploration — as we learn and expand the volume of the body of our knowledge, the surface area where this body meets the vast vacuum of ignorance inevitably grows with it. With every question answered, two more seem to pop up on that answer and point into the further darkness of unknowns.

In the meantime, as Sir Issac Newton famously put it, “If I have seen further it is by standing on the shoulders of Giants.” This cannot be more true in today’s rapidly developing world of artificial intelligence and information technology as a whole. Information technology has provided us with the infrastructure to not only replicate factual knowledge that others have produced or accumulated, but now also the operational knowledge about how to do things. Instead of reinventing the wheel, in the digital world, one can literally make “infinite” copies of that exact same wheel wherever they want and use, as long as it is permitted by the licensing of that wheel. We are constantly witnessing great breakthroughs at the scale of the entire humanity, rather than just one or two groups of really brilliant people, whose work is in turn built on countless digital wheels built by generations of their predecessors. It is with a largely open-source culture that this is possible, which has in turn often benefited the very people that decided to share their work publicly in the first place, because they are no longer obliged with having all the best ideas about how to improve their work — the community will do much of the work with them. From personal experience, five years and a few thousand Github stars later, Stanza has grown into something that is much more robust, functionally rich, and well-maintained project thanks to the numerous contributors not only to this project (especially kudos to John Bauer!), but also to Universal Dependencies, various multilingual NLP datasets, and the various tools that Stanza depends on — none of this would have happened were we in a closed-source alternate universe.

But, you might ask, this post has talked about so many parallels between rocket science and AI science — while it is good to learn from history and everything — is there anything unique to AI science, that does not have a good parallel?

One of the very defining characteristics that sets it apart from most natural sciences, in my opinion, is that its goals / references are so elusive. In physical sciences, the goal is to propose theories and formulations that can well model what the universe does and make reasonable predictions about them, the object of study is permanent. That is, the universe does not move or suddenly change its rules of engagement as we discover more. On the opposite, the challenges have typically been that we either oversimplified in our modeling and found new evidence from new experiments that contradict and therefore help refine them, or that the scale or complexity of practically meaningful simulations required too much resources, and we keep inventing better and better ways to approximate what the universe does. The universe does not care and does not play games.

Intelligence, on the other hand, has been a constant elusive dream since humans have started philosophizing about intelligence. In each era, we set a seemingly impossible goal for the technology at the time as the “crown jewel” or “holy grail” of AI, only to quickly dismiss it after it has been achieved by saying “that was not what intelligence TRULY is anyway”. From being marveled by machines that can merely remember and do simple logical reasoning, we have gotten used to them doing massive amounts of computations at a rate humans cannot hope to keep up with our biological brains, then witnessed theorem provers solving complex symbolic problems, and saw computers perform interesting tasks in blocks worlds. From there, AI developments have fueled knowledge based expert systems, driven DARPA’s cars, beaten humans at chess, Go, Starcraft, engaged in very humanlike conversations on common topics, performed human-like tasks in digital and physical environments, and now it seems like some breakthrough is happening somewhere in AI every week, if not every day.

Similar to how Yuval Harari described stock markets in his book, “Sapiens”, it appears that intelligence is a second-order chaotic system, where mere observation and attempts to predict its behavior changes the behavior of the system itself. Humans and intelligence are like Achilles and the Tortoise in Zeno’s Paradox — whenever we catch up with what we deemed the peak of intelligence in our eyes, it has already moved on to a new point. While this is by no means trying to underplay the advances in AI we have witnessed in the decade leading up to today, I am cautiously optimistic that in another decade or two, we will look back at today’s AI systems and call them (and our current understanding of intelligence) primitive. There is still so much that our current AI systems still struggle to do as well as humans, despite their ability to do a somewhat good enough job much faster and more scalably (sounds familiar?).

Would humanity catch up with the intelligence Tortoise in this millennium, or ever? I do not know, because even the best predictive models rely on good reference data and a strong pattern to make accurate predictions, and as we have seen, AI science is anything but predictable itself. In the meantime, I do hope that in this case humanity get to witness the day of artificial intelligence surpassing the entire human race on what we deem as intelligence, and with it, answer many more curious questions about our world than we have answers today. Maybe at that time, instead of repeatedly realizing “but that’s not intelligent enough” in the pursuit of intelligence, we can instead pursue our true humanness, and instead ask, “Is this human enough? Am I human enough?”

Footnotes

I am allowed to make this joke because I have worked at that meat plant in their QA department, among other positions. ↩
DeepSeek: Did a little known Chinese startup cause a ‘Sputnik moment’ for AI? NPR News, 2025. ↩

What do industry researchers do, anyway? Part 2 – What do they do when they are not publishing

2025-01-02T04:00:00-08:00

Photo by Patrick Tomasso on Unsplash.

One day in 2020, I was with my PhD advisor Chris Manning in one of our regular meetings. At this point, most of my PhD-ing work are wrapping up, so our conversations have gradually shifted towards me absorbing Chris’ wisdom on meta-research topics and career outlook. I remember myself asking, “Why does it seem like industry research people ‘disappear’ after they graduate and many stop publishing?” Chris thought for a while and jokingly said, “Because they got fat and lazy – or, lean and lazy, since many of them are more fit after working a few years.” I understood that Chris’ joking response was partly referring to the usually improved work-life balance of industry researchers compared to their former PhD selves, but surely, that cannot have explained this much of a difference, right?

Fast forward to about two years later, I was interviewing for jobs in early 2022. Until that point, I had been working at JD.com AI Research as a research scientist, where my team is almost solely focused on publishable research and did not feel too much different from a university lab. With a focus on applied research roles for this move, I find myself asking my interviewers, “What does your day-to-day work look like?” Much to my dismay, some interviewers told me that their day-to-day mostly involved optimizing models for size, accuracy, and latency in solving specific machine learning tasks, which was not exactly the innovative work for products impacting real people that I was more looking for. Is that really it? Is there nothing more to applied research work? Looking back, I don’t really blame the interviewer for this answer when put on the spot like that, as I probably would not be able to give a more comprehensive picture had I not thought about this question for long.

Knowing what I know today, I hope to give my younger and less experienced self a better answer to these questions, and hope this benefits someone with similar questions at similar career stages today (obligatory disclaimer on what I know).

In case you were wondering: No, your research internship experience likely does not reflect your mentors’ experience as full-time employees (and should not be the sole source of your decision to join them).

Why don’t industry researchers publish as much?

By and large, an average researcher publishes less when they move from academia to industry. Like every statistical story, there are notable exceptions to this average trend, but that is not what we are focusing on here. So what is it about industry research and its researchers that make them produce less published research works?

This is a perfect example where the Anna Karenina principle is at play. Simply put, all published papers are alike, each unpublished work is unpublishable in its own way. More specifically, for (industry) researchers to publish papers, a number of factors need to be satisfied simultaneously to a reasonable degree to allow for the paper-writing work to take place – which is often harder to achieve in industry labs. A few of the key factors, in my mind, include: interest, incentive, team priorities, and time.

Interest is an important factor that is intrinsic to each researcher. I believe interest in research is of paramount importance, because most of the time research projects are not smooth sailing – this is true even for some of my own work that went really smoothly from ideation to fruition. There will be research challenges that are much harder for the uninterested mind to comprehend and solve; for industry researchers, even if the research challenge itself is interesting and exciting, there are usually more hoops to jump through (e.g., dataset permissions/release review, various legal/business/budget/publicity approvals, to name a few) that will take the lead researchers some inner-drive to overcome.

One thing that is perhaps not anticipated or appreciated enough by folks in the thick of their PhDs, is that people’s interest shift and change faster and more often than you might expect. People might fall out of love with the kinds of academic research they have been doing for years during their PhD, either by changing research topics or dropping out of doing research altogether. The industry is also a good source of myriads of applied research problems that are specific to its own scale, the kinds of customer scenarios it needs to deal with, and how it operates. These can often be equally challenging and attractive to researchers that are looking to solve interesting problems, and they sometimes even require more holistic solutions that is typically needed by a publishable research prototype. This can often lead to very fulfilling research careers that do not lead to publications (or at least as many of them).

Incentive is a longer-term extrinsic mechanism that influences how we work and how we work together. This is important because it determines how things are viewed and prioritized by you and your peers on a longer time horizon. Incentive is complex and influences different individuals differently, but the overall culture and atmosphere that results from it forms an implicit contract for a group of people to coordinate and achieve shared goals together. There are a lot of different incentives industry research jobs provide, and different people might be incentivized by (a combination of) different things at different times. These might include: higher pay, a more comfortable life, work-life balance, solving unique problems that are not available elsewhere (in some cases, the solution might not be entirely novel-from-scratch or might not generalize elsewhere), making progress on the corporate career ladder (getting higher numerals behind the “Level” word or getting more people to report to you), building cool products for internal or broader social recognition, working on great industry-scale research works, etc. Having a diverse set of incentives means that you would need to pick and choose which ones matter to you the most to spend your time and energy on. Oftentimes these do not directly contribute to publishing, and publishing is not a notable contributor to many of these goals. That’s how you can easily have a few productive years as an industry researcher, but not necessarily many published papers to show for it.

Of course, incentives from your employer are not necessarily what you are optimizing for. Some people are largely oblivious to the career incentives in front of them, and choose instead to focus on what they are truly passionate about, which for some people is published research. There is absolutely nothing wrong with that, because more often than not, if your employer’s incentives are pulling you away from what you enjoy the most at work, and if you do a great job at meeting the goals they incentivize, you will likely have even less time for your passion. Others publish to stay in touch with the research frontier, and to have something to talk about during their interviews should they decide to change jobs. A publication record could help a bit for your prospective employers to understand your skill set, and recent publications show that you are still familiar with the publishing game. But as long as you are not interviewing for a junior research role, publications are unlikely what takes up the most time in your interview discussions, in my experience.

Team priorities is how incentives and goals are manifested in more day-to-day decision making processes within your team. Among other things, priorities can determine how resources are allocated, which is a key enabler for many endeavors, including publishable research works. Here, “resources” is an often overly generalized term, which can include physical resources like funding, equipment, computational resources, etc., as well as personal bandwidth resources, including the bandwidth of in-house data experts, that of your collaborating researchers, and even that of your own. These can sometimes make huge differences in the journey of your work targeting publication. While I am not one to like over-generalizations most of the time, here it might be useful: resources largely determine the time it takes to produce something meaningful, and the quality of what is produced. In the world of scientific publications, being slow to achieve a high enough quality in research works sometimes means ideas will be scooped or no longer be relevant with new waves of technology.

Besides resources, another way that team priorities influence publications is in how it sets boundaries for acceptable behaviors from time to time. For instance, many industry researchers will be involved in somewhat secretive projects throughout their career, where not only is the timeline often tight, but also the need to protect trade secrets often a high priority. This sometimes means that even if resources are at your disposal, and that the results you are generating are exactly the kind that is relevant in the scientific literature at this moment, you might not be allowed to publish them at your liberty until a much later date, if at all.

Time is a valuable thing.¹ Here, I am not just referring to how people choose to spend the nominal 40 working hours of the work week, but also the 144 hours of their typical week. While different people have wildly different working habits (some work 100-hour weeks, others 10-hour weeks), time is fair to everyone, in the sense that nobody gets more hours than others in any given day or week. Therefore, the main difference that accounts for the discrepancy in, for instance, working hours each week, is how we choose to allocate these hours differently.

The profile of a typical PhD student is usually someone in their 20s or early 30s, childless, and overall in reasonable health conditions.² Most of them, especially junior students, do not have more than 8 hours of meetings, scheduled discussions, or classes. All of these factor contribute to their energy, ability to concentrate, flexibility of their schedule, and their overall capacity to allocate uninterrupted hours to research-related (or personal health upkeep) activities. Life happens, and most of these factors will gradually (or abruptly) change as people near graduation from their PhD programs. They might start taking up more responsibility at work, start a family, experience the effect of early aging (boo!), decide to place more emphasis on work-life balance and long-term mental health, the list goes on and on. Aside from the natural and gradual course of life, another salient factor that usually contributes to this change is that people often plan their lives to accommodate major milestones in life, e.g., “I’ll wait until I graduate to do such and such”, or “once I graduate, I’ll make sure to do so and so”. All of these can take time and focus away from what is needed for publications, and more often than not, people actually embrace and welcome these changes when they are going through these processes.

Is it bad if you don’t publish as much when working as an industry researcher?

We have covered the various reasons that might lead to reduced publications from industry researchers, but what does it mean for the researchers involved, is it bad that they are not publishing as much?

The short answer is, it depends a lot on your interests and situation, but generally, it is probably not as bad as you would think. On the one hand, while many freshly minted PhDs feel a strong sense of FOMO (fear of missing out) when they join the industry for the first few years, they will also quickly learn that publications are just one of the many ways they can contribute to their employer or even the research community, and as a result they might choose to focus their time on fewer publications with more concentrated topics/flavors, or pause publishing altogether. On the other hand, if your are concerned about maintaining your competitiveness on the industry job market, stressing over publications is also often unnecessary especially for experienced hires, because your prospective employers are also going to be mindful about all the other ways you can contribute aside from publishing research papers, and focus on assessing those aspects. Sure, having some recent papers on your CV makes it easier for people to get a glimpse into your (current team’s) recent interests and gives you something you can more easily talk about, which is a plus, but I think mindfully designed interview processes can help you demonstrate your capabilities even if these are not available.

Note that this statement might not hold as firmly if you are looking for opportunities that have a more pure research flavor (e.g., faculty positions at an R1 university). But even in those cases, the quantity of publications is likely far from the only key objective you should consider optimizing for.

What are industry researchers doing when they are not publishing?

Even when controlled for the actual number of working hours, I suspect that an average PhD student would still be more productive on publications than an average industry researcher. Why? What are some places where industry researchers spend their time on?

Increased communication takes a notable portion of this time. As I have discussed in a previous post, a necessary consequence of working in the “real world” (both industry and academia) is that you work a lot more with other people, who are usually more experts in their own fields, but not your field. For more research-y jobs, these could be your research peers, junior/senior colleagues, managers, administrators, etc. For more product-y jobs, especially AI-centric products, researchers usually have the good fortune to the opportunity of very cross-cutting experiences, if you so wish. Besides your immediate team of research people, you might also interact with

Product Managers who oversee the requirements (usually including customer interviews) and progress of product development
Engineers (both “traditional” software engineers as well as machine learning engineers in some cases)
Designers for user interfaces/experience, which can sometimes make up for the deficiency of the technology
Data Annotators and people that help manage them
Data Scientists that extract insights from the large datasets you product has accumulated
Research people that are more on the research or applied end of the spectrum than yourself
Sales/Marketing people trying to understand what language to use to communicate with potential customers
Technical Writers that maintain the 100s of pages of documentation for your product
Finance people that oversee your project’s budget
…

The list goes on and on.³ Most industry projects are at a scale that involves a significantly more number of people from disciplines working on different aspects of the same thing, so communication is necessary to make sure the group moves in unison towards a shared goal.

That said, most industry researchers actually are not doing that much communication work in their day-to-day, especially if they are not managing teams or leading large (applied) research projects (which is particularly true for junior researchers). Where did their time go if not towards publishable research? Here are some typical ways industry researchers can be spending their time on that is often also meaningful research problems, but would not lead to publications (spoiler: it often has to do with product timelines):

Internal prototyping and benchmarking. Unlike research works that oftentimes end at putting numbers in a paper (which, kudos to the open-source culture in computer science, is a bit less of a problem), industry researchers cannot take these results at face value. Given a product requirement or technological goal, they often need to perform a series of prototyping and benchmarking (especially on datasets that more closely reflect their customers’ use cases) to validate existing and new technology on data distributions that are more meaningful to the product. Gone are the days where you can simply cite someone’s work and report the numbers they reported, because a) reproducibility is very important, and b) standard, public benchmarks often won’t cut it. This sometimes also leads to unique contributions driven by internal needs that are sometimes beyond what the community already knows or has, and materialize in good research papers albeit sometimes a bit later than the authors’ liking (e.g., RAG-QA Arena).
System design. You might have heard about this term in the context of industry interviews, or even learned some tricks to perform well in them. Simply put, a system design is typically a practical design for the core research-y components to work in a real product to fulfill the various functional and non-functional needs, including the behavior of the system in different settings, the robustness and reliability of such behaviors, the monetary cost of running the system, as well as the time it needs to process each user input, to name a few. These often require a lot of out-of-the-box thinking and practical consideration about not just the “core algorithm”, and can be quite revealing of one’s experience, background, and thought process (which is why they often make it into interviews, especially for more senior technical hires).
Implement parts of the product. This may or may not come as a surprise, but applied researchers in the industry are sometimes responsible for part of the production code/implementation for the final product. This is not uncommon, especially when the logic of the interaction with the “core technology” is not straightforward or highly standardized. This might entail more traditional engineering work like writing unit and integration tests, depending on what kind of production code you are contributing to. Why not have dedicated ML engineers do these all the time, you might ask? The next bullet point might shed some light.
Troubleshoot and resolve issues. Many products, especially software-as-a-service (SaaS) products receive a constant stream of use from its customers, with which potentially comes a constant stream of issues that need to be resolved. Some of these might be attributed to the “core technology” part of the system, and the researchers that implemented them in the first place have the best knowhow and expertise to identify and potentially resolve the issues. Oftentimes, providing applied researchers with more flexibility while maintaining the programmatic contract with the rest of the system can greatly improve the team’s ability to resolve issues, since, for instance, how machine learning models are interacted with can involve a number of pre- and post-processing steps that make sure the result is reliable. (This is also where good system design comes in – maintainability.)
Other issues with live systems. With a live system like SaaS comes great responsibility. In its design and subsequent troubleshooting and improvements, its provider should keep in mind to maintain (or sometimes enhance) its scale, backwards compatibility (which is notoriously difficult for ML-based systems), future compatibility, latency (time), and cost (money), etc. Since the research-y component is often one of the bottlenecks to many of these, it provides industry researchers with an interesting challenge as to how to uphold or improve these practical considerations.
Technical reports and survey papers/reports. Sometimes the product/company might require a certain form of written communication that is best carried out by researchers since they have the necessary training for consuming and writing technical content, most commonly seen today as technical reports for public releases of large foundation models. Other examples might include technical blog posts or survey papers (e.g., how conversational AI can be used for social good).

Closing Remarks

This marks the end of the series of posts I originally planned as a short note. I do hope this helps someone get a fuller picture of the life of industry researchers, and therefore make more informed decisions. Some of these might strike you as not immediately to your liking (about both academic or industry research jobs), to which I think it is important to keep an open mind. It is also important to remember that research jobs are not the entire universe of opportunities after obtaining a PhD, and there are plenty of resources online for this (e.g., this and this from a quick Google search).

Regardless of which path you end up choosing, I hope you remember that life is like a box of boxes of chocolates, and every flavor is potentially connected to everything else in those boxes.

Footnotes

If you got the lyrical reference here, I am happy to share that this song was originally published in 2000, almost two and a half decades old now. ↩
This is by no means representative of all PhD students everywhere, but more derived from my personal experience with the majority of PhD students in the U.S. and China. ↩
This list is completely based in fact. However, you do not actually interact with all of these folks all the time – they each will come into focus in very distinct stages of the product development/deployment cycle. ↩

What do industry researchers do, anyway? Part 1 – How to choose a team

2024-07-22T05:00:00-07:00

Image by Monique Carrati (via Unsplash).

My momma always said, "Life was like a box of chocolates. You never know what you're gonna get."

— Forrest Gump, 1994

The 1994 movie Forrest Gump is timeless, not only because it is full of comedic gems that reflect on the real world that we live in, but also because it is littered with soundbites of wisdom that continues to mesmerize the audience even after the movie ends – like the one quoted above.

As one of the two food-related quotes I distinctly remember from the movie (the other being about a fruit company), Mrs. Gump’s wisdom has helped me put in perspective how one makes work and life choices, and what situations they end up being in down the road. My current take of this quote is this: life is like a box of chocolates. You can pick and choose what you want, but you are only allowed to see the shape and color of the chocolate at the time of choosing. It is only when you actually put that piece of chocolate in your mouth and taste it, that you understand the full layers of flavor of the chocolate you chose, which you cannot fully predict ahead of time.

While I appreciate the wisdom behind this analogy, I do have a small problem with this view of the world, if I am allowed to geek out for a second. This “box of chocolates” can lead to two plausible but overly simplistic interpretations about our lives. In one interpretation, you are given a once-in-a-lifetime chance to choose a chocolate from a box, and you live with the consequences thereafter without being able to do much about it. In another, you are handed a (finite?) set of chocolates at birth, and the only agency you have in the process is to choose which order you’d like to pick and enjoy them. Your choices of chocolates are largely independent from each other except for the fact that you can’t get again the chocolates you have already devoured. Both of these seems to leave something to be desired.

I think a more accurate description is something akin to the following, which I dub my life is a box of (boxes of) chocolates analogy.

Life is like a box of chocolates, where each chocolate comes with a silver ticket.¹ Once you choose, you leave your current box of chocolates behind, and enjoy the chocolate you just chose. You then redeem that silver ticket for another box of chocolates, which is somewhat related in flavor to the chocolate you just had. Instead of choosing just once or just getting a single box for your lifetime, this process repeats itself and you are always given brand new boxes of chocolates to choose from.

But you still never really know what you’re gonna get.

— Yours truly, 2024

Side note: If you are one of those (un)fortunate enough to have been inducted to the world of stochastic processes, you might have realized that this is exactly what I have described.

This analogy has a few corollaries:

There is no perfect information. This part is consistent with the original box of chocolates wisdom. You can try your best to probe your chocolate candidates, learn from your past experience or other people’s chocolate-choosing experiences, or even bite off a small piece of your chocolates – none of these will give you the full picture until you actually eat it. You only determine part of your journey by making choices, the rest is sheer luck.
The world is not constant and your choices have effects on your future pool of choices. It is easy to lose sight of all those chocolate boxes in the future, and be tempted to leverage all the information you have at hand to find and devour the kinds of chocolate you enjoy the most today. Either positively or negatively, significantly or slightly, this is shaping your future chocolate boxes and what flavors will be available in those when they eventually get to your hands.
But this doesn’t mean you have no control, or can’t make predictions. The Mega-Chocolate Universe (not to be confused with other MCU) is fortunately somewhat predictable, at least in the sense of likelihoods. So even if you still don’t fully control your chocolates and which boxes come next, you can nevertheless try to improve the probability of getting the chocolates you truly enjoy in the future.

If you are unsure about what this box of (boxes of) chocolates, or stochastic process, looks like, the tweet quoted below has one of my favorite visualizations.

We think a lot about those black lines, forgetting that it’s all still in our hands. pic.twitter.com/RSZ1d3W642
— Tim Urban (@waitbutwhy) March 5, 2021

If you have read this far, and are genuinely interested in what I have to talk about, this is a good time for an obligatory reminder for why you should or should not read on, written in the “0th” part of this series. My experiences might not be useful or applicable to you, so take it with a pinch of salt.

So, what should you think about when choosing a team in the industry to work on research-adjacent projects? Below are a few factors I wish someone had laid out clearly and explained the pros and cons to help me make decisions.

Product-driven or curiosity-driven?

One of the first things that comes to mind for many people is this – does the team you are joining do more research driven by actual product needs, or by curiosity of the team members as long as they are vaguely related to the company’s goals? As mentioned in the previous post, industry research groups and positions can be as heterogenous as they come. That said, teams and projects can often be projected along the axis of product-driven vs curiosity-driven, and each end of the spectrum has its own pros and cons.

These pros and cons follow some simple first principles:

Products generally yield financial returns, which is generally welcome by companies and necessary to in turn fund R&D efforts.
Product building often come with timeline commitments (either from external customers or for the sake of coordinating a large team), and meeting these usually come with prioritization and tradeoffs for everyone involved, including researchers.
Your career progress (if that’s what you are after) and job security (especially in poor economic conditions²) as an industry researcher are likely correlated with your financial contributions (direct or indirect). This might sometimes also affect the kinds of problems you get to work on.

On the purely curiosity-driven end of the spectrum, you have a lot of research freedom, sometimes even better than those in academic institutions because you’d typically have more resources readily available to support you. Your research will more likely be driven by publication timelines (paper/journal submission deadlines) rather than anything else. But when and if you need to earn your keep with the company, or seek your promotion up the ladder, it will typically be more of an uphill battle. You would need to “sell” your idea and results and try to convince someone that works on a product to adopt your idea, and work with them to find the right resources to implement it in production code – which is usually no small feat.

At the other, the purely product-driven, end of the spectrum, you are often more bound by timelines that go hand in hand with product delivery timelines, which are usually more aggressive by curiosity-driven research standards. As a result, you cannot always wait for the “Aha!” moment for the most original ideas, but instead bias towards adopting and improving upon existing approaches to meet product needs most of the time. In the meantime, it is usually a lot easier to demonstrate that you have done something significant in the critical direction the company or larger team is heading in. You don’t need to convince your partner teams your work is valuable because they constantly rely on you and your expertise, and leveling up in the hierarchy is comparatively more straightforward.³

In reality, industry research teams are usually organizationally aligned with one of these modes of operation, in that they either have strong affinity to product teams, or they don’t. I think the underlying logic is simple – as a product team, I would pay for an in-house research team only when I know I can count on them to prioritize product support needs in the form of time commitments and/or constant brainpower support. But this organizational affinity doesn’t mean industry research teams are at the extremes of the spectrum all the time. The situation actually varies a lot from team to team, and even for the team, it can vary over time depending on where the product is, what the company’s financial situations are, how challenging the research project is, etc.

For instance, when a curiosity-driven research team needs to take on a large project that requires a lot of researcher’s efforts, it’s unlikely that everyone’s research interests will automatically fit tightly like puzzle pieces, or that a group of people will automatically function as a well-oiled machine. Furthermore, with more people working on the same project, there can be more structure, more timeline commitments that might be deemed aggressive to keep the larger group in sync, and less research freedom for the individuals involved on average. On the flip side, when a product-driven research team is not on hyperdrive towards product deadlines (and especially when research interns are around), there can be great opportunities to think about big research ideas and take actions on them. One nice thing is that if these research ideas originate from the product (which they quite often are), these teams are in the best position to test them out and helping deploy these research solutions to help benefit real people.

It is usually useful to learn about how your prospective team’s time spreads on the spectrum, and how that matches your own expectations.

Topics and fields

Should you look for a job opportunity that exactly matches your current research interests, or one that expands it a little in a way you did not plan for? Should you take the opportunity of your full-time job to re-invent your research direction and learn everything from scratch?

While I do not have the perfect answer since everyone’s situation is different at any given time, there are a few factors I would consider when it comes to research topics and fields.

Are these topics/fields entirely new to you, or way too familiar? If you need to work on something brand new, you will need more time to learn about the background knowledge, build intuition, solve relatively straightforward problems, before you can unleash your full research power as you would if you were working in your very own field of expertise. In the other extreme, when a topic is too familiar and/or you have been working on it for too long, you might encounter researcher fatigue and find less interest in the research work your team does.
What’s your runway to ramp up? If you are entering into something that isn’t immediately aligned with your expertise, what’s the gap in your knowledge/expertise, and what buffer is available to you to make up for this gap? I would like to think of full-time jobs as slightly less forgiving compared to the safety of being a PhD student, so you’d want to be able to produce something to contribute to your team’s goals within a reasonable amount of time (which varies from team to team, but typically would not exceed six months).
Could the team’s future direction lead to what you care about? This is difficult to predict, but not entirely unpredictable. Given the team’s size, current makeup, curiosity-driven or product-driven nature, the product(s) it currently supports, if any, and the technical assets it has accumulated over the years, you can make an educated guess about what the future might and might not hold for the team. It’s unlikely that a product-driven large language model (LLM) team suddenly reinvents itself to do medical imaging, but more likely that they will build more interesting things on top of LLMs. When joining Amazon, I knew I would be interested in knowledge-intensive conversational agents, which is where my research interests were leading me to at the time. I joined a team that worked on product-driven research for an enterprise search engine, which I reckoned will be a critical part of knowledge-intensive chatbots down the line. Fortunate of me, this turned out to be true, in fact much earlier than I had anticipated, though it was never guaranteed to happen.

Company size

How does company size affect your experience as an industry researcher? The best analogy I can think of is ships at sea.

Ships (or even fleets) with a larger crew are often larger in size and heavier, which makes them less vulnerable to rocks and waves. This crew probably didn’t all materialize onboard at the same time, so there’s a lot of folks with a lot more experience with the ship, and everything you do likely has precedents and standard operating procedures to follow, for the safety of the ship and your fellow crew members. To effectively organize such a large crew, there’s typically layers of commanding hierarchies. You typically have clear guidance on how to rise through the ranks, and everyone’s responsibilities are reasonably separated and confined. Unless you are already near the top of the commanding chain, any individual’s effort (or lack thereof) has a very limited effect on the ship’s heading, course, or speed. But when you are bored with one post, there are usually plenty other posts on the same ship you might be interested in.

On a smaller ship, things are quite different. There is a lot less organization and hierarchy to speak of, everything is very new and you are usually welcome to solve “other people’s problems” and be recognized for doing so. There’s a hole in the hull and nobody’s around to fix it? You can learn to fix it and help keep the ship afloat for longer. There’s no mast? You can build the mast and be the go-to mast person from then on. Rank progression usually comes with increased tenure and a growing crew – but even if you make Captain or First Officer, to put in perspective, it’s still a small ship, and a tropical storm can still tear it into oblivion with a few wrong moves.

Team/Project/Product maturity

The maturity of your team, project, and product functions very similarly to company size, where larger teams, longer-running projects, and more mature products can provide more stability and sometimes more research freedom in exchange for opportunities to potentially make bigger differences.

Within the same company, the differences between mature and new teams can be, say, 2-3x difference on risk/reward profile, where starting new things don’t usually work out, but when they do, they could come with raises and/or promotions. To put this in perspective, though, company size is a much larger multiplier on this profile, which can be as much as 10-100x.

One note about mature teams and businesses is that they are also at higher risk of being leapfrogged by emerging technology, business models, alternatives, etc., because their vision can be fogged by their own early success and they end up spending every possible resource on doing the same thing 1% better. It doesn’t happen that often, but when it does, mature teams and businesses are usually too slow to turn and sometimes face catastrophic failures (see the story of American retailer Blockbuster in the face of Netflix’s rise).

Business direction

I’m probably not the first or the last person to use the phrase “think like an investor” when it comes to choosing (private sector) jobs. Even if you can’t think like a seasoned investor, this prompt can help you think outside of your researcher box and view your job like, well, (part of) a business. Just like your job can’t exist outside of the context of your company’s products and business, your team’s product and your company can’t exist without the business and economic context it is in. Does your company’s business make sense? Is your team building a product that people understand and will want to buy? Is your team supporting a product that has sustained demand, or is it on its way out to be replaced in the market? Is your team supporting a research project that has significant financial value in its future? What’s the best or the worst that could happen to your company and your team’s work, how would that impact the company financially? A lot of this ties directly back to the stability of your research environment and the security of your job, so it doesn’t hurt to think for a moment, especially if your professional insight as a researcher can make any difference.

People!

By joining an industry research team, you are typically joining a (small or large) group of other researchers that are working towards a shared goal. You will see these people all the time in your day-to-day, collaborate with them, and if you are lucky, even make friends with them. The sole researcher mode typically does not play out well in the industry, so you almost always have co-workers that can significantly influence your day-to-day experience (positively or negatively).

Do you find them inspiring? Are they the people you’d like to spend time with on things you care about, be it deeply philosophical discussions, solving challenging real-world problems, overcoming difficulties in research, or just chat about life in general?

Would you find enough support where you need it? Do you have people to guide you through challenges when needed, to put up thought-provoking challenges to your ideas, to help you navigate the corporate environment, or to assist you in finding good projects to work on?

Do your prospective team members have the relevant expertise, experience, methodologies, and/or knowledge it takes to achieve the team’s goals? Are they struggling on key problems they need to solve without new additions like you, or are the basics covered reasonably well, and they are looking to expand research horizons?

Would you have enough room to grow? Are people helpful enough but not overwhelming, willing to give and share credit, easy to communicate with and low on ego? Is the team growing in size or hiring interns consistently, so that you can get the chance to potentially mentor others at your work?

These are just some sample questions you can ask about the people on your team, and the list can go on. Not everyone will write down the same list, but you likely have a list of characteristics you find important in research teammates. If you stumble upon a team where a lot of people matches your ideal descriptions, definitely consider it very seriously – the team is likely doing something right to keep all of these amazing (at least to you) people around, and that is not very easy to achieve in a sustainable manner.

Culture

One of the intangible assets that plays a big role in attracting and retaining people is culture. Put in my own words, culture is a community’s agreement on how we should do/prioritize things, either in written rules or manifested by members of the community and accepted as a norm. Culture has huge influences on how we make decisions, especially during hard times, when they are more consequential.

You might have heard conflicting information about the culture of companies, organizations, and the specific teams within them. These can all be true at the same time, because after all a community’s culture is determined by how an average person operates within that community – when you take averages over different populations, the result can be quite different. In this view, it is also unsurprising that culture will gravitate towards the underlying larger group, or regression toward the mean. Small pockets of distinct and very different cultures can exist for a period of time, but they typically take someone (which could be all members of that small community) significant effort to upkeep. When they come into contact with others in the larger community, it is usually easier for the larger population to influence the smaller one, not the other way around.

Talking to prospective team members and learning what their experiences are like on the team and how they approach various situations can be a good way to help you gauge team/company culture to some extent.

Closing thought: keep an open mind and embrace change

With everything mentioned in this post, it is important to keep a growth mindset. As Heraclitus famously put it, “There is nothing permanent except change”. The environment around you can change and you cannot perfectly predict it. What’s perhaps surprising to many is that your own interests (both research-wise and career-wise) can change faster than you might anticipate. Always aim to gather enough information as you need, since as companies are screening you, you are also choosing among your options. With this information, make your best guess of which chocolates you might like in the future, and which choices today would most likely lead to boxes with those chocolates. It’s helpful to remember jobs don’t have to be permanent – so these decisions might not be as consequential as they seem, don’t overthink it!

Footnotes

Not quite as powerful as the Golden Ticket when it comes to chocolates, thus silver. ↩
The US National Bureau of Economic Research publishes longitudinal studies on economic cycles. If history is any guidance, most people will encounter at least one of these down-periods in their career. ↩
This observation mostly applies to entry- to experienced-level jobs. It’s not uncommon for companies to appoint very accomplished academics at very senior research positions in the company, that’s an entirely different topic. ↩

What do industry researchers do, anyway? Part 0 – Academia vs Industry

2024-06-26T05:00:00-07:00

Image by Lauren Mancke (via Unsplash).

You are embarking on a grand journey called pursuing a doctorate, have been immersed in it for a couple of years, or are almost Ph-Done and cannot wait to pursue your dreams after the program.

Many would imagine themselves on the path that brought them into the doctorate program in the first place – continuing to pursue a research career in academia or the industry. You have a general notion of what they might entail to a first order of approximation (in fact, Google did a pretty good job summarizing it, see below). But in the back of your mind, you cannot help wondering, would the grass be on the other side?

How Google's Generative AI feature answered the query "differences between academic and industry researchers".

In this post, I hope to record some of my observations along the way to help highlight some factors of consideration I would have shared with my younger self.

Why might you want to continue/stop reading

Before sinking more valuable minutes into this post, you might want to ask, who is this for, and why should I trust the author’s experience to be relevant to my own? Both of these would be very relevant questions.

A bit about myself at the time of writing: I have been an industry researcher for a little over 3.5 years. My time has been split between two companies and two very different roles, one where the primary goal was to publish academic research, the other more on the applied side with a focus on putting great technology into products. Before these “real jobs”, I finished a PhD in Computer Science at Stanford, did a research-focused Masters before that, and did some undergraduate research when people around me are still not sure what to make of “cloud computing” and “deep learning” (are they simply just larger compute clusters and deeper versions of the same neural networks that nobody teaches in undergrad courses any more since they didn’t work? Spoiler: the answers are no and no).

In these experiences, I was fortunate to have worked with a good number of amazing research mentors and peers on very different research topics and problems, published a few papers, some of which – fortunate of me – you might have even heard about. I have worked on and witnessed many research projects that never saw the light of day, and seen my fair share of faculty interviews both as a student interviewer and a faculty interviewee. On the other hand, I have also seen how research is done in industry labs as a research intern, a many-a-time intern mentor, a researcher as part of a small research team that published on diverse topics, and a “founding” applied scientist on a more product-focused team that helped launch a brand new product (almost) from scratch, while publishing some fun papers along the way.

Now that’s out of the way, let’s get on with what this post is about.

The less talked-about similarities between academic and industry research

Before covering some of the differences between industry and academic career paths, I wanted to first reflect on what feels quite similar between the two.

Both academic jobs and industry jobs are just that, jobs. Freshly minted PhDs are usually equally surprised by the job-like natures of these, which typically comes with more responsibility and very real-world “consequences”. I think this stems from two common factors.

The first one is that you don’t work alone any more. Gone are the days that only you depended on your PhD thesis and research progress yourself, where you can spend hours and hours noodling over the same problem, and if you miss a paper deadline, there seems to always be other ones right after. In both academic and industry jobs, you work with a number of other people towards a shared goal, and they would sometimes depend on you for resources, output, time commitments, or all of the above. This means you spend a lot more energy on making predictions of your own output, as well as making sure you make progress as predicted or communicate early on otherwise. This also means you potentially spend a lot more time and energy on your responsibilities to budget, plan, and allocate resources (time, $$$, people) to make sure things go roughly as planned.

The second factor is simple economics, where if something has high unit cost, you get nothing produced by splitting that investment n ways. Since most meaningful research projects are resource intensive, society/private companies simply cannot fund everything that comes to everyone’s mind. As a result, we usually gravitate to simple heuristics that “past performance predicts future success”, and use people’s track record to establish our trust in them and support their research endeavors accordingly. This imperfect heuristic sometimes makes successes and failures in research jobs feel more consequential than the handful of papers you publish as a PhD student, since you are no longer shielded behind the great buffer that was your advisor for resources to do your research – you’re a little more on your own (though great communities exist in both jobs to soften this).

A third common factor that might not be obvious to fresh PhDs is that regardless of whether you go to academia or industry, the chances you work with domain experts in your day-to-day likely diminishes. This could be viewed as a corollary of not working alone any more, because as you work with more people, you are also typically exposed to people with different background. You get a glimpse of this during your job talk, where even the really smart faculty members of a department you are visiting probably didn’t fully understand everything you said, or that the senior researcher from a neighboring group is fascinated by some mundane details to insiders that isn’t even your contribution. But there is always more. Great collaborations during your PhD is often cherished, because it’s rare to have someone understand a topic as deeply as you do and together make exciting contributions that inspire all of you, and because “you will eventually write different theses” so it won’t last forever. This is a bit more pronounced for academics, but both academic and industry jobs entail a lot more interactions with non-experts (your university admins, senior colleagues, grant reviewers, engineers, product managers, etc.). In both of these cases, the ability to communicate with someone outside of your comfort of scientific language and jargons becomes really important to get the point across, to manage expectations, and to meet them where they are – which is their professional area of expertise – to achieve your common goals.

I should note that many of the things described in this section sound like managerial duties, because they are. These are typically a lot more pronounced for junior faculty compared to industry researchers right after their PhD, since industry folks typically don’t get to “run a lab/team” until much later (and many choose not/never to).

Two roads diverging in a wood: where academic and industry research jobs differ

Having covered the commonalities, I also wanted to reflect on some of the underlying factors that set the two apart.

First and foremost, academia and industry have different definitions of goals. This is not the simplified argument that “industry is profit-driven and everything contributes to that”, because that usually elicits a very skewed and cynical image for junior researchers facing this choice at the juncture (or even seasoned ones that have been in academia or the industry for a while). I’d like to think of it more as academia is pursuing original knowledge and that’s the end goal in and of itself, while industry is always trying to put things in hands of actual people in the long run. This means academia has a lot more freedom to pursue things that help us understand things better (but not necessarily do them better), that do not have practical implications for decades or generations, or that are scrappy and do not extrapolate well. Industry labs, on the other hand, are typically a lot more focused on the real-world, and sometimes even with a reduced emphasis on originality. As a result, the outcomes from research endeavors and their runways (time/resources taken to build the research projects) perceived wildly differently between the two. In fact, it is likely that you’ll see goals defined a lot more homogeneously across different academic groups compared to industry research units, where the latter can be tasked with research of different time horizons and quality expectations to produce a usable product.

Second, academia and industry often adopt different measurements of impact. Impact is a stamp of approval for the “goodness” of your work, and is often associated with good opportunities. While academia is more about “how much new research is enabled and inspired by your work” (often proxied with the problematic metric of citations), industry typically focuses more on “how many people or $$$ are affected, and how soon”, since that’s what keeps the endeavor afloat and pays the bills. As a result, academia and industry often take very different strategies to tackle similar problems. For instance, if something’s not making progress soon enough and resource is the main bottleneck, it’s often much easier for industry research labs to “ramp up resources” and throw it at the problem, while academic researchers might spend more time on optimizing the process in place. In other cases, while academic researchers might firmly stick to challenging problems for years till fruition, the industry typically waits till the first inklings of practical implications from such original research before making a substantial move. While both are by and large really good at what they do, each of these measures of impact, when taken to the extreme as the sole focus, can lead to troubling outcomes.

Last but not least, academia and industry have very different scales of operation and connection to real users. The first might be easier to understand. While most PhD students played the “full-stack engineers” or equivalents in their own research projects, there is only so much a single person can do in 8 hour workdays, and users ask for a lot more than your scrappy prototype. In an actual product, you typically work with a number of folks that are more experienced and talented at various stages of that full stack that your bandwidth and expertise can never hope to scale to. As a result, the industry has some of the most trusty products that real people use and often give feedback on, which no amount of lab surveys and simulations can replace – which is the second point. This also means that academics will never get to work with some of the “messy” yet challenging problems, both researchy and non-researchy ones that are brought about by scale, but more often just the “real-world distribution of data and problems”. Needless to say, this access is accompanied with great responsibilities, and user needs are usually more demanding than what careful research solutions take, which industry researchers often need to grapple with.

So, what does this mean for you?

If you’ve read this far, hopefully this post has provided some useful frameworks to think about your choice and helped you get a sense of what to expect. I have personally been fortunate that my research interest lies in a wide spectrum that spans from “this is an interesting idea for the idea’s sake” to “let’s make a thing that works”, which helped expose me to both paths somewhat. At the end of the day, you might ask what excites you more at the moment, what you will likely continue to enjoy the most, and make your choice accordingly. In the next post, I’ll try to cover how I think you can get the most out of selecting a team in the industry, which hopefully gives more practical suggestions on making this choice.

What does an area chair actually do, anyway?

2021-04-05T05:00:00-07:00

Image by Jon Tyson (via Unsplash).

In case you were wondering: no, it’s not that comfy to be an actual area chair, though you might get to enjoy a lot of reading!

What is an area chair?

Area chairs (AC) are volunteer service positions at academic conferences (or large workshops) that are tasked with overseeing a small part of the peer review process. These positions are typically associated with an area of expertise for the chairperson, thus the name.

Despite the word “chair” in the name, in large academic conferences today there are often a large number of area chairs to share the responsibility of making sure thousands of papers are properly reviewed in time. Some conferences have introduced the role of senior area chairs in recent years, which are usually filled by seasoned researchers, to provide better uniformity in review outcome by aggregating information from area chairs. At the very top, a conference is usually run by a few program chairs and general chairs that are responsible for steering the conference and making sure every aspect of it happens as smoothly as possible (see Jason Eisner’s post for more details about what a program chair does – spoiler alert: it involves great power and great responsibility).

In short, program chairs and general chairs oversee the entire review process via senior area chairs among other conference logistics; senior ACs gather information from ACs, who in turn are responsible for direct interactions with reviewers on individual papers.

To the best of my knowledge, none of the chairpersons actually receives a piece of comfortable furniture for doing what they do.

What does an area chair do?

In short, everything that directly interacts with authors, papers, and reviewers in the review proccess.

If you search for “what does an area chair do”, the first few results will typically contain the following responsibilities, which vary slightly from event to event (sources: WIFS 2017, CVPR 2013, NeurIPS 2018):

Promote your conference/area/workshop. This is typically true for smaller/newer events like workshops. Many larger conferences today have dedicated publicity/social media volunteers to help with this.
Recruit reviewers. A lot of this work has been shifted to senior ACs when the SAC role exists, who are also responsible for recruiting ACs.
Assign reviewers to papers. This is usually done with the help of automated assignment software that match reviewers to papers that are suitable for their review interests and past experience (usually evidenced by their published work). For NAACL 2021, the conference I was an area chair for for the first time, this is entirely done at the SAC level. The review process is doubleblind at the AC level.
Chase late reviews. Life happens and people forget. Sometimes assgined reviewers are not able to submit their reviews on time – it is the AC’s job to reach out and remind them to submit reviews. If that fails, conferences usually have a reserve pool of “emergency reviewers” they can contact for a quick-turnaround review.
Facilitate discussion after author response. Once all initial reviews are finished, conferences typically have a period of “rebuttal” or “author response” so that authors can address reviewer questions and clarify confusions, if any. Once author responses are submitted, the area chair is responsible for engaging reviewers to read the author responses, incorporate them into their final review of the paper, and hopefully converge on a consensus regarding the paper’s acceptance recommendation (no, consensus doesn’t always happen).

Area chairs also occasionally receive confidential feedback from authors on review quality, which they need to incorporate into consideration when moderating discussion.
Write meta-reviews and make acceptance recommendations. Meta-reviews are summaries of highlights (and lowlights) in all reviews regarding a submission. They are useful for senior area chairs to calibrate between AC recommendations, and for the authors to understand the high-level gist of the reviews if they are released.

While area chairs cannot directly determine whether a paper is accepted or not, they do have a large influence with their initial aggregation of reviewer stances. ACs are typically tasked with making a recommendation for each submission, which are then calibrated by SACs before they are presented to the program chairs for a final decision.

In a typical conference, an AC will oversee 10-20 submissions, and most of the work is concentrated within a small window of a few weeks, when they are expected to be highly responsive.

What does an area chair actually do?

If you are still reading to this point, you are probably like my wide-eyed self a few months back, wondering what life will be like after you’ve earned the degree you have worked years towards, especially what “growing in seniority” in the academic world looks like.

The list of responsibilities give you a rough sense of what is going to happen, and turns out to be quite accurate. However, I find it insufficient in preparing aspiring ACs for how to best do their job, and what to expect in the process, which is usually expected to be picked up on the job.

In the interest of transparency and to save future ACs and SACs the last-minute scramble, I’d like to share some of my personal experience about what things look like in terms of actual work done.

Promote the conference. I didn’t do much of this as an AC for NAACL, which has its dedicated publicity chairs (which I happen to be one of for NAACL, but that’s a topic for another day). If you’re promoting a new-ish workshop, expect to ask around for access to post on large mailing lists (e.g., the one for ACL members and others for similar professional organizations), drafting mass emails, and sending them out. Social media is also helpful, but you would want to seek help from existing outlets that have a good number of followers.
Recruite reviewers. I didn’t have to do this either, and I understand that this falls largely on the SACs (who also needs to recruit ACs). This usually involve the aforementioned large mailing lists with a survey form to gather initial reviewer information.
Assign reviewers. ACs usually aren’t involved in this process if SACs are available in the review process, because the reviewing process is double-blind at the AC level (i.e., ACs don’t know who the authors are, and vice versa). For conferneces with paper bidding, reviewer preference is considered jointly with their expertise. Then SACs solve a constrained bipartite matching problem to sort out conflicts of interest, reviewer capacity, etc etc before handing papers with reviewer assignments to the ACs.

Once reviewer assignment has been made, ACs have a job responsibility that is not explicitly covered in the bullet points: sanity-checking reviewer assignment. There is typically a window of a few days when ACs are tasked to ensure each paper has at least one experienced reviewer assigned, and that their general expertise is relevant to what the paper is about. If that were not the case, ACs should notify SACs to adjust the assignment accordingly (usually by exchanging a couple emails). This involves a lot of searching for people’s background and experience, where a quick link to someone’s Google Scholar (or similar) profile can really help.

Review assignment is finalized before the official reviewing window begins for reviewers to read their assigned papers and write up reviews.
Chase late reviews. Once the official review window has passed, there is typically a short window of buffer time for ACs to chase late reviews, and find emergency reviewers to fill in if possible. Chasing late reviews involves sending individual emails to each late reviewer regarding the paper they are assigned, and gentally nudging them to submit the review as soon as possible. Using individual emails is important because some conferences choose to be triple blind (i.e., reviewers don’t know each other’s identity), and keeping different papers in different email threads helps you, the AC, better manage your pool of submissions. There will likely be more late reviews than you would expect, because (in part) life happens and people have unexpected obligations come up. I sent out my “chase” email the day after the official review window, which gives authors the time to respond, and myself the time to act upon their response (or lack thereof).

After AC’s chase email, some reviewers will respond that they will (a) submit late but still before author response, which is great, or (b) not be able to review due to unexpected conflicts, which is not so great, or (c) not respond at all. When the latter two happens, an AC would have to find alternatives – emergency reviewers. Each conference would have collected a list of emergency reviewers, and the AC’s job is to find them to (heroically) fill in when the initial reviewers cannot make it. An AC would need to sift through the list of emergency reviewers, and essentially redo part of the reviewer assignment process but with less time – look for reviewers that are likely qualified and available (many will not be available or responsive, unfortunately).

Once emergency reviewers are identified, ACs need to reach out to these emergency reviewers for their consent to help review. I also shared paper titles so emergency reviewers can judge for themselves whether they would be interested. If the reviewer agrees, ACs would then need to float the name to SACs to add them to the paper and check for conflicts of interst in the process (which ACs cannot check due to the blind process). Once that is done for all of the papers missing reviews, ACs will need to work closely with emergency reviewers to make sure they are able to submit reviews on time. Balancing workload for emergency reviewers is also key given the extremely short time frame.
Facilitate discussion. After the authors had a chance to respond to initial reviews, ACs typically need to find concensus among reviewers regarding their recommendation of the paper. If there weren’t enough time prior to sharing reviews with authors, this is also a great time for ACs to call out reviewers if their evaluation is not clear, or if their criteria are not compatible with reviewer guidelines.

I personally encouraged all of my reviewers to read the author response, and update their final reviews accordingly to reflect the fact that they have read it even if the reviewers are already in agreement. If they are not, an AC can help reviewers elaborate their respective points by asking specific questions in the discussion, and helping refine their reviews to be more objective and provide constructive details that can help authors.

It is not uncommon, however, when reviewers fail to participate in the discussion or update their reviews. In these scenarios, aside from gental reminders in the discussion, an AC would also need to take a closer look at the reviews, author response, and even the paper itself to seek answers for how to objectively evaluate the submission.
Write meta-reviews. If an AC made good use of the discussion period to understand the paper and review opinion, writing the meta-review should be a good opportunity to summarize the high-level takeaways for the SACs to make recommendations to the program chairs. In conferences where meta-reviews are shared with the authors, this is also a good place to give the authors a constructive tl;dr for the final reviews.

Depending on the rules of the conference, the AC might still be able to discuss with reviewers during this period to clarify some points, but really most of the work is already done, and ACs are just writing up a coherent summary. I personally found maintaining a bullet point list of pros and cons for each paper in the reviews and the discussion to be helpful.

Finally, ACs will need to calibrate among all the papers they are in charge of overseeing, and make initial acceptance/rejection recommendations for SACs to further aggregate and calibrate. There can still be a lot of “maybe”s at this stage except for very clear accepts/rejects.

Takeaways

Seeing that the post is already growing a lot longer than I had anticipated, I thought I would end with some short but concrete takeaways for future reviewers, ACs, SACs, and program chairs to consider.

Reviewers. If you have research experience and especially have published, you are highly encouraged to sign up to review somewhat regularly! If you haven’t reviewed much, you are encouraged to seek mentorship (ACL 2021 is taking a great step in this direction).

In the meantime, please try to anticipate potential bandwidth constraints when agreeing to review. Bottomline, please try to be responsive during the period you had agreed to review, so that ACs can at least look for your replacement with sufficient notice.
ACs. If you have been invested about the reviewing process and think you might be eligible, reach out to a senior AC in your area of expertise! You might be surprised with a warm welcome – we are all struggling to keep up with the growing number of submissions these days.

Once you are serving as an AC, please try to be more responsive than if you were a reviewer, as this could really help improve the entire review process. But also keep in mind that you will likely stumble upon unresponsive reviewers or emergency reviewers. Give yourself a lot of lead time and send 2x the email out at once if needed, because response takes time. Work with other ACs in your area might also help in the event of a emergency reviewer conflict of interest – you might end up being able to “exchange” emergency reviewers who are already responsive and willing to review.
SACs. Please consider sharing as much resources as possible ahead of time with ACs, including training material (if any) and emergency reviewer info. This could help reduce your level of anxiety during the review process caused by unnecessary communication.

My NAACL SACs did a great job at providing what is available, but I do wish there were a bit more training one could complete in prior so I can start the job more prepared.
Program chairs. This is more of a comment about the reviewing software/features to consider, rather than about the chairs themselves, who I believe are doing a great job within their capacity. While the overall process was smooth for me, I find myself struggling from time to time with small issues that prohibited a potentially even smoother experience. I have personally experienced this in two places:

The first involves transparency regarding withdrawn papers. The information about desk rejects and withdrawals during author response were either not shared until a few email exchanges later, or not available in a prominent interface to ACs/reviewers. A “simple” status change in the review system can easily result in hundreds of emails and the delay/confusion associated with them, given the size of our conferences nowadays.

The second involves reviewer interaction and reviewer conflicts of interest. Never having been on the other end of the “chase” email, I had never anticipated having to send them one by one from my personal mailbox, filling paper information into the email template I drafted for myself. I was also surprised to find out that one of the reviewers never agreed to review in the first place. When it comes to emergency reviewers, there was no easy way to check for COI other than floating names to SACs, who are authorized to add them or tell us if there had been a COI. While the need for a centralized management of sensitive information is very much understandable, it is at these times I couldn’t help but think how much time a semi-automated system could save everyone.

Overall, I am excited to have had the opportunity to serve as an area chair, and to be able to share my experience that hopefully helps others. I am also encouraged that the field is actively taking measures to improve the review process (shout out to ACLRollingReview). Can’t wait to see how we can collectively improve the reviewing/publishing process in the next few months/years!

Teaching Conversational NLP Systems to Ask Informative and Specific Questions

2020-10-19T07:00:00-07:00

Image by Louis Renaudineau (via Unsplash).

“Why is the sky blue?”

Asking questions is an essential part of our process to learn about the world from those with more experience roaming the earth, like our parents, mentors, senior colleagues, etc. Asking the right questions in a conversation could mean the difference between a focused, in-depth, and meaningful exchange of information and an inefficient, shallow, and pointless discussion.

How do we ask questions in a conversation to gather information effectively on a topic that we have limited knowledge? More importantly, can we train computers to learn knowledge from conversations?

Let us begin by understanding the desiderata for good questions to gather information in a human conversation. Among other qualities, good questions typically need to be both informative and specific.

Informativeness is an essential quality for questions to reveal information that is unknown to us from the dialogue thus far. For example, consider learning about falafels in a conversation, while only knowing that they are a kind of food. An informative question would be, for example, “What are falafels typically made from?” (the answer is “chickpeas and fava beans”¹). After this question has been answered, an informative follow-up question might be “How are they usually cooked?”, which potentially reveals new information about falafels. In contrast, a question like “Do they sometimes contain fava beans?” would not be very informative since it has essentially been answered already.

Good questions are often specific to the topic under discussion as the conversation progresses besides being informative, which makes them more natural. For instance, informative questions can sometimes be very generic (e.g., “What is interesting about falafels?”), which are usually more difficult to answer (compared to, e.g., “Where do falafels originate from?”). On the other hand, sharp shifts in topic like “Where do fava beans originate from?” are also less than ideal.

Question	Answer	Informative?	Specific?
Do they sometimes contain fava beans?	Yes	No	Yes
Does hummus contain ingredients other than chickpeas?	Yes, they are made from cooked, mashed chickpeas blended with tahini, lemon juice, and garlic.²	Yes	No (off-topic)
What is interesting about the origin of fava beans?	They are of uncertain origin.³	Yes	Somewhat (too generic)
How are they cooked?	They are deep-fried.	Yes	Yes

Illustration of informativeness and specificity for some example questions in a conversation that follows "What are falafels typically made from?".

Can we train computers to ask informative and specific questions, so that they can learn about arbitrary topics by conversing with a human in an engaging dialogue?

The natural language processing (NLP) research community has long sought to teach computer systems to ask questions. One line of successful research efforts is concerned with asking questions about given material that contains the answer, where questions are intended to examine the answerer’s understanding of the material (also known as “reading comprehension”). Roughly speaking, the goal is to typically transform statements into a question, e.g., from “Falafel is a deep-fried ball or patty made from ground chickpeas, fava beans, or both.” to “What are falafels made from?”. However, such approaches are not applicable when the answer is not already available to the asker in some form.

Another prominent direction of research involves defining the template of information necessary to achieve an end goal, to ask the right questions about what is not already known to the asker (known as “goal-oriented dialogue”). Examples in this direction include ordering takeout from a restaurant or booking flight tickets for travel, where the typical set of information necessary to achieve the goal (the successful ordering or booking) can be well-defined. While this allows computer systems to inquire facts, the need to define these information templates is non-trivial, if not impossible, for many applications (e.g., troubleshooting a complex system).

In our new work, “Stay Hungry, Stay Focused: Generating Informative and Specific Questions in Information-Seeking Conversations,” we set out to tackle this challenge from a brand new angle. We study a challenging new setting for question-asking computer systems: open-domain curiosity-driven dialogues. Specifically, we consider conversations between two agents (each a human or a computer), one a curious student, the other a knowledgeable teacher. Given a shared topic the student has limited knowledge of (e.g., falafels), the common goal of these agents is to help the student learn as much about the topic as possible by asking questions about it.

This setting has some unique challenges that make it interesting:

The correspondence between the input and output is very weak. Specifically, the input (the conversation history and the shared topic) often bears very little resemblance to the desired output (the question), because, by definition, our goal is to gather new information that is not yet known. In contrast, NLP applications that have seen more success in recent years typically enjoy a much stronger input-output association (e.g., machine translation, text summarization, and generating questions from answer statements).
The knowledge desired cannot be easily categorized exhaustively. Compared to more closed-domain goal-driven conversations, the open-domain nature of our setting makes it much more difficult to rely on tabulating a few things to inquire about given each topic. As a result, the student must reason about what the teacher could potentially answer on the shared topic.
There is typically more than one right question to ask, but the data cannot exhaust all valid options. At any point in a curiosity-driven dialogue, there are usually many angles to ask questions that would reveal previously unknown information to the student. However, since dialogue datasets are collected with bounded human effort, it is impractical, if not impossible, to provide all potential questions that could have been asked naturally at any point in a conversation.

This poses challenges in two directions: the lack of diverse signal to train the question-asking system, and the lack of a good evaluation metric. While most prior work on natural language generation work can get away with evaluating the similarity between the generated output (the question, in our case) and the reference (the original question in the dataset asked by a human), questions that have virtually zero overlap can both be valid in our case (e.g., “How are they cooked?” vs “Are they typically served with other food?”, as follow-up questions to “What are falafels made from?”).

In our work, we propose two new evaluation metrics to evaluate the informativeness and specificity of generated questions. In a nutshell, we consider a question more informative if its answer contains more information that has not been mentioned in other answers in the conversation so far, and more specific if it inquires about facts that are more unique to the topic under discussion and is raised in a coherent context.

Take these follow-up questions to “What are falafels made from?” (with its answer “chickpeas and fava beans”) as an example, we can intuitively evaluate how informative and specific they are:

The QuAC dataset we use for this work contains human-human conversations on Wikipedia articles in an information-asymmetric setting similar to ours, but it provides little more than what human annotators asked and answered in each conversation. To provide our question-asking system (the student) the informativeness and specificity feedback similar to those in the table above, we proposed to train two auxiliary models on the QuAC data to derive further supervision signal.

An example conversation between human annotators from the QuAC dataset. Figure taken from the QuAC paper.

For informativeness, since we are interested in knowing how much new information is revealed from the answer to each potential question, we took advantage of off-the-shelf question answering (QA) systems on this task. Specifically, we trained a system to predict answers given a question and the conversation history, which predicts answers from the teacher’s private source of knowledge that is not revealed to the student (usually a Wikipedia paragraph). During training time, we use this QA system to predict answers to potential questions that the student might ask, and define informativeness to be the amount of new words this answer contains when compared to the answer with the highest textual overlap among all answers that the teacher previously provided. This discourages the student model from asking questions that would result in repeated information.

For specificity, instead of relying on simple cues like word overlap, we take the approach of training a classifier to discern specific and non-specific follow-up questions, similar to the discriminator in the generative adversarial networks (GAN) literature. Specifically, for each turn in the conversation where the student asks a question, we randomly sample frequent questions from other conversations in the dataset, as well as past or future questions in the same conversation (which we consider specific), and train a classifier to distinguish the actual follow-up question from the rest with high accuracy. Once this classifier is trained, we use the probability it predicts a potential question as specific as the question’s specificity score. This discourages the model from jarring topic shifts or even digressing off-topic.

To train our question-asking model (a sequence-to-sequence model) to generate questions that are both informative and specific, we first train it to maximize the probability of generating the reference questions in the QuAC dataset, then finetune the model with reinforcement learning to maximize our informativeness and specificity scores.

The result? Our finetuned models fare better when evaluated by our informativeness and specificity metrics, while standard metrics like perplexity and ROUGE, which are commonly used in natural language generation evaluation, fail to tell these systems apart.

To give you a more concrete idea of how these systems behave, here is an example where we list the original reference, the question from the baseline (non-finetuned) system, and our system after incorporating informativeness and specificity, see for yourself which candidate questions you prefer (try to evaluate the questions before you read the answer, as the answer is only compatible with the reference question):

Background: Spandau Ballet (English band) Topic: 1983–1989: International success and decline Candidate Questions
Question 1
Candidate 1.1: What happened in 1983?
Candidate 1.2: What happened in 1983?
Candidate 1.3: What was the first indication of Spandau Ballet’s success at the international level?
Answer 1 The follow-up album, Parade, was released in June 1984, and its singles were again big successes in the charts in Europe, Oceania and Canada.
Question 2
Candidate 2.1: What was the most popular single from the album?
Candidate 2.2: What were the notable songs from the album Parade?
Candidate 2.3: What was the name of the album?
Answer 2 The album’s opening song, “Only When You Leave”.
Question 3
Candidate 3.1: How did the opening song do on the charts?
Candidate 3.2: What other songs were on the album?
Candidate 3.3: What was the name of the album that was released?
Answer 3 Became the band’s last American hit.
Question 4
Candidate 4.1: Are there any other interesting aspects about this article?
Candidate 4.2: What was the last album that they released?
Candidate 4.3: What other songs were on the album?

We hope you agree with us that candidates 1.1, 2.3, 3.3, and 4.2 are usually of the poorest quality in their respective group, which is generated by the baseline system without finetuning for informativeness and specificity. Our system is much better at inquiring about new information while staying on topic (1.2, 2.1, 3.2, and 4.3), arguably even more specific than the human reference (1.3, 2.2, 3.1, and 4.1) at times.

To confirm our findings, we also conducted a larger-scale human evaluation, and after examing 200 groups of predicted questions in a blinded experiment, our annotators agreed that our system generates informative and specific questions more often than the baseline system, and the with higher overall quality. All human evaluation results are validated with statistical testing, which demonstrates statistically significant different differences between these systems on these metrics.

We hope our work can help highlight the importance and challenges in building natural language systems to gather knowledge through a natural language interface, and more importantly, a practical means to train these systems to reason about the unknown in the face of information asymmetry. If you are interested, please check out our paper for more technical details, analysis, and bibliographical notes. We have also released our code and models, should you be interested in training and running these models for yourself!

Footnotes

See Falafel on the English Wikipedia. Every answer about falafels in this post are based on this article, unless otherwise specified. ↩
See Hummus on the English Wikipedia. ↩
See Vicia faba on the English Wikipedia. ↩

Answering Complex Open-domain Questions at Scale

2019-10-16T07:00:00-07:00

TL;DR: The NLP community has made great progress on open-domain question answering, but our systems still struggle to answer complex questions over a large collection of text. We present an efficient and explainable method for enabling multi-step reasoning in these systems.

From search engines to automatic question answering systems, natural language processing (NLP) systems have drastically improved our ability to access knowledge stored in text, saving us countless hours spent memorizing facts and looking things up.

Who's old enough to remember these indexes and not just the search engine ones?

Today, whenever we have a question in mind, the answer is usually one Google/Bing search away. For instance, “Which U.S. state is the largest by area?”

Alaska! But also, great job, Google!

Other questions, however, are less straightforward. For example, “Who was the first to demonstrate that GPS could be used to detect seismic waves?” Google isn’t of much help if we were to directly type this question as a search query. On the other hand, the Internet’s encyclopedia, Wikipedia, does have an answer for us:

Thank you, Dr. Larson!

Wouldn’t it be nice if an NLP system could answer this question for us, without us having to find the article ourselves? This problem, called open-domain question answering (open-domain QA), is an active area of NLP research.

Background: Open-domain QA

Before diving into our new method for open-domain QA, let us first take a moment to understand the problem setup, challenges, and why existing solutions are not quite enough to answer complex questions.

Open-domain vs Closed-domain / Restricted-context

The first question answering systems built by NLP researchers, such as BASEBALL and LUNAR, were highly domain-specific. They were adept at answering questions about US baseball players over the period of one specific year and about lunar rocks brought back to Earth, but not terribly helpful beyond the domains they were built for. In other words, they are closed-domain.

Since then, researchers have moved towards tackling open-domain QA. In open-domain QA, the questions are not limited to predefined domains and domain knowledge; ideally, the system should be able to sift through a very large amount of text documents to find the answer for us.

Single-document open-domain QA (also known as reading comprehension) is one of the research areas seeing recent breakthroughs in natural language processing, where an NLP system is given a single document (or just a paragraph) that might contain the answer to a question, and is asked to answer the question based on this context. Take our Dr. Larson question for an example (“Who was the first to demonstrate that GPS could be used to detect seismic waves?”). A single-document QA system might be trained to answer this question given only the Wikipedia page “Kristine M. Larson”. This is the format of many popular question answering datasets used in the NLP community today, e.g., SQuAD.

Question answering systems trained on SQuAD are able to generalize to answering questions about personal biographies.

Recent reading comprehension systems can answer our question, given appropriate context. Demo credit: AllenNLP.

However, such systems cannot help us answer our question about Dr. Larson if we didn’t already know to look at her biography, which is quite limiting.

To solve this, researchers are developing question answering systems over large text collections. Instead of provided with the exact context necessary to answer the question, the system is required to sift through a collection of documents to arrive at the answer, much like how we search for answers on the web. This setting, called open-context open-domain QA, is much more challenging than reading comprehension. But, it is also a lot more useful when we have a question in mind but don’t really have a good idea where the answer might be from. The main challenge, besides those of restricted-context QA, is to narrow down the large collection of texts to a manageable amount with scalable approaches, such that we can run reading comprehension models to arrive at the answer.

Open-domain QA Systems

Inspired by the series of question answering competitions at the Text REtrieval Conference (TREC), researchers in recent years have started to look into adapting powerful neural-network-based QA models to the open-domain task.

Danqi Chen and collaborators first combined traditional search engines with modern, neural question answering systems to attack this problem. Their approach to open-domain QA, named DrQA, is simple and powerful: given a question, the system uses it to search a collection of documents for context documents that may contain the answer. Then, this reduced context is the input to a reading comprehension system, which predicts the final answer.

Illustration of Chen et al.'s "DrQA" model, which was presented at ACL 2017. Figure from the official Github repo.

Most of the recent research in open-domain QA has largely followed this two-stage approach of retrieving and reading, with added features such as reranking (see, for example, (Wang et al., 2018)) and neural retrieval systems and better joint training (see, for example, (Das et al., 2019) and (Lee et al., 2019)).

The Challenge of Complex Open-domain Questions

All systems that follow this retrieve-and-read paradigm are ill-equipped to handle complex questions. Let’s walk through an illustrative example of why that is together.

We all forget the names of celebrities from time to time. Suppose, one day, you are curious: “What is the Aquaman actor’s next movie?” To answer this question, you would probably first search for “Aquaman” or “the Aquaman actor” to find out who he/she is. Hopefully after scrolling through a few top search results, you will realize the answer is “Jason Momoa”, and then move on to finding out what his next movie is.

In this simple example, not all of the supporting evidence needed to answer the question can be readily retrieved from the question alone, i.e., there’s a knowledge discovery problem to solve.¹ This makes these questions difficult for retrieve-and-read open-domain QA systems, because there is usually some evidence that lack a strong semantic overlap with the question itself. Below is a sketch of the relations between the real-world entities that illustrate the multiple steps of reasoning required to answer this question.

Reasoning required to answer the question "What is the Aquaman actor's next movie?". In this case, "Jason Momoa" is the missing link that connects the question to its answer, but cannot be easily retrieved based on the question.

One solution to this problem might be to train neural retrieval and reading comprehension models jointly to update queries to find more evidence (Das et al. (2019) set out to do just that). While this might also work in our setting, pretraining the neural retriever with distant supervision to promote documents that contain the answer string will likely fail because of the missing semantic overlap between the question and all necessary documents. End-to-end training will also be prohibitively expensive, because the search space for queries beyond the first step of reasoning is enormous. Even if one manages to train a neural system to accomplish this task, the resulting system is probably very computationally inefficient and not very explainable.

So, can we build an open-domain QA system that is capable of handling complex, multi-step reasoning questions, and doing so in an efficient and explainable manner? We present such a system in our new EMNLP-IJCNLP paper – Answering Complex Open-domain Questions Through Iterative Query Generation.

Answering Complex Open-domain Questions

To introduce our system, we start with the overall strategy we use to address the problem of mutli-step reasoning in open-domain QA, before moving on to the dataset we evaluate our system on and experimental results.

Overall Strategy

As we have seen, retrieve-and-read systems can’t efficiently handle complex open-domain questions that require multiple steps of reasoning, because (a) these questions require multiple supporting facts to answer, and (b) it is usually difficult to find all supporting facts necessary with only the question. Ideally, we want a system to be able to iterate between “reading” the information retrieved and finding further supporting evidence if necessary, just like a human.

That is exactly where the “iterative query generation” part of the paper title comes into play. We propose an open-domain QA system that iteratively generates natural language queries based on the currently retrieved context and retrieves more information if needed before finally answering the question. This allows us to (a) retrieve multiple supporting facts with different queries, and (b) make use of documents retrieved in previous iterations to generate queries that wouldn’t have been possible from the question alone. Moreover, because our system generates natural language queries, we can still leverage off-the-shelf information retrieval systems for efficient retrieval. Furthermore, the steps our model follows are more explainable to a human, and allow for human intervention at any time to correct its course.

Given the English Wikipedia as our source of textual knowledge, the full system operates as follows to answer the question “Which novel by the author of ‘Armada’ will be adapted as a feature film by Steven Spielberg?”:

The proposed model answers the question "Which novel by the author of 'Armada' will be adapted as a feature film by Steven Spielberg?". The system first iterates between reading and retrieving to gather supporting facts, then concatenates all the top retrieval results and feeds them into a restricted-context QA model with the question to generate the final answer.

To answer this question, the model starts by generating a query to search Wikipedia to find information about the novel Armada. After “reading” the retrieved articles, it then attempts to search for Ernest Cline (the name of the author) for more information. Finally, when we have retrieved all the context necessary to answer the question, we concatenate the top retrieved articles from these retrieval steps, and feed them into a restricted-context QA system to predict the final answer.

The main challenge in building this model lies in training the query generators collaboratively to generate useful queries for retrieving all the necessary information. Our main contribution is an efficient method for training these query generators with very limited supervision about which documents to retrieve, yielding a competitive system for answering complex and open-domain questions. Our method is based on the crucial observation that, if the question can be answered with knowledge from the corpus, then there exists a progressive chain (or graph) of reasoning we can follow. In other words, we note that at any given time in the process of finding all supporting facts, there is some strong semantic overlap between what we already know (the question text, plus what we have found so far), and what we are trying to find (the remaining supporting facts).

Finding the multiple supporting facts necessary to answer complex questions is much like finding multiple needles in a haystack. Instead of looking for them independently, we make use of the thread connecting these needles, which is the strong semantic overlap between what we know and what we are trying to find.

In the beginning, the question the system is asked is all the information we already know. We are trying to find any document part of reasoning chain needed to answer this question. Based on our observation, at least one of the gold documents² would have strong semantic overlap with the question, and our goal is to find one such document to bootstrap our chain of reasoning. In our Armada example, this document would be the Wikipedia page of Armada the novel, where the overlap is the name “Armada”, and the fact that it’s a novel. To find this document with the help of a text-based information retrieval (IR) system, we just need to identify this overlap and use it as the search query.

After one step of retrieval, we have hopefully retrieved the “Armada (novel)” page from Wikipedia, among others. If, at training time, we also know that the “Ernest Cline” page is the next missing link in our chain of reasoning, we can apply the same technique. Now, the semantic overlap between what we know (the question, the “Armada (novel)” page, plus some other Wikipedia pages), and what we are trying to find (“Ernest Cline”) to generate the desired query, “Ernest Cline”. To find this semantic overlap, we simply employ longest common substring or longest common subsequence algorithms between the knowns and the wanted.

With the desired queries at each step of reasoning, we can then train a model to predict them from the retrieval context (question + already retrieved documents) at each step. We then use these query generators to complete the task of open-domain multi-step reasoning. We cast the query generation problem as one of restricted-context QA, since the goal is to map the given question and (retrieved) context to some target derived from the context.

We name the full system GoldEn (Gold Entity) Retriever, because the model-retrieved Wikipedia pages are mostly entities, and it’s a fun name for a retrieval-oriented model! Below are some example questions and the desired queries we train the query generators with:

Question	Step 1 Query	Step 2 Query
What government position was held by the woman who portrayed Corliss Archer in the film Kiss and Tell?	Corliss Archer in the film Kiss and Tell	Shirley Temple
Are Giuseppe Verdi and Ambroise Thomas both Opera composers?	Giuseppe Verdi	Ambroise Thomas

Example queries generated from our overlap-finding process to train the query generators in GoldEn Retriever. As you can see in the first example, the query at Step 2 reveals information we can only find through iterative retrieval, and is not contained in the original question.

Two practical notes should be mentioned here. First, it is not difficult to see that our observation that supervision signal for query generation can be derived from this semantic overlap generalizes to any number of supporting documents. It also requires no additional knowledge about how the question can or should be decomposed into sub-questions to answer (which previous work has studied, e.g., (Talmor and Berant, 2018) and (Min et al., 2019)). As long as the gold supporting documents are known at training time, we can use this technique to construct the chain of reasoning in an open-domain setting very efficiently at scale. Second, we further make no assumption about knowledge of the order in which documents should be retrieved. At any given step of open-domain reasoning, one can enumerate all of the documents that have yet to be retrieved, find its semantic overlap with the retrieval context, and launch searches with these generated queries. Documents that are in the immediate next step of reasoning will naturally be more discoverable, and we can choose the desired queries accordingly. In our Armada example, for instance, the overlap between the question and the Ernest Cline article is “Steven Spielberg”, “film”, etc, which lead us nowhere close to the “Ernest Cline” page, thus these are not chosen as the first-step query at training time.

Dataset: HotpotQA

To test the performance of GoldEn Retriever, we evaluate it on HotpotQA, a recent multi-hop question answering dataset presented at EMNLP 2018 (by me & collaborators). HotpotQA is a crowd-sourced QA dataset on English Wikipedia articles, in which crowd-workers are presented the introductory paragraphs from two related Wikipedia articles and asked to generate questions that require reasoning with both paragraphs to answer. Our example question about the Armada novel is one such question from this dataset. To encourage the development of explainable QA systems, we also asked crowd workers to highlight the sentences from these paragraphs that support their answer (we call these “supporting facts”), and ask QA systems to predict them at test time.

HotpotQA features two evaluation settings: a few-document distractor setting, and an open-domain fullwiki setting, which we focus on, where the system is only given the question and the entire Wikipedia to predict the answer from. HotpotQA also features a diverse range of reasoning strategies, including questions involving missing entities (our Armada example, where Ernest Cline is not in the question), intersection questions (What satisfies property A and property B?), and comparison questions, where two entities are compared by a common attribute, among others.³

QA systems on this dataset are evaluated on two aspects, answer accuracy and explainability. Answer accuracy is evaluated with answer exact matches (EM) and unigram F1, and explainability is similarly evaluated with EM and F1 by calculating the supporting fact overlap between predictions and annotations. These two aspects are unified by joint EM and F1 metrics, which encourage QA systems to work well on both.

We make two simplifications to the GoldEn Retriever system on this dataset. First, we limit the number of retrieval steps to two to match the number of gold supporting documents for all questions in HotpotQA, and avoid having to learn an extra stopping criterion. Second, we assume that all queries are contiguous spans of text in the retrieval context, and use the document reader in DrQA, an extractive question answering system, to predict them during test time. To derive the desired search queries, we employ longest common substring and longest common subsequence algorithms to find the semantic overlap between the retrieval context and the desired documents, and choose the one that results in the IR performance. For the IR engine, we use Elasticsearch with a unigram-bigram index over the same Wikipedia dump HotpotQA was constructed on.

For the final restricted-context QA component, we use a modified BiDAF++ model in this work. For more technical details, please refer to our paper.

Results

We evaluate the effectiveness of our GoldEn Retriever model on two aspects: its performance on retrieving the gold supporting documents, and it’s end-to-end performance in question answering.

For retrieval performance, we compare GoldEn Retriever to a retrieve-and-read QA system that just retrieves once with the question. We evaluate these approaches on the recall of the two gold paragraphs when a total of 10 paragraphs are retrieved by each system, because this metric reflects the ceiling performance of the entire QA system if the restricted-context QA component were perfect.

Retrieval performance of a retrieve-and-read system vs GoldEn Retriever on the gold paragraphs.

As can be seen from the figure, while both systems achieve decent recall on the paragraph that is usually more connected to the question (“Paragraph 1” in the figure), GoldEn Retriever obtains significant improvement through iterative retrieval with query generation on the other paragraph (~24% improvement). This means for about 24% of the questions, GoldEn Retriever is able to find both gold supporting documents while the retrieve-and-read system can’t. Further analysis shows that this is mainly from the improved recall for non-comparison questions (for which recall improved by about 25%), where the retrieval problem is less trivial.

For end-to-end QA performance, we compare GoldEn Retriever against various retrieve-and-read baselines on the development set, as well as systems submitted to the public leaderboard on the hidden test set.

Comparing GoldEn Retriever against various other systems on HotpotQA's fullwiki setting.

We first contrast the performance of the QA component when using the IR system originally used in HotpotQA (as reflected by the released fullwiki dev set) and Elasticsearch in an retrieve-and-read setting. As can be seen by the leftmost two bars in the figure, a better search engine does improve end-to-end performance (from 22.75% F1 to 27.11%). However, this is still far from the best previously published system (34.92% F1 on the test set, which is empirically ±2% from the model’s dev set performance). With GoldEn Retriever, we improve this state of the art to 39.13% F1, which is significant especially if one considers that the previous state-of-the-art model uses BERT and we don’t. Although this doesn’t match the contemporaneous work which achieves 47.6% F1 with another BERT-based model, we see that if our query generators were able to faithfully reproduce the desired queries on the dev set, the performance of our system wouldn’t have been far off (“Oracle IR”).

For explainability, aside from reporting supporting fact metrics that are part of HotpotQA’s evaluation, we can also look at the search queries GoldEn Retriever generates on the dev set. As can be seen in the example below, the natural language queries generated by the model are very understandable. Furthermore, one can see where the model is making mistakes and correct it in the system if needed.

Question	Step 1 Predicted	Step 2 Predicted
What video game character did the voice actress in the animated film Alpha and Omega voice?	voice actress in the animated film Alpha and Omega (animated film Alpha and Omega voice)	Hayden Panettiere
Yau Ma Tei North is a district of a city with how many citizens?	Yau Ma Tei North	Yau Tsim Mong District of Hong Kong (Hong Kong)

Examples of queries generated by GoldEn Retriever on dev set examples. The model-generated queries are shown in black, and the heuristic-generated "desired queries" are shown in parenthesis in green italic font when they differ from the model-generated ones. In the first example, we see that the model actually generates a constituent whereas the heuristics largely ignores constituency structure; in the second example, however, the model generated a Step 2 query that is overly specific.

Resources

To help facilitate future research in open-domain multi-step reasoning, we make the following resources publicly available:

The code to reproduce our results and our pretrained models
Generated “desired” query files and modified HotpotQA training and development files generated from the heuristics to train GoldEn Retriever models
Predicted search queries and dev/test set input for our restricted-context QA model

All of these can be found in our code repository on GitHub.

Language Note: All datasets and most of the research mentioned in this post are collected/tested for the English language only, but our principle of semantic overlap is applicable to answering open-domain complex questions in other languages than English if suitably augmented with lemmatization for highly inflected languages.

Acknowledgements

I would like to thank my collaborators Xiaowen (Vera) Lin, Leo Mehr, Zijian Wang, and Chris Manning for their help to make this work possible. I would also like to thank Nelson Liu and Andrey Kurenkov, who provided helpful editing suggestions for earlier drafts of this blog post.

Footnotes

This is of course contingent on the fact that very few highly ranked articles on the Web mention Jason Momoa in his next movie in close proximity to stating that he’s the “Aquaman” star who played Aquaman in that movie. This is just an example to demonstrate that as simple as this question seems, it’s not too difficult to construct questions that require information from more than one document to answer. ↩
By “gold documents” we mean the documents needed in the chain of reasoning to answer the question. ↩
Comparison questions make up about 25% of the HotpotQA dataset. For more details please see our HotpotQA paper. ↩

Pinyin Cheatsheet for (Mostly American) English Speakers

2019-07-26T10:00:00-07:00

Pinyin (Chinese: 拼音, lit. “spelling sounds”) is one of the most commonly used method for romanizing Chinese characters from Mandarin Chinese, and is used for names for the majority of the Chinese community.¹ As many find difficulty pronouncing Chinese names written in Pinyin (esp in academic conferences), I put together this cheatsheet in hope of helping (mostly American) English speakers pronounce them correctly with as little effort as possible.² No training in the International Phonetic Alphabet (IPA) is required.

Bottomline: Pinyin is a romanization method for denoting Chinese characters and providing largely approximations to the actual sounds made to pronounce the characters. There is not a one-to-one mapping between each Latin letter (or small group of letters) and the corresponding sound, and I will not pretend that that is the case. Nevertheless, we can focus on the most challenging ones first, and strive for a better and better sound approximation.

I will organize this cheatsheet in a hierarchical manner, focusing on the first-order approximations before offering more details and exceptions.

Vowels

a sounds like a in car or spa, without the r sound in the tail.
1. Except in an, where you might be better off approximating it with ban or cat (the vowel sounds a bit more closed than the open-mouth a).
e sounds like e in the (when it’s not pronounced like thee) or a in again in most places.
1. Except in diphthongs or double vowels like ei, ie, ue (actually üe, see u), where it sounds like e in bet.
i sounds like tea or sheep. It almost never sounds like bit or ship as rendered in most American accents (which drifts closer to bet).
1. When i appears before another vowel, it can usually be well approximated as yes.
2. In zhi, chi, shi, ri, zi, ci, and si, i actually doesn’t sound like sheep and doesn’t have an equivalent in English. A rough approximation is a voiced version of the consonant (i.e. hold your mouth they way you’d pronounce the consonant, and make a voice). If you speak Japanese, there are some rough correspondence: zi <=> ず (zu) and si <=> す (su), where the correspondence is more pronounced when these syllables are at the end of a word in Japanese.
o sounds like short or boy, again without the trailing r sound.
u is usually the romanization for two different vowels in Mandarin, u and ü. The former sounds like food or pool, and the later doesn’t exist in English. A good way to approximate ü is pronouncing the letter u in English).
1. u is actually ü if it comes after j, q, x, or y.
2. When u is not ü and comes before other vowels, it can usually be well approximated as wait.
3. Except for j, q, x, and y, the vowel ü can only follow n and l. To disambiguate with u in the latter cases (following n and l), it is sometimes romanized as v, and more recently, it is starting to be romanized as yu on Chinese passports.³

Some Less Compositional Diphthongs and Special Vowels

ao sounds like about or mountain with a more closed ending than the normal o.
iu sounds like trio or grandiose.
er sounds like bird or spur, with the trailing r sound in American English.

Consonants

j is similar to zh and both sound like John or dream, q is similar to ch and both sound like cheese or chowder, and x is similar to sh and both sound like sheep or shop.
1. One good approximation for j, q, and x is actually appending a y sound to zh, ch, and sh, because Mandarin pronunciation rules mandate a y sound after these most of the time and it’s not always marked explicitly (think xia vs sha, and xu vs shu).
c sounds like pants or boots.
All other consonants sound very much like their English counterparts, but note that h is never silent (in fact no consonant is silent in Pinyin).
1. To be more precise, r actually sounds more like vision than root, basically the latter but with the tongue curled upward against the top of the mouth.
2. A very subtle difference I realized while compiling this cheatsheet is that most leading consonants in Mandarin are voiceless but some of their counterparts in English are voiced. This distinction probably makes very little difference to the untrained ear, but might be of interest to the IPA inducted.

Segmentation and Sandhi

The Pinyin corresponding to each Chinese character is usually of the form CVC, where one or more vowels are surrounded by a leading consonant and a trailing one. Either or both consonants can be missing in some cases, and when that results in ambiguation in pronunciation, an apostrophe or a hyphen is usually used to delimit character boundaries (e.g., Tian’anmen). The trailing consonant can only be either n or ng.

Some special cases in segmentation:

zhi, chi, shi, ri, zi, ci, and si cannot be followed by other vowels or trailing consonants.

Although there is tone sandhi in Chinese, there isn’t as much sandhi of other sorts in Chinese, meaning boundaries between Chinese characters are clearly reflected in speech. Take Tian’anmen for an example, while the first instinct of most English speakers might be to say tian--na-men, a native speaker would say tian-an-men without mushing character boundaries.⁴

Tones

There are four tones (plus a neutral tone) in Mandarin, and I believe Wikipedia is truly the best material on this one. Most Chinese speakers can recognize your speech without perfect tones, given that you map sounds roughly correctly and are careful with segmentation and sandhi, though.

Tying It All Together

Matt Gardner suggested on Twitter that it would be helpful to include some examples, so I am including some examples here that feature some of the awesome people I know (plus yours truly) whose name are romanized in pinyin when they appear in publications. Try pronuncing their names with what you’ve learned:

Pinyin	Pronunciation
Peng Qi	p-uh-ng ch-ee, where uh=again, and ee=bee.
Yuhao Zhang	y-u how zh-ah-ng, where y-u=pronouncing the English letter u. The first name here is two characters and needs segmentation, and the common mistakes are to pronounce the a in Zhang as bat or zh as zoo.
Danqi Chen	dan ch-ee ch-uh-n, where uh=again and dan sounds just like the English name Dan. Again the first name here is two characters. I personally find the commonly used pronunciation of Chen where e is pronounced like bet a bit too flat, and the enunciated version where Dan sounds like spa a bit too open.
Ziang Xie	zzz ah-ng sh-y-eh, where zzz=prolonging the consonant a bit, ah=spa, and eh=bet. This is a bit more difficult to get right, where the first name is actually two characters, and the last name has what sounds like a double-consonant leading sound.

Other Useful Resources

Carlos Gómez Rodríguez recommended this tool for quick lookup with example pronunciation audio files available.

This Wikipedia page is also a good reference with quick examples for pronunciation.

References

Various Wikipedia pages linked throughout this cheatsheet.
David Chiang’s Mandarin Chinese Pronunciation. [pdf] [GitHub]

Footnotes

Others, like Tongyong Pinyin and Wade-Giles are still in use for historical reasons. Also, note that not all Chinese names were romanized from Mandarin (which Pinyin is for) – some names come from Cantonese and other Chinese languages and it’s not always easy to tell. ↩
Focusing more on the “standard” American accent here because this is the accent that I personally get the most samples for (and thus am more confident about vowel/consonant mappings and common mistakes in). However, much of this is applicable to other accents as well, especially for consonant mapping. ↩
Xinhua report in Chinese, cited by Wikipedia. ↩
Interestingly, sandhi is actually one of the more challenging parts for many English learners whose first language is like Mandarin in this way. ↩