Ferd.ca notes https://ferd.ca/notes/ My notes and other stuff en-us Fri, 13 Mar 2026 10:00:00 EDT Fri, 13 Mar 2026 10:00:00 EDT 60 On Joyful Militancy https://ferd.ca/notes/on-joyful-militancy.html

From carla bergman and Nick Montgomery's Joyful Militancy, about their attempt to move from rigid militancy to a more joyful style, using the analogy of children learning to walk compared to people starting to open up to their militant ideas:

We take photos, tell friends, and record these moments because we want to share the joy in witnessing the emergence of a new increase in capacity: this kid is learning to walk! But if we take a perfectionist perspective, then why celebrate? The kid won’t usually walk for very long; they stumble and fall, and they certainly can’t run. But no one says “Why are you celebrating? They’re not really walking yet!”

If the kid learning to walk is just another kid walking, it’s no longer something worth celebrating. Those who celebrate it are naïve or getting a bit carried away: kids are learning to walk all the time. But in the moment, it doesn’t seem naïve, because we are part of the process of witnessing this kid walk, in this way, for the very first time.

We bring up this example because it seems obvious that it is nonsensical to impose external ideals of walking on little kids who are just learning, or to approach the situation with a detached and suspicious stance. It seems obvious (we hope) that a toddler’s increase in capacity—those first steps that mark the emergence of something new—is sufficient in itself. It is a joyful moment, worth celebrating, not because it’s part of some linear process of development but because it’s an emergent power for that kid, palpable to all present in those moments.

With this in mind, why is it so difficult sometimes to celebrate small victories or humble increases in collective power and capacity? What makes it so easy to dismiss transformation as too limited? What makes it so easy to find joy lacking? We see variants of this dynamic happen a lot: someone celebrates something joyful, while others offer up reminders of its insufficiency. We find ourselves doing the same thing sometimes. What allows for the constant imposition of external norms, criteria, and ideals for evaluation?

(italics in the original, bold emphasis is mine)

I do see the need to remind myself of this in multiple areas of life, not just militancy.

]]>
Fri, 13 Mar 2026 10:00:00 EDT https://ferd.ca/notes/on-joyful-militancy.html
Paper: Challenger: Fine-Tuning the Odds Until Something Breaks https://ferd.ca/notes/paper-challenger-fine-tuning-the-odds-until-something-breaks.html

Many of us have, at some point heard about Diane Vaughan's Normalization of Deviance, a theory elaborated after the Challenger disaster in 1986. Deviance stands for a change from the norms, usually towards less safety. Normalization refers to the overall acceptance by an organization that establishes new norms. Other related models developed at different times can include the concepts of drift or practical sailing, which all aim to model that behaviour.

I recently attended Lund's Human Factors and Systems Safety Learning Lab as part of their MSc program. The week covered a lot of theory, and one of the surprising bits I learned about was the theory of Fine-Tuning. Fine-tuning was coined by William H. Starbuck and Frances J. Milliken in Challenger: Fine-Tuning the Odds Until Something Breaks, also in the aftermath of Challenger, but in the Journal of Management Studies, in 1988.

I found it interesting because among the many models concerning themselves with the history of organizations, it is the one that most accurately reflects the dynamics I've encountered in the software industry, particularly in the startup space where safety is not necessarily a priority, but an openly negotiable property.

At a high level, fine-tuning is described as the process that results from engineers and managers pursuing partially inconsistent goals while trying to learn from experience. It relies on three theories, any of which might be adopted by people:

  1. Past successes or failures do not impact future successes or failures: this is the statistical approach in independent processes. The previous coin flip has no impact on the next coin flip. If your risk analysis predicts certain rates of failure, you should expect outcomes in line with it.
  2. Past successes make future successes less probable, past failures make future failures less likely: this is an approach where you assume that being successful causes people to let up and stop making continuous adjustments. Failures, on the other hand, encourage people to make adjustments to prevent recurrence.
  3. Past success makes future success more likely, past failures make future failures more likely: this theory aligns with the idea that success comes from competence, and failures reveal deficiencies.

Each of these strategies can vary over time, parts of organizations, and have criticism against them. For example, for Theory 1, in sociotechnical systems, the hardware, procedures, or people's knowledge have to remain mostly unchanged for it to be true; generally, they change. Theory 2 is often adopted after failures, but less so after successes, where tweaks are seen as improving efficiency rather than eroding safety. Theory 3 relies on learning mechanisms (which are not guaranteed to provide good or bad safety results):

These learning mechanisms – buffers, slack resources, and programs – offer many advantages: they preserve some of the fruits of success, and they make success more likely in the future. They stabilize behaviors and enable organizations to operate to a great extent on the basis of habits and expectations instead of analyses and communications.

[...]

But these learning mechanisms also carry disadvantages. In fact, each of the advantages has a harmful aspect. People who are acting on the basis of habits and obedience are not reflecting on the assumptions underlying their actions. People who are behaving simply and predictably are not improving their behaviors or validating their behaviors' appropriateness. Organizations that do not pay careful attention to their environments' immediate demands tend to lose track of what is going on in those environments. Organizations that have discretion and autonomy with respect to their environments tend not to adapt to environmental changes; and successful organizations want to keep their worlds as they are, so they try to stop social and technological changes.

In short, for Theory 3, despite the ability to learn and adjust, the same mechanisms often result in people not seeing problems, threats or opportunities.

This is where the paper takes multiple pages describing the history of Challenger's o-rings, and how they have been handled and re-engineered over time. I'll skip it for the sake of focusing on fine-tuning itself. One thing the authors mention is that it appears people at NASA followed Theory 3 (their own past successes were seen as signs of ongoing successes), which ironically makes observers see Theory 2 (past successes increase risk of failure) as more realistic:

The organization's members grow more confident, of their own abilities, of their managers' skill, and of their organization's existing programmes and procedures. They trust the procedures to keep them apprised of developing problems, in the belief that these procedures focus on the most important events and ignore the least significant ones.

In gaining confidence with their technology, NASA went from an experimental to an operational mindset, reduced testing and maintenance, while increasing payloads and efficiency. This is where fine-tuning is introduced.

Fine-tuning is an optimization process, based on negotiating tradeoffs:

Although an organization is supposed 'to solve problems and to achieve goals, it is also a conflict-resolution system that reconciles opposing interests and balances countervailing goals. [...] Further, every serious problem entails real-world contradictions, such that no action can produce improvement in all dimensions and please all evaluators.

[...]

Opposing interests and countervailing goals frequently express themselves in intraorganizational labour specializations, and they produce intraorganizational conflicts. An organization asks some members to enhance quality, some to reduce costs, and others to raise revenue; and these people find themselves arguing about the trade-offs between their specialized goals. The organization's members may seek to maintain internal harmony by expelling the conflicts to the organization's boundary, or even beyond it. [...] But conflicts between organizations destroy their compatibility, and an organization needs compatibility with its environment just as much as it needs internal cohesion. Intraorganizational conflict enables the organization to resolve some contradictions internally rather than letting them become barriers between the organization and its environment.

Basically, as conflicting goals get assigned to distinct people, these people adopt their goals and end up embodying their related conflict. These conflicts need to be resolved. The authors assert that NASA's issue around Challenger showed both these cross-organization issues (between NASA and Thiokol) but also within organizations as managers and engineers opposed each other. The engineers valued safety (with wide margins) and managers sought efficiency.

Fine-tuning is what happens when organizations navigate these conflicts and resolve them over time. Safety factors are wasteful if they are truly safety factors and not needed (four spare tires might be overkill in a car), and may also be a source of hazards and complexity that could tax other components. So if a design ships with large safety factors, they will need to be reduced:

An initial design is only an approximation, probably a conservative one, to an effective operating system. Experience generates information that enables people to fine-tune the design: experience may demonstrate the actual necessity of design characteristics that were once thought unnecessary; it may show the danger, redundancy, or expense of other characteristics; and it may disclose opportunities to increase utilization. Fine-tuning compensates for discovered problems and dangers, removes redundancy, eliminates unnecessary expense, and expands capacities. Experience often enables people to operate a sociotechnical system for much lower cost or to obtain much greater output than the initial design assumed.

They add that as engineers would expect managers to cut costs, they would in turn pad the numbers further; as managers tend to have the responsibility of resolving goal conflicts, it is unsurprising that they often go against their engineers. The authors add:

Formalized safety assessments do not resolve these arguments, and they may exacerbate them by creating additional ambiguity about what is truly important. Engineering caution and administrative defensiveness combine to proliferate formalized warnings and to make formalized safety assessments unusable as practical guidelines.

For example, Challenger included 8,000 critical components, 277 of which were involved at launch time. Paying "exceptional attention" to this many items is difficult. This list is also not stable. Over time, elements such as these were in play:

  • The Thiokol design for o-rings was inspired by one borrowed from the Titan solid rocket booster and had no serious problem
  • Thiokol nevertheless added a second o-ring for redundancy
  • The criticality of the joints got reduced over time—managers thought a failure was roughly impossible by then
  • Changes were made to the system to trim the booster's weight by 2% and increase its thrust by 5%
  • Shuttle flights showed no problems even when the flights showed imperfect seals or o-ring damage
  • Improvements were planned in the following years, but the thought of managers is that the o-rings were less dangerous than engineers assumed

Some changes from this list might have gone back as far as 1982, meaning their contribution to an erosion of safety took more than 4 years to be undeniable.

However the authors warn us not to jump to conclusions:

The most important lesson to learn from the Challenger disaster is not that some managers made the wrong decisions or that some engineers did not understand adequately how O-rings worked: the most important lesson is that fine-tuning makes failures very likely.

Fine-tuning changes always have plausible rationales, so they generate benefits most of the time. But fine-tuning is real-life experimentation in the face of uncertainty, and it often occurs in the context of very complex sociotechnical systems, so its outcomes appear partially random. [...]

Fine-tuning changes constitute experiments, but multiple, incremental experiments in uncontrolled settings produce confounded outcomes that are difficult to interpret. Thus, much of the time, people only discover the content and consequences of an unknown limitation by violating it and then analysing what happened in retrospect.

Fine-tuning can be seen as experiments that probe the limits of knowledge, which means they will keep going so long as consequences are acceptable. It is difficult, if not sometimes impossible to detect the proper effect of this process outside of its clear benefits (it is obvious when you increase thrust and reduce weight, but far less visible how this may affect o-rings even when tested). These discoveries happen across complex sociotechnical systems where active goal-conflict negotiation happens with varied perspectives. As such, only larger scale failures can suspend the process and bring it to a temporary halt.

The authors conclude that we must learn from disasters:

We may need disasters in order to halt erroneous progress. We have difficulty in distinguishing correct inferences from incorrect ones when we are making multiple, incremental experiments with incompletely understood, complex systems in uncontrolled settings; and sometimes we begin to interpret our experiments in erroneous, although plausible frameworks. Incremental experimentation also produces gradual acclimatization that dulls our sensitivities, both to phenomena and to costs and benefits. [...]

[...] We benefit from disasters only if we learn from them. Dramatic examples can make good teachers. They grab our attention and elicit efforts to discover what caused them, although few disasters receive as much attention as Challenger. In principle, by analysing disasters, we can learn how to reduce the costs of failures, to prevent repetitions of failures, and to make failures rarer.

But learning from disasters is neither inevitable nor easy. Disasters typically leave incomplete and minimal evidence. [...] Retrospection often creates an erroneous impression that errors should have been anticipated and prevented. [...] Effective learning from disasters may require looking beyond the first explanations that seem to work, and addressing remote causes as well as proximate ones

There are a lot of posts on here about what can help make learning effective, but as a personal comment, I found it very interesting to have this concept of fine-tuning be introduced. As mentioned earlier, the dynamics of ongoing refinement, adding or removing to a system to make it more efficient, economical, or reliable based on experiments, very much lines up with my experience with the tech industry. Even the idea of error budgets in typical SRE speak fit this definition much better than parallel concepts such as Normalization of Deviance, Drift, or Practical Sailing. This doesn't mean that none of them can coexist within organizations or parts of the industry, but clearly stating that fine-tuning is how systems negotiate their conflicts and that it may be behind disasters was quite interesting. We may often start from a minimal version (MVP) that improves, but the process is similar.

]]>
Tue, 03 Feb 2026 09:20:00 EST https://ferd.ca/notes/paper-challenger-fine-tuning-the-odds-until-something-breaks.html
Paper: The Failure Gap https://ferd.ca/notes/paper-the-failure-gap.html

Last week, Cat Hicks' Fight for the Human contained a reference to a paper written by Lauren Eskreis-Winkler, Kaitlin Woolley, Minhee Kim, and Eliana Polimeni titled The Failure Gap. As Cat described it, "In this series of studies, researchers tested people's estimation of failure rates across more than thirty domains and found a consistent "failure gap." As an incident nerd, this instantly got my attention, and I decided to go through it.

The key concept here is what the authors term the failure gap, which is the idea that people vastly underestimate the actual number and rate of failures that happen in the world compared to successes. The term failure is broadly defined as undesirable outcomes or acts, so this will cover concepts such as medicine not working, medical workers not washing hands (as per hygiene rules), a team losing in sports, people not showing up to court, or people committing crimes, to name a few.

This is a bit of a tricky paper to summarize, since it contains a lot of studies:

  1. Establishing that people underestimate failures, but do not underestimate successes, at multiple levels, in 30+ domains, which include:
    1. national failures
    2. international failures
    3. individual failures
    4. sports failures
    5. education failures
    6. medication failures
  2. Establishing that in all 30+ domains where a failure gap exists, it is correlated with an under-reporting of the failures compared to successes in news, social media, and online reviews.
  3. By looking at the #MeToo movement, they establish an example of a type of failure ("men failing to treat women respectfully") that has gone from being under-reported to over-reported, and then compare it to still under-reported issues in women's health to isolate the effects.
  4. By exposing people to online reviews about medication, see if they are impacted into having a broader failure gap, even when told reviews can be misleading.
  5. By exposing people to news in rates that match actual media reported rates or the real world failure rate, see if the failure gap can be shrunken.
  6. See whether closing the failure gap in people reduces the suggested punishments in:
    1. an online sample
    2. educators
    3. managers in the workplace
  7. See whether closing the gap motivates people to fix the issues.

At a high level, this is kind of going: the gap is real, it is related to information availability, varying the exposure to information varies the gap, and closing the gap changes prescribed reactions and motivations.

So let's start with the failure gap. The first question is whether people know the true failure rate, where a goal is not achieved. Prior research had already established that people tend to underestimate bad outcomes and overestimate good outcomes for themselves, something related to motivated reasoning. So if we expect optimism there, what about optimism that is generally unrelated to the self? The authors want to know why and when it would occur.

A key supposition is that this might be due to lopsided information. Since negative emotions lead to topics not being discussed, maybe failures are likewise less likely to be given visibility than positive outcomes. Basically, information can be psychologically costly to share, particularly when it is ego-threatening, and even with negativity bias (people engage more with negative information), there is a disengagement if the negativity is personally threatening. In a nutshell: the self does more to avoid bad things than to chase good things:

This suggests that people closest to a failure—those with the clearest access to information about a failure that occurred—will tend to be those with the strongest motive not to share it. This could seriously impede the accessibility of information. For example, a reporter looking to write a story about an individual, a company, or an organization, may find that the most knowledgeable and powerful sources more freely share information on what is going right, versus what is going wrong.

Similar effects have been observed when the information has to do with other people, and might be behind why people sugarcoat critical feedback:

The disinclination to share failures is so strong that a recipient who directly states their desire for feedback partially—but not entirely—eliminates the problem. Thus, failure-related information tends to be inaccessible both when the “responsible” party hesitates to share (to avoid embarrassment), and when third party communicators go mum on these topics (to avoid discomfort).

A notable exception to the disinclination to share bad news occurs when the sharer is overcome by negative emotion (e.g., anger, anxiety)

In these latter cases, when sharing is perceived to relieve emotional pressure, then keeping the information to yourself is more costly than sharing it.

All this being said, we're still considering whether we see more good news than bad news (lopsided information), and the authors have to tackle an obvious counter-argument: few people really feel the news are positive. They state that the news can still feel very negative and the information be lopsided:

  • A lot of successes are seen as normal (a plane landing, a doctor washing their hands), and therefore neutral, rather than positive
  • The failures that get reported are often the very dramatic, high-stakes, or emotional ones

The end result is that if routine mundane failures are more frequently omitted, you can very much end up in a situation where the failures are under-reported compared to successes while the news still feel negative.

So to check that out, they got a bunch of lay people, and picked 30+ topics that could cover the average American's daily encounters. For each of these, they asked people to estimate the failure rates. The researchers were able to validate them against reliable data. For example, if asked to estimate how effective pain killing medication is, a 52% success rate implies a 48% failure rate. They tested from both vantage points (estimate success and estimate failures) to cover for underestimating failure, vs. underestimating everything.

Table 2 has a big list of all the questions they could ask:

Table 2 is a page-long list of all questions across all experiments.

One example of the results reported is:

people believed 28% of hospital personnel fail to comply with basic hand-washing hygiene, whereas the true percent of hospital personnel who do not wash their hands is approximately 50%

And there's a big table showing all of the gaps:

Figure 2, showing that all questions have estimates lower than the real rates

While not all gaps have the same size, there is a gap where people underestimate failures in all categories:

Across 30+ domains, failure occurred an average of 61% of the time; yet people believed the failure rate was around 41%. Participants underestimated traditionally-labeled failures (e.g., restaurant closures), non-traditionally-labeled failures (e.g., miscarriages), individual failures (e.g., failed relationships), societal failures (e.g., crime), international failures (e.g., worldwide poverty), expensive failures (e.g., failing to complete college on time), and seemingly trite failures (e.g., consumer product returns). People believed teams in the National Hockey League collectively lose fewer than 50% of their games—a logical impossibility in a sport where each time one team loses, another wins.

With the failure gap demonstrated, the next question was to try and tie it to information lopsidedness. This was done by using specific academic news search engines, where they did manage to confirm that negative outcomes tended to be under-reported compared to positive ones. This was true even when trying to lopside outcomes towards negative ones by doing strict searches for successes and broader searches for failures.

An example given is that even in the most generous searches, business failures represented roughly 25% of the content, whereas the true failure rate of businesses is 80%.

The same held true for social media and online reviews. For this later one, the authors give the example that over-the-counter medication (e.g. Tylenol) is ineffective roughly half the time in true studies, but fewer than 4% of reviews on CVS and Amazon go below 4 stars.

So overall: information accessible to lay people tends to be numerically much more positive, even if it is emotionally far more negative. This lines up with the failure gap.

The third question is next. Since they suppose that failures that are psychologically costlier to share impact the information lopsidedness, they could compare the gap effect on elements that are considered easier or harder to share (based on the number of failures reported compared to the true rate).

This is where they compared a bunch of issues between women's health (under-reported) and #MeToo (over-reported) and compared the gap by using 3 statements that have the same true failure rate:

  • 40% of women live with heart disease, 40% of women have experienced sexual harassment in the workplace
  • 50% of women receive a cancer diagnosis, 50% of women have experienced unwanted sexual touching
  • 60% of women receive a UTI diagnosis, and 60% of women have received unwanted sexual advances.

The gap here is in line with the theory:

Q1 has 30% and 51% estimates; Q2 has 38% and 58% estimates; Q3 has 57% and 67% estimates

This lines up with the over- and under-reporting. The idea is that because #MeToo destigmatized sharing sexual misconduct stories, they became less costly to share and led to an overestimation of the failure rate since the information balance had changed. That there's a connection to the reporting rate also lets the authors say that this shows optimism bias is not in play here, since reporting rates would not influence it.

The following question, covered in experiments 4 and 5, is whether we can influence how large the failure gap is by manipulating the exposure to information people will have.

The first one is with medication, where a group is asked to estimate the efficacy of painkillers, and they looked to know if showing them lopsided reviews (mostly postive) would widen the gap. It did, even when people were told the reviews weren't trustworthy.

The second one (study number 5) exposed people to more information about failure in students. They first asked participants to estimate the percentage of adults who graduated college, then gave them 10 google news hits about college news, and asked them to do a second estimation. A group received a true rate news sample (4 about graduating, 6 about not graduating), and another group received a rate reflective of the usual news ratio (9 graduating, 1 not graduating).

There was no change on the gap in the latter group, but the gap was reduced (even disappeared) for true rate folks. They also checked with fake information that didn't look realistic, which had no effect (and removed concerns about priming). They add:

While we focus on the impact of shared information here, we do not expect a 1::1 correspondence between the rate at which failure is discussed in shared information and observers’ beliefs about the rate at which failure occurs. Rather, we expect that the rate at which failure is discussed to move the needle, shifting peoples’ estimates towards the ratio of success::failure in the information before them.

Moreover, information shown to participants ought to shift their estimates more when it violates expectations.

[...]

Is there a way to correct people’s underestimates of failure in a world where shared information is lopsided? Study 4 suggests an answer. When shared information reflects the actual rate at which failure occurs, the failure gap attenuates. In contrast, in Study 5, accompanying lopsided shared information with a bold disclaimer had no effect.

For the sixth experiment, they asked what are the consequences of the failure gap and what happens when it gets closed. They had 3 groups of people (online people, educators, managers) in 3 sessions, each divided in control and gap-closed groups:

  • One group is asked, "Do you think prison is an appropriate punishment for someone who fails to appear in court?", and a second group is asked this question once the true rate shown to close the gap (1 in 3 people fails to show up) and comparing results;
  • Another group is asked, "Do you support students who commit infractions (hitting, damaging property, cyberbullying) to be suspended?", then showing the true rate to those who said yes (over 3 millions do it every year), and asking again.
  • Asking every manager "How hesitant would you be to hire a candidate with intrusive or obsessive thoughts?" then showing the true rate (94% of people do) to all of them and seeing the outcome.

Each group has a slight methodological variation to cover more ground, but in all cases, the desire for punishment decreases:

  • ~38% of people would punish people failing to appear in court, but ~29% would if told the true rate
  • 41% of educators would punish a student, but ~20% of them decided they'd rather not punish after being told the true rate
  • ~64% of managers would rather not hire someone, which went down to ~47% after being told the true rate

Basically, across the board, closing the gap reduced the desire for punishment.

The final study (7) finally asked whether exposing the gap would encourage people to act to fix the related issues.

In this one, they asked managers if they'd be willing to expend paid parental leave to new mothers. ~75% said they would, and this went up to ~77% after being told 94% of mothers experience health problems within 6 months.

In a second part, they compared whether participants would support an environmental initiative ("Do you think the government should channel tax payer money away from other key city initiatives in order to update water-supply infrastructure that provides clean drinking water to Americans?") both by showing an environmental group's messaging, or by just stating the true rate ("2 million Americans don't have access to clean drinking water"). The environmental group got 73% support, and the gap statement got 84.5%.

Covering all seven studies, the authors go over the results again:

Across seven studies, people were systematically unaware of the rate at which things go wrong. For every three species that go extinct, the public knows about one; for every five weapons undetected by airport security, the public thinks one sneaks by. People underestimated tens of thousands, and in some cases, millions, of failures. For example, they were unaware of millions of adults with poor educations, poor relationships, and declining mental health.

[...]

[...] Knowing the scale of a problem is so fundamental to motivating action that simply sharing the true rate at which various societal problems occur had the power to galvanize change.

They suspect that since the cost of negative information is psychologically higher to share, contextual factors (humble leaders encouraging vulnerable sharing, psychologically safe environments) could help reduce the cost of sharing. They state that their results are context-dependent again (and may change with time and cultures).

They do a good tour of all the limitations their work has, revisit moderating effects of various bias types, re-assert context-dependence, and state that further research would be required to fully confirm their theories around cost of sharing and what factors impact impressions have of overall negativity. They conclude:

Encouragingly, closing the failure gap led lay citizens and global leaders to back needed change across issues as divisive and diverse as paid parental leave, criminal justice reform, and inclusive hiring practices in the workplace. Merely sharing the true rate at which things go wrong motivated change. Closing the failure gap reduced support for harsh punishment among educators in the field, reduced stigma among hiring managers, and promoted support for paid parental leave among global leaders. All things considered, the failure gap is common and crippling, yet likely, correctable.

Note: I checked, and the "global leaders" are the managers in 7a, recruited in industry conferences; the paper itself does not say at which level or what organization size they worked at, so this might be somewhat misleading wording?

As a personal comment, this is an interesting framing when coupled with the idea that in many organizations, bad news sometimes do not travel very far: issues are either hidden or sugar-coated, or handled locally as normal work that makes failures harder to see for other participants. The ideas around blame-awareness and psychological safety in learning from incidents, and the desire not to make incidents go away fully through carrots and sticks, can all impact the ability to surface true rates and the sort of reactions people have to these events.

]]>
Sun, 16 Nov 2025 07:20:00 EST https://ferd.ca/notes/paper-the-failure-gap.html
Paper: Behavioural science is unlikely to change the world without a heterogeneity revolution https://ferd.ca/notes/paper-behavioural-science-is-unlikely-to-change-the-world-without-a-heterogeneity-revolution.html

I was listening to a podcast about AI/LLMs and teams with Dr. Cat Hicks where she mentioned this article by Christopher J. Bryan, Elizabeth Tipton, and David s. Yeager titled Behavioural science is unlikely to change the world without a heterogeneity revolution. I'm always a fan of cleverly titled papers and the reference made me want to dive into it.

This article registers in the broader issue of replication crisis, narrowed specifically to behavioral science used to set policies, where past studies are shown not to always replicate nicely when re-attempted. The text argues that while it is sometimes due to bad methods or outright fraud, the idea of heterogeneity (effects are not applying the same for every group or demographic) can often be in play. More than that, their argument goes that if we embrace that sort of idea, it can actually be used to better prove some of the models behind experiments.

A few things are stated fueling the current crisis: a lot of the concerns and efforts were on controlling false positives ("type-I errors"), rather than false positives ("type-II errors"), but mainly too big of a focus on finding main effects, the primary independent variables you're trying to isolate as always being involved in some phenomenon happening. In the context of policies, the impact could be pretty bad:

the current heterogeneity-naive, main-effect-focused approach could lead to policies that perpetuate or exacerbate group-based inequality by benefiting majority-group members and not others. [...] a narrow focus on main effects in the population as a whole almost necessarily means a focus on effects in the group with the greatest numerical representation [...] To the extent that members of minority groups are either benefitted less or harmed by an intervention that benefits the majority group, the result will be worsening inequality.

They point out that "nearly all phenomena occur under some conditions and not others" is sort of a scientific truism, so we shouldn't really be surprised by that: that's why you do carefully controlled experiments in laboratories. What they suggest then is to assume:

  1. Intervention effects depend on the context
  2. Doubt experiments that report 'true effects' and ignore or downplay heterogeneity
  3. Understand that variation in effects and estimates is to be expected even without false positives.

They then pair that up with recommendations that would change approaches to behavioral science and policy, but first they cover a good example case, the Opower experiment.

In short, the Opower company ran experiments to know if telling people how much power their household consumed and comparing it with their neighborhood's average would drive overall power consumption down by nudging people. They did randomized trials with controls (of more than 600,000 households), and found a 2% reduction in power, for what is a really cheap intervention. As they scaled up the project (to 8 million households), however, the impact got lower and lower.

The authors point out that the study had a lot of data, they were able to analyze it and find that heterogeneity was a good explanation for most findings: their first studies turned out to be run in a narrower set of communities, where people might have heated pools, be middle-class or above, and therefore either had strong environmentalist attitudes, or plenty of low-hanging fruits to act on. As they expanded and got a more diverse set of households, they could either get folks who had less care for the environment, or say lower-class households where there was not a lot the owners could change to save energy without more investments.

What this showed then is not that this type of intervention is unreliable and useless, or that the policy was wrong, but that it will be more effective in some contexts and less so in others. The authors point out that this happens to more studies, which unfortunately don't care about heterogeneity enough to figure out these effects:

In a recent systematic review of behavioural intervention research in the choice-architecture or nudge tradition (154 studies), the overwhelming majority of behavioural intervention experiments (98%) relied on haphazard samples—convenient and willing institutional partners, anonymous crowdsourced online participants, university participant pools and the like. [...] only 18% of studies provided even minimal information about characteristics that might moderate effects.

Without measuring for heterogeneity, these studies cannot begin to frame what sort of context lets the effect happen, because the context is essentially unknown. The assumption present seems to be that if an effect is found, it will always be there and hold across groups, which is more and more thought to not be correct.

Instead, the authors argue that embracing this heterogeneity can instead make studies better: if you can figure out good moderators, then you can likely identify more causal mechanisms when building your theory. With the Opower example, you might theorize that it provides a small motivational push, and as such it would only provide savings where people find it easy to do. You can then play with effects by manipulating them. So maybe reducing how much you heat the pool works fine, but changing your appliances in a working-class household doesn't:

Manipulating mechanism-specific moderators—referred to as ‘switches’—in this way allows researchers to test theories of causal mechanism by showing that a treatment effect is weakened or eliminated when a switch is ‘turned off’. The logic here is the same, for example, as that behind neuroscientists’ use of transcranial magnetic stimulation and related techniques to temporarily (and harmlessly) attenuate or intensify neural activity in specific brain structures in order to elucidate their causal role in particular cognitive or social functions.

The question they're asking then is how research could be designed differently if heterogeneity was seen as a tool to build better theories and finding more robust and predictable effects.

They mention this is starting already. Some journals ask authors to articulate limits to the generality of their findings; target populations are to be identified earlier in the process, while thinking of the role this serves; statisticians are working on better models for heterogeneous samples; scholars are trying to collect data in a way that better accounts for this pattern at every stage.

So far, the focus is procedural, with the idea that once procedures are better handled, then you can start looking at more meaningful things like culture and demographics. There's a cool table with the effects noted with multiple examples as well:

Table 1. Examples of common sources of treatment-effect heterogeneity in behavioral intervention research

This isn't necessarily a surprise to behavioral science, but not all studies are designed for this. They give the example of the National Study of Learning Mindsets (NSLM), which aimed to identify how low-achieving students could be helped to improve their grades:

the NSLM was not designed to find large average effects. Instead, it aimed to study treatment-effect heterogeneity in order to learn about the theoretical mechanisms behind its effects. For instance, the NSLM over-sampled schools that were expected to have weaker effects, such as very low-achieving schools that were presumed to lack the resources to benefit from a simple motivational treatment, and very high-achieving schools that were expected not to need an intervention. This gave the study sufficient statistical power to test for interactions.

With a study design that sought out these effects, machine learning tools, Bayesian methods to estimate effect sizes, and independent statistician involvement, they were able to show that supportive contexts were required for online "growth-mindset" interventions to improve things. And even if they were able to show a large overall effect, their ability to show how heterogeneity applies helps paint a better picture and increase the ability to properly replicate their effects.

Figure 1: Relation of the study population to a hypothetical study's sample size and estimated treatment effect

This illustration shows the relationship: if you ran an early study under a, then you might find a weaker effect in b, and fail to replicate it under c, and scaling the intervention to the whole population (d) would provide no benefit:

interpreting this result only in terms of the main effect would miss the fact that there is a real and sizeable segment of the population for whom the average effect is substantially larger and perhaps more clearly important from a policy perspective.

Basically, proper tracking of heterogeneity could let you define a broader portfolio of effective interventions that aren't as costly as applying them to every group. In the new heterogeneity revolution, the authors want research to consider:

  1. Intervention effects are expected to be context- and population-dependent
  2. Decline of effects in later studies isn't necessarily a sign of bad research but of needing to narrow down contexts

The main issue they identify is that this is logistically really difficult to do. Individual studies (and scholars) can't afford to build the infrastructure for all of this, and so it needs to be shared. They compare this to how physicists ran into some limits where everyone needed particle accelerators (and ever-larger ones) and so they more or less got together and created the Large Hadron Collider. Some efforts of that kind already existed, but the authors felt more needs to be done yet.

They close with:

What makes us so confident that a heterogeneity revolution is coming? Scientific revolutions emerge when it becomes clear that a field’s existing paradigm cannot explain its empirical findings.

They're hoping that larger samples and pre-registration alone won't be enough, but that systematic study of heterogeneity might.

As a side comment, this article was published in 2021. Since then, the US government has fought any attempt at science that considers diversity. Considering the warnings given by the article early on about how heterogeneity-naive approaches can lead to policies that perpetuate inequality, this newer development is pretty much going to hurt a lot of science, but also encourage a reinforcement of existing biases and dynamics at the policy level.

]]>
Tue, 26 Aug 2025 09:00:00 EDT https://ferd.ca/notes/paper-behavioural-science-is-unlikely-to-change-the-world-without-a-heterogeneity-revolution.html
Paper: Empirically derived evaluation requirements for responsible deployments of AI https://ferd.ca/notes/paper-empirically-derived-evaluation-requirements-for-responsible-deployments-of-ai.html

As if we weren't talking about AI enough these days, I've tripped upon this new paper by Dane A. Morey, Michael F. Rayo, and David D. Woods, titled Empirically derived evaluation requirements for responsible deployments of AI in safety-critical settings.

The paper covers a study where they ran experiments replicating the impact of AI augmentation nurses and nursing students' performances when doing remote monitoring of patients' vitals. I found this one interesting both on the results it got, but also on the experiment design and rationale they had for it all. I'm going to play around with the order a bit and start by covering the experiment and its results, some of the design elements I found compelling, and then their conclusions.

The charts used to do remote monitoring looked like the following:

Example patient case with AI recommendations and explanations. The display is split in 3 main areas, over a slider to select a time window to explore within the last 24h. On the left, there's a large red bar that shows an algorithmic prediction of emergency events predicted from 0 to 100% in the next 5 minutes. The central part is an oxygen saturation chart above a heart rate chart. On the right a series of smaller charts show patient information, labs, medication levels, blood saturation, blood pressure, respiration, and heart rate. As an AI explanation for the emergency percentage prediction, significant charts are overlaid with red color.

This variant is the one showing the basic data available to nurses, an AI's prediction, and the explanation of the AI's algorithmic choice. The big red bar on the left showing the AI's prediction of some emergency event happening in the next few minutes. The areas of the patient's health charts highlighted in red represent which data contributed to the AI predictions.

Other variants were used: without the explanations (only the prediction and chart), without the prediction (only red highlight explanations and chart), and the un-augmented approach (only basic charts).

These were shown to 450 nursing students and 12 licensed nurses, each seeing 10 randomized historical patient cases with randomized variants of charts. They had to score their own level of concern for the patients developing issues in the next five minutes, as quickly as possible, and then the authors analyzed the results to see the impact of AI augmentation on the final judgments. They specifically chose their 10 cases to cover a breadth of behaviors ranging from "the AI is very right" to "the AI is very wrong" to cover the whole spectrum.

Figure 2. Impacts of AI augmentation on human-AI performance. The magnitude of AI error was calculated as the absolute difference between AI recommendations and ground truth (0% for non-emergency patients or 100% for emergency patients). The impact of AI augmentation in the three AI-augmented conditions was calculated as the percentage difference between nurses’ concern with and without AI augmentation divided by the extent to which nurses differentiated emergency from non-emergency patients without AI augmentation. Positive and negative values correspond to changes in nurses’ concern which increased and decreased nurses’ differentiation of patient types, respectively. For emergency patients, results are separated by experimental condition: AI recommendation only (a), AI recommendation and explanation (b), and AI explanation only (c). For non emergency patients, results are similarly separated by experimental condition: AI recommendation only (d), AI recommendation and explanation (e), and AI explanation only (f). Each line represents the estimated marginal means from the generalized linear mixed effects model for nurses’ reported concern with 95% confidence intervals. Circles represent the mean of nurses’ concern in each case calculated from the unmodeled data, transformed with the same percentage difference calculation. Error bars correspond to 95% confidence intervals. Solid lines and closed circles represent the results of nurses using one of the three AI-augmented experimental conditions. Dashed lines and open circles represent the results of nurses using the baseline experimental condition without AI augmentation.

They found multiple things. First, neither the nurses nor the AI were universally superior to the other. There were some cases where each did better or worse than the other. More significantly though, when the AI was right, the joint system where nurses were augmented by AI did better. When the AI was wrong, they did far worse.

The magnitude of maximum degradation was nearly twice the magnitude of maximum improvement. When AI recommendations were most misleading, nurses’ reports of concern for emergency patients were indistinguishable from non-emergency patients, and vice versa. We observed a similar effect of lesser magnitude when nurses were presented with only AI explanations. These results suggest all forms of AI augmentation significantly influenced nurses’ perceptions of patient risk.

The authors particularly emphasize this aspect of risk:

the strong influence of AI recommendations appeared to propagate AI vulnerabilities and induce dangerous confusions in cases which were otherwise routine for nurses.

They point out that a better algorithm does not necessarily yield better results. The critical part is how the nurses and the AI interact as a joint system. Optimizing algorithms may not optimize the joint system, and might instead harm it. They add that "human supervision may not fully mitigate the risk of all AI errors, even when provided with explanations".

Basically, if the AI recommendations can be persuasive and subject to error, so are the explanations. This is interesting because the theory generally goes that if AI can provide explanation behind its algorithms, then people can better audit and double-check them. Empirically, this does not seem to happen, and explanations may just make the recommendations sound more credible, and degrade overall system performance.

Now the 12 licensed nurses may have seemed like a low count out of the 450 students, but another finding here is that there was seemingly no strong relationships between nurse experience and the joint human-AI performance. The authors also point out a study done with radiologists that similarly found no reliable effect of experience (years of experience, subspecialty, or familiarity with AI tools) on the impact of AI assistance. The authors caution that some smaller differences may exist but may not be detectable to their study.

This is something I find interesting because there are direct parallels between these ideas and things we generally take for granted in software, such as people hoping to get better results out of chain-of-thought prompts or other model reasoning output to provide a similar "explainable AI" sort of mechanism. There's also a wildly repeated belief that risks of clumsy LLM results best prevented/handled by senior engineers and that junior engineers could be more error-prone with AI tools. Based on these studies, we should verify these claims for software, because neither might live up to expectations.

As the authors state:

These risks are not unique to AI algorithms but rather fundamental risks of misleading information. Our findings add to growing concerns about the effectiveness of explainable AI to reliably help people recognize and recover from AI errors.

Although different AI algorithms may produce different results, similar findings suggest susceptibility to AI errors might be a common feature of recommendation-centric human-AI architectures.

Put another way, if your AI interaction model is focused on recommendations, you may be stuck into this sort of joint performance pitfalls where AI errors are not corrected as much as amplified.

This brings us to one of my favorite parts of how they designed the experiment. The way they set it up meant to cover very good to very bad results for the AI. A common objection to this design is that we should also care for the frequency of events and errors seen to match reality. If you expect to be right 95% of the time, then it feels disingenuous to spend 50% of your scenarios on the 5% of cases that are wrong!

There's a two-part counter to that. First, they state:

Improving the AI algorithm will likely reduce the frequency of poor performance but it may not reduce the range of possible performance or the consequences of poor performance. Therefore, improving the AI algorithm does not necessarily lessen the impact of AI errors. On the contrary, prior research has suggested recognizing and recovering from errors might be more difficult if the frequency of error is rare.

Or put another way: if you want to look at error recovery, you have to look at errors, and the rarer the errors in the fields, the harder recovery ought to be for practitioners who encounter them less. The second part of the counter-argument is that aggregating the values in the overall accuracy rate ignores the weight of the consequences:

If we had representatively sampled from the distribution of AI outputs or computed a weighted average of our results based on the probability of occurrence, the frequent but smaller benefits from correct AI recommendations might have outweighed the rare but larger costs from misleading AI recommendations. However, this aggregation misrepresents how healthcare practitioners are held accountable. Nurses are responsible for patients individually, not in the aggregate. Major harms are not excused by the accumulation of minor benefits.

Practitioners can be sued, may be penalized, have their license revoked, or end up jailed because of errors they are involved with. They are judged on the worst outcomes, in isolation. We should not, then, relax the criteria used for an AI to paper over that sort of responsibility, especially in a joint system where the human will likely still get to pay for these errors.

I initially thought this was a bit of a weird study design, less realistic or relevant than it could be, but in the end I really, really appreciate this argument that aims to maintain proper alignment of real-world accountability and tool evaluation criteria.

This lines up with what their main points and conclusions are: AI capabilities alone do not guarantee a safe and effective joint human-AI system. To know if it is safe and effective, the two criteria they identify are:

  1. empirically measure the performance of people and AI together
  2. examine a range of challenging cases which produce a range of strong, mediocre, and poor AI performance.

As a conclusion, I like this little summary they had in the text:

The goal of developing augmentative AI technologies should not be to improve AI algorithm performance, but rather to enable people to effectively accomplish cognitive work.

Isolated benchmarks might show the AI doing great, but it doesn't mean that the end result, where people use or are guided by AI as a joint system, are actually going to match the improvements seen in AI itself.

]]>
Mon, 23 Jun 2025 10:30:00 EDT https://ferd.ca/notes/paper-empirically-derived-evaluation-requirements-for-responsible-deployments-of-ai.html