Jekyll2025-05-26T13:09:39+02:00https://e-pet.github.io/feed.xmlEike PetersenSenior Scientist at Fraunhofer Institute for Digital Medicine MEVISEike Petersen[email protected]The path toward equal performance in medical machine learning2023-07-23T00:00:00+02:002023-07-23T00:00:00+02:00https://e-pet.github.io/posts/2023/pathWhat does it take to build medical machine learning models that work equally well for all patients? Or, inversely, why do models often work better for some patient groups than others? This is what we set out to answer in our recent Cell Press Patterns perspective.

More specifically, how can we level up performance instead of achieving equal performance by pulling down all groups to the lowest performance level? (See excellent work by Sandra Wachter, Brent Mittelstadt, Chris Russell on this.) We argue that there is a path toward equal performance in most medical applications. It may, however, be long and winding.

Representation in the training data is an obvious and important issue, but it is not the only one. Patient groups can (and often do) still underperform even if they are well-represented. There is another crucial factor: the difficulty of the prediction task may differ between patient groups. We mean this in a purely statistical sense - the relationship between prediction inputs and prediction targets can simply be less deterministic in some groups. Why would difficulty differ between patient groups? Measurements may be more noisy (e.g. abdominal ultrasound in obese patients), or there may be unobserved causes of the outcome that are more important in some groups than others (e.g. hormone fluctuations, comorbidities).

Complicating everything, the prediction targets may be noisy (think disease labels automatically inferred from EHRs) or biased against certain groups (think underdiagnosis), affecting both the model and its evaluation. Same for selection biases. Under label noise/bias and selection bias, the equality of performance metrics on a test set is neither necessary nor sufficient for “performance fairness”. Thus, investigating and addressing these should be the first priority!

To address differing task difficulty, the causes of the differences must be understood and resolved. This may require identifying appropriate additional (or alternative) measurements that resolve residual uncertainty in affected groups. In other words: we may not always need more data from underperforming groups, but rather better data. (Or both.) Importantly, standard algorithmic fairness solutions are strictly limited in what they can achieve: if the statistical relationship between inputs and outputs is simply more noisy in some group, no amount of “fair learning” can fix this!

In the paper (co-authored with Sune Holm, Melanie Ganz-Benjaminsen, Aasa Feragen), we discuss many more concrete medical examples of the different sources of bias, and we propose some tentative solution approaches. We also connect the different sources of performance differences with a group-conditional bias-variance-noise decomposition due to Irene Chen, Fredrik Johansson, and David Sontag, as well as with epistemic and aleatoric uncertainty.


]]>
Eike Petersen[email protected]
Calibration by group, error rate parity, sufficiency, and separation2022-06-06T00:00:00+02:002022-06-06T00:00:00+02:00https://e-pet.github.io/posts/2022/separation-sufficiencyTL;DR: Calibration by group and error rate parity across groups are compatible.

In the field of algorithmic fairness, it is well known that there are several definitions of fairness that are impossible to reconcile except in (practically irrelevant) corner cases. In this context, I have recently tried to wrap my head around why – intuitively – it is impossible for any classifier to achieve separation and sufficiency at the same time (unless either the classifier is a perfect classifier or there are no base rate differences between groups – we will get to these details in a minute). Since part of my troubles arose from a misunderstanding of what separation and sufficiency actually mean, let us start by revisiting their definitions.

In the following, assume that we have $n$ groups, $a=1$ , …, $a=n$, and a binary outcome, $y\in \lbrace \text{True}, \text{False} \rbrace$, and assume that we are analyzing a classifier that returns a risk score $r\in [0, 1].$

A classifier fulfills separation if $R \perp A \vert Y$, i.e., the risk score $R$ is independent of the group assignment $A$ given the (observed) outcome $Y$. In the binary outcome case, this translates1 to balance of the average score in the positive/negative class (Kleinberg et a. 2016), i.e.,

\[\begin{gather} E_{a=i, y=T}[R] = E_{a=j, y=T}[R] \\ E_{a=i, y=F}[R] = E_{a=j, y=F}[R] \end{gather}\]

for all pairs of groups $(i,j)$. This seems like a reasonable requirement: we would like the classifier to be ‘equally sure’ about its predictions in all groups; otherwise, something must be off – right? (Wrong! Read on…)

A classifier fulfills sufficiency if $Y \perp A \vert R$, i.e., the observed outcome $Y$ is independent of the group assignment $A$ given the risk score $R$. A slightly stronger requirement (see Barocas et al. for a derivation that this is indeed a stronger requirement) is that the classifier be calibrated by group, i.e., that it satisfies

\[P(T \mid R=r, a=i) = r \quad \forall\, r \in \mathop{supp_i}(R), \quad i \in \{1, \ldots, n\}.\]

This is a property that certainly is very desirable for any model to be employed in a high-stakes environment, such as healthcare (my main field). Fortunately, it is also a property that is optimized for by most standard learning procedures, including maximum likelihood estimation / cross-entropy minimization.

Now, unfortunately, it is a well-known result (due to Kleinberg et al. 2016) that these two properties cannot hold at the same time for any classifier – regardless of model class, how it is constructed, etc. – except if either

  1. the classifier is perfect, i.e., always returns the correct prediction for all samples, or
  2. the base rates $p(y \mid a)$ are the same in all groups.

This sounds like a real bummer, so let us try to understand why it is that the two conditions are incompatible.

Incompatibility of separation and sufficiency

This section is very closely based on Kleinberg et al. 2016. I take no credit whatsoever for the ideas presented below.

Given a group-wise calibrated risk score, we find the average risk score in group $i$ to be

\[\begin{align} E_{a=i}[R] &= \int r \cdot P(r \mid a=i) \,\mathrm dr \\ &= \int P(T|R=r, a=i) \cdot P(r \mid a=i) \,\mathrm dr \\ &= \int P(T, r \mid a=i) \,\mathrm dr \\ &= P(T \mid a=i), \end{align}\]

i.e., it is equal to that group’s base rate (as one might expect!), which we will denote by $P(T \mid a=i) =: p_i$. Equivalently, we can decompose the average risk score into two terms as

\[E_{a=i}[R] = p_i = p_i \cdot E_{a=i, y=T}[R] + (1-p_i) \cdot E_{a=i, y=F}[R].\]

Now set $x_i := E_{a=i, y_i=F}[R]$ and $y_i := E_{a=i, y=T}[R]$, and it becomes apparent that the average scores in the positive / negative classes of each group must satisfy the line equation

\[y_i = 1 - \frac{1-p_i}{p_i} x_i\]

if the risk score is calibrated by group.

To achieve balance of the average scores, we would need equality of $x_i$ and $y_i$ for all $i$, i.e., all lines would have to intersect. This, however, can only happen if all the base rates $p_i$ are equal (in this case, all lines are identical) or if $y_i=1$ for all $i$, i.e., the classifier is perfect. In all other cases, separation and sufficiency are not compatible.

The essential intuition here is that the average score of a calibrated classifier within each group is equal to the base rate of that group. From this, it is already almost apparent that equal average risk scores in the positive/negative classes of each group cannot be achieved, if there are base rate differences (and the classifier is not perfect).

(Limited) compatibility of calibration and error rate balance

Notice that it is possible to construct a classifier that is

  1. calibrated by group, and
  2. achieves error rate parity, i.e., equal TPR and FPR across groups.

The following example will illustrate this. Assume we have two classes, $a=1$ and $a=2$, and a binary outcome, $y\in \lbrace \text{True}, \text{False} \rbrace$. The two groups have varying base-rates:

Class label $a$ $y=\text{True}$ $y=\text{False}$
Group 1 40 10
Group 2 10 40

Assume furthermore that we have a classifier with the following confusion table:

Actual outcome \ predicted outcome True False
True $\mathrm{TP}_1=40$, $\mathrm{TP}_2=10$ $\mathrm{FN}_1=\mathrm{FN}_2=0$
False $\mathrm{FP}_1=2, \mathrm{FP}_2=8$ $\mathrm{TN}_1=8, \mathrm{TN}_2=32$

Thus, the classifier has

\[\mathrm{TPR} = \frac{\mathrm{TP}}{\mathrm{P}} = 1\]

for both classes and

\[\mathrm{FPR}_1 = \frac{\mathrm{FP}}{\mathrm{N}} = \frac{2}{10} = 0.2 = \frac{8}{40} = \mathrm{FPR}_2.\]

Thus, we have equal $\mathrm{TPR}$ and $\mathrm{FPR}$ across groups. This is not equivalent to achieving separation, though!

Moreover, we have the positive predictive values

\[\mathrm{PPV}_1 = \frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}} = \frac{40}{42} \neq \frac{10}{18} = \mathrm{PPV}_2\]

So far, we did not have to take this risk scores $R$ returned by the classifier into account at all! Up until this point, we only looked at $\mathrm{TPR}$ / $\mathrm{FPR}$ / $\mathrm{PPV}$ etc., all of which are functions of the risk score and the selected classification thresholds. To achieve calibration by group, we simply assign the correct classification probabilities as risk scores:

\[\begin{align} &P(T|R=40/42, a=1) = \mathrm{PPV}_1 = \frac{40}{42} \\ &P(T|R=0, a=1) = (1-\mathrm{NPV}_1) = 0 \\ &P(T|R=10/18, a=2) = \mathrm{PPV}_2 = \frac{10}{18} \\ &P(T|R=0, a=2) = (1-\mathrm{NPV}_2) = 0 \\ \end{align}\]

From Theorem 1.1 in Kleinberg et al. (2016) and our discussion above, we already know that this classifier cannot achieve separation. To check this, we simply observe that

\[\begin{gather} E_{a=1,y=T}[R] = \frac{40}{42} \neq E_{a=2,y=T}[R] = \frac{10}{18}\\ E_{a=1, y=F}[R] = \frac{2 \cdot 40/42}{10} = \frac{4}{21} \neq E_{a=2, y=F}[R] = \frac{8 \cdot 10/18}{40} = \frac{1}{9}. \end{gather}\]

So, in the example above, we have error rate ($\mathrm{TPR}$, $\mathrm{FPR}$) balance and calibration by group, but we do not have separation (as would be impossible as per the above result).

However, in the general setting, it is unlikely that it will be possible to achieve both perfect calibration by group and error rate balance: this requires that the ROC curves for the two groups intersect at these error rates. (And, if we want this to be achievable for all values of the error rates, then the ROC curves must be identical.) It is highly unlikely that this will hold in any practical scenario: even disregarding the exact shape of the ROC curve, a weaker requirement would be that the AUROCs for the two groups should be equal, i.e., the discriminative power of the model should be identical for the two groups. There is no reason to expect this to be true, and the only way of enforcing this would be to actively reduce performance on the better-performing group. Of course, when there are large discrepancies in discriminative performance for different groups (as indicated, e.g., by widely differing AUROC values), one should, firstly, try to improve performance in the underperforming group (by, e.g., changing the model, gathering more data, or accounting for measurement biases) and, secondly, be transparent about these performance disparities as they will impact downstream applications of the model. Nevertheless, this will almost surely not lead to identical ROC curves for the different groups, and, thus, calibration by group and (exact) error rate balance will still not be achievable at the same time.

Annotated references

  • Kleinberg, Mullainathan, Raghavan (2016), Inherent Trade-Offs in the Fair Determination of Risk Scores. arxiv link Their result is that calibration by group and equality of the average scores in the positive / negative groups cannot hold at the same time. The latter is not the same as error rate parity.
  • Barocas, Hardt, Narayanan (2021), Fairness and Machine Learning. https://fairmlbook.org/, specifically the chapter on classification. They show that separation and sufficiency are incompatible. Sufficiency implies calibration by groups. If applied to a binary classifier, separation implies error rate parity. The issue is that we want error rate parity (~separation) for the classifier (–> output in ${0,1}$) and calibration (~sufficiency) for the underlying probability model (–> output in $[0,1]$). The incompatibility result only applies if we want both for the same thing.
  • Alexandra Chouldechova (2017), Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. [Link] Their result is that error rate parity and positive predictive value (PPV) equality are not compatible. PPV equality is not the same as calibration, and calibration does not imply PPV equality.
  • Addendum, July 2023: Claire Lazar Reich and Suhas Vijaykumar (FORC 2021), A Possibility in Algorithmic Fairness: Can Calibration and Equal Error Rates Be Reconciled? [Link]. Very nice discussion of the compatibility of the two properties.

  1. Moritz Hardt very kindly pointed out to me that average score balance as defined in equations (1) and (2) is a necessary but not a sufficient condition for separation. 

]]>
Eike Petersen[email protected]
Book review: Is AI good for the planet?2022-05-20T00:00:00+02:002022-05-20T00:00:00+02:00https://e-pet.github.io/posts/2022/is-ai-good-for-the-planetI recently had the opportunity to write a book review essay for Prometheus – Critical Studies in Innovation. The book I reviewed is “Is AI good for the planet?” by Benedetta Brevini (2021), to be found here. I’d like to thank Steven Umbrello, the Prometheus book review editor, for entrusting me this review!

Below you can find the full text of my review essay, which has been published by Prometheus here. Unfortunately, due to a mistake in the editing process, the journal version published under the above link is not the final, edited version. The text below represents the final version of the review as it should have been published.


Artificial Intelligence (AI) has been hailed as an essential solution to practically all critical challenges faced by humanity, including climate change and other natural disasters. In parallel with its proliferation over the course of the last decade, critical voices have grown stronger as well, however. These critics observe that today’s AI systems power a new type of surveillance capitalism and attention economy (Zuboff 2019), that they aggravate power disparities and cause new monopolies to emerge, and that they are based on the exploitation of cheap ‘data workers’ and extensive resource extraction that is damaging the environment (Crawford 2021). Benedetta Brevini’s short book adds to this growing body of critical literature, focusing on the relationship between AI and the climate crisis. Given the urgency of this crisis and the scarcity of detailed arguments in the public discourse on whether and how AI can contribute to its solution, Dr. Brevini’s book is certainly timely. A quick glance at the table of contents firmly establishes the author’s stance on the question posed in the title, with chapters on AI hype, data capitalism and “why AI worsens the climate crisis.” (Why) does it, indeed? Could it also be a force for planetary good?

The drivers of AI development

To understand its potential for environmental impact, Brevini first examines today’s main drivers of AI development. These are, first and foremost, economic and competitive (Bughin et al. 2018). As of early 2022, eight of the world’s ten most valuable companies are tech companies, all of them heavily invested in and profiting from AI. Brevini calls (a subset of) these companies the ‘digital lords’ (p. 26), referring to their far-reaching, near-monopolistic power over our digital lives and the lack of democratic oversight over their activities. Leaving aside discussions about the proper use of the term “AI” – current techniques might be more appropriately labelled “machine learning” (Pretz 2021) – and whether or how quickly the current approaches will bring us to “true” AI (Mitchell 2019), it is apparent that AI will have and is already having a tremendous impact across the economy. According to a 2021 McKinsey report, common business use cases of AI already span from service operations, AI-based product enhancement, marketing, supply-chain management and business analytics to manufacturing (Chui et al. 2021). An earlier McKinsey report estimates that AI may cause an additional increase in global GDP by 1.2 percent per year by 2030, putting its impact above that of earlier general-purpose technologies like the steam engine or robotics in manufacturing (Bughin et al. 2018). Comparing the development of AI to earlier breakthrough techniques like the steam engine and electricity thus does not appear unreasonable, at least if one considers these projections credible. As a result, leaders of democratic countries and CEOs of companies alike feel an intense pressure to adopt and heavily invest in AI technology to stay competitive in the global economy (pp. 16⁠–23).

Does this mean, however, that AI will help solve society’s pressing challenges, ranging from inequity to the climate crisis, rather than worsen them? That seems unlikely. Brevini, drawing on classical critiques of technological determinism, emphasizes that ‘technological “fixes” have historically been developed to remove barriers to capital accumulation, not to address inequalities’ (p. 26). In the same way, AI developments are primarily driven by the motive of profit maximization, not by societal needs, and there is little reason to believe that AI will ‘accidentally’ also solve societal and environmental problems (pp. 25⁠–29). (AI research, especially in the US, is overwhelmingly dependent on and driven by the big tech companies, a fact that has led to increasing criticism in recent years; see Whittaker 2021.) On the contrary, Brevini argues that by enabling efficiency gains, boosting productivity, increasing marketing effectiveness and powering product personalization, AI will encourage “uber-consumerism” (p. 22) and, thus, exacerbate the existing problems caused by boundless profit maximization regardless of social and environmental costs.

Environmental damage caused by AI

This, then, according to Brevini, is one of the main ways in which AI harms the planet: by acting as a catalyst for consumerism and thereby intensifying its environmental costs (p. 64). While indeed a plausible hypothesis, at least to this reader, it is not immediately evident that this is true. Brevini provides little scientific evidence, nor does there appear to be much available in the literature. Consumer demand has been increasing for a long time before the proliferation of AI; do we consume more now (or in the near future) because of AI? In this regard, the aforementioned McKinsey report mentions that “a sizable portion of innovation gains come as a result of competition that shifts market share from nonadopters to front-runners,” thus indicating that the projected economic gains do not exclusively originate in increased overall consumption (Bughin et al. 2018). It certainly appears plausible that AI adoption drives consumption in various ways (more effective marketing, product personalization, efficiency gains, cost reductions), but more research seems warranted to further substantiate this hypothesis.

The second main way in which, according to Brevini, AI contributes to the climate crisis is more direct: gathering the necessary data and training AI models consumes a large and rapidly growing amount of energy and natural resources. This occurs in various stages throughout the life cycle of an AI system. The training of large models itself is now (somewhat) well-known to have a very significant carbon footprint (Strubell et al. 2020) which will likely further explode considering the ever-increasing size of current ‘foundation models’. Some cause for hope in this regard is given by the fact that this carbon footprint is comparatively simple to track (Henderson et al. 2020; Wolff Anthony et al. 2020) and, thus, manage. For example, based on such carbon footprint tracking tools, some AI conferences are beginning to ask authors for information regarding their work’s carbon footprint. Moreover, with such carbon impact estimates now available, a multitude of potential techniques for reducing the carbon impact of model training can be explored, promising very significant impact reductions already with relatively minor changes (Gupta et al. 2021; Patterson et al. 2022). These are certainly promising steps in the right direction. Nevertheless, the carbon impact of model training represents a crucial challenge that should be more widely discussed, both publicly and in the academic community.

However, model training is only responsible for a part of the resource usage associated with the use of AI. While it does not seem justified to associate the environmental footprint of the entire global IT industry with AI as Brevini sometimes appears to do (pp. 82-87), it is undoubtedly true that a significant part of it is at least partially attributable to the use of AI (IEA 2021). Most prominently, this includes the infrastructure for gathering and processing vast amounts of data (required for training AI models) and the immense amounts of toxic and non-biodegradable E-waste associated with end-user devices exploiting AI capabilities. According to the IEA, data centres and data transmission networks each accounted for around 1% of global electricity usage in 2020 (IEA 2021), together roughly equalling the total electricity consumption of Germany. Cooling today’s huge datacentres is another significant driver of environmental impact. Motivated by energy-related expenses already making up a significant fraction of the cost of operating these systems, large gains in energy efficiency have been achieved in recent years, partly compensating for the steep increase in consumer demand and internet traffic (IEA 2021). Partly because of these efficiency gains, it has been estimated that by far the larger share of the carbon impact of today’s ICT technology stems from hardware manufacturing and infrastructure, resulting in a carbon footprint of the ICT industry that is still growing despite all efficiency gains and net-zero pledges (Gupta et al. 2021). As in many other areas of our globalized economy, the harms caused by the extraction of the required resources and the disposal of toxic waste are largely ‘out-sourced’ to poorer regions of the world, as has recently been explored in depth by Kate Crawford (Crawford 2021). However, it is crucial to realize that these harms are not symptoms of AI specifically; instead, they are symptoms of an economic environment that incentivizes profit maximization, consumerism and planned obsolescence at the cost of resource and energy consumption and waste production. The same can be said of Brevini’s criticism of the use of AI techniques in fossil resource extraction: the problem is not AI; the problem is that we still use and extract fossil resources.

How can AI be good for the planet?

Comparatively little space in the book is devoted to ways in which AI can be good for the planet – and these are, indeed, manifold (Rolnick et al. 2022). AI can help increase energy efficiency in various domains, optimize supply chains and develop new, sustainable materials or better batteries. It can be used to monitor greenhouse gas emissions, deforestation and wildlife conservation efforts (Tuia et al. 2022). AI can enable predictive maintenance (thus extending product lifetime), such as for wind turbines and trains, and improve the precision and efficiency of recycling plants. It can be used for precision agriculture, enabling optimal crop selection, reduced pesticide and water use, and optimal livestock health management.

In all these domains, it is essential to consider the risk of techno-solutionism, also discussed by Brevini (pp. 25⁠–35). Is an ‘AI fix’ really what is needed, or is it only a band-aid, often motivated by potential economic gains, distracting from a deeper problem? The use of harvesting drones in agriculture does not alleviate the need to switch away from intensively farmed monocultures to a more sustainable, regenerative and humane agriculture. Is our vision for a sustainable future to have large, drone-farmed fields and livestock equipped with physiological sensors and augmented reality goggles? Is this indeed sustainable, considering the environmental costs associated with the mass use of AI and drones or robots? All AI solutions come with an associated environmental cost – as discussed above – that must be outweighed by the reaped benefits. Moreover, proposed AI-based solutions must not distract us from less fashionable (and less profitable) low-tech solutions, many of which have been known for a long time. Finally, owing to the fundamental nature of this technology, AI solutions are typically associated with a risk of increased centralization and societal dependency on international profit-driven tech companies and technologies.

Containing the risks and aligning AI development efforts for maximum positive environmental impact will largely depend on society putting in place the right incentives. Brevini appears very pessimistic in this regard, writing about “the total abdication of strategic decisions and choices on the direction of AI research and development, from government to corporate boardrooms” (pp. 61⁠–62). This has certainly been true in the past, but appears to be changing now. The EU has recently put forth a whole series of far-reaching policy proposals, including the digital markets act, digital services act, and artificial intelligence act, all of which entail significant limitations to the power of the ‘digital lords’ and are meant to ensure compatibility with existing EU law. (In this regard, one should keep in mind the EU’s ongoing struggles to effectively enforce the GDPR; see Massé, 2021.) At the same time, tightening environmental regulations and rising carbon prices will also affect the AI industry, encouraging both less energy-intensive ways of operating AI systems and the development of AI-based technologies for reducing carbon emissions in other domains. As is known from the progressively worsening IPCC projections, much more stringent policy action is needed in this regard (IPCC WGIII 2022). Holding multinational companies effectively accountable for environmental harms committed along their supply chain in other parts of the world remains a crucial challenge, with direct consequences for the environmental impact of AI.

AI is a symptom

To conclude, it seems increasingly clear that AI indeed represents a new general-purpose technology that will permeate all aspects of society. Being general-purpose implies that it has the potential to both aggravate or help solve our pressing environmental problems, as has been widely emphasized. Whether the impact of AI on the climate and our natural environment more broadly will be net positive or negative will depend almost exclusively on the predominant social and economic incentives influencing AI developers and companies. As Brevini puts it in the introduction, “without challenging the current myths of limitless economic growth and boundless consumerism, without reconsidering the way in which the structures, the violence and the inequality of capitalism work, we won’t be able to achieve the radical change we need if we are to tackle the climate crisis” (p. 14). Juxtaposed with this insight, Brevini’s concluding call for action appears almost tame. She emphasizes the need for public discourse and increased tech literacy, transparency about the environmental costs of AI, green activism and more open and unbiased (by corporate influence) research about the environmental impact of AI. While these are all laudable aims, the fundamental challenge remains that societal and economic incentives are not aligned with societal and environmental needs. Arguably, the question posed in the title of the book could, thus, be reformulated as: Is the economy good for the planet?

How to transform our economy to one that is good for the planet has, of course, troubled ecological thinkers for many decades. Proposed solutions abound, from degrowth (Kallis et al.; 2012) and green growth (Hickel and Kallis, 2020) to doughnut economics (Raworth, 2017), cradle-to-cradle or regenerative design (McDonough and Braungart, 2002), Communalism (Bookchin, 2007) and stakeholder capitalism (Schwab and Vanham, 2021). So far, these proposals have seen little uptake, but one may hope that the urgency of the looming climate disaster may change this. If we indeed succeed in transitioning to an economic environment that incentivizes finding balance instead of growth at all cost, and if we do not let AI distract us from simple, low-tech, economically unattractive solutions, AI may indeed turn out to play an important role in solving the climate crisis. There are, after all, many ways in which it can help.

Brevini’s book provides neither a fully comprehensive analysis of the subject matter nor final answers or conclusions, but these also do not seem to be the book’s aims. Instead, it may serve as a spark for public discourse and an urgent call to action for more research, policy action and public advocacy on this subject. Given its brevity and its non-technical, opinionated and engaging writing style, it is well-positioned to achieve this aim.

References


]]>
Eike Petersen[email protected]
Maximum likelihood, cross-entropy, risk minimization2022-04-03T00:00:00+02:002022-04-03T00:00:00+02:00https://e-pet.github.io/posts/2022/maximum-likelihoodreally, yet another post about about maximum likelihood (ML) estimation? Well – yes; I could not find a source that summarized exactly the things I needed to know, so here it is. What will you find?

  • A derivation of maximum likelihood estimation
  • A derivation of its equivalence to cross-entropy minimization, empirical risk minimization, and least squares estimation
  • A summary of some important properties of ML estimation, including under which circumstances it tends to produce well-calibrated estimators, as well as its robustness to model misspecification

The discussion is fully general and applies to both regression and classification settings, i.e., continuous or discrete variables.

Let’s go.

What is maximum likelihood estimation?

The idea is simple: given a model $q(y \vert x; \theta)$ of the conditional distribution of a target variable $y$ given input data $x$, find the set of parameters $\theta_{\text{ML}}$ that maximizes the likelihood of observing the data as they were observed:

\[\theta_{\text{ML}} = \arg\max_{\theta} q(Y \vert X; \theta),\]

where $Y$ and $X$ denote vectors (or matrices) representing a given dataset.

If the individual observations $(y_i, x_i)$ are assumed independent of one another, this can be rewritten as

\[\begin{align} \theta_{\text{ML}} &= \arg\max_{\theta} \prod_{i=1}^N q(y_i \vert x_i; \theta), \\ &= \arg\max_{\theta} \ln \left(\prod_{i=1}^N q(y_i \vert x_i; \theta) \right) \\ &= \arg\max_{\theta} \sum_{i=1}^N \ln q(y_i \vert x_i; \theta) \\ &= \arg\min_{\theta} - \sum_{i=1}^N \ln q(y_i \vert x_i; \theta). \end{align}\]

Thus, we arrive at the usual formulation of ML estimation as minimizing the negative log likelihood (NLL), sometimes also called the energy or the cross-entropy (the latter will be discussed in more detail below).

Maximum likelihood estimation as empirical risk minimization

Maximum likelihood estimation can be cast within the extremely broad framework of empirical risk minimization (ERM):

\[\begin{align} \theta_{\text{ML}} &= \arg\min_{\theta} - \sum_{i=1}^N \ln q(y_i \vert x_i; \theta)\\ &= \arg\min_{\theta} E_{p_{\text{emp}}}\left[-\ln q(y|x,\mathbf{\theta})\right], \end{align}\]

where $E_p$ is the expected value operator with respect to the distribution $p$, and $p_{\text{emp}}$ denotes the empirical measure defined by the observed dataset $(X, Y)$. Thus, likelihood maximization is identical to empirical risk minimization if the risk defined as

\[\mathcal{R}(x, y, \theta) = -\ln q(y \vert x,\mathbf{\theta}).\]

Maximum likelihood estimation as cross-entropy minimization

The cross-entropy of a first distribution $q$ relative to a second distribution $p$ is defined as \(H(p, q) = -E_p[\ln q].\) Returning to our identification problem, if we choose $p=p_{\text{emp}}(y \vert x)$ and $q=q(y \vert x; \theta)$, we observe that maximizing the likelihood $q(Y \vert X; \theta)$ is identical to minimizing the cross-entropy of the distribution $q(y \vert x; \theta)$ relative to the empirical distribution $p_{\text{emp}}(y \vert x)$.

Maximum likelihood estimation as Kullback-Leibler divergence minimization

The definition of the cross-entropy above can be reformulated in terms of the Kullback–Leibler divergence (a measure of differences between distributions, also known as the relative entropy), since

\[\begin{align} H(p,q)&= -E_p\left[\ln q(x)\right] \\ &= -E_p\left[ \ln \frac{p(x) q(x)}{p(x)}\right] \\ &= -E_p\left[\ln p(x) + \ln \frac{q(x)}{p(x)}\right] \\ &= -E_p\left[\ln p(x) - \ln \frac{p(x)}{q(x)}\right] \\ &= -E_p\left[\ln p(x)\right] + E_p\left[\frac{p(x)}{q(x)}\right] \\ &= H(p) + D_{\text{KL}}(p\vert\vert q), \end{align}\]

where $H(p)$ denotes the entropy of the distribution $p$ and $D_{\text{KL}}(p\vert\vert q)$ the Kullback-Leibler divergence.

Again choosing $p=p_{\text{emp}}(y \vert x)$ and $q=q(y \vert x; \theta)$, and noting that $H(p)$ is independent of our choice of model parameters $\theta$, we observe that maximizing the likelihood of the data is also identical to minimizing the Kullback-Leibler divergence between the empirical distribution $p_{\text{emp}}(y \vert x)$ and the model $q(y \vert x; \theta)$. (We would, of course, prefer to minimize the divergence with respect to the true, data-generating process $p(y \vert x)$ instead of the empirical distribution. However, this is obviously infeasible since $p(y \vert x)$ is unknown.)

Maximum likelihood and least squares

Known since the eighteenth century, least-squares estimation is possibly the single most famous parameter estimation paradigm. It turns out that under mild assumptions, least-squares estimation coincides with maximum likelihood estimation. For an arbitrary, possibly nonlinear regression model $f(x; \theta)$, we have

\[\theta_{\text{LS}} = \arg\min_\theta \sum_{i=1}^N || y_i - f(x_i; \theta) || ^2.\]

If we now assume a Gaussian noise model

\[q(y_i \vert x_i, \theta, \sigma_{\varepsilon}) = \mathcal{N}(f(x_i; \theta), \sigma_{\varepsilon}^2),\]

we obtain for the maximum likelihood estimator that

\[\begin{align} \theta_{\text{ML}}, \sigma_{\varepsilon, \text{ML}} &= \arg \min_{\theta, \sigma_{\varepsilon}} - \sum_{i=1}^N \ln q(y_i \vert x_i, \theta, \sigma_{\varepsilon}) \\ &= \arg \min_{\theta, \sigma_{\varepsilon}} - \sum_{i=1}^N \ln \frac{1}{\sqrt{2\pi \sigma_{\varepsilon}^2}} \mathrm{e}^{-\frac{1}{2 \sigma_{\varepsilon}^2} (y_i - f(x_i; \theta))^2} \\ &= \arg \min_{\theta, \sigma_{\varepsilon}} \sum_{i=1}^N (y_i - f(x_i; \theta))^2 + \frac{N}{2} \ln 2 \pi \sigma_{\varepsilon}^2. \end{align}\]

Since the optimization with respect to the regression parameters $\theta$ can be carried out independently of the value of $\sigma_{\varepsilon}$, it follows that

\[\theta_{\text{ML}} = \theta_{\text{LS}}\]

for arbitrary functions $f(x; \theta)$. (Again, this relies on the assumption of a Gaussian noise model.)

Consistency, efficiency, calibration

Maximum likelihood estimation is asymptotically consistent: if there is a unique true value $\theta^{\ast}$ for which $p(y \vert x) = q(y | x; \theta^{\ast})$ (in other words, there is no model mismatch or model error), then a maximum likelihood estimator converges towards that value as the number of samples increases. (However, notice that even in the case where there is model mismatch, we retain the reassuring property that the ML estimator minimizes the KL divergence between the empirical data distribution and the identified model.)

Moreover, maximum likelihood estimation is asymptotically efficient, meaning that for large sample numbers, no consistent estimator achieves a lower mean squared parameter error than the maximum likelihood estimator. (In other words, it reaches the Cramér-Rao lower bound.)

Finally, ML estimators also tend to be well-calibrated, meaning that

\[p(y \vert x, R=r) = r \quad \forall\, r,\]

where $R$ denotes the (risk score) output of the trained model. This is readily apparent from the fact that an ML-optimal model minimizes the KL divergence from the data-generating distribution, as discussed above: the optimum is only obtained if $p(y \vert x) = q(y \vert x; \theta^{\ast})$. For a more detailed discussion about how maximum likelihood estimation implies calibration, refer to Liu et al. 2019. For the same reason, the negative log likelihood has also been proposed as a calibration measure. Notice, however, that it is not a pure measure of calibration; instead, it measures a mixture of calibration and separation. Importantly, calibration of (maximum likelihood/cross-entropy-optimal) neural networks is usually only achieved for in-domain data, whereas out-of-distribution prediction typically suffers from extreme overconfidence. Various fixes have been proposed. (This phenomenon depends on the employed model: Gaussian process models, for example, typically do not suffer from asymptotic overconfidence.)

Properties of the optimization problem

The likelihood landscape (as a function of the parameters $\theta$ to be optimized) is, in general, non-convex. (It also depends on the way the model is parameterized.) Thus, global optimization strategies are required if local minima are to be escaped. (One of the various benefits of stochastic gradient descent is that it is capable of escaping local minima to some degree. It is, however, of course not a true global optimization strategy.)

On the positive side, however, the negative log likelihood represents a proper scoring rule – as opposed to, e.g., classification accuracy, which is an improper scoring rule and should never be used as an optimization loss function or to drive feature selection and parameter estimation.

Finally, an interesting remark on the potential for overfitting when doing maximum likelihood estimation, due to Bishop (2006), p. 206:

It is worth noting that maximum likelihood can exhibit severe over-fitting for data sets that are linearly separable. This arises because the maximum likelihood solution occurs when the hyperplane corresponding to $\sigma = 0.5$, equivalent to $w^T\phi=0$, separates the two classes and the magnitude of $w$ goes to infinity. In this case, the logistic sigmoid function becomes infinitely steep in feature space, corresponding to a Heaviside step function, so that every training point from each class k is assigned a posterior probability $p(C_k \vert x) = 1$. Furthermore, there is typically a continuum of such solutions because any separating hyperplane will give rise to the same posterior probabilities at the training data points. […] Maximum likelihood provides no way to favour one such solution over another, and which solution is found in practice will depend on the choice of optimization algorithm and on the parameter initialization. Note that the problem will arise even if the number of data points is large compared with the number of parameters in the model, so long as the training data set is linearly separable. The singularity can be avoided by inclusion of a prior and finding a MAP solution for $w$, or equivalently by adding a regularization term to the error function.

How does this not contradict all the nice properties of maximum likelihood estimation discussed above, such as consistency, efficiency, and calibration? Well, in the case discussed by Bishop, there simply is no unique optimum – instead, there is a manifold of possible solutions. As Bishop remarks, to obtain a specific solution, some prior information must be included about which of the infinitely many solutions of the estimation problem to prefer. Notice that the maximum likelihood solution discussed by Bishop is, in fact, calibrated: it correctly assigns high confidence to its predictions.

The special case of binary classification

For discrete probability distributions $p$ and $q$ with the same support $\mathcal{Y}=\lbrace 0, 1 \rbrace$, the (binary) cross-entropy simplifies (again assuming $p=p_{\text{emp}}$) to the often-used formulation

\[\begin{align} H(p,q) &= -E_p[\ln q] \\ &= -\sum_{y\in\mathcal{Y}} p(y \vert x) \ln q(y | x)\\ &= -\sum_{i=1}^N y_i \ln q(y_i \vert x_i) + (1-y_i) \ln (1-q(y_i \vert x)). \end{align}\]

References

  • Bottou (1991), Stochastic Gradient Learning in Neural Networks. Link
  • Ljung (1999), System Identification: Theory for the User. Prentice Hall, second edition edition.
  • Bishop (2006), Pattern Recognition and Machine Learning. Springer.
  • Nowozin (2015), How good are your beliefs? Part 1: Scoring Rules. Link
  • Kull and Flach (2015), Novel Decompositions of Proper Scoring Rules for Classification: Score Adjustment as Precursor to Calibration. Link
  • Goodfellow, Bengio, Courville (2016), Deep Learning. Link
  • Guo, Pleiss, Sun, Weinberger (2017), On Calibration of Modern Neural Networks. Link
  • Liu et al. (2018), The implicit fairness criterion of unconstrained learning. Link
  • Hein, Andriushchenko, Bitterwolf (2019), Why ReLU networks yield high-confidence predictions far away from the training data and how to mitigate the problem. Link
  • Harrell (2020), Damage Caused by Classification Accuracy and Other Discontinuous Improper Accuracy Scoring Rules. Link
  • Ashukha et al. (2021), Pitfalls of in-domain uncertainty estimation and ensembling in deep learning. Link

]]>
Eike Petersen[email protected]
Beautiful boxplots in pgfplots2018-03-21T00:00:00+01:002018-03-21T00:00:00+01:00https://e-pet.github.io/posts/2018/pgfplots-boxplotRecently, I wanted to create boxplots from a data file for a paper I was writing using pgfplots. Turns out that’s more difficult than expected, especially since the (otherwise very useful) documentation is a bit meager in this point.

To save others the hassle, here’s the result of my efforts (link to PDF version):

Some “features” of this plot that took me a while to set up correctly:

  • Box design with fill color and black border
  • Nice color scheme (based on ColorBrewer)
  • Clean plot without any unnecessary clutter
  • Getting pgfplots to work with data in row format (instead of column format, as expected by the boxplot function)

This is the code I used to generate the figure (link to code file:

\documentclass{standalone}
\usepackage{pgfplots}
% Nice color sets, see see http://colorbrewer2.org/	
\usepgfplotslibrary{colorbrewer}
% initialize Set1-4 from colorbrewer (we're comparing 4 classes),
\pgfplotsset{compat = 1.15, cycle list/Set1-8} 
% Tikz is loaded automatically by pgfplots
\usetikzlibrary{pgfplots.statistics, pgfplots.colorbrewer} 
% provides \pgfplotstabletranspose
\usepackage{pgfplotstable}
\usepackage{filecontents}

\begin{filecontents*}{data.csv}
22, 26, 30, 17, 45
10, 15, 13, 12, 17
12, 30, 6,  57, 10
33, 38, 36, 25, 24
\end{filecontents*}

\begin{document}
\begin{tikzpicture}
	\pgfplotstableread[col sep=comma]{data.csv}\csvdata
	% Boxplot groups columns, but we want rows
	\pgfplotstabletranspose\datatransposed{\csvdata} 
	\begin{axis}[
		boxplot/draw direction = y,
		x axis line style = {opacity=0},
		axis x line* = bottom,
		axis y line = left,
		enlarge y limits,
		ymajorgrids,
		xtick = {1, 2, 3, 4},
		xticklabel style = {align=center, font=\small, rotate=60},
		xticklabels = {Apples, Oranges, Bananas, Melons},
		xtick style = {draw=none}, % Hide tick line
		ylabel = {Juiciness},
		ytick = {20, 40}
	]
		\foreach \n in {1,...,4} {
			\addplot+[boxplot, fill, draw=black] table[y index=\n] {\datatransposed};
		}
	\end{axis}
\end{tikzpicture}
\end{document}

]]>
Eike Petersen[email protected]