inverseprobability.com: Neil Lawrence’s Homepage

Laplace’s Gremlin and Irreducibility

2026-01-25T00:00:00+00:00

Stephen Hawking’s book¹, A Brief History of Time, is one of the most influential popular physics books to have been written. But it may also be the source of a confusion about the meaning of another great physicist’s idea, Laplace’s demon.

The demon appears in an 1814 paper. Here is a 1902 English translation.

We may regard the present state of the universe as the effect of its past and the cause of its future. An intellect which at a certain moment would know all forces that set nature in motion, and all positions of all items of which nature is composed, if this intellect were also vast enough to submit these data to analysis, it would embrace in a single formula the movements of the greatest bodies of the universe and those of the tiniest atom; for such an intellect nothing would be uncertain and the future just like the past would be present before its eyes.²

Laplace’s demon working away on the natural laws, data and computation necessary for deterministic prediction

Hawking was writing before the internet, and before the source was widely available, so he may be forgiven for not noticing that the demon is not Laplace’s point. It is a straw man. Within two pages Laplace goes on to explain:

The curve described by a simple molecule of air or vapour is regulated in a manner just as certain as the planetary orbits; the only difference between them is that which comes from our ignorance.³

Laplace’s real point is that we need probability to deal with this ignorance. I’ve come to think about this as Laplace’s gremlin, and wrote about it extensively in The Atomic Human. But as a short cut to the thinking, we might think about the irreducibility of each of the pillars of the demon. Laplace is reducing prediction to a combination of laws, data and compute. The reason that perfect prediction isn’t possible is lack of availability of the laws, the data and the irreducibility of the compute.

The notion of computational irreducibility comes to us from Wolfram,⁴ but I’m tempted to extend it here to the laws and the data. That gives me the following idea. If we can somehow prove that any or all of the components of Laplace’s demon are irreducible, does that give us an impossibility theorem for demon-like intelligences?

I don’t know, but I’m having fun playing with the idea!

The connection between these ideas and the inaccessible game may not yet be clear. But the problem I’ve encountered when looking at other approaches, such as the thermodynamics of information,⁵ is that it’s very difficult to get the primitives of energy and time into the discussion (which I view as resources the demon may have access to) in a way that calibrates them with information (a third form of resource). This is based on an intuition that suggests that … to the extent that intelligence can be defined … it feels like it needs to be done in the context of these three resources … time, energy and information.

The idea in the inaccessible game is to try and better understand how they interrelate by starting with simple information-derived rules and seeing what might be emergent.

The first paper hints at the way energy might emerge. The second paper, that came out on arXiv last week https://arxiv.org/abs/2601.12576 suggests that the game’s clock might be entropy itself. It’s called “The Origin of the Inaccessible Game” and it builds on the first paper by looking at a `natural origin’ for the game where the multi-information is maximised and joint entropy is zero. This triggers some paradoxes and moves the game away from Shannon entropy to von Neumann entropy. But it does allow us to identify a distinguishable trajectory in the information dynamics and associate that trajectory with the internal entropy clock … access to external clocks is prohibited by the information isolation condition.

I’ve been very much enjoying studying these ideas and seeing the direction they take me. But for the moment I want to emphasise that they are only suggestive. I’ve had to learn maths I wasn’t deeply familiar with before and relate it to my own mathematical intuitions (that mainly come from working with information theory and stochastic processes in machine learning). I also have a sense of where I’m trying to take the maths. Together this all feels like a recipe for clumsy errors.

With that in mind I’ve tried to spot such errors, but would welcome suggestions for where I may have gone wrong.

Stephen Hawking, A Brief History of Time: From Big Bang to Black Holes (London: Bantam, 1988), Chapter 12. ↩
Pierre Simon Laplace, Essai philosophique sur les probabilités, 1814, translated as A Philosophical Essay on Probabilities, 1902, p. 4. ↩
Pierre Simon Laplace, Essai philosophique sur les probabilités, 1814, translated as A Philosophical Essay on Probabilities, 1902. p. 6. ↩
Wolfram, Stephen. “Computation Theory of Cellular Automata.” Communications in Mathematical Physics 96, 15–57 (1984). ↩
I also had fun looking at this, and very much enjoyed this review paper as a way in to recent thinking: Parrondo J. M. R., Horowitz J. M. & Sagawa T. Thermodynamics of information. Nature Physics 11, 131–139 (2015). And the thermodynamics of information puts some really nice limits on the efficiency of heat engines when information resevoirs are involved. This is fascinating, but doesn’t give much intuition about how it composes to macroscopic scales, where Boltzmann’s constant means that efficiency improvements only come through burning an internet’s worth of data each second. It is clearly a significant driver at biochemical scales though. I gave a talk on this in Manchester last year but by then was already distracted by the inaccessible game idea. The notes for the talk contain some early speculations on directions and explorations of entropy flow. But I wasn’t then working with the information isolation constraint, so it’s just an entropy game or “Jaynes’ world”. ↩

Perpetual Motion, Superintelligence and the Inaccessible Game

2025-11-12T00:00:00+00:00

Imagine in 1925 a world where the automobile is already transforming society, but big promises are being made for things to come. The stock market is soaring, the 1918 pandemic is forgotten. And every major automobile manufacturer is investing heavily on the promise they will each be the first to produce a car that needs no fuel. A perpetual motion machine.

An advert for a Fuelless car as rendered by ChatGPT after being fed the blogpost

Well, of course that didn’t happen. But I sometimes wonder if what we’re seeing today 100 years later is the modern equivalent of that. This week it was announced that Yann’s leaving the FAIR lab he founded. And one can’t help but wonder if the billions that Zuckerberg is now investing in superintelligence are the modern equivalent of arguing for perpetual motion.¹

Why might this be the case? Well, that’s where entropy comes in. The second law of thermodynamics tells us that entropy always increases. So we can’t have motion without entropy production. How might we make an equivalent statement for the bizarre claims around superintelligence? Some inspiration comes from Maxwell’s demon, an “intelligent” entity which operates against the laws of thermodynamics. The inspiration comes because the demon suggests that for the second law to hold there must be a relationship between the demon’s decisions and thermodynamic entropy. One of the resolutions comes from Landauer’s principle, the notion that erasure of information requires heat dissipation.

I’ve been scratching my head about this for a few years now, but on Monday I shared an arXiv paper² that captures some of the directions I’ve been taking. By considering an axiomatic game where entropy is maximised I’ve been exploring how the dynamics emerge. I admit, it seems a long way from my original motivation, but its just been one of those threads that when you start pulling on it you end up having to go further back. I’m relieved to have the paper out, not because I know it’s all correct, I’m pretty sure I’ve misunderstood lots of things and made some clumsy mathematical errors. But I’m hoping the foundation is still solid enough to build on.

The idea is for an “inaccessible game” which is information-isolated from observers. The assumption is that inside this game the dynamics take the form of “information relaxation” which turns out to be equivalent to constrained entropy production. The nice thing is that resulting dynamics turn out to have a structure that resembles GENERIC, a formalism that is at the heart of modern non-equilibrium thermodynamics.

By building on the similarities to GENERIC one can show an energy-information equivalance that leads to a principle equivalent to Landauer’s.

All this gives me the feeling that the work is on the right track, but it’s clearly still a long way from its destination. My hope is that some friendly folks will be interested enough to take a look and help me tidy it up a bit!

I’ve no doubt that AI technologies will transform our world just as much as the automobile has. But I also have no doubt that the promise of superintelligence is just as silly as the promise of perpetual motion. Maybe the insights from the inaccessible game could provide one way of understanding that.

Strictly speaking it’s investing in a “superintelligence singularity” that is the equivalent of perpetual motion. ↩
The software to recreate the simulations from the paper is here: https://github.com/lawrennd/tig-code/. ↩

The Open Society and its AI

2024-02-09T00:00:00+00:00

This is a summary of some of the ideas presented in a talk for The State of Open 2024 on 7th February 2024. The piece is also included in the “State of Open” report on AI and Open Innovation.

In Goethe’s poem The Sorcerer’s Apprentice, a young sorcerer learns one of their master’s spells and deploys it to assist in his chores. Unfortunately, he cannot control it. The poem was popularised by Paul Dukas’s musical composition, in 1940 Disney used the composition in the film Fantasia. Mickey Mouse plays the role of the hapless apprentice who deploys the spell but cannot control the results.

When it comes to our software systems, the same thing is happening. The Harvard Law professor, Jonathan Zittrain calls the phenomenon intellectual debt. In intellectual debt, like the sorcerer’s apprentice, a software system is created but it cannot be explained or controlled by its creator. The phenomenon comes from the difficulty of building and maintaining large software systems: the complexity of the whole is too much for any individual to understand, so it is decomposed into parts. Each part is constructed by a smaller team. The approach is known as separation of concerns, but it has the unfortunate side effect that no individual understands how the whole system works. When this goes wrong, the effects can be devastating. We saw this in the recent Horizon scandal, where neither the Post Office or Fujitsu were able to control the accounting system they had deployed, and we saw it when Facebook’s systems were manipulated to spread misinformation in the 2016 US election.

When Disney’s Fantasia was released, the philosopher Karl Popper was in exile in New Zealand. He wrote The Open Society and its Enemies when his hometown of Vienna was under Nazi rule. The book defends the political system of liberal democracy against totalitarianism. For Popper, the open society is one characterised by institutions that can engage in the pragmatic pursuit of solutions to social and political problems. Those institutions are underpinned by professions: lawyers, the accountants, civil administrators. To Popper these “piecemeal social engineers” are the pragmatic solution to how a society solves political and social problems.

In 2019 Mark Zuckerberg wrote an op-ed in the Washington Post calling for regulation of social media. He was repeating the realisation of Goethe’s apprentice, he had released a technology he couldn’t control. In Goethe’s poem, the master returns, “Besen, besen! Seid’s gewesen” he calls, and order is restored, but back in the real world the role of the master is played by Popper’s open society. Unfortunately, those institutions have been undermined by the very spell that these modern apprentices have cast. The book, the letter, the ledger, each of these has been supplanted in our modern information infrastructure by the computer. The modern scribes are software engineers, and their guilds are the big tech companies. Facebook’s motto was to “move fast and break things”. Their software engineers have done precisely that and the apprentice has robbed the master of his powers. This is a desperate situation, and it’s getting worse. The latest to reprise the apprentice’s role are Sam Altman and OpenAI who dream of “general intelligence” solutions to societal problems which OpenAI will develop, deploy, and control. Popper worried about the threat of totalitarianism to our open societies, today’s threat is a form of information totalitarianism which emerges from the way these companies undermine our institutions.

So, what to do? If we value the open society, we must expose these modern apprentices to scrutiny. Open development processes are critical here, Fujitsu would never have got away with their claims of system robustness for Horizon if the software they were using was open source. We also need to re-empower the professions, equipping them with the resources they need to have a critical understanding of these technologies. That involves redesigning the interface between these systems and the humans that empowers civil administrators to query how they are functioning. This is a mammoth task. But recent technological developments, such as code generation from large language models, offer a route to delivery.

The open society is characterised by institutions that collaborate with each otherin the pragmatic pursuit of solutions to social problems. The large tech companies that have thrived because of the open society are now putting that ecosystem in peril. For the open society to survive it needs to embrace open development practices that enable Popper’s piecemeal social engineers to come back together and chant “Besen, besen! Seid’s gewesen.” Before it is too late for the master to step in and deal with the mess the apprentice has made.

With thanks to Sylvie Delacroix and Jess Montgomery for comments and suggestions.

Boogeyman Diplomacy

2023-10-31T00:00:00+00:00

When I’m sat at my computer, on the shelf behind me, sometimes visible on my Zoom calls is a 50-year-old stuffed toy panda. The toy was the first gift I received and it’s a legacy of Richard Nixon’s panda diplomacy with China. Détente with China was symbolised with a gift of two giant pandas, they arrived in Washington in the same month that I was born in New Jersey.

Forty five years later, at a Parisian trilateral “International Cooperation on artificial intelligence” summit involving China, US and France, I watched as a US state department official made a calculated snub to their Chinese counterpart, declaring that the US would not co-operate with China on AI while it remained authoritarian and unpicking the fruits of a nearly half-century of diplomacy.

My son was given a brown stuffed bear when he was born, and by the time he was four years old we lived in a Victorian house in Sheffield, one big enough to easily accommodate my Italian in-laws on their visits to see him. The house had a large garage that had served as a billiard room, and behind that was a pile of garbage dumped by previous occupants as part of an unfinished remodelling. It contained glass, nails, lead painted boards and goodness knows what else.

On one of my in-laws’ visits, when my son headed towards the back of the garage my father-in-law, called out to him in Italian.

"Don't go there there's a boogeyman behind the garage."

This put me in a difficult position, I also didn't want my son to go behind the garage, but I’d prefer him to have understood the real dangers rather than giving him boogeyman nightmares.

This story reminds me of the UK AI Summit and a new era of what I’ve started thinking of as boogeyman diplomacy. The way the summit is focussing on long term risks of “frontier models” draws to mind popular notions of Terminator robots as portrayed by Schwarzenegger. This is what I think of as the AI boogeyman.

Just like behind my garage, there are many dangers to AI, and some of those dangers are unknown. In the rubbish there could have been rats and asbestos. Similarly with today’s information technologies we already face challenges around power asymmetries, where a few companies are controlling access to information. Competition authorities on both sides of the Atlantic are addressing these. We also face challenges around automated decision making: the European GDPR is an attempt to regulate how and when algorithmic decision making can be used. The AI boogeyman is frightening extreme conflation of these two challenges, just as an asbestos breathing rat would be a very disturbing conflation of the challenges behind my garage.

But just as the right approach to dealing with an asbestos breathing rat would be to deal with the rats and the asbestos separately, so the AI boogeyman can be dealt with by dealing with both power asymmetries and automated decision making. In both these areas many of us have already been supporting governments in developing new regulation to address these risks. But by combining them, boogeyman diplomacy runs the serious risk of highlighting the problems in a way that distracts us from the real dangers we face.

However, like Mao Zedong and Richard Nixon’s panda diplomacy, boogeyman diplomacy does promise a potential benefit. It is far easier to agree on an exchange of pandas than it is to bridge cultural and political divides between two great nations. Similarly, it will be far easier for China and the United States to agree on measures for imagined risks than it will be for them to agree on the significant challenges their different societies are facing. Ever since the calculated snub in that trilateral summit in Paris the US and China have not talked on these important issues.

As Connected by Data’s open letter from yesterday shows, the UK Prime Minister has sacrificed a great deal of credibility with technical experts, civil society, and citizens across the world with his cry of AI boogeyman. But if the United States and China can be brought back around the table to begin talking again, it may be that the sacrifice is worth it.

Happy Halloween from the Boogeyman.

Storming the Castle: Data science for Covid-19 Policy

2020-10-13T00:00:00+00:00

This piece was written for the Bennett Institute Blog, the original version is available here.

In the classic film Monty Python and the Holy Grail, John Cleese, as Sir Lancelot the Brave, finds a note – a plea from a prisoner of Swamp Castle – beseeching the discoverer to help them escape an unwanted marriage. Responding to this distress call, Sir Lancelot singlehandedly storms Swamp Castle, slaying the guards, terrorising the wedding guests, and fighting his way to the Tall Tower. There, instead of the expected damsel in distress, cruelly imprisoned, Sir Lancelot is surprised to find a wayward groom, Prince Herbert, who sent the note after an argument with his father.

The United Kingdom is considered an international leader in science, and a pioneer in the provision of science advice. The Government has well-established structures for accessing scientific expertise in emergencies through its network of science advisers, the Scientific Advisory Group for Emergencies (SAGE) and departmental advisory committees, including the Science for Pandemic Influenza Groups that provide advice on covid-19 modelling and behavioural science. Together, these structures might call to mind a different Arthurian vision, evoking the works of Thomas Malory: the scientist as Merlin, giving wise counsel to Arthur and honing the Government’s decision-making through deep knowledge of the scientific arts.

Scientists are concerned citizens, and it is perhaps with this vision of adviser as trusted arbiter that many researchers entered into public and policy debates surrounding covid-19. While pursuing the wise Merlin, however, efforts to advise government can easily drift towards Monty Python’s Lancelot. Confident in his knowledge of castle-storming, his individual dedication and his skills in damsel-rescuing, Sir Lancelot enters the fray with only a partial understanding of the challenges and circumstances at hand.

Science policy has long sought ways of bridging the gaps between scientists and policymakers, helping each understand the ways in which evidence can inform policymaking. The UK’s response to the covid-19 pandemic has highlighted the importance of this work, and the long-standing cultural issues that make this mission so challenging. Driven by experiment and theory, scientists often seek definitive answers to a particular question, with each new study prompting more questions and illuminating areas for investigation that stretch into the future. In contrast, policy advice is often rooted to a moment in time. Events cannot wait for definitive scientific understanding. Instead, policymakers need access to high-quality advice that provides actionable insights, based on current understandings, while acknowledging areas of uncertainty.

But what constitutes the best current understanding? Any policy issue can be viewed through multiple lenses: the scientific principles at hand, the economic implications, public acceptability of potential responses, the values embedded in those responses, or operational considerations in policy delivery, amongst others. Each of these lenses is important in considering the evidence available to inform responses to the covid-19 pandemic: the complexity of the pandemic, our relatively limited understanding of the virus, and the practical difficulties of implementing public health policy demand a range of expertise.

These complex and uncertain challenges require a multi-disciplinary response.

The unprecedented nature of the pandemic has spurred multiple efforts to bring research expertise to bear on covid-19 policymaking. These have included a call for rapid assistance from the modelling community, a group providing rapid review and literature analysis, and an independently convened SAGE group. Our own experience is of another of these efforts – the Royal Society-convened DELVE Initiative.

In the early stage of the pandemic, as many governments struggled to implement policies that held back the first wave of infections, data scientists began to explore how advanced analytics could complement traditional forms of government science advice. Chaired by the President of the Royal Society and feeding into SAGE, the DELVE Initiative set out to analyse data from countries at a more advanced stage of the covid-19 pandemic, using these insights to inform UK policy responses.

Data science for ‘real world’ policy questions can only be done effectively with access to domain expertise: extracting insights from data is important, but applying these insights to policy development requires the contextual understanding brought by domain experts, including those embedded in the policymaking process. Mapping this onto the attack of Swamp Castle, Professor Lancelot is better advised to consult his colleagues at the Round Table before charging the Tall Tower – while Lancelot has the tools to break down the doors, other knights may know the residents, the routes in, and why the distress call was sent.

Bridging the ‘supply chain of ideas’ between researchers and policymakers has been core to DELVE’s approach. The breadth of covid-19’s impacts and potential policy responses has required that DELVE make connections across public health, economics, behavioural science, immunology, and more, and the value of collaboration can be seen across DELVE reports. For example, in an early report on testing and tracing systems: researchers from public health brought a wealth of insights about how to detect and manage disease outbreaks in communities, data scientists translated this to analysis that quantified the compliance rates needed to ensure testing and tracing efforts would be successful, and economists contextualised this with evidence about what interventions would encourage individuals and organisations to comply with a test, trace, isolate regime. DELVE’s remit became interdisciplinary by default, with data the focal point around which to convene domain experts.

This type of evidence synthesis would traditionally rely on collaborations developed with the luxury of time – time to understand how different disciplines frame an issue and to identify the different types of evidence that might be policy-relevant. Using data as a convenor has offered a short-cut through these discussions, by creating a common focal point from which different domain experts can explore their ideas. Arthur brought his knights together through the convening power of a sword, Excalibur; DELVE convened its scientists through data.

Despite its importance in enabling rapid evidence synthesis, in pursuing this ambitious research agenda, a consistent barrier to further action has been access to data. Labouring the parallels to Arthurian legend, in many cases relevant data was so difficult to identify and access it may as well have been mythical. But in practice it was the idea that the data might exist and be accessible, rather its actual availability, that was sufficient to convene expertise through DELVE.

Barriers to government data sharing – whether resulting from perceived legal issues, lack of capability in government, or technical barriers to data use – were well-characterised before the pandemic, but have been thrown into sharp relief in recent months. Where successful data sharing arrangements have been established to support the covid-19 response, these have tended to rely on pre-existing relationships between data scientists and policymakers that foster shared understandings of how to use data in research and policy. In some ways, other disciplines have already learned this lesson – sustained engagement between government and academia has played a central role in major policy changes across government, from the smoking ban to the Climate Change Act. If data science is to find a role in policymaking, it will need to build on these experiences.

A new model of open data science, which capitalises on the power of data in convening multidisciplinary exchanges, will be vital, if we are to realise the potential of data science for research and policy. By building a community of researchers at the interface of data science and other disciplines, there is an opportunity to create exciting new research agendas that both advance data science methods and generate new insights for research and policy. Such a community would embed multidisciplinary engagement in its research culture, developing relationships and building capacity for rapid response to future policy challenges. It would seek to create a governance environment in which data can be used safely and rapidly, while ensuring that data analysis tools are made widely available, with clear information about how to use them.

This open data science model will be central to the work of the Accelerate Programme for Scientific Discovery, a new initiative from the Cambridge Computer Lab that will pursue research at the interface of machine learning and the sciences. By operating outside the traditional boundaries that separate disciplines, open data science could bridge the gap between ‘data science’ and the domains that would benefit from its tools and techniques, enabling ideas to spread rapidly and ultimately advancing scientific discovery for the benefit of society.

A Primer on Decision Making with Uncertainty

2020-08-25T00:00:00+00:00

From 4th April 2020 I became involved in the Royal Society DELVE Initiative. The following piece was co-written with other DELVE members (but in personal capacities) as an aide to understanding the different nature of policy decision making that’s required when there is uncertainty about the science. It was originally written in April, but published on Peter Diggles web page at end of August.

The current coronavirus epidemic has broad to very public attention the frequent need for policy decisions, most dramatically the announcement of the UK government’s lockdown restrictions on 23 March 2020, to be made quickly, and on the basis of incomplete information.

The government and the general public receive scientific advice related to the coronavirus from a range of sources. One of these is the Royal Society’s DELVE initiative, whose particular focus on data-driven methods (as opposed to modelling, which provides another very important perspective). An immediate difficulty with providing such advice is that there is a great deal that we do not know about the virus and how it is transmitted, and the kind of information we would like takes a long time to obtain, especially if one wishes to apply the usual standards of scientific rigour. But the virus will not wait for us, so governments are forced to make decisions despite many important facts being unknown. This situation is known as ‘finite horizon’ – there is a fixed time after which if you have not acted, then the decision is effectively made for you.

What constitutes good scientific advice when there is a finite horizon? With complete information, one can offer a cost-benefit analysis of the various options between which the government must choose. But when the information is only partial, probability comes into the picture, and this becomes a risk-benefit analysis.

This can be illustrated with a simple example. Suppose you are on holiday and you visit an island. At the end of the day you need to catch a ferry back to your hotel, which leaves at around 11pm, but you are not quite sure of the precise time. You are running a bit late, and as you arrive at the terminal, you see that a ferry is just about to leave. You do not speak the language and do not have time to check whether it is the right ferry.

One part of deciding what to do will be to weigh up the costs and benefits of the possible outcomes. If you do not get on the ferry, you will probably spend the night on the island, so you will consider what that would be like: whether it would be safe, whether there is anywhere to stay, etc. If you do get on the ferry, then you may end up back at your hotel, which is the ideal outcome, but you may perhaps be taken somewhere else, where you will arrive, late at night, not speaking the language.

Another part of the decision is assessing probabilities. For instance, if there are very few ferries, then you may judge that it is likely that the departing ferry is the right one, but if there are many, then you will be less sure. This assessment will have an important effect on your decision: the more likely the ferry is to be the right one, the lower the risk of ending up in the wrong place, and therefore the more sensible it is to jump on.

DELVE comprises a highly diverse group of experts from public health, epidemiology, economics, immunology, mathematics, statistics, machine learning and psychology. The group aims to offer the best policy advice it can, on the time frames where that advice is useful. Its reports assimilate evidence from these different fields and advise on the implications of this evidence for policy-making’. Where scientific consensus is available it can give this advice on the basis of that consensus. But where there is no consensus, it can still offer advice based on our best understanding of the outcomes of any interventions that might be made. Often those outcomes will be uncertain: in such situations we try to assess the risks of the interventions and the likelihood of potential benefits, in order to offer the best possible advice given the evidence we have.

For example, DELVE recently reported on the use of face masks, concluding that more widespread face mask use can help reduce the risk of onward transmission from asymptomatic individuals and could play an important role in situations where physical distancing is not practical. The scientific evidence that this would make a significant difference to transmission rates is by no means conclusive, but neither is the scientific evidence that it would not make a significant difference. DELVE judged that there was a reasonable probability that masks would make a difference. For example, if they reduce the transmission rate, then they could potentially reduce the length of the lockdown, and each day of the lockdown is estimated to cost £2.4bn. On that basis, despite the uncertainties, we felt confident in our advice.

A few general points are worth bearing in mind.

It may be that as decisions-makers accumulate more evidence, they will come to realize that another decision is better: given how rapidly the evidence is changing, it is important to be flexible and ready to change our minds.
In some situations maintaining the status quo until more evidence accumulates is a good default decision. But in an emergency such as the current pandemic, inaction can have serious adverse consequences, so its risks and potential benefits should be assessed along with those of possible interventions.
While science has a very important part to play in decision-making, important aspects of decision-making are not scientific. For instance, science may be able to tell us that there is approximately a 1% chance of a certain catastrophe occurring, but science alone cannot determine how much money it would be worth spending on interventions to reduce that chance to 0.5%.

In summary, because there are many uncertainties associated with the situation in which we find ourselves, we will often have to make decisions in the absence of a scientific consensus. That need not prevent us from making good decisions. We should look at the evidence in order to assess the probabilities of the possible outcomes of any intervention and the likely costs and benefits of those outcomes. This will not always make the decisions easy, but it will give them a much better foundation.

Note. The authors are members of the DELVE Working Group, but are writing in their personal capacity. A good starting point for further reading is a blog-post by LSE Economists Matteo Galizzi, Benno Guenther, Maddie Quinlan and Jet Sanders: https://coronavirusandtheeconomy.com/question/risk-time-covid-19-what-do-we-know-andnot-know

The Ministry of Silly Models

2019-12-14T00:00:00+00:00

This is an imagined future, one that I believe could place the UK at the forefront of AI today. It builds upon our traditional strengths, but is forward looking.

I’ve set the scene below by imagining the application process. Imagine a 1970s style office just off Whitehall (with apologies to John Cleese and Michael Palin).

Whitehall where our action takes place

Minister: Good morning. I’m sorry to have kept you waiting, but I’m afraid my model has become rather sillier recently, and so it takes me rather longer to train it. (sits at desk) Now then, what was it again?

Mr. Pudey: Well sir, I have a silly model and I’d like to obtain a Government grant to help me develop it.

Minister: I see. May I see your silly model?

Mr. Pudey: Yes, certainly, yes.

(Shows laptop with PyTorch open in Jupyter notebook)

Minister: That’s it, is it?

Mr. Pudey: Yes, that’s it, yes.

Minister: It’s not particularly silly, is it? I mean, the first layer isn’t silly at all and second layer merely does a forward pass with a partial recursion every alternate iteration.

Mr. Pudey: Yes, but I think that with Government backing I could make it very silly.

Minister: (rising) Mr. Pudey, (he walks about behind the desk) the very real problem is one of money. I’m afraid that the Ministry of Silly Models is no longer getting the kind of support it needs. You see there’s Defense, Social Security, Health, Housing, Education, Silly Models … they’re all supposed to get the same. But last year, the Government spent less on the Ministry of Silly Models than it did on National Defense! Now we get £348,000,000 a year, which is supposed to be spent on all our available products. (he sits down) Coffee?

Mr. Pudey: Yes please.

(Context: many of you may not yet have heard of the UK’s National Smart Assistant project, “Two-Lumps”, now widely deployed across our civil service)

Minister: Two-Lumps, would you bring us in two coffees please?

Disembodied Voice: Yes, Mr. Teabag.

Minister: … Out of its mind. Now the Japanese have a model which inverts each of its layers over its head and back again with every single iteration. While the Israelis… here’s the coffee.

(Enter a robot with tray with two cups on it. Its motor system has been trained via reinforcement learning through an evolutionary search strategy. This gives it a particularly jerky movement which means that by the time it reaches the minister there is no coffee left in the cups. The minister has a quick look in the cups, and smiles understandingly.)

Minister: Thank you - lovely. (robot exits still carrying tray and cups) You’re really interested in silly models, aren’t you?

Mr. Pudey: Oh rather. Yes.

Minister: Well take a look at this, then.

(He produces a projector from beneath his desk already spooled up and plugged in. He flicks a switch and it beams onto the opposite wall. The film shows a sequence of six old-fashioned silly modelers at the first Connectionist Models Summer School. Geoff Hinton, David Rumelhart, Michael Mozer, Terry Sejnowski, Mike Jordan, John Hopfield. The film is old silent-movie type, scratchy, jerky and 8mm quality. All the participants wear 1980’s type costume. One model has a huge input layer, one has no hidden units. Cut back to office. The minister hurls the projector away. Along with papers and everything else on his desk. He leans forward.)

Minister: Now Mr. Pudey. I’m not going to mince words with you. I’m going to offer you a Research Fellowship on the Anglo-French Silly Model.

Mr. Pudey: La Modèle Futile?

(Cut to two Frenchmen, wearing striped jerseys and berets, standing in a field with a third man who is entirely covered by a sheet.)

First Frenchman: Bonjour … et maintenant … comme d’habitude, au sujet du Le Modèle Commun. Et maintenant, je vous presente, encore une fois, mon ami, le pouf célèbre, Yann LeCun. (he removes his moustache and sticks it onto the other Frenchman)

Second Frenchman: Merci, mon petit chou-chou Leon Bottou. Et maintenant avec les couplages causales à droite, et les couplages dirigés au gauche, et maintenant l’Anglais-Française Modèle Futile, et voilà.

The 3Ds of Machine Learning Systems Design

2018-11-05T00:00:00+00:00

There is a lot of talk about the fourth industrial revolution centered around AI. If we are at the start of the fourth industrial we also have the unusual honour of being the first to name our revolution before it’s occurred.

The technology that has driven the revolution in AI is machine learning. And when it comes to capitalising on the new generation of deployed machine learning solutions there are practical difficulties we must address.

In 1987 the economist Robert Solow quipped "You can see the computer age everywehere but in the productivity statistics". Thirty years later, we could equally apply that quip to the era of artificial intelligence.

From my perspective, the current era is merely the continuation of the information revolution. A revolution that was triggered by the wide availability of the silicon chip. But whether we are in the midst of a new revolution, or this is just the continuation of an existing revolution, it feels important to characterize the challenges of deploying our innovation and consider what the solutions may be.

There is no doubt that new technologies based around machine learning have opened opportunities to create new businesses. When home computers were introduced there were also new opportunities in software publishing, computer games and a magazine industry around it. The Solow paradox arose because despite this visible activity these innovations take time to percolate through to existing businesses.

Brownfield and Greenfield Innovation

Understanding how to make the best use of new technology takes time. I call this approach, brownfield innovation. In the construction industry, a brownfield site is land with pre-existing infrastructure on it, whereas a greenfield site is where construction can start afresh.

The term brownfield innovation arises by analogy. Brownfield innovation is when you are innovating in a space where there is pre-existing infrastructure. This is the challenge of introducing artificial intelligence to existing businesses. Greenfield innovation, is innovating in areas where no pre-existing infrastructure exists.

One way we can make it easier to benefit from machine learning in both greenfield and brownfield innovation is to better characterise the steps of machine learning systems design. Just as software systems design required new thinking, so does machine learning systems design.

In this post we characterise the process for machine learning systems design, convering some of the issues we face, with the 3D process: Decomposition¹, Data and Deployment.

We will consider each component in turn, although there is interplace between components. Particularly between the task decomposition and the data availability. We will first outline the decomposition challenge.

One of the most succesful machine learning approaches has been supervised learning. So we will mainly focus on supervised learning because this is also, arguably, the technology that is best understood within machine learning.

Decomposition

Machine learning is not magical pixie dust, we cannot simply automate all decisions through data. We are constrained by our data (see below) and the models we use. Machine learning models are relatively simple function mappings that include characteristics such as smoothness. With some famous exceptions, e.g. speech and image data, inputs are constrained in the form of vectors and the model consists of a mathematically well behaved function. This means that some careful thought has to be put in to the right sub-process to automate with machine learning. This is the challenge of decomposition of the machine learning system.

Any repetitive task is a candidate for automation, but many of the repetitive tasks we perform as humans are more complex than any individual algorithm can replace. The selection of which task to automate becomes critical and has downstream effects on our overall system design.

Pigeonholing

The machine learning systems design process calls for separating a complex task into decomposable separate entities. A process we can think of as pigeonholing.

Some aspects to take into account are

Can we refine the decision we need to a set of repetitive tasks where input information and output decision/value is well defined?
Can we represent each sub-task we’ve defined with a mathematical mapping?

The representation necessary for the second aspect may involve massaging of the problem: feature selection or adaptation. It may also involve filtering out exception cases (perhaps through a pre-classification).

All else being equal, we’d like to keep our models simple and interpretable. If we can convert a complex mapping to a linear mapping through clever selection of sub-tasks and features this is a big win.

For example, Facebook have feature engineers, individuals whose main role is to design features they think might be useful for one of their tasks (e.g. newsfeed ranking, or ad matching). Facebook have a training/testing pipeline called FBLearner. Facebook have predefined the sub-tasks they are interested in, and they are tightly connected to their business model.

It is easier for Facebook to do this because their business model is heavily focused on user interaction. A challenge for companies that have a more diversified portfolio of activities driving their business is the identification of the most appropriate sub-task. A potential solution to feature and model selection is known as AutoML [1]. Or we can think of it as using Machine Learning to assist Machine Learning. It’s also called meta-learning. Learning about learning. The input to the ML algorithm is a machine learning task, the output is a proposed model to solve the task.

One trap that is easy to fall in is too much emphasis on the type of model we have deployed rather than the appropriateness of the task decomposition we have chosen.

Recommendation: Conditioned on task decomposition, we should automate the process of model improvement. Model updates should not be discussed in management meetings, they should be deployed and updated as a matter of course. Further details below on model deployment, but model updating needs to be considered at design time. This is the domain of AutoML.

The answer to the question which comes first, the chicken or the egg is simple, they co-evolve [2]. Similarly, when we place components together in a complex machine learning system, they will tend to co-evolve and compensate for one another.

To form modern decision making systems, many components are interlinked. We decompose our complex decision making into individual tasks, but the performance of each component is dependent on those upstream of it.

This naturally leads to co-evolution of systems, upstream errors can be compensated by downstream corrections.

To embrace this characteristic, end-to-end training could be considered. Why train individual systems by localized metrics when we can just optimize, globally, end-to-end for final system performance? End to end training can lead to improvements in performance, but it could also damage our systems decomposability, its interpretability, and perhaps its adaptability.

The less human interpretable our systems are, the harder they are to adapt to different circumstances or diagnose when there's a challenge. The trade-off between interpretability and performance is a constant tension which we should always retain in our minds when performing our system design.

Data

It is difficult to overstate the importance of data. It is half of the equation for machine learning, but is often utterly neglected. We can speculate that there are two reasons for this. Firstly, data cleaning is perceived as tedious. It doesn’t seem to consist of the same intellectual challenges that are inherent in constructing complex mathematical models and implementing them in code. Secondly, data cleaning is highly complex, it requires a deep understanding of how machine learning systems operate and good intuitions about the data itself, the domain from which data is drawn (e.g. Supply Chain) and what downstream problems might be caused by poor data quality.

A consequence of these two reasons, data cleaning seems difficult to formulate into a readily teachable set of principles. As a result it is heavily neglected in courses on machine learning and data science. Despite data being half the equation, most University courses spend little to no time on its challenges.

However, these challenges aren't new, they are merely taking a different form.

Anecdotally, talking to data modelling scientists, most say they spend 80% of their time acquiring and cleaning data. The “software crisis” was the inability to deliver software solutions due to increasing complexity of implementation. There was no single shot solution for the software crisis, it involved better practice (scrum, test orientated development, sprints, code review), improved programming paradigms (object orientated, functional) and better tools (CVS, then SVN, then git).

The Software Crisis

The major cause of the software crisis is that the machines have become several orders of magnitude more powerful! To put it quite bluntly: as long as there were no machines, programming was no problem at all; when we had a few weak computers, programming became a mild problem, and now we have gigantic computers, programming has become an equally gigantic problem.

Edsger Dijkstra (1930-2002), The Humble Programmer

In the late sixties early software programmers made note of the increasing costs of software development and termed the challenges associated with it as the "Software Crisis". Edsger Dijkstra referred to the crisis in his 1972 Turing Award winner's address.

The Data Crisis

Modern software is driven by data. That is the essence of machine learning. We can therefore see the software crisis as merely the first wave of a (perhaps) larger phenomenon, "the data crisis".

We can update Dijkstra's quote for the modern era.

The major cause of the data crisis is that machines have become more interconnected than ever before. Data access is therefore cheap, but data quality is often poor. What we need is cheap high quality data. That implies that we develop processes for improving and verifying data quality that are efficient.

There would seem to be two ways for improving efficiency. Firstly, we should not duplicate work. Secondly, where possible we should automate work.

The quantity of modern data, and the lack of attention paid to data as it is initially "laid down" and the costs of data cleaning are bringing about a crisis in data-driven decision making. This crisis is at the core of the challenge of technical debt in machine learning [3].

Just as with software, the crisis is most correctly addressed by 'scaling' the manner in which we process our data. Duplication of work occurs because the value of data cleaning is not correctly recognised in management decision making processes. Automation of work is increasingly possible through techniques in "artificial intelligence", but this will also require better management of the data science pipeline so that data about data science (meta-data science) can be correctly assimilated and processed. The Alan Turing institute has a program focussed on this area, AI for Data Analytics.

Data is the new software, and the data crisis is already upon us. It is driven by the cost of cleaning data, the paucity of tools for monitoring and maintaining our deployments, the provenance of our models (e.g. with respect to the data they’re trained on).

Three principal changes need to occur in response. They are cultural and infrastructural.

The Data First Paradigm

First of all, to excel in data driven decision making we need to move from a software first paradigm to a data first paradigm. That means refocusing on data as the product. Software is the intermediary to producing the data, and its quality standards must be maintained, but not at the expense of the data we are producing. Data cleaning and maintenance need to be prized as highly as software debugging and maintenance. Instead of software as a service, we should refocus around data as a service. This first change is a cultural change in which our teams think about their outputs in terms of data. Instead of decomposing our systems around the software components, we need to decompose them around the data generating and consuming components.² Software first is only an intermediate step on the way to be coming data first. It is a necessary, but not a sufficient condition for efficient machine learning systems design and deployment. We must move from software orientated architecture to a data orientated architecture.

Data Quality

Secondly, we need to improve our language around data quality. We cannot assess the costs of improving data quality unless we generate a language around what data quality means. Data Readiness Levels³ are an assessment of data quality that is based on the usage to which data is put.

Move Beyond Software Engineering to Data Engineering

Thirdly, we need to improve our mental model of the separation of data science from applied science. A common trap in current thinking around data is to see data science (and data engineering, data preparation) as a sub-set of the software engineer’s or applied scientist’s skill set. As a result we recruit and deploy the wrong type of resource. Data preparation and question formulation is superficially similar to both because of the need for programming skills, but the day to day problems faced are very different.

Recommendation: Build a shared understanding of the language of data readiness levels for use in planning documents, the costing of data cleaning, and the benefits of reusing cleaned data.

Combining Data and Systems Design

One analogy I find helpful for understanding the depth of change we need is the following. Imagine as a software engineer, you find a USB stick on the ground. And for some reason you know that on that USB stick is a particular API call that will enable you to make a significant positive difference on a business problem. However, you also know on that USB stick there is potentially malicious code. The most secure thing to do would be to not introduce this code into your production system. But what if your manager told you to do so, how would you go about incorporating this code base?

The answer is very carefully. You would have to engage in a process more akin to debugging than regular software engineering. As you understood the code base, for your work to be reproducible, you should be documenting it, not just what you discovered, but how you discovered it. In the end, you typically find a single API call that is the one that most benefits your system. But more thought has been placed into this line of code than any line of code you have written before.

Even then, when your API code is introduced into your production system, it needs to be deployed in an environment that monitors it. We cannot rely on an individual’s decision making to ensure the quality of all our systems. We need to create an environment that includes quality controls, checks and bounds, tests, all designed to ensure that assumptions made about this foreign code base are remaining valid.

This situation is akin to what we are doing when we incorporate data in our production systems. When we are consuming data from others, we cannot assume that it has been produced in alignment with our goals for our own systems. Worst case, it may have been adversarialy produced. A further challenge is that data is dynamic. So, in effect, the code on the USB stick is evolving over time.

Anecdotally, resolving a machine learning challenge requires 80% of the resource to be focused on the data and perhaps 20% to be focused on the model. But many companies are too keen to employ machine learning engineers who focus on the models, not the data.

A reservoir of data has more value if the data is consumable. The data crisis can only be addressed if we focus on outputs rather than inputs.

For a data first architecture we need to clean our data at source, rather than individually cleaning data for each task. This involves a shift of focus from our inputs to our outputs. We should provide data streams that are consumable by many teams without purification.

Recommendation: We need to share best practice around data deployment across our teams. We should make best use of our processes where applicable, but we need to develop them to become data first organizations. Data needs to be cleaned at output not at input.

Deployment

Continuous Deployment

Once the decomposition is understood, the data is sourced and the models are created, the model code needs to be deployed.

To extend our USB stick analogy further, how would we deploy that code if we thought it was likely to evolve in production? This is what datadoes. We cannot assume that the conditions under which we trained our model will be retained as we move forward, indeed the only constant we have is change.

This means that when any data dependent model is deployed into production, it requires continuous monitoring to ensure the assumptions of design have not been invalidated. Software changes are qualified through testing, in particular a regression test ensures that existing functionality is not broken by change. Since data is continually evolving, machine learning systems require 'continual regression testing': oversight by systems that ensure their existing functionality has not been broken as the world evolves around them. An approach we refer to as progression testing. Unfortunately, standards around ML model deployment yet been developed. The modern world of continuous deployment does rely on testing, but it does not recognize the continuous evolution of the world around us.

Recommendation: We establish best practice around model deployment. We need to shift our culture from standing up a software service, to standing up a data as a service. Data as a Service would involve continual monitoring of our deployed models in production. This would be regulated by 'hypervisor' systems⁴ that understand the context in which models are deployed and recognize when circumstance has changed and models need retraining or restructuring.

Recommendation: We should consider a major re-architecting of systems around our services. In particular we should scope the use of a streaming architecture (such as Apache Kafka) that ensures data persistence and enables asynchronous operation of our systems.⁵ This would enable the provision of QC streams, and real time dash boards as well as hypervisors.

Importantly a streaming architecture implies the services we build are stateless, internal state is deployed on streams alongside external state. This allows for rapid assessment of other services' data.

Outlook for Machine Learning

Machine learning has risen to prominence as an approach to scaling our activities. For us to continue to automate in the manner we have over the last two decades, we need to make more use of computer-based automation. Machine learning is allowing us to automate processes that were out of reach before.

Conclusion

We operate in a technologically evolving environment. Machine learning is becoming a key component in our decision making capabilities. But the evolving nature of data driven systems means that new approaches to model deployment are necessary. We have characterized three parts of the machine learning systems design process. Decomposition of the problem into separate tasks that are addressable with a machine learning solution. Collection and curation of appropriate data. Verificaction of data quality through data readiness levels. Using progression testing in our deployments. Continuously updating models as appropriate to ensure performance and quality is maintained.

References

[1] M. Feurer, A. Klein, K. Eggensperger, J. T. Springenberg, M. Blum, and F. Hutter, “Efficient and robust automated machine learning,” in Advances in neural information processing systems.

[2] K. R. Popper, Conjectures and refutations: The growth of scientific knowledge. London: Routledge, 1963.

[3] D. Sculley et al., “Hidden technical debt in machine learning systems,” in Advances in neural information processing systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, Eds. Curran Associates, Inc., 2015, pp. 2503–2511.

[4] N. D. Lawrence, “Data readiness levels,” arXiv, May 2017.

In earlier versions of the Three D process I've referred to this as the design stage, but decomposition feels more appropriate for what the stage involves and that preserves the word design for the overall process of machine learning systems design.↩
This is related to challenges of machine learning and technical debt [3], although we are trying to frame the solution here rather than the problem.↩
Data Readiness Levels [4] are an attempt to develop a language around data quality that can bridge the gap between technical solutions and decision makers such as managers and project planners. The are inspired by Technology Readiness Levels which attempt to quantify the readiness of technologies for deployment.↩
Emulation, or surrogate modelling, is one very promising approach to forming such a hypervisor. Emulators are models we fit to other models, often simulations, but the could also be other machine learning modles. These models operate at the meta-level, not on the systems directly. This means they can be used to model how the sub-systems interact. As well as emulators we shoulc consider real time dash boards, anomaly detection, mutlivariate analysis, data visualization and classical statistical approaches for hypervision of our deployed systems.↩
These approaches are one area of focus for my own team's reasearch. A data first architecture is a prerequisite for efficient deployment of machine learning systems.↩

Natural and Artificial Intelligence

2018-02-06T00:00:00+00:00

How are we making computers do the things we used to associated only with humans? Have we made a breakthrough in understanding human intelligence?

While recent achievements might give the sense that the answer is yes, the short answer is that we are nowhere near. All we’ve achieved for the moment is a breakthrough in emulating intelligence.

Science on Holborn Viaduct, cradling Watt's Governor

Artificial intelligence is a badly defined term. Successful deployments of intelligent systems are common, but normally they are redefined to be non-intelligent. My favourite example is Watt’s governor. Immortalised in the arms of the statue of “Science” on the Holborn viaduct in London, Watt’s governor automatically regulated the speed of a steam engine, closing the inlet valve progressively as the engine ran faster. It did the job that an intelligent operator used to have to do, but few today would describe it as “artificial intelligence”.

A more recent example comes from the middle of the last century. A hundred years ago computers were human beings, often female, who conducted repetitive mathematical tasks for the creation of mathematical tables such as logarithms. Our modern digital computers were originally called automatic computers to reflect the fact that the intelligence of these human operators had been automated. But despite the efficiency with which they perform these tasks, very few think of their mobile phones or computers as intelligent.

Norbert Wiener launched last century’s first wave of interest in emulation of intelligence with his book “Cybernetics”. The great modern success that stemmed from that work is the modern engineering discipline of Automatic Control. The technology that allows fighter jets to fly. These ideas came out of the Second World War, when researchers explored the use of radar (automated sensing) and automatic computation for decryption of military codes (automated decision making). Post war a body of researchers, including Alan Turing, were seeing the potential for electronic emulation of what had up until then been the preserve of an animallian nervous system.

So what of the modern revolution? Is this any different? Are we finally encroaching on the quintessential nature of human intelligence? Or is there a last bastion of our minds that remains out of reach from this latest technical wave?

My answer has two parts. Firstly, current technology is a long way from emulating all aspects of human intelligence: there are a number of technological breakthroughs that remain before we crack the fundamental nature of human intelligence. Secondly, and perhaps more controversially, I believe that there are aspects of human intelligence that we will never be able to emulate, a preserve that remains uniquely ours.

Before we explore these ideas though, we first need to give some sense of current technology.

Recent breakthroughs in artificial intelligence are being driven by advances in machine learning. Or more precisely, a sub domain of that field known as deep learning. So what is deep learning? Well, machine learning algorithms proceed in the following way. They observe data, often from humans, and they attempt to emulate that data creation process with a mathematical function. Let’s explain that in a different way. Whichever activities we are interested in emulating, we acquire data about them. Data is simply a set of numbers representing the activity. It might include location, or number representing past behaviour. Once we have the behaviour represented in a set of numbers, then we can emulate it mathematically.

Different mathematical functions have different characteristics, in trigonometry we learnt about sine and cosine functions. Those are functions that have a period. They repeat over time. They are useful for emulating behaviour that repeats itself, perhaps behaviour that reflects day/night cycles. But these functions on their own are too simple to emulate human behaviour in all its richness. So how do we make things more complex? In practice we can add functions together, scale them up and scale them down. All these things are done. Deep learning refers to the practice of creating a new function by feeding one function in to another. So we create the output of a function, feed it in to a new function, and take the output of that as our answer. In maths this is known as composition of functions. The advantage is that we can generate much more complex functions from simpler functions. The resulting functions allow us to take an image as an input and produce an output that includes numbers representing what the objects or people are in that image. Doing this accurately is achieved through composition of mathematical functions, otherwise known as deep learning.

To describe this process, I sometimes use the following analogy. Think of a pinball machine with rows of pins. A plunger is used to fire a ball into the top of the machine. Think of each row of pins as a function. The input to the function is the position of the ball along the left to right axis of the machine at the top. The output of the function is the position of the ball as it leaves each row of pins. Composition of functions is simply feeding the output of one row of pins into the next. If we do this multiple times, then although the machine has multiple functions in it, by feeding one function into another, then the entire machine is in itself representing a single function. Just a more complex function than that that any individual row of pins can represent. Our modern machine learning systems are just like this machine. Except they take very high dimensional inputs. The pinball is only represented by its left to right position, a one dimensional function. A real learning machine will often represent multiple dimensions, like playing pinball in a hyperspace.

So the intelligence we’ve created is just a (complex) mathematical function. But how do we make this function match what humans do? We use more maths. In fact we use another mathematical function. To avoid confusion let’s call the function represented by our pinball machine the prediction function. In the pinball machine, we can move the pins around, and any configuration of pins leads to a different answer for our functions. So how do we find the correct configuration? For that we use a separate function known as the objective function.

The objective function measures the dissimilarity between the output of our prediction function, and data we have from the humans. The process of learning in these machines is the process of moving the pins around in the pinball machine to make the prediction more similar to the data. The quality of any given pin configuration is assessed by quantifying the differences between what the humans have done, and what the machine is predicting using the objective function. Different objective functions are used for different tasks, but objective functions are typically much simpler than prediction functions.

So, we have some background. And some understanding of the challenges for the designer of machine learning algorithms. You need data, you need a prediction function, and you need an objective function. Choice of prediction function and objective function are in the hands of the algorithm designer. Deep learning is a way of making very complex, and flexible, classes of prediction functions that deal well with the type of data we get from speech, written language and images.

But what if the designer gets the choice of prediction function wrong? Or the choice of the objective function wrong? Or what if they choose the right class of prediction function (such as a convolutional neural network, used for classifying images) but get the location of all those pins in the machine wrong?

This is a major challenge for our current generation of AI solutions. They are fragile. They are sensitive to incorrect choices and when they fail, they fail catastrophically.

To understand why, let’s take a step back from artificial intelligence, and consider natural intelligence. Or even more generally, let’s consider the contrast between an artificial system and an natural system. The key difference between the two is that artificial systems are designed whereas natural systems are evolved.

Systems design is a major component of all Engineering disciplines. The details differ, but there is a single common theme: achieve your objective with the minimal use of resources to do the job. That provides efficiency. The engineering designer imagines a solution that requires the minimal set of components to achieve the result. A water pump has one route through the pump. That minimises the number of components needed. Redundancy is introduced only in safety critical systems, such as aircraft control systems. Students of biology, however, will be aware that in nature system-redundancy is everywhere. Redundancy leads to robustness. For an organism to survive in an evolving environment it must first be robust, then it can consider how to be efficient. Indeed, organisms that evolve to be too efficient at a particular task, like those that occupy a niche environment, are particularly vulnerable to extinction.

So it is with natural vs artificial intelligences. Any natural intelligence that was not robust to changes in its external environment would not survive, and therefore not reproduce. In contrast the artificial intelligences we produce are designed to be efficient at one specific task: control, computation, playing chess. They are fragile.

The first criterion of a natural intelligence is don’t fail, not because it has a will or intent of its own, but because if it had failed it wouldn’t have stood the test of time. It would no longer exist. In contrast, the mantra for artificial systems is to be more efficient. Our artificial systems are often given a single objective (in machine learning it is encoded in a mathematical function) and they aim to achieve that objective efficiently. These are different characteristics. Even if we wanted to incorporate don’t fail in some form, it is difficult to design for. To design for “don’t fail”, you have to consider every which way in which things can go wrong, if you miss one you fail. These cases are sometimes called corner cases. But in a real, uncontrolled environment, almost everything is a corner. It is difficult to imagine everything that can happen. This is why most of our automated systems operate in controlled environments, for example in a factory, or on a set of rails. Deploying automated systems in an uncontrolled environment requires a different approach to systems design. One that accounts for uncertainty in the environment and is robust to unforeseen circumstances.

The systems we produce today only work well when their tasks are pigeonholed, bounded in some way. To achieve robust artificial intelligences we need new approaches to both the design of the individual components, and the combination of components within our AI systems. We need to deal with uncertainty and increase robustness. Today, it is easy to make a fool of an artificial intelligent agent, technology needs to address the challenge of the uncertain environment to achieve robust intelligences.

However, even if we find technological solutions for these challenges, it may be that the essence of human intelligence remains out of reach. It may be that the most quintessential element of our intelligence is defined by limitations. Limitations that computers have never experienced.

That characteristic is our limited ability to communicate, in particular our limited ability to communicate versus our ability to compute. Not compute in terms of working out arithmetic or solving sudoku problems, but the amount of compute that underpins our higher thoughts. The computation that our neurons are doing.

It is estimated that to simulate the human mind it would require a supercomputer operating as fast as the world’s fastest. The UK Met Office’s computer, that is used for simulating weather and climate across the world, might do it. At the time of writing its the 11th fastest computer in the world. Much faster than a typical desktop. But while a typical computer seems to be performing less operations than our brain does. A typical computer can communicate much faster than we can.

Claude Shannon developed the idea of information theory: the mathematics of information. He defined the amount of information we gain when we learn the result of a coin toss as a “bit” of information. A typical computer can communicate with another computer with a billion bits of information per second. Equivalent to a billion coin tosses per second. So how does this compare to us? Well, we can also estimate the amount of information in the English language. Shannon estimated that the average English word contains around 12 bits of information, twelve coin tosses, this means our verbal communication rates are only around the order of tens to hundreds of bits per second. Computers communicate tens of millions of times faster than us, in relative terms we are constrained to a bit of pocket money, while computers are corporate billionaires.

Our intelligence is not an island, it interacts, it infers the goals or intent of others, it predicts our own actions and how we will respond to others. We are social animals, and together we form a communal intelligence that characterises our species. For intelligence to be communal, our ideas to be shared somehow. We need to overcome this bandwidth limitation. The ability to share and collaborate, despite such constrained ability to communicate, characterises us. We must intellectually commune with one another. We cannot communicate all of what we saw, or the details of how we are about to react. Instead, we need a shared understanding. One that allows us to infer each other’s intent through context and a common sense of humanity. This characteristic is so strong that we anthropomorphise any object with which we interact. We apply moods to our cars, our cats, our environment. We seed the weather, volcanoes, trees with intent. Our desire to communicate renders us intellectually animist.

But our limited bandwidth doesn’t constrain us in our imaginations. Our consciousness, our sense of self, allows us to play out different scenarios. To internally observe how our self interacts with others. To learn from an internal simulation of the wider world. Empathy allows us to understand others’ likely responses without having the full detail of their mental state. We can infer their perspective. Self-awareness also allows us to understand our own likely future responses, to look forward in time, play out a scenario. Our brains contain a sense of self and a sense of others. Because our communication cannot be complete it is both contextual and cultural. When driving a car in the UK a flash of the lights at a junction concedes the right of way and invites another road user to proceed, whereas in Italy, the same flash asserts the right of way and warns another road user to remain.

Our main intelligence is our social intelligence, intelligence that is dedicated to overcoming our bandwidth limitation. We are individually complex, but as a society we rely on shared behaviours and oversimplification of our selves to remain coherent.

This nugget of our intelligence seems impossible for a computer to recreate directly, because it is a consequence of our evolutionary history. The computer, on the other hand, was born into a world of data, of high bandwidth communication. It was not there through the genesis of our minds and the cognitive compromises we made are lost to time. To be a truly human intelligence you need to have shared that journey with us.

Of course, none of this prevents us emulating those aspects of human intelligence that we observe in humans. We can form those emulations based on data. But even if an artificial intelligence can emulate humans to a high degree of accuracy it is a different type of intelligence. It is not constrained in the way human intelligence is. You may ask does it matter? Well, it is certainly important to us in many domains that there’s a human pulling the strings. Even in pure commerce it matters: the narrative story behind a product is often as important as the product itself. Handmade goods attract a price premium over factory made. Or alternatively in entertainment: people pay more to go to a live concert than for streaming music over the internet. People will also pay more to go to see a play in the theatre rather than a movie in the cinema.

In many respects I object to the use of the term Artificial Intelligence. It is poorly defined and means different things to different people. But there is one way in which the term is very accurate. The term artificial is appropriate in the same way we can describe a plastic plant as an artificial plant. It is often difficult to pick out from afar whether a plant is artificial or not. A plastic plant can fulfil many of the functions of a natural plant, and plastic plants are more convenient. But they can never replace natural plants.

In the same way, our natural intelligence is an evolved thing of beauty, a consequence of our limitations. Limitations which don’t apply to artificial intelligences and can only be emulated through artificial means. Our natural intelligence, just like our natural landscapes, should be treasured and can never be fully replaced.

Decision Making and Diversity

2017-11-15T00:00:00+00:00

I’m planning to put this into Arxiv, but thought I’d share via a blog post first in case there’s any feedback.

TL;DR Variation is important in decision making.

Introduction

It should be remembered that just as the Declaration of Independence promises the pursuit of happiness rather than happiness itself, so the iterative scientific model building process offers only the pursuit of the perfect model. For even when we feel we have carried the model building process to a conclusion some new initiative may make further improvement possible. Fortunately to be useful a model does not have to be perfect.

George E. Box (1979)

In the book “Justice: What’s The Right Thing to Do?” (Sandel 2010) Michael Sandel aims to help us answer questions about how to do the right thing by giving some context and background in moral philosophy. Sandel is a philosopher based at Harvard University who is reknowned for his popular treatments of the subject. He starts by illustrating decision making through the ‘trolley’ problem.

Trolley problems seem to be a mainstay of moral philosophy: in the original variant (Foot 1967) there is a runaway trolley¹ rolling at speed down a track, it is approaching a set of points beyond which is a group of five workers. The workers do not have time to get out of the way of the trolley. They will be killed. You have the opportunity to switch the track, saving the five workers, but the trolley will run onto another track, killing a single worker. You could shout a warning, but somehow the workers wouldn’t hear you, or if they did hear they wouldn’t have time to get out of the way. The moral question is should you switch the track? You would kill one worker, but save five. Apparently most of us think that in that situation we’d pull the switch.²

Sandel starts his book by looking at utilitarianism. Utilitarianism says that we should make decisions that lead to the the greatest benefit to humanity. Under utility theory, Sandel explains, we pull the lever because we expect fewer people will be unhappy when one person dies than when five people die. Of course we can debate that, and we will come back to that in a moment.

Utility Theory and Utilitarianism

In moral philosophy utilitarianism is the idea that when considering a choice of actions we should choose the one that brings the most benefit to the population.

Utilitarianism is a philosophical variation of utility theory. It is due to Jeremy Bentham and John Stuart Mill. For their time the ideas in utilitarianism are very advanced. But it needs to be borne in mind that that time was rather more limited mathematically and scientifically than we are today. Jeremy Bentham predates Laplace and was born only 15 years after Newton. In Jeremy Bentham’s time differential calculus was only taught in advanced degrees in the most sophisticated universities. Probability was only just beginning to be understood.

Utility Theory

Utilitarianism is an example of a philosophy that suggests that the result of a decision can be evaluated mathematically. By quantifying the benefits of the decision and weighing them against the downsides, the idea is that decision making can be rendered mathematical. Such mathematical functions are known as utility functions. The basic principle is that we can encode the good and the bad in a mathematical formula.

Since Bentham and Stuart Mill, the idea of framing our actions by formulating them mathematically has become inextricably interlinked with the domain of mathematical modelling. One way we have of quantifying the value of a mathematical model is its ‘goodness of fit’. In particular, we validate our models by evaluating how well the model does at predicting a quantity of interest. Sometimes that prediction can be associated with a direct monetary gain (such as in a stock market) or sometimes that prediction is associated with an improvement in the quality or length of life (such as in the diagnosis of disease).

Sensitivity and Specificity

Consider a model that tries to predict, on the basis of a test, whether an individual has cancer or not. Most, if not all, clinical tests are unreliable, i.e. a positive test does not always mean that the disease is present. The portion of positive tests where disease are present is known as the sensitivity of the test. It is a count of those people that are correctly diagnosed with the disease divided by the number who really have the disease. A test with 50% sensitivity will diagnose 50% of the people who really have the disease with the disease.

You might think it is a good thing to have a highly sensitive test, and you’d be right. But actually it is easy to increase the sensitivity of a test. Imagine a lazy lab technician, who rather than carrying out the test, just answers ‘disease present’ for every sample sent. The test will now have 100% sensitivity! Unfortunately all those who don’t have disease will also be categorised as having the disease. The technician has rendered the test non-specific. To account for a test that is weak in this way we also measure the quality of a test by its specificity: the number of patients who don’t have the disease that are correctly identified. For the lazy technician the specificity is 0%.

For clinical tests there is a trade off between sensitivity and specificity. A test that is highly sensitive (captures all the diseased population) will often be non specific (it will incorrectly diagnose a large portion of the non-diseased population).

When considering which tests to use we necessarily have to decide what the effect of wrong decisions is on indvidual people. If a test isn’t very sensitive, then patients will be given the all clear, when they actually have a disease. But if a test is highly sensitive but non-specific, many patients will be faced with unnecessary worry that they have the disease when none is present. The number of people involved will also depend on the base rates of the disease in the population. There will also be monetary cost associated with processing the tests and those who have a positive result.

We want to know the utility of the test, and Utility Theory is one approach we have to assessing that. If we make a decision that a particular sensitivity and specificity is acceptable for a test, if we make judgments of the numbers of worrying and unnecessary trips to the doctors that patients must endure then we are placing value on these aspects of life.

To make a single decision we must weigh up these different aspects against one another. We can then decide which outcomes we value the most. Even if we don’t write it all down explicitly, by our actions we can see that we are making decisions that define a utility function: a mathematical function that weights how we value the competing factors. Utility functions are so common that they receive many different names: objective function, cost function, error function, fitness function, risk function.³ The motivation for any given utility function can be different but the end result is the same, a mathematical formula by which we can quantify the relative value of a set of predictions or decisions.

The more we desire accountability for our decisions, the more that we require that they are rationalised, the greater our tendency to be explicit about the mathematical form of our utility function. The more explicit we become the easier it is to see how much value we are associating with aspects of our lives that we might instinctively feel cannot be priced. Aspects that cannot be bought or sold: our health or our life itself.

How can we reconcile this drive for accountability with our natural instinct that we should not be placing a price on such things?

The first thing we have to do is acknowledge the application of ideas about utility is much more sophisticated than is sometimes acknowledged in popular treatments of the subject. In Sandel’s book he criticises utilitarianism by framing it in the context of the historical arguments given in a specific era. But he doesn’t place those writings in context. The utility functions of Jeremy Bentham and John Stuart Mill provide a nascent theory, but there are a number of ways in which those ideas can be updated given the knowledge we have gained in the intervening 250 years.

The Push and the Trolley

In his book, Sandel follows up his initial trolley example with a more complex one.⁴ This time you are on a bridge. There is a rail line going under the bridge and there are workers on the line. This time there is a trolley going towards the five workers on the far side of the bridge. The workers will be killed by the runaway trolley. There is a man on the bridge with you, but there is nothing else to hand. As before you could shout, but the workers can’t get out of the way in time.

Apparently your only opportunity is to push the other man off the bridge onto the rails, thereby deflecting the trolley and saving the five men working on the line. You could jump onto the line yourself to deflect the trolley, but the man on the bridge is heavier than you, and you would be too light to deflect the trolley. Your only chance is to push the other man, the ‘fat man’, off.

It seems most people would choose not to push the fat man off. Sandel finds this difficult to explain from the perspective of utilitarianism and becomes dismissive of utilitarianism as a result.

We may be doing Sandel an injustice but it seems there is a significant over simplification in this scenario. In an attempt to construct a counter example to illustrate a point, there is a key aspect to the second example, which is not so present in the first. It’s an aspect that humans are faced with every day, but we are not entirely sure how we deal with it. It is a domain where humans can outperform machines, that aspect is uncertainty.

In an effort to define a situation that primes us for decision making the trolley scenarios tell stories. The story is far more complex in the second example than the first. We are unable to help but imagine the scenario. We might picture that the bridge is a viaduct. We might picture that it is made of sandstone. We might imagine that the fat man is wearing a blue shirt. We can picture what it means to push him off. One of Sandel’s arguments is that the difference between the two scenarios is that people are unable to kill the a man through direct interaction, and this aspect isn’t explained by utilitarianism. That is certainly a plausible explanation, but we don’t have to leave utilitarianism behind so quickly.

Evolution and Utilitarianism

Bentham was not aware of Darwin’s principle of natural selection, he predates Darwin. Bentham believed that we should take an action if it maximised the happiness of the population (thereby minimising pain). So his utility function was the sum of the happiness of the population. For its time this is an interesting concept, but Bentham’s disciple, John Stuart Mill, struggled with the idea that a night of debauchery was the same form of happiness as, for example, staying in and reading a good book.⁵ He argued that these different happinesses should not be valued the same. This is a clear flaw in the idea of a single utility function which Bentham had proposed. However, we should probably be allowed to update Bentham’s ideas a little in the light of what we’ve discovered since.

Natural Selection

Darwin’s idea was that species evolve through natural selection. Natural selection is a relatively simple concept, but it has some complex consequences. Natural selection just suggests that successful strategies will prevail. This is a somewhat self-verifying statement. The measure of success is that the strategy did prevail. However, there are some complex consequences. For a species to prevail it needs to survive. Natural selection forces us to think about why our behaviour might be helpful in determining our survival. Bentham’s idea was that we should maximize happiness. But given knowledge of Darwin, it may be that we’d prefer to identify happiness as an intermediate reward. Natural selection implies that one of our longer term goals is the survival of our species, with intermediat implications for our selves, our societies and our ways of life.

Most of us wouldn’t accept that our happiness at any precise moment is an absolute assessment of where we feel we are in life. In fact, we often find our state of happiness much more sensitive to changes in our circumstances than any particular absolute measurement we can make. If our circumstances are good in the absolute sense, but not improving, then we may have to consciously remind ourselves of how lucky we are to gain pleasure from our position.

In the children’s book “A Squash and a Squeeze” (Donaldson and Scheffler 2004) illustrates this idea nicely with a rendering of an old folk tale. An old lady lives in a house which she finds too small. She asks the advice of a “wise old man” who suggests she adds a chicken, a pig, a goat and a cow to her living quarters. The lady does as instructed finding her accommodation increasingly cramped as she does so. Finally the man tells the lady to let all the animals out again. The lady is now happy, finding her house to be far more spacious without the animals. She is happy at that moment, despite her absolute circumstances not having changed since she first went to the wise old man for advise. This folk tale may not provide a practical solution to the housing crisis: its moral is a caricature of our sensibilities: our happiness is more sensitive to our changes of circumstance than our absolute positioning. From an evolutionary perspective this makes sense. If as organisms we became satiated at a particular stage of achievement then our species or societies could become complacent.

The mathematical foundations of the study of change are given by Newton and Leibniz‘s work differential calculus. In Bentham’s time these ideas were taught only in the most advanced University courses. The argument above suggests that actually happiness is some (monotonic) function of the gradient of whatever personal utility function we have. Not the absolute value. This idea also may deal nicely with the issue of different types of happiness. A night of debauchery may make us instantaneously very happy, implying a high rate of change. But it is over quickly, implying that the absolute change in our circumstances is small (the absolute improvement in the utility would be the level of happiness multiplied by the time we were happy for). Of course this improvement may be offset by whatever the consequences of the debauchery are the next day. John Stuart Mill’s variation on utilitarianism considered ’higher pleasures’, such as the pleasure gained from literature and learning. See also Eleni Vasilaki’s perspective on Epicurus (Vasilaki 2017). Instantaneously we may experience less happiness when engaged in these activities compared to whatever our night of debauchery involved. However, these activities can be sustained for a very long period and allow us to achieve more. The absoute improvement in our circumstance would be given by the period of study multiplied by the pleasure given.

In the terminology of differential calculus, our happiness must be integrated to form our utility. Not instantaneously measured. It may be that Stuart Mill’s differentiating of happiness could have been reconciled with utility theory by realising he was actually distinguishing between sustainable forms of happiness and non-sustainable. Unfortunately we can’t revisit him to ask, but we can at least do him the justice of giving him the benefit of the doubt.

Daniel Kahneman’s Nobel Memorial Prize in Economics was awarded for the idea of prospect theory. Kahneman describes the theory and its background in his book, “Thinking Fast and Slow” (Kahneman 2011). Prospect theory is not a mathematical theory, but a theory in behavioral economics based on empirical observations of human behavior. Empirical observation about how we value different alternatives that involve risk. A key observation of Kahneman’s is in line with our analysis above: people are responsive to change in circumstance, not absolute circumstance. Prospect theory goes on to identify asymmetries in our sensitivities. A negative change in circumstance weights upon us greater than the equivalent positive change in circumstance.

Subjective Utility

Bentham’s ideas focussed around the idea of a global utility, maximisation of happiness across the population. Darwin’s principle of natural selection actually insists that there must be variation in the population, and therefore variation in our perception of our circumstances.

Natural selection relies on variation, because if there is no variation, then there can be no separation between effective and ineffective strategies. Different strategies arise from different value systems. If all organisms were to pursue the same strategy and when the circumstantial judgment of the selection process would fall upon all members of the species simultaneously, they would die or survive together.

A Cognitive Bias towards Variance

One of the themes that Kahneman explores is the tendency of humans to produce, through their System 2 thought processes (their ‘slow thinking brains’), overcomplicated explanations of observed data. There’s a tendency for people to focus on a detailed narrative as if it was pre-determined and within the control of all participants. In practice we cannot control events in such a regulated manner.

To predict we need data, a model and computation. Data is the information we are given, the model is our belief in the way the world works and computation is required to assimilate the two. This is true for humans and computers. Ignoring the quality of the data for a moment, and focussing on the model, our predictive system can fail in one of two ways. It can either over simplify or it can over complicate.

This binary choice may seem obvious, but it has some significant consequences. The phenomenon was studied in machine learning by Geman et al (Geman, Bienenstock, and Doursat 1992) who referred to it as the ‘bias variance dilemma’. They decomposed errors into those due to oversimplification (the bias error) and those due to insufficient data to underpin a complex model.

Bias errors are errors that arises when your model is not rich enough to capture all the nuances of the world around. Bias errors occur when the rich underlying phenomena underpinning an observation are ignored and a simpler model explanation is given. An example of a bias error would be one that arises from the simple model “home teams always win sports games”. There is some truth to the home advantage: this model will do better than 50/50 guessing. But it is an oversimplification. It is biased. However, because we have a lot of data about sports games, then two experts using this rule to predict outcome would make consistent predictions.

An error due to variance is one which occurs when we go too far the other way. There are a myriad of factors that could effect the outcome of a sports event. Weather, balloons on the pitch, the mental and physical fitness of each of the players. The quality of the pitch. If we take them all into account we might hope for better predictions. But in reality, we haven’t seen enough data to determine how each of these factors effects outcome. Badly determined parameters lead to high variance error. Two experts see slightly different data and weight these complex factors according to their perspective. As a result their predictions can vary, leading to variance error.

The important point is that these errors are fundamentally different in their characteristics. The error due to bias, the simplification error, comes about by not taking taking all the factors into account. In statistical models bias errors are very common: indeed they are often preferred because they are associated with simpler models and the parameters often have some explanatory power.

The type of error Kahneman is describing in human explanations would be termed an error due to variance. A variance error is different from a bias error. In a variance error you may have a model that is sufficient to describe the underlying system, but you don’t have enough data or information to pin down exactly how your observed outcome came to be. A characteristic of error due to variance is that different observers may have highly rich, but conflicting, explanations of what brought about the phenomenon. The soft of conflicting explanations that bring about lively debate in television studios during half-time breaks in football matches

The bias-variance dilemma is a major challenge in machine learning. One widely accepted solution to the dilemma is that we choose a model which exhibits larger bias because even though it is known to be incorrect (too simple) for the data w have available it will make better predictions. Many of Kahneman’s mechanism’s and solutions for human irrationality actually do introduce simple statistical models to improve the quality of prediction. Kahneman relates how decision making can be rendered more consistent and higher quality in this manner.

An alternative and widely used solution in machine learning is do develop large families of complex models that exhibit variance, just as individual humans do. However, once these models are trained they are not relied on individually but the are combined in ‘ensembles’ to predict together. They vote on the solution or their average prediction is taken. This idea is very similar to the ‘wisdom of the crowds’. Seen from this context there are very good reasons why a ‘population’ of intelligent beings should exhibit variance-error instead of bias-error. A characteristic of bias-error would be that we would all be consistent in our predictions. This is good for accounting for behaviour, but it is a serious problem if we are all consistently wrong. Variance-error implies that we all come up with different, over-complicated, reasons why events transpire as they did. Taken as a population we cover a wide range of alternatives. As a result we act in different ways, and make different decisions. It is clear that we will never be all correct, but when it comes to evolution, the important thing is that we are never all wrong.

Decision Making and Bias-Variance

A further advantage of choosing variance-error over bias-error for a population is that a consistent and robust prediction can always be achieved by averaging outcome. In machine learning approaches such as bagging and boosting (Breiman 1996) can be used to reduce variance-error in a population of models.⁶ Advocates of the “Wisdom of the Crowds” propose the same principle. By preferring bias-error in our population we have no recourse, but models that exhibit variance-error can always be combined to create a more stable prediction from the population as a whole: democratic decision making is one way to achieve this.

So the ‘rational’ behaviour of a population under natural selection is to sustain a variety of approaches to life. It follows then, that if there is to be natural selection within our species, our ideas of achievement should vary. Our individual utility should be subjective. What brings me happiness, may not bring you happiness. We probably have different ideas of debauchery, literature and learning. While we can disagree with each others tastes, natural selection tells us that our species is more robust if there if there is a diversity of approaches. Darwin’s principle tells us we should be like this.

When we see half-time football pundits debating their convolved explanations of the way the match is evolving we should remember that their arguments are all important and each one may have some validity. They are the result of overly complex models being applied on little data. The game of football is fundamentally stochastic, but the analysts treat it as deterministic.

This phenomenon is not new, and it is not constrained to football punditry. In 1954 the psychologist Peter Meehl wrote a book about how clinical experts can be outperformed by simple statistical models (Meehl 1954). Meehl suggested they ‘try to be clever and think outside the box’. Kahneman addresses this challenge in Chapter 21 of his book.

complexity may work in the odd case, but more often than not it reduces validity

going on to say

humans are incorrigibly inconsistent in making summary judgments of complex information. When asked to evaluate the same information twice, they frequently give different answers.

and

Unreliable judgements cannot be predictors of anything.

The two approaches Kahneman proposes for dealing with this in human society are:

Replace human punditry making with simple statistical models. This also makes sense from a statistics point of view if we desire consistency we should replace the models that exhibit variance-error (the humans) with models that exhibit bias-error (simple statistical forumlae).
Exploit wisdom of the crowds. Wisdom of the crowds is a proposal that human opinion should be aggregated to improve predictions. This is consistent with the idea that humans tend to make variance-errors.

Punditry is widespread, and the bias-variance analysis shows us that there are good reasons why, under natural selection, we should value such diversity of opinions. What is also important is that we should develop mechanisms in society for these opinions to be properly represented when making a decision.

This error is dangerous, it is one of the intellectual failings of those that sought to put ideas from eugenics into political practice. Early philosophies based on natural selection were overly focussed on the average value of the ‘fitness’ of a population. Trying to increase this value whilst simultaneously reducing variation is a very dangerous game. It is artificial selection. It assumes that you have preordained what the future natural circumstances are going to be. It may be OK for race horses, greyhounds, crops, sheep and cows because in those circumstances we are aiming to control their environment. It is not OK for the human race.

We are right to express moral outrage at what the negative eugenecists tried to achieve. But it was also motivated from a flawed understanding of science. Their model was wrong. Populations that are capable of excelling in a particular environment, because they are highly tuned to it, are rapidly extinguished when circumstances change. If the animals in a species become too specialised then they may not be able to respond to changing circumstances. Think of cheetahs and eagles vs rats and pigeons.

Socially the same principle should hold. I may not agree with many people’s subjective approach to life, I may even believe it to be severely sub-optimal. But I should not presume to know better, even if prior experience shows that my own ‘way of being’ is effective. Variation is vitally important for robustness. There may be future circumstances where my approaches fail utterly, and other ways of being are better.

A Universal Utility

The quality of our subjective utilities at any given time is measured by their effectiveness in the world. Survival of the species indicates that it is the sustenance of the entire human species that should concern us in the long run. Although there will be many intermediate effects that we will be looking to achieve in the medium term. Indeed, we may even question if there are circumstances under which we would not wish the human species to survive.⁷ The universal utility by which we are judged is therefore difficult to define. We can pin down aspects of it, but perhaps the best we can do is seek compromise between our individual utilities, while maintaining awareness that there may be outside forces, for example climate change, that will have such a detrimental effect on all our lives that it is worth investing significant time and effort as a society to try and reduce our exposure.

Lets get back to the trolleys.

The Real Ethical Dilemma

The trolley problem is an oversimplification, and one which is not useful in characterizing the moral dilemmas we are faced with in the modern era of computer decision making. Instead of the trolley problem, let’s propose a new dilemma, and let’s focus on driverless cars.

Most arguments for driverless cars focus on the overall reduction in human death rates we expect to result from their adoption. For example, if we introduce driverless cars and bring about a 90% reduction in deaths on the road, then that surely is a good thing. But what if the remaining 10% of deaths are focussed on a particular section of the population, for example, what if we find that the only people those cars do continue to kill are cyclists.

Now there are ethical and moral questions about what we have developed. Even if we have reduced the total number of cyclist deaths⁸ is it fair to disproportionately affect one section of the population? A simplistic utilitarian perspective would say yes, please proceed, although we individually might be uncomfortable with this. Let’s explore further.

Utilitarianism appears to be telling us to favour a set up that could disproportionately effect a minority: in this case cyclists. Let’s develop a more sophisticated view of the right utility and see how it might effect our conclusions.

Uncertainty: The Tyger that Burns Bright

There are two principles we should take into account when considering how we should aim to effect the evolution of our society. The first we have discussed above, uncertainty. The second is Darwin’s principle of natural selection.

Natural selection is a simple idea although it’s had a difficult history. Unfortunately, when applied on its own it leads to some unsophisticated principles that don’t work in practice. First of all, we might naively assume from natural selection that there is a single best way of doing things. This idea is embedded in the statement “survival of the fittest”. But that is a naive reinterpretation of natural selection that gives the wrong connotation. The phrase is not due to Darwin, but due to Herbert Spencer, a Victorian philosopher who applied principles of evolution to sociology. The phrase encourages the idea that one individual will survive, the one that has the best approach. This is a damaging idea, and one that should be put to bed.

The marvel of evolution is its responsiveness, success is defined by the environment, both at the individual level and the level of communities and species. But the environment itself is complex and evolving. It is dynamic. Strategies for success are therefore not static. The criteria for success are also uncertain. While we might accept that any given moment there is a ‘formula for success’, the dynamic nature of our evironment means that that formula for success is evolving.

Such a formula for success is equivalent to our utility function, but we are now placed in the circumstance where there is uncertainty around the utility function itself, because it is rapidly evolving and we don’t have access to it.

How quickly is that utility evolving? In the modern world, for humans, and animals, very quickly. Species are becoming extinct at an alarming rate, our world is warming, potentially changing our climate from our current relatively benign climate closer to the more uncomfortable climates of the past. This means the utility is evolving. The criteria for a lion born 200 years ago are very different from a lion today. How would we redesign the lion to survive today?

In a rapidly evolving environment, the species that are most vulnerable are those that are specialised to fragile niche environments that will disappear as our global environments evolve. As humans we are lucky in that we can consciously change our practices to react to our evolving environment. But one thing is clear: the idea that there is a single particular solution, an ‘absolute’ principle by which we should progress a society is seriously flawed.

In fact, that is not quite true. There is one absolute policy we should follow. That policy is: “There will be single absolute policy that should be followed slavishly in all circumstances”. Our environment is evolving, and will continue to evolve through our lives. Perhaps more so now than at any other time in human history. I’m not even referring to our climate, the temperature changes we might expect from global warming, I’m referring to our social circumstances, the rate at which new technologies are emerging, such that within the one individual’s lifespan our modes of intercommunication in our society are changing so as to be almost unrecognisable, and certainly unenvisageable from decade to decade.

In these periods uncertainty about the right thing to do dominates. And the correct form of response to uncertainty is to value diversity. Every form of extremism that forms a threat to the aspects of life I value can be characterised as absolutist. Whether communist, fascist, islamist or christianist. The malignant characteristic of their extremist forms prohibits any other strategy for society. You don’t need the existence of gods to interpret this as extreme folly and hubris. Ironically these absolutist philosophies are the only behaviours that we can absolutely exclude.

Tigers and Trolleys

From this perspective Sandel’s use of the second trolley example takes on a new light. The first example, that of a simple switch in the points, is about as deterministic and mechanistic as a situation can get. You pull the lever, you expect the points to change. Even if they don’t there is no downside associated with the consequences, other than a maybe angry railworker who sees you just tried to send a quarter ton railway wagon down his throat. The second example is largely contrived, and riddled with uncertainty. The story suggests that we know that the man we want to push off the bridge is heavy enough to divert the trolley and yet we ourselves aren’t. That strikes me as an absurd notion. We’ve had moments to assess the situation, and yet we can judge this with confidence? Next we have to heave this heavier man over the barrier at the side of the bridge (I’m assuming it has a barrier). Assuming we achieve this, we now have to ensure our aim is accurate enough to land the gentleman fully on the track. We’ve already determined his general heft is only just sufficient to divert the trolley, we know he’ll have to lay very squarely on the track, and remain there. Finally, who’s to say what the path of the trolley will be after hitting him? The endangered workmen are clearly close enough to not be able to clear the track in time, so is it not possible that the diverted trolley will hurtle into them anyway?

My own belief is that, perhaps at a sub-conscious level, humans are very sensitive to the uncertainty in this scenario. We can vividly imagine attempting to explain our actions after the event, and I think the reaction of anyone we explained them to would be utter incredulity: “So the trolley was hurtling to the men, and instead of yelling to them you attempted to push another man off the bridge?”. Only the most trivial simplification of the situation combined with a naive interpretation of utility theory would conclude that pushing the man was the correct action. Indeed, if it had not been phrased as a decision process, it would not even have entered our mind. Why then is it a mainstay of moral philosophy? This may be an example of what Daniel Kahneman refers to as theory induced blindness, but what I think would be more correctly referred to as model-induced blindness. Here the model of reasoning is one that ignores uncertainty and our ability to subconciously quantify it. That’s a good model for the original trolley example but an extremely poor one for the second. The model is wrong.

This brings us to mind again George Box’s quote, after all, the model may be wrong but is it useful? My colleague Richard Wilkinson pointed out to me that a better quote for the modern era (from the same paper) might be

Since all models are wrong the scientist must be alert to what is importantly wrong. It is inappropriate to be concerned about mice when there are tigers abroad.

George E. P. Box (1976)

I use this quote in talks⁹, the intent of this quote is it is important to worry about the manner in which the model is wrong. Models are abstractions, and it is important when modelling to decide upon the correct level of granularity for our abstraction. Unfortunately, it seems that there is a widespread propensity to take such a model (as applied to the first trolley question) and unquestioningly apply it to the next similar seeming problem on the test paper. Kahneman also has something to say about this type of behaviour from a behavioural psychologist perspective: he categorises it as a consequence of the laziness of System 2 and the resulting dominance of System 1. Or to put it in plain terms: very often people don’t stop and think. And in this case it seems to me that failure to stop and think means that the moral philosopher was eaten by a tiger before he had a chance to get anywhere near the fat man.

Apologies for being trite and rather polemical, but these points emerge from an ongoing frustration about the extent to which our theorising about decision making ignores the almost omnipresent effects of uncertainty. Uncertainty about our individual values, our values as a society, our future circumstances, our present circumstances. Uncertainty is endemic, and explicitly accounting for it induces robust behaviour in the presence of changing circumstance. Paramount among these is respect for diversity. Diversity of opinion and behaviour. This is the reason for our cognitive bias towards variance. The importance of diversity in dealing with an uncertain environment.

Conclusion

Uncertainty of the correct utility and our wider values means that, in practice, the ‘correct’ decision is difficult produce or verify. In these circumstances, there has a been a tendency to seek proxies such as consistency in decision making. However, a consistent decision making algorithm will tend to err on the side of over-simplifying circumstances, potentially missing particular nuances and more negatively effecting one sector of society over another.

In this paper we have argued that the right response to uncertainty is diversity. As studied by Peter Meehl and Daniel Kahnemann, our own cognitive bias towards overcomplicating situations leads to a diversity of responses by experts when presented with the same data. Both Meehl and Kahnemann characterize this diversity of response as a bad thing because, provably, if experts differ opinion then at least one of them must be ‘wrong’.

However, obsession about consistency between experts, or algorithms, misses the point that such algorithms (or experts) could also be consistently wrong. This is particularly worrisome for algorithmic decision making which can be deployed en masse with rapid detrimental effect over particular sectors of society. Such deployments are encouraged by a naive-utilitarian perspective, but a more sophisticated understanding of societal utility actually suggests that preservation of diversity could be much more important.

In an uncertain environment, we are arguing that society would be more robust if diversity of solutions and opinions are sustained and respected. Consistency of opinion should not be substituted as a proxy for truth, and diversity should be understood as being more robust in an evolving society such as ours where there is he uncertainty around our shared value system and the nature of ‘fitness’ in a rapidly evolving environment.

Bibliography

Box, George E. P. 1976. “Science and Statistics.” Journal of the American Statistical Association 71 (356): 791–99. http://www.jstor.org/stable/2286841.

———. 1979. “Robustness in the Strategy of Scientific Model Building.” Edited by R. L. Launer and G. N. Wilkinson. Academic Press, 201–36. http://www.dtic.mil/docs/citations/ADA070213.

Breiman, Leo. 1996. “Bagging Predictors.” Machine Learning 24 (2): 123–40. doi:10.1007/BF00058655.

Donaldson, Julia, and Axel Scheffler. 2004. A Squash and a Squeeze.

Foot, Philippa. 1967. “The Problem of Abortion and the Doctrine of the Double Effect in Virtues and Vices.” Oxford Review 5: 5–15. doi:10.1093/0199252866.003.0002.

Geman, Stuart, Elie Bienenstock, and René Doursat. 1992. “Neural Networks and the Bias/Variance Dilemma.” Neural Computation 4 (1): 1–58. doi:10.1162/neco.1992.4.1.1.

Kahneman, Daniel. 2011. Thinking Fast and Slow.

Meehl, Paul E. 1954. Clinical Versus Statistical Prediction: A Theoretical Analysis and a Review of the Evidence.

“Reported Road Casualties in Great Britain: Main Results 2015.” 2016. UK Department for Transport.

Sandel, Michael. 2010. Justice: What’s the Right Thing to Do?

Thomson, Judith Jarvis. 1976. “Killing, Letting Die, and the Trolley Problem.” The Monist 59 (2): 204–17.

Vasilaki, Eleni. 2017. “Is Epicurus the Father of Reinforcement Learning?” ArXiv E-Prints. https://arxiv.org/abs/1710.04582.

In Phillipa Foot’s original version the example refers to a runaway tram and you are the driver, but this has evolved for it to be referred to as a trolley, and you have control of the points.↩
There is almost palpable excitement in the room amoung students of humanities when driverless cars are discussed, mainly in anticipation (I believe) of a real world application of the trolley problem. The car has a choice between killing a grandma, or a baby, which does it choose? What algorithm does it use? We will discuss a more sophisticated consequence of the algorithm below, but if the car gets in the situation where it’s about to kill someone, something will have gone seriously wrong. There will be a great deal of uncertainty in what happens next, and the main objective should be to reduce the energy of the car, and minimize the result of the impact. In these circumstances, no car manufacturer will be calling upon moral subroutines to determine who should survive.↩
Actually a risk function normally applies we are looking at expected utility, i.e. it incorporates some notion of the probability of outcomes.↩
This example is due to Judith Jarvis Thomson, her original paper (Thomson 1976) is a very interesting read. She introduces three separate variations on Phillipa Foot’s trolley scenario and impersonalizes them by associating them with three characters, Edward, Frank and George.↩
Since I first drafted these ideas, my colleague Eleni Vasilalki began exploring Epicurius’s philosophy as an underpinning of Reinforcement Learning. From a machine learner’s perspective the ideas of Epicurius also seem related to John Stuart Mill (???).↩
If you have an Xbox at home and use the Kinnect, it is using exactly this technique to determine where you are in the video frame. The algorithm is called a “random forest”, and it averages across many ‘decision trees’ to create an output using a voting scheme. A decision tree is an approach to categorisation that is used in machine learning to classify objects. It uses a tree-like structure to decide what an object is according to its attributes. The algorithm typically takes one feature of the data at a time and ‘branches’ according to the value of the attribute, for example one decision in a ‘vehicle classifier’ might be to go down one branch if the number of wheels are greater than 3 (car or truck) and another if the wheels are less than 3 (motorbike or bicycle). A follow on question might involve engine maximum power, sending it down a different branch. Each tree makes its own, detailed decision. Each individual tree may also not be very good on average. Leo Breiman suggested an approach to aggregating these decisions to give the final answer. The algorithm is known as a random forest. Decision trees are a key component in Facebook’s advertising ranking algorithm. They also are used to identify where you are in the image when the Kinnect turns on.↩
This idea emerges in the film “The Matrix”.↩
Cyclist deaths in the UK currently make up around 6% of the total deaths. For example in 2015 of 1,730 total deaths on the road, 100 were cyclists (6%). Pedestrians made up a further 408 (24%), motorcyclists 365 (21%) (“Reported Road Casualties in Great Britain: Main Results 2015” 2016).↩
I did once have to explain that it doesn’t mean the tigers are in foreign countries.↩