Home

Now is the Time for Reinforcement Learning on Real Robots

2019-04-23T00:00:00+00:00

The purpose of this post is to encourage more people to work on machine learning algorithms for real-world robots. End-to-end deep learning has become ubiquitous with most of today’s challenging artificial intelligence problems, such as image recognition, natural language processing, protein folding prediction and game playing agents. In fact, you would probably be laughed at for not using deep learning for these problems.

But the same is not yet true for mobile robotics. Why?

Deep learning has the potential to completely revolutionise mobile robotics. It understands vast quantities of data, simplifies engineering and creates representations which generalise beyond anything we can hand-engineer. We’ve seen this over and over again in other fields rich in data. However, today deep learning is typically only seen in isolated parts of mobile robotics systems, such as computer vision front-ends. Most leading robotics applications rely on hand-engineered control policies, such as Waymo’s self-driving car or Boston Dynamic’s Atlas.

So what are the world’s talented machine learning researchers doing about this? Unfortunately in my experience there are seldom examples of world-leading mobile robotics research labs which also have outstanding machine learning people. Many of the strategic leaders in the world’s biggest AI labs are reluctant; they are of the opinion that real-world experiments will hold back progress towards A.G.I. (artificial general intelligence). They see hardware, and the problems (and opportunities) that come with it, as a distraction. They’d rather iterate more quickly with game-playing agents in simulation from the comfort of their office. Unfortunately in my experience it’s often not possible to transfer these algorithms developed in simulation to real-world robots.

I believe if we are going to see the positive effect of robotics on our society, we need to concentrate more effort on challenging real-world applications with machine learning. While blogs like “Deep Reinforcement Learning Doesn’t Work Yet” have some truth today, I think robotics is about to go through its 2012 ImageNet moment. Yes, reinforcement learning may be the cherry on the cake, but the critical component is end-to-end machine learning. This is what will bring self-driving cars, smart manufacturing and domestic robotics to society before 2030. Focusing the majority of the world’s talent on advancing A.I. with video games (such as StarCraft or DOTA) is coming at a large opportunity cost.

There are some awesome teams working intensively on machine learning for robotics, but are largely still in the minority. To share a few in the figure above: Skydio for drones (top-left), OpenAI with dexterous manipulation (top-right), Berkeley’s learning manipulation robots (bottom-left) and Wayve with self-driving cars (bottom-right). We are seeing some progress! For example, in 2018 our team at Wayve showed two world-firsts for mobile robotics, using deep learning:

first example of deep reinforcement learning on a self-driving car, learning to lane-follow from 11 episodes of training data.
sim2real, where we demonstrated that it is possible to train a robot in simulation, then transfer the policy to the real-world. We drove a car for 3km+ on UK roads using a policy trained only from labelled simulated data using image-to-image translation techniques.

Some of the most important things we learned from this work so far could not have come from studies on static computer vision datasets or simulation alone. In fact, many times we found standard practices and algorithms in machine learning did not work for real-world robotics. I’ll spend the rest of this blog discussing a few of these problems and the insights I learned.

Exploration: it’s not a big deal for robotics

A large amount of research in reinforcement learning has focused on the problem of exploration: taking actions in order to learn about the world. For example, Montezuma’s Revenge is a simulation environment which is notoriously difficult to solve because it requires extreme exploration with infrequent feedback and sparse rewards. This has challenged researchers, with solutions proposed in 2018, e.g. by a single demonstration or by Go-Explore.

However these ideas are of little use in robotics. In robotics, we typically have data collected by a demonstrator available to us, e.g. by a human or other hand-coded system. It is simply too dangerous to let a robot randomly explore the world with epsilon-greedy. In fact, control policies must often be learned from predominantly off-policy data.

The more pressing problem is to be able to learn in a very data-efficient manner, retaining information and avoiding catastrophic forgetting of information. When we first applied a model-free reinforcement learning algorithm to our self-driving car, DDPG, it was very difficult to get it to learn quickly, and then to not forget stable policies. Dealing with the random-ness of random exploration made experiments incredibly stochastic and unreliable. We found model-based reinforcement learning to be more repeatable and effective. This allowed us to learn off-policy with much more deterministic results, with policies which improved with more data. Ultimately, to scale our driving policy to public UK roads, we needed to use expert demonstrations.

What is the reward in real life?

In real life, there is no game score we need to optimise, unlike in Atari games. Philosophers have debated for millennia what optimal behavior is, but the answer to key questions such as the Trolley Problem is still still unclear.

I’ve thought a lot about how to design a reward for self-driving. We can’t use privileged information like ‘distance to lane’ which is available in simulation. If we want to be data-efficient it can’t be sparse, such as ‘did we get safely to the destination’. We can’t learn by failure in the real-world as it is unsafe (well maybe sometimes we can… see Gandhi et al. 2017).

It is important to note that for mobile robotics, train environment != test environment. Unlike other academic literature, we should not report best case, but worst case. The reward should encourage generalisation, robustness and safety.

For an interesting anecdote for reward design, we observed that when our car was trained with the reward to drive as far as possible without safety driver intervention, it learned to zig-zag down the road, as it didn’t leave the lane and cause intervention, but drove a greater distance by zig-zaging, therefore maximising the reward. This is a phenomenon known as reward hacking, where the agent earns reward using unintended behavior. For an excellent treatment of reward-hacking and other problems in AI safety, see Amodei et al. 2016.

I believe ideas like reward-learning, inverse reinforcement learning, preference learning and imitation learning are going to be very important to real-life robotics. Ultimately, the best reward will be learned from demonstration and feedback.

Combining computer vision and control

There are not many reinforcement learning systems that work with computer vision. There are also not many reinforcement learning systems that work with state-spaces with millions of dimensions. Therefore if our robots are going to work with mega-pixel camera sensors, we are going to need to learn policies that combine RL and computer vision.

This is not as trivial as taking a semantic segmentation representation and passing it to a policy or value function. The representation needed for control is different from recognition problems. In contrast, most state-of-the-art semantic segmentation algorithms, like PSPNet or DeepLab, don’t compress the state. They explicitly design architectures to retain features with large spatial resolution because this performs better at the IoU metric used for this task. For example, with a 1000 x 1000 x 3 RGB pixel input (1 megapixel), the resulting feature representation from DeepLabV3 is a tensor with dimensions 125 x 125 x 256. This hasn’t compressed the state - it has expanded it by a factor of 1.33!

We need to design computer vision systems which compress the state and learn features relevant for the control task. Fortunately supervised learning is very good at learning semantics, motion and geometry.

But what are the metrics we need to optimise? Counter-intuitively, I’ve often seen systems which perform better at computer vision metrics be worse representations to use for learning control. We don’t care about the pixel accuracy of segmentation masks in order to interact with an object. The robustness of object detection is more essential. We need to focus on the right semantics, geometry and motion representations and metrics.

This requires a new attitude towards computer vision metrics. Improving IoU or accuracy metrics on computer vision problems are often no longer a good proxy for improving the end-to-end control system. When we get to 90%+ performance on a computer vision task, it now becomes more important to learn the right representation, rather than improving the metric.

Learning with noisy data

Real data is noisy. Datasets like KITTI and CityScapes are painstakingly labelled and cleaned. But their scope is tiny, to build real world robots we need to learn on much more vast datasets. Perhaps the Oxford Robot Car dataset is much more representative of what real data at scale looks like.

One interesting observation I’ve made, is that on clean computer vision benchmark datasets, methods which increase the weighting of hard data points perform better. Recently examples include focal-loss and other hard mining or loss weighting techniques. However, these methods are less effective on real-data, where hard data points are due to noise, not learning difficulty.

In contrast, probabilistic deep learning (see my previous blog here) down-weights noisy data points. These approaches perform worse on clean benchmark computer vision datasets. However, in the real-world, probabilistic deep learning often performs better than hard-negative-mining approaches. This is because it down-weighs the most noisy examples, reducing errors.

Summary

If we take the approach of, ‘let’s solve A.G.I. in simulation first and only then let’s solve it in the real-world’, we are not going to see real benefits of intelligent robotics in the near future. There is a huge opportunity to work on A.I. for robotics today. Hardware is cheaper, more accessible and reliable than ever before. I think mobile robotics is about to go through the revolution that computer vision, NLP and other data science fields have seen over the last five years.

Autonomous driving is the ideal application to work on. Here’s why; the action space is relatively simple. Unlike difficult strategy games like DOTA, driving does not require long term memory or strategy. At a basic level, the decision is either left, right, straight or stop. The counter point to this is that the input state space is very hard, but computer vision is making remarkable progress here.

If you share my excitement and want to work with me at Wayve on deep learning for autonomous vehicles, please get in touch: wayve.ai/careers

Thank you to my team who make this work possible and to my friends who gave feedback on this post before publication.

PhD Thesis: Geometry and Uncertainty in Deep Learning for Computer Vision

2018-10-07T00:00:00+00:00

Today I can share my final PhD thesis, which I submitted in November 2017. It was examined by Dr. Joan Lasenby and Prof. Andrew Zisserman in February 2018 and has just been approved for publication. This thesis presents the main narrative of my research at the University of Cambridge, under the supervision of Prof Roberto Cipolla. It contains 206 pages, 62 figures, 24 tables and 318 citations. You can download the complete .pdf here.

My thesis presents contributions to the field of computer vision, the science which enables machines to see. This blog post introduces the work and tells the story behind this research.

This thesis presents deep learning models for an array of computer vision problems: semantic segmentation, instance segmentation, depth prediction, localisation, stereo vision and video scene understanding.

The abstract

Deep learning and convolutional neural networks have become the dominant tool for computer vision. These techniques excel at learning complicated representations from data using supervised learning. In particular, image recognition models now out-perform human baselines under constrained settings. However, the science of computer vision aims to build machines which can see. This requires models which can extract richer information than recognition, from images and video. In general, applying these deep learning models from recognition to other problems in computer vision is significantly more challenging.

This thesis presents end-to-end deep learning architectures for a number of core computer vision problems; scene understanding, camera pose estimation, stereo vision and video semantic segmentation. Our models outperform traditional approaches and advance state-of-the-art on a number of challenging computer vision benchmarks. However, these end-to-end models are often not interpretable and require enormous quantities of training data.

To address this, we make two observations: (i) we do not need to learn everything from scratch, we know a lot about the physical world, and (ii) we cannot know everything from data, our models should be aware of what they do not know. This thesis explores these ideas using concepts from geometry and uncertainty. Specifically, we show how to improve end-to-end deep learning models by leveraging the underlying geometry of the problem. We explicitly model concepts such as epipolar geometry to learn with unsupervised learning, which improves performance. Secondly, we introduce ideas from probabilistic modelling and Bayesian deep learning to understand uncertainty in computer vision models. We show how to quantify different types of uncertainty, improving safety for real world applications.

The story

I began my PhD in October 2014, joining the controls research group at Cambridge University Engineering Department. Looking back at my original research proposal, I said that I wanted to work on the ‘engineering questions to control autonomous vehicles… in uncertain and challenging environments.’ I spent three months or so reading literature, and quickly developed the opinion that the field of robotics was most limited by perception. If you could obtain a reliable state of the world, control was often simple. However, at this time, computer vision was very fragile in the wild. After many weeks of lobbying Prof. Roberto Cipolla (thanks!), I was able to join his research group in January 2015 and begin a PhD in computer vision.

When I began reading computer vision literature, deep learning had just become popular in image classification, following inspiring breakthroughs on the ImageNet dataset. But it was yet to become ubiquitous in the field and be used in richer computer vision tasks such as scene understanding. What excited me about deep learning was that it could learn representations from data that are too complicated to hand-design.

I initially focused on building end-to-end deep learning models for computer vision tasks which I thought were most interesting for robotics, such as scene understanding (SegNet) and localisation (PoseNet). However, I quickly realised that, while it was a start, applying end-to-end deep learning wasn’t enough. In my thesis, I argue that we can do better than naive end-to-end convolutional networks. Especially with limited data and compute, we can form more powerful computer vision models by leveraging our knowledge of the world. Specifically, I focus on two ideas around geometry and uncertainty.

Geometry is all about leveraging structure of the world. This is useful for improving architectures and learning with self-supervision.
Uncertainty understands what our model doesn’t know. This is useful for robust learning, safety-critical systems and active learning.

Over the last three years, I have had the pleasure of working with some incredibly talented researchers, studying a number of core computer vision problems from localisation to segmentation to stereo vision.

Bayesian deep learning for modelling uncertainty in semantic segmentation.

The science

This thesis consists of six chapters. Each of the main chapters introduces an end-to-end deep learning model and discusses how to apply the ideas of geometry and uncertainty.

Chatper 1 - Introduction. Motivates this work within the wider field of computer vision.

Chapter 2 - Scene Understanding. Introduces SegNet, modelling aleatoric and epistemic uncertainty and a method for learning multi-task scene understanding models for geometry and semantics.

Chapter 3 - Localisation. Describes PoseNet for efficient localisation, with improvements using geometric reprojection error and estimating relocalisation uncertainty.

Chapter 4 - Stereo Vision. Designs an end-to-end model for stereo vision, using geometry and shows how to leverage uncertainty and self-supervised learning to improve performance.

Chapter 5 - Video Scene Understanding. Illustrates a video scene understanding model for learning semantics, motion and geometry.

Chapter 6 - Conclusions. Describes limitations of this research and future challenges.

An overview of the models considered in this thesis.

As for what’s next?

This thesis explains how to extract a robust state of the world – semantics, motion and geometry – from video. I’m now excited about applying these ideas to robotics and learning to reason from perception to action. I’m working with an amazing team on autonomous driving, bringing together the worlds of robotics and machine learning. We’re using ideas from computer vision and reinforcement learning to build the most data-efficient self-driving car. And, we’re hiring, come work with me! wayve.ai/careers

I’d like to give a huge thank you to everyone who motivated, distracted and inspired me while writing this thesis.

Here’s the bibtex if you’d like to cite this work.

@phdthesis{kendall2018phd,
  title={Geometry and Uncertainty in Deep Learning for Computer Vision},
  author={Kendall, Alex},
  year={2018},
  school={University of Cambridge}
}

And the source code for the latex document is here.

Reprojection Losses: Deep Learning Surpassing Classical Geometry in Computer Vision?

2018-01-21T00:00:00+00:00

2017 was an exciting year as we saw deep learning become the dominant paradigm for estimating geometry in computer vision.

Learning geometry has emerged as one of the most influential topics in computer vision over the last few years.

“Geometry is … concerned with questions of shape, size, relative position of figures and the properties of space” (wikipedia).

We’ve first seen end-to-end deep learning models for these tasks using supervised learning, for example depth estimation (Eigen et al. 2014), relocalisation (PoseNet 2015), stereo vision (GC-Net 2017) and visual odometry (DeepVO 2017) are examples. Deep learning excels at these applications for a few reasons. Firstly, it is able to learn higher order features which reason over shapes and objects with larger context than point-based classical methods. Secondly, it is very efficient for inference to simply run a forward pass of a convolutional neural network which approximates an exact geometric function.

Over the last year, I’ve noticed epipolar geometry and reprojection losses improving these models, allowing them to learn with unsupervised learning. This means they can train without expensive labelled data by just observing the world. Reprojection losses have contributed to a number of significant breakthroughs which now allow deep learning to outperform many traditional approaches to estimating geometry. Specifically, photometric reprojection loss has emerged as the dominant technique for learning geometry with unsupervised (or self-supervised) learning. We’ve seen this across a number of computer vision problems:

Monocular Depth: Reprojection loss for deep learning was first presented for monocular depth estimation by Garg et al. in 2016. In 2017, Godard et al. show how to formulate left-right consistency checks to improve results.
Optical Flow: this requires training reprojection disparities over 2D and has been demonstrated by Yu et al. 2016, Ren et al. 2017 and Meister et al. 2018.
Stereo Depth: in my PhD thesis I show how to extend our stereo architecture, GC-Net, to learn stereo depth with epipolar geometry & unsupervised learning.
Localisation: I presented a paper at CVPR 2017 showing how to train relocalisation systems by learning to project 3D geometry from structure from motion models Kendall & Cipolla 2017.
Ego-motion: learning depth and ego motion with reprojection loss now out performs traditional methods like ORB-SLAM over short sequences under constrained settings (Zhou et al. 2017) and (Li et al. 2017).
Multi-View Stereo: projection losses can also be used in a supervised setting to learn structure from motion, for example DeMoN and SfM-Net.
3D Shape Estimation: projection geometry also aids learning 3D shape from images in this work from Jitendra Malik’s group.

In this blog post I’d like to highlight the importance of epipolar geometry and how we can use it to learn representations of geometry with deep learning.

An example of state of the art monocular depth estimation with unsupervised learning using reprojection geometry (Godard et al. 2017)

What is reprojection loss?

The core idea behind reprojection losses is using epipolar geometry to relate corresponding points in multi-view stereo imagery. To dissect this jargon-filled sentence; epipolar geometry relates the projection of 3D points in space to 2D images. This can be thought of as triangulation (see the figure below). The relation between two 2D images is defined by the Fundamental matrix. If we choose a point on one image and know the fundamental matrix, then this geometry tells us that the same point must lie on a line in the second image, called the epipolar line (the red line in the figure below). The exact point of the correspondence on the epipolar line is defined by the 3D point’s depth in the scene.

If these two images are from a rectified stereo camera then this is a special type of multi-view geometry, and the epipolar line is horizontal. We then refer to the corresponding point’s position on the epipolar line as disparity. Disparity is inversely proportional to metric depth.

The standard reference for this topic is the textbook, “Multiple View Geometry in Computer Vision” Hartley and Zisserman, 2004.

Epipolar geometry relates the same point in space seen by two cameras and can be used to learn 3D geometry from multi-view stereo (Image borrowed from Wikipedia).

One way of exploiting this is learning to match correspondences between stereo images along this epipolar line. This allows us to estimate pixel-wise metric depth. We can do this using photometric reprojection loss (Garg et al. in 2016). The intuition behind reprojection loss is that pixels representing the same object in two different camera views look the same. Therefore, if we relate pixels, or determine correspondences between two views, the pixels should have identical RGB pixel intensity values. The better the estimate of geometry, the closer the photometric (RGB) pixel values will match. We can optimise for values which provide matching pixel intensities between each image, known as minimising the photometric error.

An important property of these losses is that they are unsupervised. This means that we can learn these geometric quantities by observing the world, without expensive human-labelled training data. This is also known as self-supervised learning.

The list of papers at the start of this post further extend this idea to optical flow, depth, ego-motion, localisation etc. — all containing forms of epipolar geometry.

Does this mean learning geometry with deep learning is solved?

I think there are some short-comings to reprojection losses.

Firstly, photometric reprojection loss makes a photometric consistency assumption. This means it assumes that the same surface has the same RGB pixel value between views. This assumption is usually valid for stereo vision, because both images are taken at the same time. However, this is not always the case for learning optical flow or multi-view stereo, because appearance and lighting changes over time. This is because of occlusion, shadows and the dynamic nature of scenes.

Secondly, reprojection suffers from the aperture problem. The aperture problem is unavoidable ambiguity of structure due to a limited field of view. For example, if we try to learn depth by photometric reprojection, our model cannot learn from areas with no texture, such as sky or featureless walls. This is because the reprojection loss is equal across areas of homogeneous texture. To resolve the correct reprojection we need context! This problem is usually resolved by a smoothing prior, which encourages the output to be smooth where there is no training signal, but this also blurs correct structure.

Thirdly, we don’t need to reconstruct everything. Learning to reproject pixels is similar to an auto encoder — we learn to encode all parts of the world equally. However, for many practical applications, attention based reasoning has been shown to be most effective. For example, in autonomous driving we don’t need to learn the geometry of building facades and the sky, we only care about the immediate scene in front of us. However, reprojection losses will treat all aspects of the scene equally.

State of the art stereo depth estimation from GC-Net. This figure shows the saliency of the model with respect to the depth prediction for the point with the white cross. This demonstrates the model uses a wider context of the surrounding car and road to make its prediction.

How can we improve performance?

It is difficult to learn geometry alone, I think we need to incorporate semantics. There is some evidence that deep learning models learn semantic representations implicitly from patterns in the data. Perhaps our models could more explicitly exploit this?

I think we need to reproject into a better space than RGB photometric space. We would like this latent space to solve the problems above. It should have enough context to address the aperture problem, be invariant to small photometric changes and emphasise task-dependant importance. Training on the projection error in this space should result in a better performing model.

After the flurry of exciting papers in 2017, I’m looking forward to further advances in 2018 in one of the hottest topics in computer vision right now.

I first presented the ideas in this blog post at the Geometry in Deep Learning Workshop at the International Conference on Computer Vision 2017. Thank you to the organisers for a great discussion.

Let’s Talk About Ethics in Artificial Intelligence

2017-08-24T00:00:00+00:00

When I am in the pub and I tell people I am working on Artificial Intelligence (AI) research, the conversation that invariably comes up is, “Why are you building machines to take all our jobs?” However, within AI research communities, this topic is rarely discussed. My experience with colleagues is that it is often dismissed with off-hand arguments such as, “We’ll make more advanced jobs to replace those which are automated”. I’d like to pose the question to all AI researchers: how long have you actually sat down and thought about ethics? Unfortunately, the overwhelming opinion is that we just build the tools and ethics are for the end-users of our algorithms to deal with. But, I believe we have a professional responsibility to build ethics into our systems from the ground up. In this blog I’m going to discuss how to build ethical algorithms.

How do we implement ethics? There are a number of ethical frameworks which philosophers have designed; virtue ethics (Aristotle), deontological ethics (Kant) and teleological ethics (Mill, Bentham). However Toby Walsh, Professor of AI at the University of New South Wales in Australia, has a different view:

“For too long philosophers have come up with vague frameworks. Now that we have to write code to actually implement AI, it forces us to define a much more precise and specific moral code”

Therefore, AI researchers may even have an opportunity to make fundamental contributions to our understanding of ethics! I have found it interesting to think about what ethical issues such as trust, fairness and honesty mean for AI researchers.

Concrete ethical issues for machine learners

In this section I am going to discuss concrete ways to implement trust, fairness and honesty in AI models. I will try to translate these ethical topics into actual machine learning problems.

Trust

It is critical that users trust AI systems, otherwise their acceptance in society will be jeopardised. For example, if a self-driving car is not trustworthy, it is unlikely anyone will want to use it. Building trustworthy algorithms means we must make them safe by:

improving accuracy and performance of algorithms. We are more likely to trust something which is accurate. This is something which most machine learning researchers do,
designing algorithms which are aware of their uncertainty and understand what they do not know. This means they will not make ill-founded decisions. See a previous blog I wrote on Bayesian deep learning for some ideas here,
making well-founded decision making or control policies which do not over-emphasise exploration over exploitation.

Fairness

We have a responsibility to make AI fair. This means to remove unwanted bias in algorithms. Unfortunately, there are many examples of biased / unfair AI systems today. For example:

systems used to estimate re-offending risk in the US are biased towards African-Americans,
Google’s voice recognition system has been shown to systematically perform better with male voices rather than female voices,
most self-driving cars are biased to training data in California.

There are in-fact many sources of bias in algorithms. For an excellent taxonomy, see this paper. In some situations we even want bias - but this is something we must understand. Concrete problems to improve fairness and improve bias in AI systems include:

improving data efficiency to better learn rare classes,
improve methodologies for collecting and labelling data to remove training data bias,
improved causal reasoning so we can remove an algorithm’s access to explanatory variables which we deem morally unusable (e.g. race).

Honesty

Honesty requires algorithms to be transparent and interpretable. We should expect algorithms to provide reasonable explanations for the decisions they make. This is going to become a significant commercial concern in the EU when the new GDPR data laws go into effect in 2018. They grant users the right to a logical explanation of how an algorithm uses our personal data. This is going to require advances in:

saliency techniques which explain causality from input to output,
interpretablity of black box models. Models must be able to explain their reasoning. This may be by forward simulation of an internal state or by analysing human interpretable intermediate outputs.

Will we be accountable for the AI we build?

Perhaps of more immediate concern to AI researchers is if we will be accountable for the algorithms we build. If a self-driving car fails to obey the road rules and kills pedestrians, is the computer vision engineer at fault? Another interesting example is the venture capitalist Deep Knowledge, who placed an AI on their board of directors in 2014. What happens if this AI fails in its responsibilities as company director? Today, AI cannot be legally or criminally accountable as it has no legal entity and owns no assets. This means liability lies with the manufacturer.

Unfortunately, it is unlikely we will be able to get insurance to cover liability of autonomous systems. This is because there is a total lack of data on them. Without these statistics, insurance firms are unable to estimate risk profiles.

Most professional bodies (law, medicine, engineering) regulate standards and values of their profession. In my undergraduate engineering school we had compulsory courses on professional standards and ethics. Why do most computer science courses not do the same?

Medical research trials involving humans have to pass several ethics committee processes before being admitted. I don’t think we need the same approval before running a deep learning experiment, but I think that there needs to be more awareness from the tech world. For example, in 2014 Facebook conducted human subject trials which were widely condemned. They wanted to see if they could modify human emotion by changing the types of content shown in their newsfeed. Is this ethical? Did the 689,000 people involved willingly consent? Why are tech companies exempt from the ethical procedures we place on other fields of research?

Is AI going to steal our jobs?

Going back to the original question I was asked in the pub, “will robots steal our jobs?” I think there are some real concerns here.

The AI revolution will be different to the industrial revolution. The industrial revolution lasted 100 years, which is 4 generations. This was sufficient time for subsequent generations to be re-skilled in their jobs of the future. The disruptive technology due to AI is likely to occur much more rapidly (perhaps a single generation). We will need to re-skill within our lifetime, or else become jobless.

It is worth noting that in some situations, AI is going to increase employment. For example, AI is drastically improving the efficiency of matching Kidney donors to those with Kidney disease, increasing the work for surgeons. But these will certainly be isolated examples. Disruptive AI technology is going to displace many of today’s jobs.

Hopefully automation will drastically reduce the cost of living. Perhaps this places less pressure to have a job to work for money. But it is well-known that humans need to feel self-worth to feel happy. Perhaps entertainment and education will be enough for many people?

For those for whom it is not, we need to shift to a new distribution of jobs, quickly. Here are some positive ideas I like:

reactive retraining of those who have jobs displaced by automation. For example, Singapore has a retraining fund for people who have been replaced by automation,
proactive retraining, for example changing the way we teach accountants and radiologists today, because their jobs are being displaced,
allow automation to increase our leisure time,
redeploy labour into education and healthcare which require more human interaction than other fields,
MOOCs and other educational tools,
introducing a living wage or universal basic income.

Eventually, perhaps more extreme regulation will be needed here to keep humans commercially competitive with robots.

Acknowledgements: I’d like to thank Adrian Weller for first opening my eyes to these issues. This blog was written while attending the International Joint Conference on Artificial Intelligence (IJCAI) 2017 where I presented a paper on autonomous vehicle safety. Thank you to the conference organisers for an excellent forum to discuss these topics.

Deep Learning Is Not Good Enough, We Need Bayesian Deep Learning for Safe AI

2017-05-23T00:00:00+00:00

Understanding what a model does not know is a critical part of many machine learning systems. Unfortunately, today’s deep learning algorithms are usually unable to understand their uncertainty. These models are often taken blindly and assumed to be accurate, which is not always the case. For example, in two recent situations this has had disastrous consequences.

In May 2016 we tragically experienced the first fatality from an assisted driving system. According to the manufacturer’s blog, “Neither Autopilot nor the driver noticed the white side of the tractor trailer against a brightly lit sky, so the brake was not applied.”
In July 2015, an image classification system erroneously identified two African American humans as gorillas, raising concerns of racial discrimination. See the news report here.

And I’m sure there are many more interesting cases too! If both these algorithms were able to assign a high level of uncertainty to their erroneous predictions, then each system may have been able to make better decisions and likely avoid disaster.

It is clear to me that understanding uncertainty is important. So why doesn’t everyone do it? The main issue is that traditional machine learning approaches to understanding uncertainty, such as Gaussian processes, do not scale to high dimensional inputs like images and videos. To effectively understand this data, we need deep learning. But deep learning struggles to model uncertainty.

In this post I’m going to introduce a resurging field known as Bayesian deep learning (BDL), which provides a deep learning framework which can also model uncertainty. BDL can achieve state-of-the-art results, while also understanding uncertainty. I’m going to explain the different types of uncertainty and show how to model them. Finally, I’ll discuss a recent result which shows how to use uncertainty to weight losses for multi-task deep learning. The material for this blog post is mostly taken from my two recent papers:

What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? Alex Kendall and Yarin Gal, 2017. (.pdf)
Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics. Alex Kendall, Yarin Gal and Roberto Cipolla, 2017. (.pdf)

And, as always, more technical details can be found there!

An example of why it is really important to understand uncertainty for depth estimation. The first image is an example input into a Bayesian neural network which estimates depth, as shown by the second image. The third image shows the estimated uncertainty. You can see the model predicts the wrong depth on difficult surfaces, such as the red car’s reflective and transparent windows. Thankfully, the Bayesian deep learning model is also aware it is wrong and exhibits increased uncertainty.

Types of uncertainty

The first question I’d like to address is what is uncertainty? There are actually different types of uncertainty and we need to understand which types are required for different applications. I’m going to discuss the two most important types – epistemic and aleatoric uncertainty.

Epistemic uncertainty

Epistemic uncertainty captures our ignorance about which model generated our collected data. This uncertainty can be explained away given enough data, and is often referred to as model uncertainty. Epistemic uncertainty is really important to model for:

Safety-critical applications, because epistemic uncertainty is required to understand examples which are different from training data,
Small datasets where the training data is sparse.

Aleatoric uncertainty

Aleatoric uncertainty captures our uncertainty with respect to information which our data cannot explain. For example, aleatoric uncertainty in images can be attributed to occlusions (because cameras can’t see through objects) or lack of visual features or over-exposed regions of an image, etc. It can be explained away with the ability to observe all explanatory variables with increasing precision. Aleatoric uncertainty is very important to model for:

Large data situations, where epistemic uncertainty is mostly explained away,
Real-time applications, because we can form aleatoric models as a deterministic function of the input data, without expensive Monte Carlo sampling.

We can actually divide aleatoric into two further sub-categories:

Data-dependant or Heteroscedastic uncertainty is aleatoric uncertainty which depends on the input data and is predicted as a model output.
Task-dependant or Homoscedastic uncertainty is aleatoric uncertainty which is not dependant on the input data. It is not a model output, rather it is a quantity which stays constant for all input data and varies between different tasks. It can therefore be described as task-dependant uncertainty. Later in the post I’m going to show how this is really useful for multi-task learning.

Illustrating the difference between aleatoric and epistemic uncertainty for semantic segmentation. You can notice that aleatoric uncertainty captures object boundaries where labels are noisy. The bottom row shows a failure case of the segmentation model, when the model is unfamiliar with the footpath, and the corresponding increased epistemic uncertainty.

Next, I’m going to show how to form models to capture this uncertainty using Bayesian deep learning.

Bayesian deep learning

Bayesian deep learning is a field at the intersection between deep learning and Bayesian probability theory. It offers principled uncertainty estimates from deep learning architectures. These deep architectures can model complex tasks by leveraging the hierarchical representation power of deep learning, while also being able to infer complex multi-modal posterior distributions. Bayesian deep learning models typically form uncertainty estimates by either placing distributions over model weights, or by learning a direct mapping to probabilistic outputs. In this section I’m going to briefly discuss how we can model both epistemic and aleatoric uncertainty using Bayesian deep learning models.

Firstly, we can model Heteroscedastic aleatoric uncertainty just by changing our loss functions. Because this uncertainty is a function of the input data, we can learn to predict it using a deterministic mapping from inputs to model outputs. For regression tasks, we typically train with something like a Euclidean/L2 loss: \(\begin{align} Loss = || y - \hat{y} ||_2 \end{align}\) . To learn a Heteroscedastic uncertainty model, we simply can replace the loss function with the following:

\[\begin{align} Loss = \frac{|| y - \hat{y} ||_2}{2 \sigma^2} + \frac{1}{2} \log \sigma^2 \end{align}\]

where the model predicts a mean \(\hat{y}\) and variance \(\sigma^2\). As you can see from this equation, if the model predicts something very wrong, then it will be encouraged to attenuate the residual term, by increasing uncertainty \(\sigma^2\). However, the \(\log \sigma^2\) prevents the uncertainty term growing infinitely large. This can be thought of as learned loss attenuation.

Homoscedastic aleatoric uncertainty can be modelled in a similar way, however the uncertainty parameter will no longer be a model output, but a free parameter we optimise.

On the other hand, epistemic uncertainty is much harder to model. This requires us to model distributions over models and their parameters which is much harder to achieve at scale. A popular technique to model this is Monte Carlo dropout sampling which places a Bernoulli distribution over the network’s weights.

In practice, this means we can train a model with dropout. Then, at test time, rather than performing model averaging, we can stochastically sample from the network with different random dropout masks. The statistics of this distribution of outputs will reflect the model’s epistemic uncertainty.

In the previous section, I explained the properties that define aleatoric and epistemic uncertainty. One of the exciting results in our paper was that we could show that this formulation gives results which satisfy these properties. Here’s a quick summary of some results of a monocular depth regression model on two datasets:

Training Data	Testing Data	Aleatoric Variance	Epistemic Variance
Trained on dataset #1	Tested on dataset #1	0.485	2.78
Trained on 25% dataset #1	Tested on dataset #1	0.506	7.73
Trained on dataset #1	Tested on dataset #2	0.461	4.87
Trained on 25% dataset #1	Tested on dataset #2	0.388	15.0

These results show that when we train on less data, or test on data which is significantly different from the training set, then our epistemic uncertainty increases drastically. However, our aleatoric uncertainty remains relatively constant – which it should – because it is tested on the same problem with the same sensor.

Uncertainty for multi-task learning

Next I’m going to discuss an interesting application of these ideas for multi-task learning.

Multi-task learning aims to improve learning efficiency and prediction accuracy by learning multiple objectives from a shared representation. It is prevalent in many areas of machine learning, from NLP to speech recognition to computer vision. Multi-task learning is of crucial importance in systems where long computation run-time is prohibitive, such as the ones used in robotics. Combining all tasks into a single model reduces computation and allows these systems to run in real-time.

Most multi-task models train on different tasks using a weighted sum of the losses. However, the performance of these models is strongly dependent on the relative weighting between each task’s loss. Tuning these weights by hand is a difficult and expensive process, making multi-task learning prohibitive in practice.

In our recent paper, we propose to use homoscedastic uncertainty to weight the losses in multi-task learning models. Since homoscedastic uncertainty does not vary with input data, we can interpret it as task uncertainty. This allows us to form a principled loss to simultaneously learn various tasks.

We explore multi-task learning within the setting of visual scene understanding in computer vision. Scene understanding algorithms must understand both the geometry and semantics of the scene at the same time. This forms an interesting multi-task learning problem because scene understanding involves joint learning of various regression and classification tasks with different units and scales. Perhaps surprisingly, we show our model can learn multi-task weightings and outperform separate models trained individually on each task.

Multi-task learning improves the smoothness and accuracy for depth perception because it learns a representation that uses cues from other tasks, such as segmentation (and vice versa).

Some challenging research questions

Why doesn’t Bayesian deep learning power all of our A.I. systems today? I think they should, but there are a few really tough research questions remaining. To conclude this blog I’m going to mention a few of them:

Real-time epistemic uncertainty techniques are preventing epistemic uncertainty models from being deployed in real-time robotics applications. Either increasing sample efficiency, or new methods which don’t rely on Monte Carlo inference would be incredibly beneficial.
Benchmarks for Bayesian deep learning models. It is incredibly important to quantify improvement to rapidly develop models – look at what benchmarks like ImageNet have done for computer vision. We need benchmark suites to measure the calibration of uncertainty in BDL models too.
Better inference techniques to capture multi-modal distributions. For example, see the demo Yarin set up here which shows some multi-modal data that MC dropout inference fails to model.

Have We Forgotten about Geometry in Computer Vision?

2017-04-18T00:00:00+00:00

Deep learning has revolutionised computer vision. Today, there are not many problems where the best performing solution is not based on an end-to-end deep learning model. In particular, convolutional neural networks are popular as they tend to work fairly well out of the box. However, these models are largely big black-boxes. There are a lot of things we don’t understand about them.

Despite this, we are getting some very exciting results with deep learning. Remarkably, researchers are able to claim a lot of low-hanging fruit with some data and 20 lines of code using a basic deep learning API. While these results are benchmark-breaking, I think they are often naive and missing a principled understanding.

In this blog post I am going to argue that people often apply deep learning models naively to computer vision problems – and that we can do better. I think a really good example is with some of my own work from the first year of my PhD. PoseNet was an algorithm I developed for learning camera pose with deep learning. This problem has been studied for decades in computer vision, and has some really nice surrounding theory. However, as a naive first year graduate student, I applied a deep learning model to learn the problem end-to-end and obtained some nice results. Although, I completely ignored the theory of this problem. At the end of the post I will describe some recent follow on work which looks at this problem from a more theoretical, geometry based approach which vastly improves performance.

I think we’re running out of low-hanging fruit, or problems we can solve with a simple high-level deep learning API. Specifically, I think many of the next advances in computer vision with deep learning will come from insights to geometry.

What do I mean by geometry?

In computer vision, geometry describes the structure and shape of the world. Specifically, it concerns measures such as depth, volume, shape, pose, disparity, motion or optical flow.

The dominant reason why I believe geometry is important in vision models is that it defines the structure of the world, and we understand this structure (e.g. from the many prominent textbooks). Consequently, there are a lot of complex relationships, such as depth and motion, which do not need to be learned from scratch with deep learning. By building architectures which use this knowledge, we can ground them in reality and simplify the learning problem. Some examples at the end of this blog show how we can use geometry to improve the performance of deep learning architectures.

The alternative paradigm is using semantic representations. Semantic representations use a language to describe relationships in the world. For example, we might describe an object as a ‘cat’ or a ‘dog’. But, I think geometry has two attractive characteristics over semantics:

Geometry can be directly observed. We see the world’s geometry directly using vision. At the most basic level, we can observe motion and depth directly from a video by following corresponding pixels between frames. Some other interesting examples include observing shape from shading or depth from stereo disparity. In contrast, semantic representations are often proprietary to a human language, with labels corresponding to a limited set of nouns, which can’t be directly observed.
Geometry is based on continuous quantities. For example, we can measure depth in metres or disparity in pixels. In contrast, semantic representations are largely discretised quantities or binary labels.

Why are these properties important? One reason is that they are particularly useful for unsupervised learning.

A structure from motion reconstruction of the geometry around central Cambridge, UK - produced from my phone's video camera.

Unsupervised learning

Unsupervised learning is an exciting area in artificial intelligence research which is about learning representation and structure without labeled data. It is particularly exciting, because getting large amounts of labeled training data is difficult and expensive. Unsupervised learning offers a far more scalable framework.

We can use the two properties which I described above to form unsupervised learning models with geometry: observability and continuous representation.

For example, one of my favourite papers last year showed how to use geometry to learn depth with unsupervised training. I think this is a great example of how geometric theory and the properties described above can be combined to form an unsupervised learning model. Other research papers have also demonstrated similar ideas which use geometry for unsupervised learning from motion.

Aren’t semantics enough?

Semantics often steal a lot of the attention in computer vision – many highly-cited breakthroughs are from image classification or semantic segmentation.

One problem with relying just on semantics to design a representation of the world, is that semantics are defined by humans. It is essential for an AI system to understand semantics to form an interface with humanity. However, because semantics are defined by humans, it is also likely that these representations aren’t optimal. Learning directly from the observed geometry in the world might be more natural.

It is also understood that low level geometry is what we use to learn to see as infant humans. According to the American Optometric Association, we spend the first 9 months of our lives learning to coordinate our eyes to focus and perceive depth, colour and geometry. It is not until 12 months when we learn how to recognise objects and semantics. This illustrates that a grounding in geometry is important to learn the basics in human vision. I think we would do well to take these insights into our computer vision models.

A machine's semantic view of the world (a.k.a. SegNet). Each colour represents a different semantic class - such as road, pedestrian, sign, etc.

Examples of geometry in my recent research

I’d like to conclude this blog post by giving two concrete examples of how we can use geometry in deep learning from my own research:

Learning to relocalise with PoseNet

In the introduction to this blog post I gave the example of PoseNet which is a monocular 6-DOF relocalisation algorithm. It solves what is known as the kidnapped robot problem.

In the initial paper from ICCV 2015, we solved this by learning an end-to-end mapping from input image to 6-DOF camera pose. This naively treats the problem as a black box. At CVPR this year, we are going to presenting an update to this method which considers the geometry of the problem. In particular, rather than learning camera position and orientation values as separate regression targets, we learn them together using the geometric reprojection error. This accounts for the geometry of the world and gives significantly improved results.

Predicting depth with stereo vision

The second example is in stereo vision – estimating depth from binocular vision. I had the chance to work on this problem while spending a fantastic summer with Skydio, working on the most advanced drones in the world.

Stereo algorithms typically estimate the difference in the horizontal position of an object between a rectified pair of stereo images. This is known as disparity, which is inversely proportional to the scene depth at the corresponding pixel location. So, essentially it can be reduced to a matching problem - find the correspondences between objects in your left and right image and you can compute depth.

The top performing algorithms in stereo predominantly use deep learning, but only for building features for matching. The matching and regularisation steps required to produce depth estimates are largely still done by hand.

We proposed the architecture GC-Net which instead looks at the problem’s fundamental geometry. It is well known in stereo that we can estimate disparity by forming a cost volume across the 1-D disparity line. The novelty in this paper was showing how to formulate the geometry of the cost volume in a differentiable way as a regression model. More details can be found in the paper here.

An overview of the GC-Net architecture which uses an explicit representation of geometry to predict stereo depth.

Conclusions

I think the key messages to take away from this post are:

it is worth understanding classical approaches to computer vision problems (especially if you come from a machine learning or data science background),
learning complicated representations with deep learning is easier and more effective if the architecture can be structured to leverage the geometric properties of the problem.