Michal Rolínek

RaMBO: Ranking Metric Blackbox Optimization

2020-04-08T00:00:00-07:00

Our paper resulting in an oral at CVPR 2020 about applying the blackbox differentiation theory (codename #blackboxbackprop) to optimizing rank-based metrics. As it turns out, it is all possible with a few simple twists…

In our newest paper [1] we are dealing with training of deep neural networks by directly **optimizing rank-based metrics. Our approach bases itself on the blackbox-backprop theory introduced in [2]. In the blackbox-backprop paper (see **#blackboxbackprop **on Twitter for latest updates and the accompanying blogpost), we have shown how to calculate “useful” gradients through combinatorial solvers in neural networks, without hurting the optimality of the solvers themselves. The theory enables us to utilize the **combinatorial **solvers as plug and play modules **within complicated models which we can train with the standard backpropagation algorithm.

In search for practical application of the theory, we turn to computer vision. Concretely, we show that applying blackbox-backprop to computer vision benchmarks for **optimizing recall and Average Precision **for retrieval and detection tasks consistently improves the underlying architectures’ performance. This is, by the way, a common theme in ML, that it is always desirable to optimize for what you actually care for. If the recall@K is the right measure of performance, then it makes sense to make the end-to-end architecture optimize that and not some sort of approximation. Both recall (more concretely, recall@K) and Average Precision are metrics that are based on rankings of the inputs, which essentially require a sorting operation of their scores. Here, multiple things pose a challenge. For one, using these metrics as loss functions results in non-decomposable losses (i.e. we cannot reliably estimate the loss based on the subset of the input, but we require the whole set of inputs). Additionally, the ranking operation that is used to calculate the metrics is non-differentiable.

Although many competing approaches have been proposed, they have not been accepted by the practitioners for different reasons. They’re computationally too expensive or were lacking practical implementations for easy use. With the blackbox-backprop theory, we apply the sorting operation directly on the outputted scores, which results in low computational complexity O(n log n) (the complexity of general sorting with** torch.argsort**). The questions that we need to answer are the following:

How to cast the ranking problem into the blackbox differentiation framework?
How to address the non-decomposability of the losses?
How to prevent rank-based losses from collapsing?

Fitting Ranking into the Blackbox Differentiation Framework

In order to cast ranking into the framework proposed in [2], we need an argmin operation over a dot product. We first define the vector of scores, y. The ranking of y, rk(y) is the result of an argmin operation on the dot product between the vector **y **and permutation **π, **over the set of all possible permutations:

[email protected]" alt="" />

The proof of the previous proposition is simple and based upon the well-known permutation inequality which states that given a decreasing sequence (the vector y), for any integer n:

[email protected]" alt="" />

Intuitively, this means that the least weight is put on the biggest scores, which happens if the permutation is a sorting permutation. With this simple twist, we are able to apply the blackbox framework to the ranking problem, this means that we can simply use an efficient implementation of a fast sorting algorithm (torch.argsort in PyTorch for instance) in order to calculate the ranking and differentiate through it** based on the blackbox-backprop theory. **An example of the optimization landscape resulting from applying the blackbox-backprop theory can be seen in the following figure:

Score Margin to Prevent Loss Collapse

Rank-based losses have a difficult time dealing with ties. To illustrate their instability, think of the situation where we have a tie over the whole dataset in scores. We can obtain all possible rankings in a small neighborhood of such a tie, since the smallest change to scores changes the ranking completely. This means that applying rank-based losses is very unstable. We alleviate the problem by** introducing a margin α** inducing a positive shift on the negatively labeled scores and a negative shift on the positively labeled scores:

[email protected]" alt="" />

Score Memory for Better Estimates

Ideally, we would have a dataset-wide loss to optimize, because of the non-decomposability problem of the ranking-based losses. Because this is computationally intractable (as it is limited by GPU memory for example), we want to use minibatches to train our models. We allow for this by extending the scores of the current batch with scores of a certain number of previous batches, which reduces the bias of the loss estimate.

The Algorithm

Taking the above mentioned techniques yields a method that we call Ranking Metric Blackbox Optimization (RaMBO). Again, the additional computational overhead is introduced only by the complexity of the** sorting operation** O(n log n), which is blazing fast when implemented efficiently. This puts us ahead of most approaches out there. The algorithm is summarized in the following listing:

We evaluated the performance of our approach on object detection (Pascal VOC) and several image retrieval benchmarks (CUB-200–2011, In-shop Clothes, Stanford Online Products). In each experiment, we take well-performing architectures and modify them with RaMBO. The method is on-par or beats the state-of-the-art results in these benchmarks.

Nevertheless, we should acknowledge that metric learning baselines are a mess, as a result of a fast and enormous body of research in different directions (improving network architectures, better objective functions and so forth). This leads to irreproducible results, wrong conclusions and unfair comparisons, a great analysis of these difficulties can be found here.

Stanford Products dataset for retrieval.

References

[1] Rolínek, Michal et al. Optimizing Rank-based Metrics with Blackbox Differentiation, CVPR 2020

[2] Vlastelica, Marin et al. Differentiation of Blackbox Combinatorial Solvers, ICLR 2020

[3] Images taken from Pixabay

Acknowledgments

This is joint work from the Autonomous Learning Group of the Max Planck Institute for Intelligent Systems, Tuebingen, Germany, the Univesity of Tuebingen and Universita degli Studi di Firenze, Italy.

The Fusion of Deep Learning and Combinatorics

2020-01-17T00:00:00-08:00

(authored by Marin Vlastelica)

Breakthrough: how can we seamlessly incorporate combinatorial solvers in deep neural networks. A summary of our ICLR 2020 spotlight paper.

The current landscape of machine learning research suggests that modern methods based on deep learning are at odds with good old-fashioned AI methods. Deep learning has proven to be a very powerful tool for feature extraction in various domains, such as computer vision, reinforcement learning, optimal control, natural language processing and so forth. Unfortunately, deep learning has an Achilles heel, the fact that it cannot deal with problems that require combinatorial generalization. An example is learning to predict quickest routes in Google Maps based on map input as an image, an instance of the Shortest Path Problem. A plethora of such problems exists like (Min,Max)-Cut, Min-Cost Perfect Matching, Travelling Salesman, Graph Matching and more.

But if such combinatorial problems are to be solved in isolation, we have an amazing toolbox of solvers available, ranging from efficient C implementations of algorithms to more general MIP (Mixed Integer Programming) solvers such as Gurobi. The problems that the solvers face are in the representation of the input space since the solvers require well-defined, structured input.

Although combinatorial problems have been a subject in the machine learning research community, the attention towards solving such problems has been lacking. This doesn’t mean that the problem of combinatorial generalization has not been identified as a crucial challenge on the path to intelligent systems. Ideally, one would be able to combine the rich feature extraction available through powerful function approximators, such as neural networks, with the efficient combinatorial solvers in an end-to-end manner without any compromises. This is exactly what we were able to achieve in our recent paper [1] for which we have received top review scores and are giving a spotlight talk at ICLR 2020.

For the following sections, it’s worth keeping in mind that we are not trying to improve the solvers themselves, but rather enable usage of existing solvers in synergy with function approximation.

We imagine a blackbox solver as an architectural module for deep learning that we can simply plug in.

Gradients of Blackbox Solvers

The way we think about combinatorial solvers is in terms of a mapping between a continuous input (e.g. weights of graph edges) to a discrete output (e.g. shortest path, selected graph edges), defined as

[email protected]" alt="" />

The solver minimizes some kind of cost function c(ω,y), for instance, the length of the path. More concretely, the solvers solves the following optimization problem:

[email protected]" alt="" />

Now, imagine that ω is the output of a neural network, i.e. is some kind of representation that we learn. Intuitively, what does this ω mean? ω serves the purpose of defining the instance of the combinatorial problem. As an example, ω can be a certain vector that defines the edge weights of a graph. In this case, the solver can solve the Shortest Path Problem or the Travelling Salesman, or whichever problem we want to be solved for the specified edge costs. The thing that we want to achieve is the correct problem specification through ω.

Naturally, we want to optimize our representation such that it minimizes the loss which is a function of the output of the solver L(y). The problem that we are facing right away is the fact that the loss function is piecewise-constant, meaning the gradient of this function with respect to the representation ω is 0 almost everywhere and undefined on the jumps of the loss function. Put more bluntly, the gradient as-is is useless for minimizing the loss function.

Till now there have been approaches relying on solver relaxations, where sacrifices had to be made with regards to its optimality. In comparison, we have developed a method without compromises to the optimality of the solver. We achieve this by defining a piecewise affine interpolation of the original objective function where the interpolation itself is controlled by a hyperparameter λ, as shown in the following figure:

As we can see, f (black) is piecewise constant. Our interpolation (orange) connects the plateaus in a reasonable fashion. For instance, notice that the minimum is not changed.

The domain of f is, of course, multi-dimensional. As such, we can observe the set of inputs ω for which f obtains the same value as a polytope. Naturally, there are many such polytopes in the domain of f. What the hyperparameter λ effectively does is that it shifts the polytopes through the perturbation of the input of the solver, ω. The g interpolators that define the piecewise-affine objective connect the shifted boundary of the polytope to the original boundary. Such a situation is depicted in the lower figure, where the boundary of the polytope that obtains the value f(y2) is shifted to obtain the value of f(y1). This also intuitively explains why higher values of λ are preferable. The shifts have to be big enough to obtain the interpolator g which is going to provide us with an informative gradient. Proofs can be found in [1].

First, let us define a solution to the perturbed optimization problem, where the perturbation is controlled by the hyperparameter λ:

[email protected]" alt="" />

If we assume that the cost function c(ω,y) is a dot product between y and ω, we can define the interpolated objective like the following:

[email protected]" alt="" />

Note that the linearity of the cost function is not as restrictive as it might seem at first glance. All problems involving edge selection, where the cost is the sum of the edge weights fall into this category. The Shortest Path Problem (SPP) and Travelling Salesman Problem (TSP) are examples that belong to this category of problems.

The Algorithm

With our method, we were able to remove the rift between classical combinatorial solvers and deep learning with a simple modification to the backward pass to calculate the gradient.

The computational overhead of calculating the gradient of the interpolation depends on the solver, the additional overhead is calling the solver once on the forward pass and once on the backward pass.

Experiments

We have developed synthetic tasks that contain a certain level of combinatorial complexity to validate the method. In the following tasks, we have shown that our method is essential for combinatorial generalization since naive supervised learning approaches fail at generalizing to unseen data. Again, the goal is to learn the correct specification of the combinatorial problem.

For the Warcraft Shortest Path problem, the training set consists of Warcraft II maps and corresponding shortest paths on the maps as targets. The test set consists of unseen Warcraft II maps. The maps themselves encode a k × k grid. The maps are inputs to a convolutional neural network which outputs the vertex costs for the map that are fed to the solver. Finally, the solver, which is effectively Dijkstra’s shortest path algorithm, outputs the shortest path on the map in the form of an indicator matrix.

Naturally, at the beginning of training, the network doesn’t know how to assign correct costs to the tiles of the map, but with our method, we’re able to learn the correct tile costs and therefore the correct shortest path. The histogram plot shows how our method is able to generalize **significantly **better than traditional supervised training of the ResNet.

In the MNIST Min-Cost Perfect Matching problem, the goal is to output a min-cost perfect matching of a grid of MNIST digits. Concretely, in the min-cost perfect matching problem, we are supposed to select edges such that all vertices are contained in the selection exactly once and the sum of the edge costs is minimal. Each cell in the grid contains an MNIST digit that is a node in the graph having vertical and horizontal neighbors. The edge costs are determined by reading the two-digit number vertically downwards or horizontally to the right.

For this problem, a convolutional neural net (CNN) receives as input the image of the MNIST grid and outputs a grid of vertex costs that are transformed into edge costs. The edge formulation is then given to the Blossom V perfect matching solver.

The solver outputs an indicator vector of edges selected in the matching. The cost of the matching on the right is 348 (46 + 12 horizontally and 27 + 45 + 40 + 67 + 78 + 33 vertically).

Again, in the performance plot, we notice a clear advantage of embedding an actual perfect matching solver in the neural network.

We also looked at a formulation of the **Travelling Salesman Problem **where the network is supposed to output optimal TSP tours of country capitals. For this problem, it is important to learn the correct capital positions in the latent representation. Our dataset consisted of country flags (i.e. the raw representation) and optimal tours of the respective capitals. A training example consists of **k **countries. In this case, a convolutional neural network is shown a concatenation of the country flags and is supposed to output the optimal tour.

(5)

In the following animation, we can see the learned locations of the countries’ capitals on the globe during training time. In the beginning, the locations are scattered randomly, but after training the neural network not only learns to output the correct TSP tours, but also the correct representation, i.e. the correct 3D coordinates of the individual capitals. Notably, this follows from merely using the Hamming distance loss for supervision and a Mixed Integer Program in Gurobi on the outputs of the network.

Conclusion

We have shown that we can, in fact, propagate gradients through blackbox combinatorial solvers under certain assumptions about the cost function of the solver. This enables us to achieve combinatorial generalization of which standard neural network architectures are incapable based on traditional supervision.

We are in the process of showing that this method has a wide range of applications in tackling real-world problems that require combinatorial reasoning. We have already demonstrated one such application addressing rank-based metric optimization [2]. The question remains however how far away (in theory and in practice) can we go from the linearity assumption of the solver’s cost. Another question for future work is if we can learn the underlying constraints of the combinatorial problem in a MIP formulation as an example. The application spectrum of such approaches is broad and we welcome anyone that is willing to collaborate on extending this work to its full potential.

References

[1] Vlastelica, Marin, et al. “Differentiation of Blackbox Combinatorial Solvers” ICLR 2020.

[2] Rolínek, Michal, et al. “Optimizing Rank-based Metrics with Blackbox Differentiation” CVPR 2020 (oral).

Acknowledgment

This is joint work from the Autonomous Learning Group of the Max Planck Institute for Intelligent Systems and Universita degli Studi di Firenze, Italy.