Not so Great Ideas in Theoretical Computer Science

Estimating Transitive Closure via Sampling

mittheory — Tue, 04 Oct 2016 03:31:58 +0000

In this post, I describe an algorithm of Edith Cohen, which estimates the size of the transitive closure of a given directed graph in near-linear time. This simple, but extremely clever algorithm uses ideas somewhat similar to the algorithm of Flajolet–Martin for estimating the number of distinct elements in a stream, and to MinHash sketch of Broder¹.

Suppose we have a large directed graph with vertices and directed edges. For a vertex , let us denote the set of vertices that are reachable from . There are two known ways to compute sets (all at the same time):

Perform Depth-First Search (DFS) from each vertex. This takes time , which is the best known bound for sparse graphs;
Use fast matrix multiplication, which takes time . This algorithm is better for dense graphs.

Can we do better? Turns out we can, if we are OK with merely approximating the size of every . Namely, the following theorem was proved back in 1994:

Theorem 1. There exists a randomized algorithm for computing -multiplicative approximation for every with running time .

Instead of spelling out the full proof, I will present it as a sequence of problems: each of them will likely be doable for a mathematically mature reader. Going through the problems should be fun, and besides, it will save me some typing.

Problem 1. Let be a function that assigns random independent and uniform reals between 0 and 1 to every vertex. Let us define . Show how to compute values of for all vertices at once in time .

Problem 2. For a positive integer , denote the distribution of the minimum of independent and uniform reals between 0 and 1. Suppose we receive several independent samples from with an unknown value of . Show that we can obtain a -multiplicative approximation of with probability using as few as samples.

Problem 3. Combine the solutions for two previous problems and prove Theorem 1.

Footnotes

These similarities explain my extreme enthusiasm towards the algorithm. Sketching-based techniques are useful for a problem covered in 6.006, yay!

—Ilya

Better Circuit Lower Bounds for Explicit Functions

mittheory — Tue, 20 Oct 2015 07:56:30 +0000

Below is a post by Ilya on a new circuit lower bound.

Let be a Boolean function of arguments. What is the smallest circuit that can compute if the set of allowed gates consists of all the unary and binary functions? The size of a circuit is simply the number of gates in it.

It’s easy to show that a random function requires circuits of size (this is tight in the worst case), but a big question in computational complexity is to provide a simple enough function that requires large enough circuits to compute.

Basically, if one comes up with a function that lies in and requires circuits of size super-polynomial in , then , and one of the Millenium Problems is solved!

How far are we from accomplishing this? Well, until recently, the best lower bound for a function from NP was [Blum 1984] (remember that eventually we are aiming at super-polynomial lower bounds!).

I’m very excited to report that very recently a better lower bound has been proved by Magnus Gausdal Find, Sasha Golovnev, Edward Hirsch and Sasha Kulikov! Is it super-polynomial in ? Is it at least super-linear in ?

Well, not really. The new lower bound is ! You may ask: why one should be excited about such a seemingly weak improvement (which, nevertheless, took the authors several years of hard work, as far as I know)?

One reason is that this is the first progress on one of the central questions in computational complexity in more than 30 years. In general, researchers like to invent new models and problems and tend to forget good old “core” questions very easily. I’m happy to see this not being the case here.

Besides, the new result is, in fact, much more general than say the Blum’s. The authors prove that any function that is not constant on every sufficiently high-dimensional affine subspace requires large circuits. Then, they simply invoke the recent result of Ben-Sasson and Kopparty, who construct an explicit function with this property. That is, the paper shows a certain pseudo-randomness property to be enough for better lower bounds. Hopefully, there will be more developments in this direction.

Is this result one more step towards proving ? Time will tell.

Purifying spoiled randomness with spoiled randomness

Henry Yuen — Sat, 15 Aug 2015 18:37:34 +0000

Recently a preprint was posted to ECCC, with the pulse-quickening title “Explicit Two-Source Extractors and Resilient Functions“. If the result is correct, then it really is — shall I say it — a breakthrough in theoretical computer science. One of my favorite things about TCS is how fresh it is. It seems like every other week, an exciting result in all sorts of subareas are announced on arxiv or ECCC. I thought I’d use this as an opportunity to (briefly) explain what this breakthrough is about, for those who aren’t theoretical computer scientists.

Computer programs need sequences of random numbers all the time. Encryption, games, scientific applications need it. But these applications usually assume that they have access to completely unpredictable random number sequences that have no internal patterns or regularities.

But where in the world would we find such perfect, pristine sources of random numbers? The methods that people devise to generate sources of randomness is a fascinating subject in its own right, but long story short, it’s actually really, really hard to get our hands on such good random numbers. Pure randomness, as it turns out, is a valuable resource that is as rare as — perhaps even rarer than — gold or diamond.

On the other hand, it seems much easier to get access to weak randomness, which are number sequences that are pretty unpredictable, but there might be subtle (or not-so-subtle) correlations or relations between the different numbers. For example, one way you might try to get random numbers is to take a digital thermometer that’s accurate to within, say, 2 decimal places — and always take the last digit of the temperature reading whenever you need a random number.
Since the temperature of the air around the thermometer is always fluctuating, this will give some randomness, but it’s probably not perfectly random. If the temperature at this second is 74.32 F, in the next second, it’s probably more likely to be 74.34 F instead of 74.39 F. So there are patterns and correlations within these random numbers.

You might try to devise more elaborate ways to mix up the numbers, but it’s pretty difficult to actually get your scheme to produce perfect unpredictability. Like a pernicious disease, corrupting correlations and predictable patterns can run deep within a sequence of random digits, and removing these unwanted blemishes is a tough task confounded by the fact that you can never be sure if what you’re seeing is a pattern or merely an artifact of the randomness.

But randomness extractors do just do that. They are careful surgeons that expertly cut out the malignant patterns, and leave fresh, pure unpredictability. More technically: randomness extractors are algorithms that are designed to read in weak random number sequences, and output much higher quality randomness — in fact, something that is very, very close to perfectly unpredictable.

Extractors are beautiful algorithms, powered by elegant mathematics and wonderful ideas that have taken decades to understand. And we’re still just beginning to understand them, as this breakthrough result shows.

For a long time now, we’ve known that extractors exist. Via nonconstructive mathematical arguments, we know these algorithms are out there, whose job it is to transform weak randomness into perfect randomness. The only problem was, for quite some time we didn’t know what these algorithms looked like! No one could actually write down a description of the algorithm!

It took a long while, but we finally learned how to construct a certain type of extractor algorithm — called a seeded extractor. Basically, what these algorithms do is to take in, say, 1000 weakly random numbers (for example, temperature readings), and then 10 perfectly random numbers (called the seed), and combine these two sources of numbers to produce 500 perfectly random numbers. So they give you a lot more perfectly unpredictable numbers than you started with.

That’s really nice, but where would you get these 10 perfectly random numbers from? There’s a chicken and the egg problem here (albeit with a smaller chicken and egg each time). Now what theoretical computer scientists have been deep in the hunt for are two-source extractors — the subject of this breakthrough work.

As you might guess, these are extractors that take two unrelated sources of weak randomness, and combine them together to produce ONE perfect random source. So you might imagine taking temperature readings, and then have another apparatus that reads in the direction of the wind, and combine these two weak sources of randomness, to get pristine randomness.

Again, we’ve known that two source extractors exist “in the wild” — but actual sightings of such magical beasts have been rare. Until recently, the best two source extractor we knew how to construct required that the two weak sources weren’t actually that weak — these weak sources needed to be incredibly random, possessing very very few correlations and patterns. Bourgain’s extractor, as it is known, takes these almost-perfect random sources and combines them to be perfect.

It might not sound like much, but Bourgain’s extractor was a tremendous achievement, involving some of the newest ideas from arithmetic combinatorics, a branch of mathematics. And for 10 years, this was the best known.

Until today. Chattopadhyay and Zuckerman have told us how to create a two-source extractor that works with sources that are quite weak. There can be a littany of cross-correlations and patterns and relations between the numbers, but as long as the the random numbers have some unpredictability, their two-source extractor will crunch the numbers, clean up the patterns, clean up the correlations, and produce something that’s pristine, perfect, and unpredictable.

It’s not the end of the story, though. Their algorithm only produces ONE random bit. The next step — and I’m sure people at this very moment are working hard on this — is to improve their algorithm to produce MANY perfectly random bits.

Update (August 15, 2015): It didn’t take long for Xin Li, one of the world’s leading extractor designers, to extend the Chattopadhyay and Zuckerman’s construction to output multiple bits. See his followup paper here. You’re watching science unfold, folks!

–Henry Yuen

Distribution Testing: Do It With Class!

mittheory — Sun, 09 Aug 2015 17:14:00 +0000

After a long break since the last time we touched upon distribution testing, we are back to discuss two papers that appeared this month on the arXiv — and hopefully some of their implications.

[ADK15] Optimal testing for properties of distributions.
[CDRG15] Testing Shape Restrictions of Discrete Distributions.

As a warmup, recall that in distribution testing (“property testing of probability distributions,” but shorter), one is given access to independent samples from an unknown distribution over some fixed domain (say, the set of integers , and wishes to learn one bit of information about :

Does have the property I’m interested in, or is it far from every distribution that has this property?

In particular, the question is how to do things more efficiently (in terms of the number of samples required) than by fully learning the distribution (which would “cost” samples): surely, one bit of information is not that expensive, is it? Initiated by the seminal work of Batu, Fortnow, Rubinfeld, Smith, and White [BFRSW01], the field of distribution testing has since then aimed at answering that question by a resounding “No, it isn’t! Mostly. Sort of?” (And has had a fair amount of success in doing so.)

For more on property testing and distribution testing in general, we now take the liberty to refer the interested reader to [Ron08] and [Rub12,Can15] respectively.

Testing Membership: why, how, err… what?

Let be a property of distributions, which is a fancy way of saying a subset or class of probability distributions. Think for instance of as “the single uniform distribution on elements,” or “the class of all Binomial distributions.”

We mentioned above that the goal was, given a distance parameter as well as sample access to an arbitrary, unknown distribution over , to determine the (asymptotic) minimum number of samples needed from to distinguish with high probability between the following two cases. (a) (it has the property); versus (b) ( is at distance at least from every distribution that has the property).

The metric here is the distance, or equivalently the total variation distance. Now, many results, spanning the last decade, have culminated in pinpointing the exact sample complexity for various properties, such as (being the uniform distribution), or (being equal to a fixed distribution, specified in advance), or even (being a Poisson Binomial Distribution, a generalization of the Binomial distributions). There has also been work in testing if a joint distribution is a product distribution (“independence”), or if a distribution has a monotone probability mass function.

Note that in the above, two trends appear: the first is consider a property that is a singleton, that is to test if the distribution is equal to something fixed in advance; while the second looks at a bigger set of distributions, typically a “class” of distributions that exhibits some structure; and amounts to testing if belongs to this class. (To be consistent with this change of terminology, we switch from now on from — property — to — class.)

“We’re Binomials. Are you one of us?”

This latter type of result has many applications, e.g. for hypothesis testing, or model selection (“Is my statistical model completely off if I assume it’s Gaussian?”, or “I know how to deal with histograms. Histograms are nice. Can we say it’s a histogram?”). But for many classes of interest, the optimal sample complexity (as a function of and ) remained either completely open, or some gap subsisted between the known upper and lower bounds. (For instance, the cases of monotone distributions and histograms was addressed in [BKR04] and [ILR12], but the results left room for improvement.)

Up until now, roughly. Two very recent papers tackle this question, and simultaneously solve it for many such distribution classes at once. The first, [CDGR15], gives a generic “meta-algorithm” that does ‘quite well’ for a whole range of classes: that is, a one-fits-all tester which has nearly-tight sample complexity (with regard to ). Their main theorem has the following flavor:

Let be a class of distributions that all exhibit some nice structure. If it can be tested, there is a good chance this algorithm can — maybe not optimally, but close enough.

The second, [ADK15] describes and instantiates a general method of “testing-by-learning” for distributions, which yields optimal sample complexity for several of these classes (with respect to both and , for a wide range of parameters). Interestingly, any such testing-by-learning approach was, until today, thought to be a dead-end for distribution testing — strong known impossibility results seemed to doom it. By coming up with an elegant twist, [ADK15] shows that, well, impossibility can only doom so much. Roughly and not quite accurately speaking, here is what they manage to show:

Let be a class of distributions which all exhibit some nice structure. If one can efficiently -learn it, then one can optimally test it.

A Unified Approach to Things

One Tester to Rule Them All

At the root of the generic tester of [CDGR15] is the observation that the testing algorithm of [BKR04], specifically tailored, built, and analyzed for testing whether a distribution is monotone, actually need not be. Namely, one can abstract and generalize the core idea of Batu, Kumar, and Rubinfeld; and by modifying it carefully obtain an algorithm that can apply to any class that satisfies a structural assumption:

Definition 1. (Structural Criterion) Any distribution in is close (in a specific strong sense) to some piecewise-constant distribution on ‘pieces’, where is small (say ).

By “close,” here we mean that on each of the “pieces” either the distribution puts very little probability weight, or stays within an factor of the piecewise-constant distribution . Moreover quite crucially one does not need to be able to compute this decomposition in pieces explicitly: it is sufficient to prove existence, for each , of a good decomposition.

Theorem 2. ([CDGR15, Main Theorem]) There exists an algorithm which, given sampling access to an unknown distribution over and parameter , can distinguish with probability 2/3 between (a) versus (b) \varepsilon" class="latex" />, for any class that satisfies the above structural criterion. Moreover, for many such properties this algorithm is computationally efficient, and its sample complexity is optimal (up to logarithmic factors and the exact dependence on ).

As it turns out, this assumption is satisfied by many interesting classes of distributions: monotone, -modal, log-concave, monotone hazard rate, Poisson Binomials, histograms, to cite only few… moreover, it’s easy to see that any mixture of distributions from these “structural” classes will also satisfy the above criterion (for a corresponding value of ).

To apply the tester to some class you may fancy testing, it then only remains to find the best possible for the structural criterion (i.e., to prove an existential result for the class, independent on any algorithmic aspect), and plug this into Theorem 2. This will immediately yield an upper bound — maybe not optimal, but maybe not far from it.

For an intuition of how the algorithm works, we paraphrase the authors’ description:

The algorithm proceeds in 3 stages:

the decomposition step attempts to recursively construct a partition of the domain in a small number of intervals [that is, ], with a very strong guarantee [the distribution is almost uniform on each piece]. If this succeeds, then the unknown distribution will be close (in ) to its “flattening” on the partition; while if it fails (too many intervals have to be created), this serves as evidence that and we can reject.

The second stage, the approximation step, learns this — which can be done with few samples since by construction we do not have many intervals.

The last stage is purely computational, the projection step: we verify that is indeed close to .

If all three stages succeed, then by the triangle inequality is close to and by the structural assumption on the class, if then it will admit succinct enough partitions, and all three stages will go through.

Upper bounds are fine, but lower bounds?

As a counterpart to generically proving testing is not too hard, it would be nice to have an equally generic way of establishing it is actually not very easy either… Canonne, Diakonikolas, Gouleakis, and Rubinfeld also tackle this question, and provide a general “testing-by-narrowing” reduction that enables to do just that. We won’t dwelve into the details here (the reader is encouraged to read — or skim — the paper to learn more, including slightly more general statements and extensions to tolerant testing), but roughly speaking the theorem goes as follows:

Theorem 3. ([CDGR15, Lower Bound Theorem]) Let be a class of distributions that can be efficiently (agnostically) learned. Then, testing membership to is at least as hard as testing identity to any fixed distribution in .

While this may seem intuitive, there is actually no obvious reason why it should hold — and it actually fails if one removes the “efficiently learnable” assumption. (As an extreme case, consider the class of all distributions (which is hard to learn). Testing if a distribution is in there can trivially be done with no sample at all, yet there are elements in for which identity testing requires samples.)

As examples, [CDGR15] then instantiates Theorem 3 to show lower bounds for all the classes considered — choosing a suitably right “hard instance” in each case.

So… we’re done?

Well, not quite. As said above, this paper provides “nearly-tight” results for many classes, basically a Swiss army knife for class testing. Depending on the situation, it may be more than enough; but sometimes more fine-grained tools are needed… what about an optimal testing algorithm for these classes? What about getting the sample complexity right? And what about testing in higher dimensions?

What about all of these? There is a next section, you know.

Testing by Learning: “Yes, we can!”

A naive first approach

As mentioned before, learning an arbitrary distribution on is expensive, requiring samples. However, if we assume the distribution belongs to some class , we may be able to do better. In particular, for many classes , we can efficiently solve the proper learning problem, in which we output a near distribution in the same class. Formally stated:

Given sample access to , output such that .

This leads to the following naive algorithm for testing. First, regardless of whether or not is actually in the class, attempt to properly learn it up to accuracy . If is in the class, then we will learn a distribution which is actually -close. If it is not, since , by assumption, and are -far. Thus, we have reduced to the following testing problem:

Given sample access to and a description of a distribution , identify whether they are -close in -distance or -far in -distance.

Unfortunately, this is where we hit a brick wall: Valiant and Valiant showed that this problem requires samples, even for the simplest distribution possible, the uniform distribution [VV10]. All this work, only to achieve a barely sublinear tester?

The solution: relax

It turns out that one can circumvent this lower bound by considering a relaxation of the previous problem. For this purpose, we shall use the -distance (which can be seen as a pointwise nonuniform rescaling of ): . The main feature of this distance we require is that . As such, we can now consider the following (easier) testing problem:

Given sample access to and a description of a distribution , identify whether they are -close in -distance or -far in -distance.

Now, the upshot: surprisingly, this new problem can be solved using samples, significantly bypassing the previous lower bound. The actual test is a slight modification of the -test by Pearson, a statistician so far ahead of his time that he casually introduced techniques which lead to optimal algorithms for major problems over a century later (and this isn’t the first time, either).

Take that, brick wall.

" data-medium-file="https://mittheory.wordpress.com/wp-content/uploads/2015/07/brickwallbreaking.png?w=300" data-large-file="https://mittheory.wordpress.com/wp-content/uploads/2015/07/brickwallbreaking.png?w=500" class="wp-image-754 size-medium" src="proxy.php?url=https://mittheory.wordpress.com/wp-content/uploads/2015/07/brickwallbreaking.png?w=300&h=225" alt="Take that, brick wall. (Image from penningtonhennessy.com)" width="300" height="225" srcset="https://mittheory.wordpress.com/wp-content/uploads/2015/07/brickwallbreaking.png?w=300 300w, https://mittheory.wordpress.com/wp-content/uploads/2015/07/brickwallbreaking.png?w=150 150w, https://mittheory.wordpress.com/wp-content/uploads/2015/07/brickwallbreaking.png 500w" sizes="(max-width: 300px) 100vw, 300px" />

Another brick off the wall!

The fruits of our labor

Given this powerful primitive, we have a framework, which is a slight modification of the naive approach described above. Instead of: proper learn, then test; we proper learn in -distance, then test:

1. Use samples to obtain an explicit s.t.
– if , small
– if , large
2. Perform the -close vs -far test.

The learning problem is now slightly harder than before, but still requires samples for a large number of classes in the parameter regime of interest. As a result, this leads to optimal algorithms for testing many classes, including monotonicity, unimodality, log-concavity, and monotone hazard rate.

The story doesn’t end here — this framework also applies for testing multivariate discrete distributions. While the previous literature on testing one-dimensional distributions was quite mature (i.e., for the previously studied classes, we sort of “knew” the right answer to be , up to log factors and the precise dependence on epsilon), fairly little was known about higher dimensional testing problems. This work manages to give optimal algorithms for testing monotonicity and independence of marginals over the -dimensional hypergrid. In particular, for monotonicity, it gives a quadratic improvement in the sample complexity, reducing it from to the optimal .

Conclusions

The hope is that these frameworks may find applications for many other problems, in distribution testing and beyond. The world is yours…

— Clément Canonne and Gautam Kamath

References

[ADK15] Optimal testing for properties of distributions. J. Acharya, C. Daskalakis, and G. Kamath. arXiv, 2015.
[BFRSW01] Testing that distributions are close. T. Batu, L. Fortnow, R. Rubinfeld, W. D. Smith, and P. White, FOCS’01.
[BKR04] Sublinear algorithms for testing monotone and unimodal distributions. T. Batu, R. Kumar, and R. Rubinfeld, STOC’04.
[Can15] A Survey on Distribution Testing: Your Data is Big. But is it Blue?. C. Canonne. ECCC, 2015.
[CDGR15] Testing Shape Restrictions of Discrete Distributions. C. Canonne, I. Diakonikolas, T. Gouleakis, and R, Rubinfeld. arXiv, 2015.
[ILR12] Approximating and Testing -Histogram Distributions in Sub-linear Time. P. Indyk, R. Levi, and R. Rubinfeld, PODS’12.
[Ron08] Property Testing: A Learning Theory Perspective. D. Ron. FTML, 2008.
[Rub12] Taming Big Probability Distributions. R. Rubinfeld. XRDS, 2012.
[VV10] A CLT and tight lower bounds for estimating entropy. G. Valiant and P. Valiant. ECCC, 2010.

Sublinear Day at MIT

Gautam — Mon, 02 Mar 2015 21:14:53 +0000

On Friday, April 10th, MIT will be hosting the second Sublinear Algorithms Day. This event will bring together researchers in the northeast for a day of interaction and discussion.

Sublinear Day will feature talks by five experts in the areas of sublinear and streaming algorithms: Costis Daskalakis, Robert Krauthgamer, Jelani Nelson, Shubhangi Saraf, and Paul Valiant — each giving a 45-minute presentation on the hot and latest developments in their fields.

Additionally, for the first time this year, we will have a poster session! We strongly encourage young researchers (particularly students and postdocs) to present work related to sublinear algorithms. Abstract submission details are available here.

So what are you waiting for? Registration is available here, and we hope to see you at the event!

Website: http://tinyurl.com/sublinearday2
Poster: http://tinyurl.com/sublinearday2poster
Contact [email protected] with any questions.

Gautam Kamath

Sketching and Embedding are Equivalent for Norms

mittheory — Fri, 06 Feb 2015 05:54:53 +0000

Summary

In this post I will show that any normed space that allows good sketches is necessarily embeddable into an space with close to . This provides a partial converse to a result of Piotr Indyk, who showed how to sketch metrics that embed into for . A cool bonus of this result is that it gives a new technique for obtaining sketching lower bounds.

This result appeared in a recent paper of mine that is a joint work with Alexandr Andoni and Robert Krauthgamer. I am pleased to report that it has been accepted to STOC 2015.

Sketching

One of the exciting relatively recent paradigms in algorithms is that of sketching. The high-level idea is as follows: if we are interested in working with a massive object , let us start with compressing it to a short sketch that preserves properties of we care about. One great example of sketching is the Johnson-Lindenstrauss lemma: if we work with high-dimensional vectors and are interested in Euclidean distances between them, we can project the vectors on a random -dimensional subspace, and this will preserve with high probability all the pairwise distances up to a factor of .

It would be great to understand, for which computational problems sketching is possible, and how efficient it can be made. There are quite a few nice results (both upper and lower bounds) along these lines (see, e.g., graph sketching or a recent book about sketching for numerical linear algebra), but the general understanding has yet to emerge.

Sketching for metrics

One of the main motivations to study sketching is fast computation and indexing of similarity measures between two objects and . Often times similarity between objects is modeled by some metric (but not always! think KL divergence): for instance the above example of the Euclidean distance falls into this category. Thus, instantiating the above general question one can ask: for which metric spaces there exist good sketches? That is, when is it possible to compute a short sketch of a point such that, given two sketches and , one is able to estimate the distance ?

The following communication game captures the question of sketching metrics. Alice and Bob each have a point from a metric space (say, and , respectively). Suppose, in addition, that either or D \cdot r" class="latex" /> (where and are the parameters known from the beginning). Both Alice and Bob send messages and that are bits long to Charlie, who is supposed to distinguish two cases (whether is small or large) with probability at least . We assume that all three parties are allowed to use shared randomness. Our main goal is to understand the trade-off between (approximation) and (sketch size).

Arguably, the most important metric spaces are spaces. Formally, for we define to be a -dimensional space equipped with distance

(when this expression should be understood as ). One can similarly define spaces for ; even if the triangle inequality does not hold for this case, it is nevertheless a meaningful notion of distance.

It turns out that spaces exhibit very interesting behavior, when it comes to sketching. Indyk showed that for one can achieve approximation and sketch size for every 0" class="latex" /> (for this was established before by Kushilevitz, Ostrovsky and Rabani). It is quite remarkable that these bounds do not depend on the dimension of a space. On the other hand, for with 2" class="latex" /> the dependence on the dimension is necessary. It turns out that for constant approximation the optimal sketch size is .

Are there any other examples of metrics that admit efficient sketches (say, with constant and )? One simple observation is that if a metric embeds well into for , then one can sketch this metric well. Formally, we say that a map between metric spaces is an embedding with distortion , if

for every and for some 0" class="latex" />. It is immediate to see that if a metric space embeds into for with distortion , then one can sketch with and . Thus, we know that any metric that embeds well into with is efficiently sketchable. Are there any other examples? The amazing answer is that we don’t know!

Our results

Our result shows that for a very important class of metrics—normed spaces—embedding into is the only possible way to obtain good sketches. Formally, if a normed space allows sketches of size for approximation , then for every 0" class="latex" /> the space embeds into with distortion . This result together with the above upper bound by Indyk provides a complete characterization of normed spaces that admit good sketches.

Taking the above result in the contrapositive, we see that non-embeddability implies lower bounds for sketches. This is great, since it potentially allows us to employ many sophisticated non-embeddability results proved by geometers and functional analysts. Specifically, we prove two new lower bounds for sketches: for the planar Earth Mover’s Distance (building on a non-embeddability theorem by Naor and Schechtman) and for the trace norm (non-embeddability was proved by Pisier). In addition to it, we are able to unify certain known results: for instance, classify spaces and the cascaded norms in terms of “sketchability”.

Overview of the proof

Let me outline the main steps of the proof of the implication “good sketches imply good embeddings”. The following definition is central to the proof. Let us call a map between two metric spaces -threshold, if for every :

implies ,
implies .

One should think of threshold maps as very weak embeddings that merely
preserve certain distance scales.

The proof can be divided into two parts. First, we prove that for a normed space that allows sketches of size and approximation there exists a -threshold map to a Hilbert space. Then, we prove that the existence of such a map implies the existence of an embedding into with distortion .

The first half goes roughly as follows. Assume that there is no -threshold map from to a Hilbert space. Then, by convex duality, this implies certain Poincaré-type inequalities on . This, in turn, implies sketching lower bounds for (the direct sum of copies of , where the norm is definied as the maximum of norms of the components) by a result of Andoni, Jayram and Pătrașcu (which is based on a very important notion of information complexity). Then, crucially using the fact that is a normed space, we conclude that itself does not have good sketches (this step follows from the fact that every normed space is of type and is of cotype ).

The second half uses tools from nonlinear functional analysis. First, building on an argument of Johnson and Randrianarivony, we show that for normed spaces -threshold map into a Hilbert space implies a uniform embedding into a Hilbert space—that is, a map , where is a Hilbert space such that

where are non-decreasing functions such that 0" class="latex" /> for every 0" class="latex" /> and as . Both and are allowed to depend only on and . This step uses a certain Lipschitz extension-type theorem and averaging via bounded invariant means. Finally, we conclude the proof by applying theorems of Aharoni-Maurey-Mityagin and Nikishin and obtain a desired (linear) embedding of into .

Open problems

Let me finally state several open problems.

The first obvious open problem is to extend our result to as large class of general metric spaces as possible. Two notable examples one should keep in mind are the Khot-Vishnoi space and the Heisenberg group. In both cases, a space admits good sketches (since both spaces are embeddable into -squared), but neither of them is embeddable into . I do not know, if these spaces are embeddable into , but I am inclined to suspect so.

The second open problem deals with linear sketches. For a normed space, one can require that a sketch is of the form , where is a random matrix generated using shared randomness. Our result then can be interpreted as follows: any normed space that allows sketches of size and approximation allows a linear sketch with one linear measurement and approximation (this follows from the fact that for there are good linear sketches). But can we always construct a linear sketch of size and approximation , where and are some (ideally, not too quickly growing) functions?

Finally, the third open problem is about spaces that allow essentially no non-trivial sketches. Can one characterize -dimensional normed spaces, where any sketch for approximation must have size ? The only example I can think of is a space that contains a subspace that is close to . Is this the only case?

—Ilya

Insensitive Intersections of Halfspaces – STOC 2014 Recaps (Part 11)

Gautam — Fri, 11 Jul 2014 13:54:35 +0000

In the eleventh and final installment of our STOC 2014 recaps, Jerry Li tells us about a spectacularly elegant result by Daniel Kane. It’s an example of what I like to call a “one-page wonder” — this a bit of a misnomer, since Kane’s paper is slightly more than five pages long, but the term refers to any beautiful paper which is (relatively) short and readable.

We hope you’ve enjoyed our coverage of this year’s STOC. We were able to write about a few of our favorite results, but there’s a wealth of other interesting papers that deserve your attention. I encourage you to peruse the proceedings and discover some favorites of your own.

Jerry Li on The Average Sensitivity of an Intersection of Half Spaces, by Daniel Kane

The Monday afternoon sessions kicked off with Daniel Kane presenting his work on the average sensitivity of an intersection of halfspaces. Usually, FOCS/STOC talks can’t even begin to fit all the technical details from the paper, but unusually, Daniel’s talk included a complete proof of his result, without omitting any details. Amazingly, his result is very deep and general, so something incredible is clearly happening somewhere.

At the highest level, Daniel deals with the study of a certain class of Boolean functions. When we classically think about Boolean functions, we think of things such as CNFs, DNFs, decision trees, etc., which map into things like or , but today, and often in the study of Boolean analysis, we will think of functions as mapping to , which is roughly equivalent for many purposes (O’Donnell has a nice rule of thumb as when to use one convention or the other here). Given a function , we can define two important measures of sensitivity. The first is the average sensitivity (or, for those of you like me who grew up with O’Donnell’s book, the total influence) of the function, namely,

where is simply with its th coordinate set to . The second is the noise sensitivity of the function, which is defined similarly: for a parameter , it is the probability that if we sample uniformly at random from , then independently flip each of its bits with probability , the value of at these two inputs is different. We denote this quantity . When we generate a string from a string in this way we say they are -correlated. The weird function of in that expression is because often we equivalently think of being generated from by independently keeping each coordinate of fixed with probability , and uniformly rerandomizing that bit with probability .

Why are these two measures important? If we have a concept class of functions , then it turns out that bounds on these two quantities can often be translated directly into learning algorithms for these classes. By Fourier analytic arguments, good bounds on the noise sensitivity of a function immediately imply that the function has good Fourier concentration on low degree terms, which in turn imply that the so-called “low-degree algorithm” can efficiently learn the class of functions in the PAC model, with random access. Unfortunately, I can’t really give any more detail here without a lot more technical detail, see [2] for a good introduction to the topic.

Now why is the average sensitivity of a Boolean function important? First of all, trust me when I say that it’s a fundamental measure of the robustness of the function. If we identify with , then the average sensitivity is how many edges cross from one subset into another (over ), so it is fundamentally related to the surface area of subsets of the hypercube, which comes up all over the place in Boolean analysis. Secondly, in some cases, we can translate between one measure and the other by considering restrictions of functions. To the best of my knowledge, this appears to be a technique first introduced by Peres in 1999, though his paper is from 2004 [3]. Let . We wish to bound the noise sensitivity of , so we need to see how it behaves when we generate uniformly at random, then as -correlated to . Suppose for some integer (if not, just round it). Fix a partition of the coordinates into bins , and a . Then, for any string , we associate it with the string whose th coordinate is the th coordinate of times the th coordinate of , if . Why are we doing this? Well, after some thought, it’s not too hard to convince yourself that if we choose the bins and the strings uniformly at random, then we get a uniformly random string . Moreover, to generate a string which is -correlated with , it suffices to, after having already randomly chosen the bins, , and , to randomly pick a coordinate of and flip its sign to produce a new string , and produce a new string with these choices of the bins, and . Thus, importantly, we can reduce the process of producing -correlated strings to the process of randomly flipping one bit of some weird new function–but this is the process we consider when we consider the average sensitivity! Thus noise sensitivity of is exactly equal to the expected (over the random choice of the bins and ) average sensitivity of this weird restricted thing. Why this is useful will (hopefully) become clear later.

Since the title of the paper includes the phrase “intersection of halfspaces,” at some point I should probably define what an intersection of halfspaces is. First of all, a halfspace (or linear threshold function) is a Boolean function of the form where and for concreteness let’s say (however, it’s not too hard to see that any halfspace has a representation so that the linear function inside the sign is never zero on the hypercube). Intuitively, take the hyperplane in with normal vector , then assign to all points which are in the same side as of the hyper plane the value , and the rest . Halfspaces are an incredibly rich family of Boolean functions which include arguably some of the important objects in Boolean analysis, such as the dictator functions, the majority function, etc. There is basically a mountain of work on halfspaces, due to their importance in learning theory, and as elementary objects which capture a surprising amount of structure.

Secondly, the intersection of functions is the function which is at if and only for all , and otherwise. If we think of each as a predicate on the boolean cube, then their intersection is simply their AND (or NOR, depending on your map from to ).

Putting these together gives us the family of functions that Daniel’s work concerns. I don’t know what else to say other than they are a natural algebraic generalization of halfspaces. Hopefully you think these functions are interesting, but even if you don’t, it’s (kinda) okay, because, amazingly, it turns out Kane’s main result barely uses any properties of halfspaces! In fact, it only uses the fact that halfspaces are unate, that is, they are either monotone increasing or decreasing in each coordinate. In fact, he proves the following, incredibly general, theorem:

Theorem. [Kane14]
Let be an intersection of unate functions. Then

I’m not going to go into too much detail about the proof; unfortunately, despite my best efforts there’s not much intuition I can compress out of it (in his talk, Daniel himself admitted that there was a lemma which was mysterious even to him). Plus it’s only roughly two pages of elementary (but extremely deep) calculations, just read it yourself! At a very, very high level, the idea is that intersecting a intersection of halfspaces with one more can only increase the average sensitivity by a small factor.

The really attentive reader might have also figured out why I gave that strange reduction between noise sensitivity and average sensitivity. This is because, importantly, when we apply this weird process of randomly choosing bins to an intersection of halfspaces, the resulting function is still an intersection of halfspaces, just over fewer coordinates (besides their unate-ness, this is the only property of intersections of halfspaces that Daniel uses). Thus, since we now know how to bound the average sensitivity of halfspaces, we also know tight bounds for the noise sensitivities of intersection of halfspaces, namely, the following:

Theorem. [Kane14]
Let be an intersection of halfspaces. Then

Finally, this gives us good learning algorithms for intersections of halfspaces.

The paper is remarkable; there had been previous work by Nazarov (see [4]) proving optimal bounds for sensitivities of intersections of halfspaces in the Gaussian setting, which is a more restricted setting than the Boolean setting (intuitively because we can simulate Gaussians by sums of independent Boolean variables), and there were some results in the Boolean setting, but they were fairly suboptimal [5]. Furthermore, all these proofs were scary: they were incredibly involved, used powerful machinery from real analysis, drew heavily on the properties of halfspaces, etc. On the other hand, Daniel’s proof of his main result (which I would say builds on past work in the area, except it doesn’t use anything besides elementary facts), well, I think Erdos would say this proof is from “the Book”.

[1] The Average Sensitivity of an Intersection of Half Spaces, Kane, 2014.
[2] Analysis of Boolean Functions, O’Donnell, 2014.
[3] Noise Stability of Weighted Majority, Peres, 2004.
[4] On the maximal perimeter of a convex set in with respect to a Gaussian measure, Nazarov, 2003.
[5] An Invariance Principle for Polytopes, Harsha, Klivans, Meka, 2010.

So Alice and Bob want to flip a coin… – STOC 2014 Recaps (Part 10)

mittheory — Wed, 09 Jul 2014 21:15:06 +0000

A major use case for coin flipping (with actual coins) is when you’re with friends, and you have to decide where to eat. This agonizing decision process can be elegantly avoided when randomness is used. But who’s doing the coin flipping? How do you know your friend isn’t secretly choosing which restaurant to go to? Cryptography offers a solution to this, and Sunoo will tell us about how this solution is actually equivalent to one way functions!

Sunoo Park on Coin Flipping of Any Constant Bias Implies One-Way Functions, by Itay Berman, Iftach Haitner, and Aris Tentes

In this post, we look at a fundamental problem that has plagued humankind since long before theoretical computer science: if I don’t trust you and you don’t trust me, how can we make a fair random choice? This problem was once solved in the old-fashioned way of flipping a coin, but theoretical computer science has made quite some advances since.

What are the implications of this? In their recent STOC paper, Berman, Haitner, and Tentes show that the ability for two parties to flip a (reasonably) fair coin means that one-way functions exist. This, in turn has far-reaching cryptographic implications.

A function is one-way if it is “easy” to compute given any input , and it is “hard”, given the image of a random input, to find a preimage such that . The existence of one-way functions imply a wide range of fundamental cryptographic primitives, including pseudorandom generation, pseudorandom functions, symmetric-key encryption, bit commitments, and digital signatures – and vice versa: the seminal work of Impagliazzo and Luby [1] showed that the existence of cryptography based on complexity-theoretic hardness assumptions – encompassing the all of the aforementioned primitives – implies that one-way functions exist.

About coin-flipping protocols, however, only somewhat more restricted results were known. Coin-flipping has long been a subject of interest in cryptography, since the early work of Blum [2] which described the following problem:

“Alice and Bob want to flip a coin by telephone. (They have just divorced, live in different cities, want to decide who gets the car.)”

More formally, coin-flipping protocol is a protocol in which two players interact by exchanging messages, which upon completion outputs a single bit interpreted as the outcome of a coin flip. The aim is that the coin should be (close to) unbiased, even if one of the players “cheats” and tries to bias the outcome towards a certain value. We say that a protocol has constant bias if the probability that the outcome is equal to 0 is constant (in a security parameter).

Impagliazzo and Luby’s original paper showed that coin-flipping protocols achieving negligible bias (that is, they are very close to perfect!) imply one-way functions. Subsequently, Maji, Prabhakaran and Sahai [3] proved that coin-flipping protocols with a constant number of rounds (and any non-trivial bias, i.e. ) also imply one-way functions. Yet more recently, Haitner and Omri [4] showed that the same holds for any coin-flipping protocol with a constant bias (namely, a bias of ). Finally, Berman, Haitner and Tentes proved that coin-flipping of any constant bias implies one-way functions. The remainder of this post will give a flavor of the main ideas behind their proof.

The high-level structure of the argument is as follows: given any coin-flipping protocol between players and , we first define a (sort of) one-way function, then show that an adversary capable of efficiently inverting that function must be able to achieve a significant bias in . The one-way function used is the transcript function which maps the players’ random coinflips to a protocol transcript. The two main neat insights are these:

Consider the properties of a coin-flipping protocol when interpreted as a zero-sum game between two players: Alice wins if the outcome is 1, and Bob wins otherwise. If the players play optimally, who wins? It turns out that from the winner, we can deduce that there is a set of protocol transcripts where the outcome is bound to be the winning outcome, no matter what the losing player does: that is, transcripts that are “immune” to the losing player.
A recursive variant of the biasing attack proposed by Haitner and Omri in [4] is proposed. The new attack can be employed by the adversary in order to generate a transcript that lies in the “immune” set with probability close to 1 – so, this adversary (who has access to an inverter for the transcript function) can bias the protocol outcome with high probability.

The analysis is rather involved; and there are some fiddly details to resolve, such as how a bounded adversary can simulate the recursive attack efficiently, and ensuring that the inverter will work for the particular query distribution of the adversary. Without going into all those details, the last part of this post describes the optimal recursive attack.

Let be the function that takes as input a pair , where is a random transcript of a partial (incomplete) honest execution of and , and outputs a random pair of random coinflips for the players, satisfying the following two conditions: (1) they are consistent with , that is, the coinflips could be plausible next coinflips given the partial transcript ; and (2) there exists a continuation of the protocol after and the next-coinflips that leads to the protocol outcome .

It seems that an effective way for the adversary to use might be to call for each partial transcript at which the relevant player has to make a move, and then to behave according to the outputted coins. We call this strategy the biased-continuation attack, which is the crucial attack underlying the result of [4].

The new paper proposes a recursive biased-continuation attack that adds an additional layer of sophistication. Let be the honest first player’s strategy. Now, define to be attacker which, rather than sampling a random 1-continuation among all the possible honest continuations of the protocol , instead samples a random 1-continuation among all continuations of . Note that is the biased-continuation attacker described above! It turns out that as the number of recursions grows, the probability that the resulting transcript will land in the “immune” set approaches 1 – meaning a successful attack! Naïvely, this attack may require exponential time due to the many recursions required; however, the paper circumvents this by analyzing the probability that the transcript will land in a set which is “almost immune”, and finding that this probability approaches 1 significantly faster.

[1] One-Way Functions Are Essential for Complexity Based Cryptography. Impagliazzo, Luby (FOCS 1989).
[2] Coin Flipping by Telephone: A Protocol for Solving Impossible Problems. Blum (CRYPTO 1981).
[3] On the Computational Complexity of Coin Flipping. Maji, Prabhakaran, Sahai (FOCS 2010).
[4] Coin Flipping with Constant Bias Implies One-Way Functions. Haitner, Omri (FOCS 2011).

Faster, I say! The race for the fastest SDD linear system solver – STOC 2014 Recaps (Part 9)

mittheory — Tue, 08 Jul 2014 02:46:37 +0000

In the next post in our series of STOC 2014 recaps, Adrian Vladu tells us about some of the latest and greatest in Laplacian and SDD linear system solvers. There’s been a flurry of exciting results in this line of work, so we hope this gets you up to speed.

The Monday morning session was dominated by a nowadays popular topic, symmetric diagonally dominant (SDD) linear system solvers. Richard Peng started by presenting his work with Dan Spielman, the first parallel solver with near linear work and poly-logarithmic depth! This is exciting, since parallel algorithms are used for large scale problems in scientific computing, so this is a result with great practical applications.

The second talk was given by Jakub Pachocki and Shen Chen Xu from CMU, which was a result of merging two papers. The first result is a new type of trees that can be used as preconditioners. The second one is a more efficient solver, which together with the trees shaved one more factor in the race for the fastest solver.

Before getting into more specific details, it might be a good idea to provide a bit of background on the vast literature of Laplacian solvers.

Typically, linear systems are easier to solve whenever has some structure on it. A particular class we care about are positive semidefinite (PSD) matrices. They work nicely because the solution is the minimizer of the quadratic form , which happens to be a convex function due to the PSD-ness of . Hence we can use various versions of gradient descent, the convergence of which depends usually on the condition number of .

A subset of PSD matrices are Laplacian matrices, which are nothing else but graph Laplacians; using an easy reduction, one can show that any SDD system can be reduced to a Laplacian system. Laplacians are great because they carry a lot of combinatorial structure. Instead of having to suffer through a lot of scary algebra, this is the place where we finally get to solve some fun problems on graphs. The algorithms we aim for have running time close to linear in the sparsity of the matrix.

One reason why graphs are nice is that we know how to approximate them with other simpler graphs. More specifically, when given a graph , we care about finding a sparser graph such that ^a, for some small (the smaller, the better). The point is that whenever you do gradient descent in order to minimize , you can take large steps by solving a system in the sparser . Of course, this requires another linear system solve, only that this only needs to be done on a sparser graph. Applying this idea recursively eventually yields efficient solvers. A lot of combinatorial work is spent on understanding how to compute these sparser graphs.

In their seminal work, Spielman and Teng used ultrasparsifiers^b as their underlying combinatorial structure, and after many pages of work they obtained a near linear algorithm with a large polylogarithmic factor in the running time. Eventually, Koutis, Miller and Peng came up with a much cleaner construction, and showed how to construct a chain of sparser and sparser graphs, which yielded a solver that was actually practical. Subsequently, people spent a lot of time trying to shave log factors from the running time, see [9], [6], [7], [12] (the last of which was presented at this STOC), and the list will probably continue.

After this digression, we can get back to the conference program and discuss the results.

An Efficient Parallel Solver for SDD Linear Systems by Richard Peng and Daniel Spielman

How do we solve a system ? We need to find away to efficiently apply the operator to . Even Laplacians are not easy to invert, and what’s worse, their pseudoinverses might not even be sparse. However, we can still represent as a product of sparse matrices which are easy to compute.

We can gain some inspiration from trying to numerically approximate the inverse of for some small real . Taking the Taylor expansion we get that . Notice that in order to get precision, we only need to take the product of the first factors. It would be great if we could approximate matrix inverses the same way. Actually, we can, since for matrices of norm less than we have the identity . At this point we’d be tempted to think that we’re almost done, since we can just write , and try to invert . However we would still need to compute matrix powers, and those matrices might again not even be sparse, so this approach needs more work.

Richard presented a variation of this idea that is more amenable to SDD matrices. He writes

The only hard part of applying this inverse operator to a vector consists of left multiplying by . How to do this? One crucial ingredient is the fact that is also SDD! Therefore we can recurse, and solve a linear system in . You might say that we won’t be able to do it efficiently, since is not sparse. But with a little bit of work and the help of existing spectral sparsification algorithms can be approximated with a sparse matrix.

Notice that at the level of recursion, the operator we need to apply is . A quick calculation shows that if the condition number of is , then the condition number of is . This means that after iterations, the eigenvalues of are close to , so we can just approximate the operator with without paying too much for the error.

There are a few details left out. Sparsifying requires a bit of understanding of its underlying structure. Also, in order to do this in parallel, the authors originally employed the spectral sparsification algorithm of Spielman and Teng, combined with a local clustering algorithm of Orecchia, Sachdeva and Vishnoi. Blackboxing these two sophisticated results might question the practicality of the algorithm. Fortunately, Koutis recently produced a simple self-contained spectral sparsification algorithm, which parallelizes and can replace all the heavy machinery in Richard and Dan’s paper.

Solving SDD Linear Systems in Nearly Time by Michael B. Cohen, Rasmus Kyng, Gary L. Miller, Jakub W. Pachocki, Richard Peng, Anup B. Rao, and Shen Chen Xu

Jakub Pachocki and Shen Chen Xu talked about two results, which together yield the fastest SDD system solver to date. The race is still on!

Let me go through a bit of more background. Earlier on I mentioned that graph preconditioners are used to take long steps while doing gradient descent. A dual of gradient descent on the quadratic function is the Richardson iteration. This is yet another iterative method, which refines a coarse approximation to the solution of a linear system. Let be the Laplacian of our given graph, and be the Laplacian of its preconditioner. Let us assume that we have access to the inverse of . The Richardson iteration computes a sequence , which converges to the solution of the system. It starts with a weak estimate for the solution, and iteratively attempts to decrease the norm of the residue by updating the current solution with a coarse approximation to the solution of the system . That coarse approximation is computed using . Therefore steps are given by

where is a parameter that adjusts the length of the step. The better approximates , the fewer steps we need to make.

The problem that Jakub and Shen talked about was finding these good preconditioners. The way they do it is by looking more closely at the Richardson iteration, and weakening the requirements. Instead of having the preconditioner approximate spectrally, they only impose some moment bounds. I will not describe them here, but feel free to read the paper. Proving that these moment bounds can be satisfied using a sparser preconditioner than those that have been so far used in the literature constitutes the technical core of the paper.

Just like in the past literature, these preconditioners are obtained by starting with a good tree, and sampling extra edges. Traditionally, people used low stretch spanning trees. The issue with them is that the number of edges in the preconditioner is determined by the average stretch of the tree, and we can easily check that for the square grid this is . Unfortunately, in general we don’t know how to achieve this bound yet; the best known result is off by a factor. It turns out that we can still get preconditioners by looking at a different quantity, called the stretch (), which can be brought down to . This essentially eliminates the need for computing optimal low stretch spanning trees. Furthermore, these trees can be computed really fast, time in the RAM model, and the algorithm parallelizes.

This result consists of a careful combination of existing algorithms on low stretch embeddings of graphs into tree metrics and low stretch spanning trees. I will talk more about these embedding results in a future post.

^{a. is also known as the Lowner partial order. is equivalent to , which says that is PSD.}
^{b. A -ultrasparsifier of is a graph with edges such that . It turns out that one is able to efficiently construct ultrasparsifiers. So by adding a few edges to a spanning tree, you can drastically reduce the relative condition number with the initial graph.}

[1] Approaching optimality for solving SDD systems, Koutis, Miller, Peng (2010).
[2] Nearly-Linear Time Algorithms for Graph Partitioning, Graph Sparsification, and Solving Linear Systems, Spielman, Teng (2004).
[3] A Local Clustering Algorithm for Massive Graphs and its Application to Nearly-Linear Time Graph Partitioning, Spielman, Teng (2013).
[4] Spectral Sparsification of Graphs, Spielman, Teng (2011).
[5] Nearly-Linear Time Algorithms for Preconditioning and Solving Symmetric, Diagonally Dominant Linear Systems, Spielman, Teng (2014).
[6] A Simple, Combinatorial Algorithm for Solving SDD Systems in Nearly-Linear Time, Kelner, Orecchia, Sidford, Zhu (2013).
[7] Efficient Accelerated Coordinate Descent Methods and Faster Algorithms for Solving Linear Systems, Lee, Sidford (2013).
[8] A nearly-mlogn time solver for SDD linear systems, Koutis, Miller, Peng (2011).
[9] Improved Spectral Sparsification and Numerical Algorithms for SDD Matrices, Koutis, Levin, Peng (2012).
[10] Simple parallel and distributed algorithms for spectral graph sparsification, Koutis (2014).
[11] Preconditioning in Expectation, Cohen, Kyng, Pachocki, Peng, Rao (2014).
[12] Stretching Stretch, Cohen, Miller, Pachocki, Peng, Xu (2014).
[13] Solving SDD Linear Systems in Nearly mlog1/2n Time, Cohen, Kyng, Miller, Pachocki, Peng, Rao, Xu (2014).
[14] Approximating the Exponential, the Lanczos Method and an -Time Spectral Algorithm for Balanced Separator, Orecchia, Sachdeva, Vishoni (2012).

An Encore: More Learning and Testing – STOC 2014 Recaps (Part 8)

Gautam — Thu, 03 Jul 2014 14:43:01 +0000

Thought you were rid of us? Not quite: in a last hurrah, Clément and I come back with a final pair of distribution estimation recaps — this time on results from the actual conference!

Gautam Kamath on Efficient Density Estimation via Piecewise Polynomial Approximation by Siu-On Chan, Ilias Diakonikolas, Rocco A. Servedio, and Xiaorui Sun

Density estimation is the question on everyone’s mind. It’s as simple as it gets – we receive samples from a distribution and want to figure out what the distribution looks like. The problem rears its head in almost every setting you can imagine — fields as diverse as medicine, advertising, and compiler design, to name a few. Given its ubiquity, it’s embarrassing to admit that we didn’t have a provably good algorithm for this problem until just now.

Let’s get more precise. We’ll deal with the total variation distance metric (AKA statistical distance). Given distributions with PDFs and , their total variation distance is . Less formally but more intuitively, it upper bounds the difference in probabilities for any given event. With this metric in place, we can define what it means to learn a distribution: given sample access to a distribution , we would like to output a distribution such that .

This paper presents an algorithm for learning -piecewise degree- polynomials. Wow, that’s a mouthful — what does it mean? A -piecewise degree- polynomial is a function where the domain can be partitioned into intervals, such that the function is a degree- polynomial on each of these intervals. The main result says that a distribution with a PDF described by a -piecewise degree- polynomial can be learned to accuracy using samples and polynomial time. Moreover, the sample complexity is optimal up to logarithmic factors.

A 4-piecewise degree-3 polynomial.
Lifted from Ilias’ slides.

Now this is great and all, but what good are piecewise polynomials? How many realistic distributions are described by something like “ for but for and …”? The answer turns out to be a ton of distributions — as long as you squint at them hard enough.

The wonderful thing about this result is that it’s semi-agnostic. Many algorithms in the literature are God-fearing subroutines, and will sacrifice their first-born child to make sure they receive samples from the class of distributions they’re promised — otherwise, you can’t make any guarantees about the quality of their output. But our friend here is a bit more skeptical. He deals with a funny class of distributions, and knows true piecewise polynomial distributions are few and far between — if you get one on the streets, who knows if it’s pure? Our friend is resourceful: no matter the quality, he makes it work.

Let’s elaborate, in slightly less blasphemous terms. Suppose you’re given sample access to a distribution which is at total variation distance from some -piecewise degree- polynomial (you don’t need to know which one). Then the algorithm will output a -piecewise degree- polynomial which is at distance from . In English: even if the algorithm isn’t given a piecewise polynomial, it’ll still produce something that’s (almost) as good as you could hope for.

With this insight under our cap, let’s ask again — where do we see piecewise polynomials? They’re everywhere: this algorithm can handle distributions which are log-concave, bounded monotone, Gaussian, -modal, monotone hazard rate, and Poisson Binomial. And the kicker is that it can handle mixtures of these distributions too. Usually, algorithms fail catastrophically when considering mixtures, but this algorithm keeps chugging and handles them all — and near optimally, most of the time.

The analysis is tricky, but I’ll try to give a taste of some of the techniques. One of the key tools is the Vapnik-Chervonenkis (VC) inequality. Without getting into the details, the punchline is that if we output a piecewise polynomial which is “close” to the empirical distribution (under a weaker metric than total variation distance), it’ll give us our desired learning result. In this setting, “close” means (roughly) that the CDFs don’t stray too far from each (though in a sense that is stronger than the Kolmogorov distance metric).

Let’s start with an easy case – what if the distribution is a -piecewise polynomial? By the VC inequality, we just have to match the empirical CDF. We can do this by setting up a linear program which outputs a linear combination of the Chebyshev polynomials, constrained to resemble the empirical distribution.

It turns out that this subroutine is the hardest part of the algorithm. In order to deal with multiple pieces, we first discretize the support into small intervals which are roughly equal in probability mass. Next, in order to discover a good partition of these intervals, we run a dynamic program. This program uses the subroutine from the previous paragraph to compute the best polynomial approximation over each contiguous set of the intervals. Then, it stitches the solutions together in the minimum cost way, with the constraint that it uses fewer than pieces.

In short, this result essentially closes the problem of density estimation for an enormous class of distributions — they turn existential approximations (by piecewise polynomials) into approximation algorithms. But there’s still a lot more work to do — while this result gives us improper learning, we yearn for proper learning algorithms. For example, this algorithm lets us approximate a mixture of Gaussians using a piecewise polynomial, but can we output a mixture of Gaussians as our hypothesis instead? Looking at the sample complexity, the answer is yes, but we don’t know of any computationally efficient way to solve this problem yet. Regardless, there’s many exciting directions to go — I’m looking forward to where the authors will take this line of work!

-G

Clément Canonne on -Testing, by Piotr Berman, Sofya Raskhodnikova, and Grigory Yaroslavtsev [1])

Almost every — if not all — work in property testing of functions are concerned with the Hamming distance between functions, that is the fraction of inputs on which they disagree. Very natural when we deal for instance with Boolean functions , this distance becomes highly arguable when the codomain is, say, the real line: sure, and technically disagree on almost every single input, but should they be considered two completely different functions?

This question, Grigory answered by the negative; and presented (joint work with Piotr Berman and Sofya Raskhodnikova [2]) a new framework for testing real-valued functions , less sensitive to this sort of annoying “technicalities” (i.e., noise). Instead of the usual Hamming/ distance between function, they suggest the more robust () distance

(think of as being the hypercube or the hypergrid , and being 1 or 2. In this case, the denominator is just a normalizing factor or )

Now, erm… why?

because it is much more robust to noise in the data;
because it is much more robust to outliers;
because it plays well (as a preprocessing step for model selection) with existing variants of PAC-learning under norms;
because and are pervasive in (machine) learning;
because they can.

Their results and methods turn out to be very elegant: to outline only a few, they

give the first example of testing monotonicity testing (de facto, for the distance) when adaptivity provably helps; that is, a testing algorithm that selects its future queries as a function of the answers it previously got can outperform any tester that commits in advance to all its queries. This settles a longstanding question for testing monotonicity with respect to Hamming distance;
improve several general results for property testing, also applicable to Hamming testing (e.g. Levin’s investment strategy [3]);
provide general relations between sample complexity of testing (and tolerant testing) for various norms ();
have quite nice and beautiful algorithms (e.g., testing via partial learning) for testing monotonicity and Lipschitz property;
give close-to-tight bounds for the problems they consider;
have slides in which the phrase “Big Data” and a mention to stock markets appear (!);
have an incredibly neat reduction between and Hamming testing of monotonicity.

I will hereafter only focus on the last of these bullets, one which really tickled my fancy (gosh, my fancy is so ticklish) — for the other ones, I urge you to read the paper. It is a cool paper. Really.

Here is the last bullet, in a slightly more formal fashion — recall that a function defined on a partially ordered set is monotone if for all comparable inputs such that , one has ; and that a one-sided tester is an algorithm which will never reject a “good” function: it can only err on “bad” functions (that is, it may sometimes accept, with small probability, a function far from monotone, but will never reject a monotone function).

Theorem.
Suppose one has a one-sided, non-adaptive tester for monotonicity of Boolean functions with respect to Hamming distance, with query complexity . Then the very same is also a tester for monotonicity of real-valued functions with respect to distance.

Almost too good to be true: we can recycle testers! How? The idea is to express our real-valued as some “mixture” of Boolean functions, and use as if we were accessing these. More precisely, let be a function which one intends to test for monotonicity. For all thresholds , the authors define the Boolean function by

All these are Boolean; and one can verify that for all , . Here comes the twist: one can also show that the distance of to monotone satisfies

i.e. the distance of to monotone is the integral of the Hamming distances of the ‘s to monotone. And by a very simple averaging argument, if is far from monotone, then at least one of the ‘s must be…
How does that help? Well, take your favorite Boolean, Hamming one-sided (non-adaptive) tester for monotonicity, : being one-sided, it can only reject a function if it has some “evidence” it is not monotone — indeed, if it sees some violation: i.e., a pair with but f(y)" class="latex" />.

Feed this tester, instead of the Boolean function it expected, our real-valued ; as one of the ‘s is far from monotone, our tester would reject ; so it would find a violation of monotonicity by if it were given access to . But being non-adaptive, the tester does exactly the same queries on as it would have done on this ! And it is not difficult to see that a violation for is still a violation for : so the tester finds a proof that is not monotone, and rejects.

Wow.

— Clément.

Final, small remark: one may notice a similarity between testing of functions and the “usual” testing (with relation to total variation distance, ) of distributions . There is actually a quite important difference, as in the latter the distance is not normalized by (because distributions have to sum to anyway). In this sense, there is no direct relation between the two, and the work presented here is indeed novel in every respect.

Edit: thanks to Sofya Raskhodnikova for spotting an imprecision in the original review.

[1] Slides available here: http://grigory.github.io/files/talks/BRY-STOC14.pptm
[2] http://dl.acm.org/citation.cfm?id=2591887
[3] See e.g. Appendix A.2 in “On Multiple Input Problems in Property Testing”, Oded Goldreich. 2013. http://eccc-preview.hpi-web.de/report/2013/067/