Efficient Distribution Estimation 4: Greek Afternoon – STOC 2014 Recaps (Part 7)

This post is the finale on our series on the Efficient Distribution Estimation workshop. See also, part 1, part 2, and part 3.

After one last caffeination, we gathered once more for the exciting finale to the day — we were in for a very Greek afternoon, with back-to-back tutorials by Ilias Diakonikolas and Costis Daskalakis. As a wise man once said a few blog posts ago, “if you can’t beat them… change what ‘them’ means”. That was the running theme of this session – they showed us how to exploit the structure of a distribution to design more efficient algorithms.

Gautam Kamath on Beyond Histograms: Exploiting Structure in Distribution Estimation

Ilias focused on the latter of these – in particular, $k$ -flat distributions. A $k$ -flat distribution can be described by $k$ intervals, over which the probability mass function is constant – it literally looks like $k$ flat pieces. Warm up: what if we knew where each piece started and ended? Then the learning problem would be easy: by grouping samples that fall into each interval together, we reduce the support size from $n$ to $k$ . Now we can use the same algorithm as for the general case – only this time, it takes $O(k/\varepsilon^2)$ samples.

But we’re rarely lucky enough to know the correct intervals. What should we do? Guess the intervals? Too expensive. Guess the intervals approximately? A bit better, but still pricey. Make the problem a bit easier and allow ourselves to use $k/\varepsilon$ intervals, instead of only $k$ ? This actually works — while lesser mortals would be satisfied with this solution and call it a day, Ilias refused to leave us with anything less than the grand prize. Can we efficiently output a $k$ -flat distribution using only $O(k/\varepsilon^2)$ samples?

Yes, we can, using some tools from Vapnik-Chervonenkis (VC) theory. The VC inequality is a useful tool which allows us to relate the empirical distribution and the true distribution, albeit under a weaker metric than the total variation distance. Skipping some technical details, the key idea is that we want to output a $k$ -flat distribution that is close to the empirical distribution under this weaker metric. Using the triangle inequality, specific properties of this metric, and the fact that we’re comparing two $k$ -flat distributions, this will give us a $k$ -flat distribution which is close to the target in total variation distance, as desired. Computing a $k$ -flat distribution that’s close to the empirical one isn’t actually too tough – a careful dynamic program works.

We can learn $k$ -flat distributions – so what? This class might strike you as rather narrow, but this result leads to algorithms for a variety of classes of distributions, including monotone, $t$ -modal, monotone hazard rate, and log-concave distributions. These classes are all close to $k$ -flat, and this algorithm is fine with that. In this sense, this tool captures all these classes at the same time — One algorithm to rule them all, so to speak. This algorithm even directly generalizes to mixtures of these distributions, which is huge — studying mixtures usually makes the problem much more difficult.

Alright, but what’s the catch? Not all distributions are that close to $k$ -flat. For example, this algorithm requires $\tilde O(1/\varepsilon^3)$ samples to learn a log-concave distribution, even though the optimal sample-complexity is $O(1/\varepsilon^{5/2})$ . It turns out that log-concave distributions are close to $k$ -linear, rather than $k$ -flat, and we must use a $k$ -linear approximation if we want a near-optimal sample complexity.

“Please sir, I want some more.”

You can have it, Oliver — it’s actually possible to learn distributions which are close to mixtures of piecewise polynomials! But this topic is juicy enough that it deserves its own blog post.

Open problems

The perennial question – what can we do in high dimensions?
Most of these results are fundamentally improper – they approximate a distribution with a distribution which may be from a different class. Can these techniques lead to computationally efficient proper learning algorithms, where we output a hypothesis from the same class? We already know sample-efficient algorithms, the trouble is the running time.

Gautam Kamath on Beyond Berry Esseen: Structure and Learning of Sums of Random Variables

Finally, we had Costis, who talked about sums of random variables. We would like to understand a class of distributions $\mathcal{D}$ in three (highly related) ways:

Structure: What “simpler” class of distributions approximates $\mathcal{D}$ ?
Covering: Can we generate a small set of distributions $S$ , such that for any distribution $D \in \mathcal{D}$ , we have a distribution $D^* \in S$ such that $D^*$ is $\varepsilon$ -close to $D$ ?
Learning: Given sample access to a distribution $D \in \mathcal{D}$ , can we output a distribution $D^*$ such that $D^*$ is $\varepsilon$ -close to $D$ ?

Understanding the structure often implies covering results, since we can usually enumerate the simpler class of distributions. And covering implies learning, since generic results allow us to find a “good” distribution from $S$ at the cost of $O(\log |S|)$ samples. It’s not that easy though, since either the details are not obvious, or we’d like to go beyond these generic results.

Consider Poisson Binomial Distributions*, or PBDs for short. A PBD is the sum $X = \sum_i X_i$ of $n$ independent Bernoulli random variables, where $X_i$ is $\mathrm{Bernoulli}(p_i)$ . This is like a Binomial distribution on steroids: the Binomial distribution is the special case when all $p_i$ are equal.

PBDs pop up everywhere in math and computer science, so there’s a plethora of classical structural results – with a few catches. For example, check out a result by Le Cam, which approximates a PBD by a Poisson distribution with the same mean:

$d_{TV}\left(\sum_i X_i, \mathrm{Poisson}\left(\sum_i p_i\right)\right) \leq \sum_i p_i^2$

Catch one: this structural result describes only a small fraction of PBDs – it gives a good approximation when the $p_i$ values are really small. Catch two: we’re approximating a PBD with a Poisson, a distribution from a different family – we’d ideally like to have a proper approximation, that is, approximate the PBD with another PBD.

Recent results from our community show that you can get around both of these issues at once (woo, chalk one up for the computer scientists!). Structurally, we know the following: any PBD is $\varepsilon$ -close to either a Binomial distribution or a shifted PBD with the parameter $n$ replaced by $O(1/\varepsilon^3)$ — a much smaller number of “coin flips”. While the original distribution had $n$ different $p_i$ parameters, the approximating distribution has only $O(1/\varepsilon^3)$ parameters. Since $n$ is usually way bigger than $1/\varepsilon$ , this is a huge reduction in complexity. By carefully enumerating the distributions in this structural result, we can get a small cover. Specifically, the size of the cover is polynomial in $n$ and quasi-polynomial in $1/\varepsilon$ , while the naive cover is exponential in $n$ . Now, using the generic reduction mentioned before, we can learn the distribution with $\mathrm{polylog}(n)$ samples. Again though, let’s pretend that $n$ is so big that it makes André the Giant feel insecure – can we take it out of the picture? If you’ve been paying any attention to my rhetorical style, you might predict the answer is yes (and you’d be right): $O(1/\varepsilon^2)$ samples suffice for learning.

Is that the end of the story? No, we need to go deeper — let’s generalize PBDs once more: Sums of Independent Integer Random Variables (SIIRVs)**. A $k$ -SIIRV is the sum $X = \sum_i X_i$ of $n$ independent random variables ( $k$ -IRVs), where each $X_i \in \{1, \dots, k\}$ . Note that when $k = 2$ , this is essentially equivalent to a PBD. PBDs are actually quite tame, and have many nice properties – for example, they’re unimodal and log-concave. However, even a $3$ -SIIRV is far from being $k$ -modal or log-concave. After some careful analysis and modular arithmetic, it turns out that a $k$ -SIIRV with sufficiently large variance is approximated by $cZ + Y$ , where $c$ is some integer $\in \{1, \dots, k-1\}$ , $Z$ is a discretized Gaussian, $Y$ is a $c$ -IRV, and $Y$ and $Z$ are independent. On the other hand, if the variance is not large, the distribution has a sparse support – it can be approximated by a $(k/\varepsilon)$ -IRV. This structural observation leads to a learning result with sample complexity which is polynomial in $k$ and $1/\varepsilon$ , but once again independent of $n$ .

A particularly sinister 3-SIIRV, far from being 3-modal or log-concave. Image taken from Costis’ slides.

Open problems

Current results for PBDs are quasi-polynomial in $1/\varepsilon$ , for both the cover size and the running time for the proper learning result. Can we make this polynomial?
For $k$ -SIIRVs, our current understanding is fundamentally improper. Can we obtain proper covers or learning algorithms for this class?
What about higher dimensions? We have some preliminary results on the multidimensional generalization of PBDs (joint work with Costis and Christos Tzamos), but we’re totally in the dark when it comes to generalizing $k$ -SIIRVs.

-G

*For maximum confusion, Poisson Binomial Distributions have nothing to do with Poisson Distributions, except that they were both named after our good friend Siméon.
**For another perspective on the SIIRV result, see this post by my partner in crime, Clément.

[1] Learning k-modal distributions via testing, Daskalakis, Diakonikolas, Servedio (2012).
[2] Approximating and Testing k-Histogram Distributions in Sub-Linear time, Indyk, Levi, Rubinfeld (2012).
[3] Learning Poisson Binomial Distributions, Daskalakis, Diakonikolas, Servedio (2012).
[4] Learning Mixtures of Structured Distributions over Discrete Domains, Chan, Diakonikolas, Servedio, Sun (2013).
[5] Testing k-modal Distributions: Optimal Algorithms via Reductions, Daskalakis, Diakonikolas, Servedio, Valiant, Valiant (2013).
[6] Learning Sums of Independent Integer Random Variables, Daskalakis, Diakonikolas, Servedio (2013).
[7] Efficient Density Estimation via Piecewise Polynomial Approximation, Chan, Diakonikolas, Servedio, Sun (2013).
[8] Sparse Covers for Sums of Indicators, Daskalakis, Papadimitriou (2013).
[9] Faster and Sample Near-Optimal Algorithms for Proper Learning Mixtures of Gaussians, Daskalakis, Kamath (2014).
[10] Near-optimal-sample estimators for spherical Gaussian mixtures, Acharya, Jafarpour, Orlitsky, Suresh (2014).

Spaghetti code crypto – STOC 2014 Recaps (Part 6)

We return to our STOC recap programming with Justin Holmgren, who will tell us about an exciting new result in program obfuscation. (However, he doesn’t tell us how to say “indistinguishability obfuscation” 10 times fast).

Justin Holmgren on How to Use Indistinguishability Obfuscation: Deniable Encryption, and More, by Amit Sahai and Brent Waters

One very recent topic of study in cryptography is program obfuscation – that is, how to make a program maximally unintelligible, while still preserving the functionality. We will represent programs as circuits, which for most purposes is equivalent to any other representation. At some level, obfuscation of a program seems like it should be an easy task – any programmer knows that it’s easy to end up with unreadable spaghetti code, but theoretical treatment of the issue was lacking, and finding a construction with provable security properties has proven quite difficult.

In 2001, Barak et al. [1] precisely formulated black-box obfuscation, which says that if you can compute something given the obfuscated program, you could also compute it given black-box access to the program. This is a natural and very strong notion of obfuscation. If we could obfuscate programs according to this definition, we could easily solve many difficult problems in cryptography. For example, a public-key fully homomorphic encryption scheme could be constructed from an ordinary symmetric-key encryption scheme by obfuscating a program which decrypts two input ciphertexts, and outputs the encryption of their NAND. Unfortunately, [1] also showed that black-box obfuscation is impossible to achieve.

One weaker definition of obfuscation that [1] did not rule out is called indistinguishability obfuscation (iO). It requires that if two circuits $C_1$ and $C_2$ are functionally equivalent (for all $x$ , $C_1(x) = C_2(x)$ ), then their obfuscations are indistinguishable. In 2013, Garg et al. [2] gave the first candidate construction for iO. This construction is not proven secure under standard assumptions (e.g. the hardness of factoring), but no attacks on the construction are currently known, and it has been proven that certain types of generic attacks cannot possibly work.

It isn’t immediately clear whether iO is useful. After all, what information is the obfuscation really hiding if it only guarantees indistinguishability from functionally equivalent programs? “How to use Indistinguishability Obfuscation: Deniable Encryption and More”, by Amit Sahai and Brent Waters [3], shows several applications of iO to building cryptographic primitives, and proposes iO as a “hub of cryptography”, meaning that constructions which use iO as a building block would be able to benefit from improvements in the efficiency or underlying computational assumptions of iO constructions.

As a first example of how to use iO, they show how to construct an IND-CPA public key cryptosystem based on one-way functions and iO. This uses a few other basic and well-studied cryptographic primitives (it is known that all of them exist if one-way functions exist). First, a pseudorandom generator is a deterministic function which takes as input a random string $s$ , and computes a longer string $r$ which is indistinguishable from truly random. The length of $r$ can be polynomially larger than the length of $s$ , but we will let $r$ be twice the length of $s$ . A pseudorandom function family is parameterized by a secret key $k$ and an input $x$ , and produces outputs of some fixed length. Oracle access to $f_k(\cdot)$ for some random unknown $k$ is computationally indistinguishable from oracle access to a truly random function. The main special property that [2] require is a puncturable pseudorandom function family. That is, for a secret key $k$ and an input $x$ , a “punctured” key $k\{x\}$ can be computed. $k\{x\}$ allows one to compute $f_k(y)$ for all $y \neq x$ , but essentially reveals no information about $f_k(x)$ .

The power of iO comes from combining its indistinguishability guarantees with those of other cryptographic primitives, in what is known as the hybrid method. We will start with the security game played with a real scheme, and give a sequence of other games, each indistinguishable from the previous one to the adversary, such that the adversary obviously has no advantage in the last game. This will prove that the scheme is secure.

For example, a simple public key encryption scheme for one-bit messages (let $G$ be a PRG, and $F$ be a puncturable pseudorandom function family) would be the following.

The secret key $SK$ is just a puncturable PRF key $k$ .
The public key $PK$ is the obfuscation of an Encrypt program. Encrypt takes a message $m$ and randomness $r$ , and outputs $(G(r), F_k(G(r)) \oplus m)$ .
To decrypt a ciphertext $(c_1, c_2)$ given $SK$ , one just computes $F_k(c_1) \oplus c_2$ .

In the IND-CPA security game, there is a challenger and an adversary. The challenger generates the public and secret key, and gives the public key to the adversary. The challenger also generates an encryption of a random bit $b$ , and gives this to the adversary. The adversary wins if he can guess the bit that the challenger chose.

In the first hybrid, the challenger encrypts by choosing a truly random $\tilde{r}$ , and generating the ciphertext $(\tilde{r}, F_k(\tilde{r}) \oplus b)$ . This is indistinguishable by the security property of a pseudorandom generator.

In the next hybrid, the program Encrypt which is obfuscated and given to the adversary uses a punctured key $k\{\tilde{r}\}$ instead of the full PRF key $k$ . This is with high probability functionally equivalent, because $\tilde{r}$ is most likely not in the range of $G$ . So by the security of iO, this hybrid is indistinguishable from the previous one.

Next, the ciphertext is generated by choosing a truly random bit $p$ , and generating the ciphertext $(\tilde{r}, p \oplus b)$ . This is also indistinguishable to the adversary, because the punctured key reveals nothing about $F_k(\tilde{r})$ . That is, if these two hybrids were distinguishable, then given only $k\{\tilde{r}\}$ , one could distinguish between a truly random bit $p$ and the true value $F_k(\tilde{r})$ .

In this hybrid, it is clear (by analogy to a one-time pad) that no adversary can distinguish a ciphertext encrypting $0$ from a ciphertext encrypting $1$ . So the original scheme is also secure.

Sahai and Waters achieve more than just public-key encryption – they also achieved the long-standing open problem of publicly deniable encryption. This is an encryption system for which if $Encrypt(m ; r) = c$ , one can also compute a “fake” randomness $r'$ for any other message $m'$ such that $Encrypt(m'; r')$ is also equal to c, and $r'$ is indistinguishable from $r$ . The motivation is that after a message is encrypted, coercing the encrypter into revealing his message is futile – there is no way of verifying whether what he says is the truth. Note that of course the decrypter could be coerced to reveal his secret key, but there’s nothing that can be done about that.

The basic idea of the deniable encryption scheme is that the encryption public key is again an obfuscated program – this program takes a message and randomness string. If the randomness string has some rare special structure (it is a hidden sparse trigger), it is interpreted as encoding a particular ciphertext, and the program will output that ciphertext. Otherwise, the program encrypts normally. There is also a public obfuscated Explain program, which allows anybody to sample a random hidden sparse trigger, but not test whether or not a string is a hidden sparse trigger.

At present, iO obfuscation is known to imply some very powerful primitives, including for example multi-input functional encryption. It is still open to construct fully homomorphic encryption for iO, or even collision resistant hash functions from iO, or to show that no such reduction is possible.

[1] On the (Im)possibility of Obfuscating Programs, Barak, Goldreich, Impagliazzo, Rudich, Sahai, Vadhan, Yang, 2001.
[2] Candidate Indistinguishability Obfuscation, Garg, Gentry, Halevi, Raykova, Sahai, Waters, 2013.
[3] How to Use Indistinguishability Obfuscation: Deniable Encryption, and More, Sahai, Waters, 2014.

Efficient Distribution Estimation 3: Rocco’s Modern Life – STOC 2014 Recaps (Part 5)

The road twists, as Clément takes us down a slightly different path in the penultimate post of our Efficient Distribution Estimation workshop summary series. See also: Part 1 and Part 2.

Clément Canonne on A complexity theoretic perspective on unsupervised learning

And now, for something completely different… Rocco Servedio presented a (joint with Ilias Diakonikolas and Anindya De [1]) work, not dealing with distribution testing. Showing there is a world outside distribution testing, and that this world is filled with wonder, probabilities and Boolean functions. A world of inverse problems.

I will only try to give the main idea, maybe the main ideas, of the problem they consider, why it is hard, and how they overcame the difficulties. To start, fix an unknown Boolean function $f\colon\{0,1\}^n\to\{0,1\}$ in some class $\mathcal{C}$ — think of the class of linear threshold functions, for instance (LTFs, a.k.a. weighted majorities, or halfspaces). $f$ is unknown, but it brings with it the set of its satisfying inputs, $I\stackrel{\rm{}def}{=} f^{-1}(\{1\})$ ; and you can get uniform samples from this $I$ . You press a button, you get $x\in I$ .

Can you do this a few time, and then output a new device with its own button, that will generate on demand i.i.d. samples from some distribution $D$ close (in total variation distance, as usual) to the uniform distribution on $I$ ? In other terms, can you approximate the uniform distribution on $I$ , by observing i.i.d. samples from $I$ (but without knowing $f$ )?

Of course, one way to do this would be to use a learning algorithm for $\mathcal{C}$ (but using only what we get: only positive examples): learn an approximate version $\hat{f}$ of $f$ , and then study it “offline” to find out exactly what $\hat{f}^{-1}(\{1\})$ is, before spitting uniform samples from it. Clearly, if $\hat{f}$ is close enough to ${f}$ , this will work!

But does it always? Recall that here, we want to be close to uniform on $I$ , that is we only consider the error with relation to the positive examples of the function; while the general results for (PAC-)learning of functions under the uniform distribution only care about the overall error. This means $\hat{f}$ may need to be very close to $f$ for this approach to succeed… As an example, take $f$ to be $\mathrm{AND}_n$ , i.e. 0 everywhere but on the single point $x=1^n$ . The function $\hat{f}= \lnot\mathrm{OR}_n$ is a very good approximation: it has error only $1/2^{n-1}$ ; and yet, this would lead to a distribution supported on the single point $0^n$ , which is thus 1-far from the actual distribution (supported on $1^n$ )!

Intuitively, the problem here is that PAC-learning gives a sort of additive guarantee:

$\mathbb{P}\{\hat{f}(x) \neq f(x)\} \leq \varepsilon$

while we need some some of multiplicative one:

$\mathbb{P}\{\hat{f}(x) \neq f(x)\} \leq \varepsilon\cdot \mathbb{P}\{f(x) = 1\}$

which is much harder to obtain when $f$ has very few satisfying assignments.

This being said, the discussion above shows that when we are in the dense case, things look simpler: i.e., when $f$ has “many” (an inverse polynomial fraction of the hypercube) satisfying inputs. In this case, the idea is to combine two steps — and show they can be performed:
(1) SQ Learning: get a hypothesis $\hat{f}$ . Upshot: one has to show that it’s possible to use a SQ learner, even from positive examples only.
(2) Hypothesis testing: test if $\hat{f}$ is good enough (if not, try again)
before then using the now explicitly known “good” $\hat{f}$ to approximate the set of satisfying inputs of $f$ .

But then, once this “simple” (erm) case is dealt with, how to handle the general case, where $f$ could have an exponentially small fraction of satisfying inputs? Rocco explained one of their main ingredient — what they called a “densifier”, i.e. an algorithm that will start from positive examples from $f$ and output a “densified function” $g$ which agrees with $f$ on most of $f$ ‘s satisfying inputs, and does not have many more (so that $f^{-1}(\{1\})$ is “dense” in $g^{-1}(\{1\})$ ). Provided such a densifier exists (and can be efficiently simulated) for a class $\mathcal{C}$ , they show how to combine it with (1) and (2) above to iteratively “densify” the sparse set $f^{-1}(\{1\})$ and reduce to the “easy”, dense case.

Now that all the bricks are in place, the next step in order to play Lego is, of course, to show that such densifiers do exist… for interesting classes, and in particular classes who also have SQ learning algorithms (otherwise, brick (1) is still missing!). In the paper, the authors describe how to build such a densifier for (for instance) the class of LTFs. Combining this with the ingredients sketched above, this gives them a polynomial time algorithm for learning the uniform distribution over satisfying assignments to an unknown LTF. As another instantiation of their approach, the authors also give a densifier for DNF formulas and a resulting quasi-polynomial time distribution learning algorithm.

Two remarks here:

maybe surprisingly, the hardness of this problem stems from the Boolean setting — indeed, the same question, dealing with functions $f\colon\mathbb{R}^n\to\{0,1\}$ on the Gaussian space, becomes way easier;
the related “direct” problem (on input the full specification of $f$ , generate close-to-uniform samples from $f^{-1}(\{1\})$ ), is neither harder nor simpler: the authors provide examples where (under some suitable well-believed assumption) the former is easy and the latter hard, and vice-versa. In particular, if $f^{-1}(\{1\})$ is a finite cyclic group, getting even one positive example allows the algorithm to get easily the whole $f^{-1}(\{1\})$ “for free”.

[1] Inverse Problems in Approximate Uniform Generation, De, Diakonikolas, Servedio, 2012.

Efficient Distribution Estimation 2: The Valiant Sequel – STOC 2014 Recaps (Part 4)

This is a continuation of Efficient Distribution Estimation STOC workshop summary series by G and Clément (see the Part 1 here).

Gautam Kamath on Distributional Property Testing and Estimation: Past, Present, and Future

After lunch, we were lucky enough to witness back-to-back talks by the Valiant brothers, Greg and Paul. If you aren’t familiar with their work, first of all, how have you been dealing with all of your everyday property testing needs? But second of all, over the past few years, they’ve touched a number of classic statistical problems and left most of them with matching optimal upper and lower bounds.

Greg kicked off the show and focused on property estimation. Given a property of interest and access to independent draws from a fixed distribution $\mathcal{D}$ over $\{1, \dots, n\}$ , how many draws are necessary to estimate the property accurately?

Of course, this is an incredibly broad problem. For example, properties of a distribution which could be of interest include the mean, the $L^{3}$ norm, or the probability that the samples will spell out tomorrow’s winning lottery numbers. In order to tame this mighty set of problems, we focus on a class of properties which are slightly easier to understand — symmetric properties. These are properties which are invariant to permuting the labels on the support — for instance, if we decided to say 3 was now called 12 and vice-versa, the property wouldn’t change. Notable examples of symmetric properties include the support size*, entropy, and distance between two distributions. Still a broad class of properties, but surprisingly, we can make statements about all of them at once.

What’s the secret? For symmetric properties, it turns out that the fingerprint of a set of samples contains all the relevant information. The fingerprint of a set of samples $X$ is a vector $\mathcal{F}_X$ whose $i$ th component is the number of elements in the domain which occur exactly $i$ times in $X$ . The CliffNotes version of the rest of this post is that using linear programming on the fingerprint gives us an unreasonable amount of power**.

First off, Greg talked about a surprising dichotomy which demonstrates the power of linear estimators. A linear estimator of a property is an estimator which outputs a linear combination of the fingerprint. Given a property $\pi$ , a number of samples $k$ , and a distance $\varepsilon$ , exactly one of the following holds:

There exist distributions $y_1$ and $y_2$ such that $|\pi(y_1) - \pi(y_2)| > \varepsilon$ but no estimator (linear or not) that uses only $k$ samples can distinguish the two with probability $\geq 51\%$ .
There exists a linear estimator which requires $k(1 + o(1))$ samples and estimates the property to within $\varepsilon(1 + o(1))$ with probability $\geq 99\%$ .

In other words, if there exists an estimator for a property, there is a linear estimator which is almost as good. Even better, their result is constructive, for both cases! A linear program is used to construct the estimator, where the variables are the estimator coefficients. Roughly, this linear program attempts to minimize the bias of the estimator while keeping the variance relatively small. On the other hand, when we take the dual of this linear program, we construct two distributions which have similar fingerprint expectations, but differ in their value for the property — we construct an explicit counter-example to any estimator which claims to use $k$ samples and estimate the property to accuracy $\varepsilon$ .

But here’s the $64 question – do these estimators work in practice? Sadly, the answer is no, which actually isn’t that surprising here. The estimator is defined in terms of the worst-case instance for each property — in other words, it’s oblivious to the particular distribution it receives samples from, which can be wasteful in many cases.

Let’s approach the problem again. What if we took the fingerprint of our samples and computed the property for the empirical distribution? This actually works great – the empirical distribution optimally estimates the portion of the distribution which we have seen. The downside is that we have to see the entire distribution, which takes $\Omega(n)$ samples. If we want to break past this linear barrier, we must estimate the unseen portion. It turns out that, once again, linear programming saves the day. We solve a program which describes all distributions whose expected fingerprint is close to the observed fingerprint. It can be shown that the diameter of this set is small, meaning that any distribution in this space will approximate the target distribution for all symmetric properties at the same time! Furthermore, this estimator takes only $O(n/\log n)$ samples, which turns out to be optimal.

Again, we ask the same question – do these estimators work in practice? This time, the answer is yes! While performance plots are often conspicuously missing from theory papers, Greg was happy to compare their results to estimators used in practice — their approach outperformed the benchmarks in almost all instances and regimes.

Finally, Greg mentioned a very recent result on instance-by-instance optimal identity testing. Recall that in identity testing, given a known distribution $P$ and an unknown distribution $Q$ , we wish to test if they are equal or $\varepsilon$ -far (in total variation distance). As mentioned before, when $P$ is uniform, $\Theta(\sqrt{n}/\varepsilon^2)$ are necessary and sufficient for this problem. But what about when $P$ is $\mathrm{Bin}(n,\frac12)$ , or when $P$ assigns a probability to $i$ which is proportional to the $i$ th digit of $\pi$ — do we have to design an algorithm by scratch for every $P$ ?

Thankfully, the answer is no. The Valiants provide an algorithm which is optimal up to constant factors for all but the most pathological instances of $P$ . Somewhat unexpectedly, the optimal number of samples turns out to be $\Theta(\|P\|_{2/3}/\varepsilon^2)$ – aside from the fact it matches the uniform case, it’s not obvious why the $2/3$ -norm is the “correct” norm here. Their estimator is reasonably simple – it involves a small tweak to Pearson’s classical $\chi$ -squared test.

-G

Clément Canonne on Lower Bound Techniques for Statistical Estimation Tasks

At this point, Greg tagged out, allowing Paul to give us a primer in “Lower Bound Techniques for Statistical Estimation Tasks”: roughly speaking, the other side of Greg’s coin. Indeed, all the techniques used to derive upper bounds (general, systematic testers for symmetric properties) can also been turned into their evil counterparts: how to show no algorithm supposed to test a given property can be “too efficient”.

The dish is very different, yet the ingredients similar: without going into too much details, let’s sketch the recipe. In the following, keep in mind that we will mostly cook against symmetric properties, that is properties $\mathcal{P}$ invariant by relabeling of the domain elements: if $D\in\mathcal{P}$ , then for any permutation $\pi$ of $[n]$ one has $D\circ\pi\in\mathcal{P}$ .

elements do not matter: all information any tester can use can be found in the fingerprint of the samples it obtained. In other terms, we can completely forget the actual samples for the sake of our lower bound, and only prove things about their fingerprint, fingerprint of fingerprint, and so on.
Poissonization is our friend: if we take $m$ samples from a distribution, then the fingerprint, a tuple of $m$ integer random variables $X_1,\dots X_m$ , lacks a very nice feature: the $X_i$ ‘s are not independent (they do kind of have to sum to $m$ , for instance). But if instead of this, we were to take $m^\prime\sim\mathrm{Poisson}(m)$ samples, then the individual components of the fingerprint will be independent! Even better, many nice properties of the Poisson distribution will apply: the random variable $m^\prime$ will be tightly concentrated around its mean $m$ , for instance. (this applies whether the property is symmetric or not. When in doubt, Poissonize. Independence!)
Central Limit Theorems: the title ain’t witty here, but the technique is. As a very high-level, imagine we get to random variables $T, S$ corresponding to the “transcript” of what a tester/algorithm sees from taking $m$ samples from two possible distributions it should distinguish. We would like to argue that the distributions of $T$ and $S$ are close, so close (in a well-defined sense) that it is information-theoretically impossible to do so. This can be very difficult; a general idea would be to prove that there exists another distribution $G$ , depending on very few parameters — think of a Gaussian: only the mean and (co)variance (matrix) — such that both $S$ and $T$ are close to $G$ . Then, by the triangle inequality, they must be close, and we are done.Sweeping a lot of details under the rug, what Paul presented is this very elegant idea of proving new Central Limit Theorems that exactly give that: theorems that state that asymptotically (with quantitative bounds on the convergence rate), some class of distributions of interest will “behave like” a multivariate Gaussian or Multinomial with the “right parameters” — where “behave like” refers to a convient metric on distributions (e.g. Earthmover or Total Variation), and “right parameters” means “as few as possible”, in order to keep as many degrees of freedom when designing the distributions of $S$ and $T$ ).Proving such CLT’s is no easy task, but — on the bright side — they have many applications, even outside the field of distribution testing.

There would be much, much more to cover here, but I’m past 600 words already, and was only allowed $\mathrm{Poisson}(500)$ .

-Clément

*In order to avoid pathological cases for estimating support size, we assume none of the probability are in the range $(0,\frac{1}{n})$ .

**At least, when channeled by sufficiently talented theoreticians.

[1] Estimating the unseen: A sublinear-sample canonical estimator of distributions, Valiant, Valiant (2010).
[2] A CLT and tight lower bounds for estimating entropy, Valiant, Valiant (2010).
[3] Estimating the Unseen: An n/log(n)-sample Estimator for Entropy and Support Size, Shown Optimal via New CLTs, Valiant, Valiant (2011).
[4] Estimating the Unseen: Improved Estimators for Entropy and other Properties, Valiant, Valiant (2013).
[5] An Automatic Inequality Prover and Instance Optimal Identity Testing, Valiant, Valiant (2014).
[6] Testing Symmetric Properties of Distributions, Valiant (2011).
See also, Greg’s TCS+ talk, which covers [1,2,3,4].

Algebraic circuit lower bounds bonanza – STOC 2014 Recaps (Part 3)

Pritish Kamath has kindly agreed to survey the exciting flurry of algebraic complexity theory results in STOC, which, to many of us, is in a permanent state of mystery. But Pritish did some valiant depth-reduction on the state of affairs in the area, and I hope you’ll appreciate his determined effort.

Pritish Kamath on the latest and greatest in algebraic circuit lower bounds

The most natural and intuitive way to compute a polynomial is via an arithmetic circuit. In this model the inputs are variables $x_1, x_2, \cdots, x_n$ and the computation is performed using the operations $+$ and $\times$ (where we allow the $+$ gates to take arbitrary linear combinations of its inputs). The output of the circuit is a formal polynomial over the base field $\mathbb{F}$ (as opposed to functions) computed by boolean circuits).

The size of an arithmetic circuit is defined as the number of $+$ and $\times$ gates in the circuit, and is typically measured with respect to the number of variables. Proving super-polynomial lower bounds on the size of arithmetic circuits computing any explicit polynomial is one of the most important open problems in the area of algebraic complexity theory. The best known lower bound as of today is a modest $\Omega(n \log n)$ due to Baur-Strassen [BS83] from 30 years ago!

Depth-reduction results

Since proving lower bounds for general arithmetic circuits is such a daunting task, it is natural to focus on proving lower bounds for restricted models. One such restriction is of constant-depth circuits. Although this seems like a severe restriction, a surprising result due to Agrawal-Vinay [AV08], subsequently strengthened by Koiran [Koi12] and Tavenas [Tav13] showed that if a polynomial is computed by poly-sized circuits, then they are in fact computed by not-so-large-sized depth-4 circuits. In particular, they showed the following theorem,

Theorem.[AV08, Koi12, Tav13]
Let $p_n(\mathbf{x}) \in \mathbb{F}[\mathbf{x}]$ be a polynomial over $n$ -variables, of degree $d = n^{O(1)}$ . If there is a $\mathrm{poly}(n)$ sized arithmetic circuit computing $p_n(\mathbf{x})$ , then there is a hom- $\Sigma\Pi\Sigma\Pi[\sqrt{d}]$ -circuit of size $2^{O(\sqrt{d} \log n)}$ computing $p_n$ .

hom- $\Sigma\Pi\Sigma\Pi[\sqrt{d}]$ denotes homogeneous depth-4 circuits with fan-in of bottom product gates bounded by $\sqrt{d}$ . The contrapositive of the above depth-reduction result says that proving a $2^{\omega(\sqrt{d} \log n)}$ lower bound on the size of hom- $\Sigma\Pi\Sigma\Pi[\sqrt{d}]$ -circuits computing $p_n(\mathbf{x})$ , suffices to prove that $p_n(\mathbf{x})$ cannot be computed by $n^{O(1)}$ -sized arithmetic circuits!

Results from STOC 2014, before and after…

Proving lower bounds for hom- $\Sigma\Pi\Sigma\Pi[\sqrt{d}]$ -circuits seems like a conceptually much simpler task than proving lower bounds for general circuits. Using a new technique of the shifted partial derivatives, Gupta et al [GKKS13] showed that hom- $\Sigma\Pi\Sigma\Pi[\sqrt{d}]$ -circuits computing the $d \times d$ determinant ( $\mathrm{Det}_d$ ) requires size $2^{\Omega(\sqrt{d})}$ , which is only a $\omega(\log n)$ factor away (in the exponent), from the depth-reduction result!

Thus, after the above result was established, there were two possible approaches for proving super-polynomial lower bounds for general circuits,

Improve the lower bound on hom- $\Sigma\Pi\Sigma\Pi[\sqrt{d}]$ -circuits
Improve the depth-reduction result from general circuits to hom- $\Sigma\Pi\Sigma\Pi[\sqrt{d}]$ -circuits

Post-[GKKS13], there was a lot of excitement and a flurry of activity followed, proving stronger lower bounds (in fact the improved depth reduction of Tavenas [Tav13] came after [GKKS13]). The magnitude of this activity is well illustrated by the fact that there were a sequence of four papers in STOC 2014, each showing significant improvements over the previous ones, all produced in a span of just 6 months or so!

[KSS14] A super-polynomial lower bound for regular arithmetic formulas

In this paper, Kayal-Saha-Saptharishi prove a lower bound of $2^{\Omega(\sqrt{d} \log n)}$ on size of hom- $\Sigma\Pi\Sigma\Pi[\sqrt{d}]$ -circuits computing an explicit polynomial (neatly constructed using Nisan-Wigderson designs), denoted by $\mathrm{NW}_{n,d}$ . After this result, we are only a $\omega(1)$ factor away (in the exponent) from our goal!

[FLMS14] Lower bounds for depth 4 formulas computing iterated matrix multiplication

In this paper, Fournier-Limaye-Malod-Srinivasan prove a lower bound of $2^{\Omega(\sqrt{d} \log n)}$ on size of hom- $\Sigma\Pi\Sigma\Pi[\sqrt{d}]$ -circuits computing the iterated matrix product of $d$ generic $n \times n$ matrices, denoted by $\mathrm{IMM}_{n,d}$ . This result improved upon [KSS14] in the sense that $\mathrm{IMM}_{n,d}$ is in $\mathrm{VP}$ , whereas $\mathrm{NW}_{n,d}$ was only in $\mathrm{VNP}$ . This shows that the depth-reduction of [Tav13] is in fact tight, and hence rules out approach (b) stated above!

[KS14a] The Limits of Depth Reduction for Arithmetic Formulas: It’s all about the top fan-in

Taking it one more step further than [FLMS14], Kumar-Saraf prove a $2^{\Omega(\sqrt{d} \log n)}$ lower bound for an explicit polynomial that can be computed by $\mathrm{poly}(n)$ -sized formulas, thus showing that the depth-reduction of Tavenas is in fact tight even for formulas (formulas are weaker models than circuits, in that, the fan-out of any gate in a formula is at most one; thus, formulas are representable by trees, whereas circuits are represented by directed acyclic graphs).

Another question that had been tantalizingly open for a long time was to prove super-polynomial lower bounds on the size of general hom- $\Sigma\Pi\Sigma\Pi$ -circuits (that is, homogenous depth-4 circuits without the bottom fan-in restriction). Towards this direction, Kumar-Saraf considered the problem of proving lower bounds on size of hom- $\Sigma\Pi\Sigma\Pi$ -circuits with bounded top fan-in, and in this paper, they also prove a super-polynomial lower bound on size of such circuits where the top fan-in is $o(\log n)$ .

[KLSS14] Super-polynomial lower bounds for depth-4 homogeneous arithmetic formulas

And finally, Kayal-Limaye-Saha-Srinivasan prove a super-polynomial lower bound of $n^{\Omega(\log n)}$ on the size of general hom- $\Sigma\Pi\Sigma\Pi$ -circuits computing $\mathrm{IMM}_{n,d}$ . Interestingly, Kumar-Saraf in a different paper [KS13] had independently proved a weaker lower bound of $n^{\Omega(\log \log n)}$ on the same circuit model computing a Nisan-Wigderson type polynomial (similar to the one in [KSS14]).

Subsequent results

Now that super-polynomial lower bounds for hom- $\Sigma\Pi\Sigma\Pi$ -circuits had been shown, it remained open to prove exponential lower bound (such as $2^{\Omega(\sqrt{d} \log n)}$ ) on such circuits. Since the STOC deadline in November 2013, more improvements have come out!

Kayal-Limaye-Saha-Srinivasan [KLSS14] have proved a lower bound of $2^{\Omega(\sqrt{d} \log n)}$ on the size of hom- $\Sigma\Pi\Sigma\Pi$ -circuits computing an explicit Nisan-Wigderson type polynomial (similar to the ones in [KSS14] and [KS13]).

Completing this picture, Kumar-Saraf [KS14b] have proved a similar lower bound of $2^{\Omega(\sqrt{d} \log n)}$ on the size of hom- $\Sigma\Pi\Sigma\Pi$ -circuits computing $\mathrm{IMM}_{n,d}$ , which shows that the depth-reduction of [Tav13] is tight even if the reduction was to general $\Sigma\Pi\Sigma\Pi$ -circuits (without bottom fan-in condition).

Perhaps we will see these papers in FOCS 2014. ;)

Bird’s eye view of the techniques

To prove lower bounds on any class $\mathcal{C}$ of circuits, a natural idea (pun intended!) used in most lower bound techniques has been to introduce a measure $\Gamma : \mathbb{F}[\mathbf{x}] \rightarrow \mathbb{Z}_+$ such that $\Gamma(p)$ is large for a certain polynomial $p$ , but $\Gamma(C)$ is small for any small-sized $C \in \mathcal{C}$ , and hence no small-sized circuit in $\mathcal{C}$ can compute $p$ .

The main technique used in all the papers above are certain variants of the measure of shifted-partial derivatives technique introduced in [Kay12] and [GKKS13], inspired by the measure of partial derivatives introduced by Nisan-Wigderson [NW95]. The measure is defined as follows,

$\Gamma_{k, \ell}(f) = \mathrm{dim}(\mathbf{x}^{\le \ell} \partial^{=k} f)$

Essentially, this measure looks at the dimension of the sub-space spanned by the polynomials obtained by multiplying monomials of degree at most $\ell$ to all $k$ -th order partials derivatives of $f$ (where we think of polynomials as a vector of coefficients).

The other new ideas used in the papers described has been the idea of random restrictions (somewhat seen in the context of Håstad’s switching lemma), and of looking at novel variants of shifted partial derivatives by means of projections onto a smaller space of polynomials. In the interest of length of this post, we won’t go into those details here.

Efficient Distribution Estimation – STOC 2014 Recaps (Part 2)

The STOC 2014 saga continues (check out Part 1 here), with Clément Canonne and Gautam “G” Kamath covering the day-long workshop on Efficient Distribution Estimation. There’s a wealth of interesting stuff from the workshop, so we’ve broken it up into episodes. Enjoy this one, and stay tuned for the next installment!

Clément Canonne on Taming distributions over big domains

Talk slides here

Before STOC officially began, before the talks and the awe and the New Yorker Hotel and all the neat and cool results the reviews on this blog will try to give an idea and a sense of, before even this very long and cumbersome and almost impossible — almost, only? to understand sentence, before all this came Saturday.
Workshops.

Three simultaneous sessions, all three very tempting: one covering “Recent Advances on the Lovasz Local Lemma”, one giving tools and insight on how to “[overcome] the intractability bottleneck in unsupervised machine learning”, and finally one — yay! — on “Efficient Distribution Estimation“. I will hereby focus on the third one, organized by Ilias Diakonikolas and Gregory Valiant, as it is the one I attended, and in particular 3 of the 6 talks of the workshop; G (Gautam) will review the other ones, I bet with shorter sentences and more relevance.

It all started with Ronitt Rubinfeld, from MIT and Tel Aviv University, describing the flavour of problems this line of work was dealing with, what exactly were the goals, and why should one care about this. Imagine we have samples from a space of data — a Big space of Data; whether it be records from car sales, or the past ten years of results for the New Jersey lottery, or even logs of network traffic; and we want to know something about the underlying distribution: has the trend in cars changed recently? Is the lottery fixed? Is someone doing unusual requests to our server?

How can we tell?

Let us rephrase slightly the problem: there exists a probability distribution $D$ on a set of $n$ elements $[n]=\{1,\dots,n\}$ ; given i.i.d. (independent, identically distributed) samples from $D$ , can we decide with as few of these samples as possible if, for instance,

$D$ is uniform, or “far”* from it?
$D$ is equal to a known, fully specified distribution (say, the distribution of car sales from 10 years ago), or far from it?
has $D$ big entropy, or not?

Of course, there exist many tools in statistics for this type of tasks: the Chi-Square test, for instance. Yet, they usually have a fatal flaw: namely, to give good guarantees they require a number of samples at least linear in $n$ . And $n$ is big. Like, huge. (and anyway, one could always actually learn $D$ to high accuracy with $O(n)$ samples) No, the question is really can one do anything interesting with $o(n)$ samples?

As Ronitt explained: yes, one can. Testing uniformity can be done with (only!) $O(\sqrt{n})$ samples**; one can test identity to a known distribution with $O(\sqrt{n})$ as well; and testing whether two unknown distributions are equal or far from each other only requires $O(n^{2/3})$ samples from each. All these results (see e.g. [1] for a survey of this huge body of research) are quite recent; and come from works in TCS spanning the last 15 years or so.

“And they saw they could test distributions in $o(n)$ samples; and they saw that this was good, and they also saw that most of these results were tight”. But as always, the same question arose: can we do better? This may sound strange: if the sample complexities are tight, how could we even hope to do better? If there is a lower bound, how can we even ask to beat it?

This was the second part of Ronitt’s talk: bypassing impossibility results. She presented two different approaches, focusing on the second:

“if you can’t beat them… change what ‘them’ means”. Indeed, getting better algorithms for general, arbitrary distributions may be impossible; but in practice, things are rarely arbitrary. What if we knew something more about this unknown $D$ — for instance, it is a mixture of simple distributions? Or it has a particular “shape” (e.g., its probability mass function is monotone)? Or it is obtained by summing many “nice” and simple random variables? Under these assumptions, can the structure of $D$ leveraged to get better, more sample efficients algorithms? (in short: yes. See for instance [2], where the sample complexity, from $n^c$ in general, becomes $\mathrm{poly}(\log n)$ )
“if playing by the rules is too hard, change the rules”. All the results and lower bound above suppose the algorithms have (only) access to i.i.d. samples from the distribution $D$ . But in many cases, they could get more: for instance, there are scenarii where a tester can focus on some subset of the domain, and only get samples from it (think about “stratified sampling”, or just the situation where a pollster targets in particular voters between age 25 and 46). Given this new type of access to the “conditional distributions” induced by $D$ , can algorithms be more efficient? And if the data comes from a known, preprocessed (even sorted!) dataset, as it is the case for example with Google and the $n$ -grams (frequencies and statistics of words in a vast body of English literature), the testing algorithm can actually ask the exact frequency (equivalently, the probability value) of any given element of its choosing, in addition to just sampling a random word! This type of access, with additionial or modified types of query to $D$ , has been studied for instance in [3,4,5,6]; and, in a nutshell, the answer is positive.

Yes, Virginia, there is a Santa Claus. And he has constant sample complexity!

Open problems

Study more types of structures or assumptions on the distribution: namely, what type of natural assumptions on $D$ can we get, for which the testing algorithms get a significant speedup? (in terms of sample complexity)
What is the relation between the new type of access? Are they related? Can one simulate the other? What other natural sorts of access can one consider?
What about learning with these new types of access? (so far, we were interested in testing whether $D$ had some property of interest; but — maybe — learning the distribution also became easier?)

Such excitement — and this was only the beginning!

—
Clément

* Far in total variation (or statistical) distance: this is essentially (up to a factor 2) the $\ell_1$ distance between two probability distributions.
** and this is tight.

(this bibliography is far from comprehensive; if any reader wants additional pointers, I’ll be happy to provide more!)

[1] Taming big probability distributions, R. Rubinfeld (2012).
[2] Testing k-Modal Distributions: Optimal Algorithms via Reductions, Daskalakis, Diakonikolas, Servedio, Valiant, Valiant (2011).
[3] On the Power of Conditional Samples in Distribution Testing, Chakraborty, Fischer, Goldhirsh, Matsliah (2012).
[4] Testing probability distributions using conditional samples, Canonne, Ron, Servedio (2012).
[5] Streaming and Sublinear Approximation of Entropy and Information Distances, Guha, McGregor, Venkatasubramanian (2005).
[6] Testing probability distributions underlying aggregated data, Canonne, Rubinfeld (2014).

TCS in the Big Apple – STOC 2014 Recaps (Part 1)

courtesy of frontpagemag.com Ah, summertime. For many NotSoGITCSers, it means no classes, warm weather, ample research time… and a trip to STOC. This year the conference was held in New York City, only four hours away from Cambridge via sketchy Chinatown tour bus. Continuing in the great tradition of conference blogging by TCS blogs (see here, and here), NotSoGITCS will share some of our favorite papers from the conference. I’ll start this off with my recap on Rothvoss’s paper on extension complexity. Over the next few posts, we’ll hear from Clément Canonne and Gautam Kamath on Efficient Distribution Estimation, Pritish Kamath on the latest results in algebraic circuit complexity, Justin Holmgren on indistinguishability obfuscation, and Sunoo Park on coin flipping!

Henry Yuen on The matching polytope has exponential extension complexity, by Thomas Rothvoss

The Best Paper award for this year’s STOC went to this paper, which is another milestone result in the recent flurry of activity in lower bounds for extension complexity of combinatorial problems. This area represents a fruitful “meeting in the middle” between complexity theory and algorithms. Although a proof of P ≠ NP — a conjecture about the limitation of all polynomial time algorithms — seems astronomically distant, we can instead shoot for a closer goal: show that polynomial time algorithms we know of now cannot solve NP-complete problems. The area of extension complexity is about understanding the limitations of some of the finest weapons in our polynomial-time toolbox: LP and SDP solvers. As many of you know, these algorithms are tremendously powerful and versatile, in both theory and practice alike.

A couple years ago, STOC 2012 (also held in New York!) saw the breakthrough result of Fiorini, et al. titled Linear vs. Semideﬁnite Extended Formulations: Exponential Separation and Strong Lower Bounds. This paper showed that while powerful, LP solvers cannot naturally solve NP-complete problems in polynomial time. What do I mean by naturally solve? Let’s take the Traveling Salesman Problem for a moment (which was the subject of the aforementioned paper). One way to try to solve TSP on an n-city instance using linear programming is to try to optimize over the polytope P_TSP where each vertex corresponds to a tour of the complete graph on n nodes. Turns out, this polytope has exponentially many facets, and so is a rather unwieldy object to deal with — simply writing down a system of linear inequalities defining P_TSP takes exponential time! Instead, people would like to optimize over another polytope in higher dimension Q_TSP that projects down to P_TSP (such a Q_TSP would be called a linear extension of P_TSP) except now Q_TSP has a polynomial number of facets. Does such a succinct linear extension exist? If so, that would imply P = NP! The Fiorini, et al. paper showed that one can’t solve the Traveling Salesman Problem this way: the TSP polytope has exponential extension complexity — meaning that any linear extension of Q_TSP of P_TSP necessarily has exponentially many facets ^[1].

In a sense, one shouldn’t be too surprised by this result: after all, we all know that P ≠ NP, so of course the TSP polytope has exponential extension complexity. But we’ve learned more than just that LP solvers can’t solve NP-complete problems: we’ve gained some insight into the why. However, after the Fiorini et al. result, the natural question remained: is this exponential complexity of polytopes about NP-hardness, or is it something else? In particular, what is the extension complexity of the perfect matching polytope P_PM (the convex hull of all perfect matchings in $K_n$ )? This is interesting question for a number of reasons: this polytope, like P_TSP, has an exponential number of facets. However, unlike the TSP polytope, optimizing linear objective functions over P_PM is actually polynomial-time solvable! This is due to the seminal work of Jack Edmonds, who, in the 1960’s, effectively brought the notion of “polynomial time computability” into the (computer science) public consciousness by demonstrating that the maximum matching problem admits an efficient solution.

As you might have guessed from the title, Thomas Rothvoss showed that actually the matching polytope cannot be described as the projection of a polynomially-faceted polytope! So, while we have LP-based algorithms for optimizing over P_PM, these algorithms must nontrivially rely on the structure of the matching problem, and cannot be expressed as generically optimizing over some succinct linear extension. I’ll say a few words about how he shows this. To lower bound the extension complexity of a polytope, one starts by leveraging the Yannakakis’s Factorization Theorem, proved in his 1991 paper that started the whole area of extension complexity. His theorem says that, for a polytope $P$ , $\mathrm{xc}(P) = \mathrm{rank}_{+}(S_P)$ , where the left-hand side denotes the extension complexity of $P$ ^[2], and the right-hand side denotes the non-negative rank of the slack matrix of $P$ ^[3]. Now, instead of worrying about every possible linear extension of $P$ , we “only” have to focus our attention on the non-negative rank of $S_P$ . It turns out that fortuitously, we have techniques of lower bounding the non-negative rank of matrices from communication complexity (a connection that was also pointed out by Yannakakis). Roughly speaking, communication complexity tells us that the non-negative rank of a matrix is high if the matrix cannot be “described” as a small collection of combinatorial rectangles.

This idea is captured in what Rothvoss calls the “Hyperplane separation lower bound”: let $S$ be an arbitrary non-negative $m\times n$ matrix; and $W$ be an arbitrary matrix of the same dimensions (not necessarily non-negative). Then $\mathrm{rank}_+(S) \geq \langle W, S \rangle/ (\|S \|_\infty \alpha_W)$ , where $\langle W, S \rangle$ is the Frobenius inner product between $W$ and $S$ , $\|S \|_\infty$ is the largest-magnitude entry of $S$ , and $\alpha_W := \max \{ \langle W, R \rangle : R \in \{0,1\}^{m \times n}\text{ is rank 1} \}$ . Intuitively, in order to show that the non-negative rank of $S$ is large, you want to exhibit a matrix $W$ (a “hyperplane”) that has large correlation with $S$ , but has small correlation with any rectangle $R$ . Rothvoss presents such a hyperlane $W$ that proves the matching polytope has exponential extension complexity; the hard part is showing that $\langle W, R \rangle \leq 2^{-\Omega(n)}$ for all rectangles $R$ . To do so, Rothvoss follows a similar strategy to Razborov’s proof of the $\Omega(n)$ lower bound on the randomized communication complexity of DISJOINTNESS, with substantial modifications to fit the matching problem.

There are a number of reasons why I like these papers in extension complexity: they’re concrete, solid steps towards the dream of proving P ≠ NP. They’re short and elegantly written: the Fiorini, et al. paper is 24 pages; the Rothvoss paper is less than 20. They also demonstrate the unity of theoretical computer science: drawing upon ideas from algorithms, combinatorics, communication complexity, and even quantum information.

-Henry Yuen

[1] Well, now we know that these guys couldn’t have used a succinct extended formulation.

[2] $\mathrm{xc}(P)$ is defined to be the minimum number of facets of any polytope that linearly projects to $P$ .

[3] Let $S$ be a non-negative matrix of size $m\times n$ . Then $\mathrm{rank}_+(S) = k$ if there exists a factorization $S = TU$ where $T$ and $U$ are non-negative matrices, and $T$ and $U$ have dimensions $m\times k$ and $k\times n$ , respectively. Let $P = \{x : Ax \leq b \}$ is a polytope defined by $m$ inequalities, with vertices $v_1,\ldots,v_n$ . Then the $(i,j)$ th entry of the slack matrix $S_P$ of $P$ is defined to be $b_j - A_j v_i$ , with $A_j$ the $j$ th row of $A$ .

Can you tell if a bit is random?

Avi Wigderson has this wonderful talk about the power and limitations of randomness in computation, surveying the beautiful area of randomized algorithms, and the surprising connection between pseudorandomness and computational intractability. If you haven’t seen it already, I highly recommend it!

There is a part of the talk when Avi fearlessly engages the age-old question of the meaning of randomness. What does it mean for something to be random? We theoretical computer scientists often speak of “coin tosses”: many of our algorithms toss coins to decide what to do, and Alice and Bob toss coins to foil their mortal enemy Eve. Avi asks, “Is a coin toss random?”

Imagine that I hold in my hand a (fair) coin, and a second after I toss it high in the air, you, as you are watching me, are supposed to guess the outcome when it lands on the floor. What is the probability that you will guess correctly? 50-50 you say? I agree!

(Reproduced from the essay version of his talk). But now, Avi says, imagine that the coin toss was repeated, but this time you had some help: you were equipped with a powerful computing device, and accurate sensors, like an array of high speed video recorders. The coin suddenly starts to look a lot less random now! If your computer and sensors were good enough, you’d likely be able to predict the outcome of the coin toss nearly perfectly. In both cases the coin toss was the same, but the observer changed, and so did the amount of randomness in the situation. Thus, he contends, randomness is in the eye of the beholder.

From here, Avi launches into an excellent discussion of pseudorandomness and other topics. However, I’m going to forge onwards — foolishly, perhaps — with the original theme: what is randomness? As he pointed out, the randomness in the coin toss situation depended on the capabilities of the observer. Thus, one could argue that a coin toss is not truly random, because in principle you could predict the outcome. Is there a situation that is intrinsically random, no matter how powerful or omniscient the observer was? Put another way: is there a process by which I can generate a bit that no external super-being could possibly predict? Can you tell if a bit is random?

Enter the physics

Do you believe that the universe is governed by quantum mechanics? If so, then it seems that the matter is settled: quantum mechanics guarantees the existence of intrinsically random events. For those who know basic quantum, generating a truly random bit is easy: simply prepare a qubit state $\frac{1}{\sqrt{2}}(|0\rangle + |1 \rangle)$ and measure in the computational basis — done! In fact, there are hardware random number generators that operate on this principle.

But how do you actually prepare such a qubit, and perform the measurement? If you were a skilled experimentalist with access to perfectly manufactured mirrors and laser sources and detectors, then maybe. But then, how would you know these mirrors and lasers and detectors were built correctly?

For the rest of us down on Earth who don’t have time to learn experimental quantum physics and personally guarantee that all our laboratory equipment was built to specification, we’d like to be able to take some apparatus like a hardware random number generator and test whether its outputs were genuinely random — without having to open up the device and inspect its insides. Can we test randomness in a black box fashion?

It is easy to see that, if you only had input/output access to a black box device, there’s no test you can perform to reliably tell whether the device’s outputs are random. Intuitively, this is because for every test, you can construct a black box device that deterministically produces outputs that pass the test! Here’s a proof-by-Dilbert-cartoon:

What about statistical tests for randomness? Pseudorandom number generators are usually subject to an extensive battery of tests that check for suspicious patterns or regularities. Again, these are not fool-proof: one can design a fixed string that “diagonalizes” against all these heuristics and passes them all!

The power of non-locality

Black-box randomness testing seems like a hopeless task. But it turns out that, with a tiny extra assumption, it becomes possible! The added ingredient is non-locality.

Imagine that we had a hardware random number generator (HRNG) that we wanted to test, and suppose that the HRNG consisted of two separate pieces — call them Piece A and Piece B. Prevent A and B from communicating, either by surrounding them with Faraday cages or putting them on opposite sides of the galaxy. Now play the following game with them: generate a random bits x, y, and give x to A and y to B. Collect output bits a from A and b from B, and check if a + b = x Λ y (where a + b indicates the parity of the two bits). If so, the devices win this curious little game.

Suppose A and B were completely deterministic devices: their outputs a and b were fixed deterministic functions of their inputs x and y, respectively. Then it is not hard to see that, over the random choices of x and y, they can only win this game with probability 3/4.

What if you somehow knew that the devices were winning with probability greater than 3/4? What must you conclude? These devices — and hence their output bits — must be acting randomly. The killer fact is, it’s actually possible to win this game with probability strictly greater than 3/4! This can happen if part A and part B, though isolated, utilize something called quantum entanglement — a phenomenon that Einstein derisively termed “spooky action at a distance”.

We’ve made some headway on randomness testing, it seems. First, this game I’ve described does not require you to peer inside the HRNG. Second, if the HRNG wins the game with high enough probability, you’ve certified that its outputs must contain some randomness. Third, quantum mechanics says it is possible to build a HRNG to win the game. Finally, as a bonus: you don’t need to believe in quantum mechanics in order to trust the conclusion that “If the devices win with probability greater than 3/4, they must be producing randomness”!

What I’ve described to you is a distillation of the famous Bell’s Theorem from physics, which says that quantum mechanics is inconsistent with any so-called “hidden variable” theory of nature that respects locality (i.e. you can separate systems by distances so they cannot communicate or influence each other). This is why people say that quantum mechanics exhibits “non-local” behavior, because games like the one above can be won with probability greater than 3/4 (which has been demonstrated experimentally). Bell’s Theorem has been historically important in establishing a dividing line between classical and quantum theories of nature; but today we have a new interpretation: one can view Bell’s Theorem — published half a century ago — as saying that we can meaningfully perform randomness testing.

Randomness expansion — to infinity and beyond!

“Wait a minute,” you say. “While you’ve presented a test for randomness, this isn’t so useful — to even run the test you need 2 bits of randomness to generate x and y! Why test a HRNG if you already have randomness to begin with?”

True. A HRNG test isn’t very useful when it requires more randomness than is produced. It’s like the current state of fusion reactors: they produce less energy than is required to run them. But researchers in quantum cryptography realized that you could get a randomness efficient test, using less randomness than the amount that is produced by the HRNG. For example, one can start with 100 random bits, run one of these randomness efficient tests, and end up with, say, 200 random bits, a net gain of 100! In a sense, the amount of randomness you now possess has been expanded — hence these tests are called randomness expansion protocols.

Roger Colbeck, in his Ph.D. thesis from 2006, came up with a protocol that uses m bits of initial seed randomness, and certifies cm bits of output randomness using a HRNG that consists of 3 isolated parts, where c is some constant (approximately 3/2). Later in 2010, Pironio et al. gave a protocol that expands m bits of seed randomness to m² certified bits. In 2011, two simultaneous works — one by Vazirani and Vidick, and one by Fehr, et al. — demonstrated a protocol that attained exponential expansion: starting with m bits, one can certify that a 2-part HRNG produces a whopping 2^m certified bits of randomness!

While perhaps 2¹⁰⁰ random bits starting from a mere 100 is probably more than you’d ever need, the tantalizing question still stands: is there a limit to how much randomness you can certify, starting with a finite amount of seed randomness? In the upcoming Quantum Information Processing conference, Matt Coudron and I will show that there is no upper limit, and that infinite randomness expansion is possible. We give a protocol where, starting with say 1000 random bits, you can command a finite number of devices (which is eight in our protocol) to produce an unbounded amount of randomness. Furthermore, this protocol is immune to any secret backdoors installed in your HRNG (e.g. by a certain security agency) — the output of your HRNG is guaranteed to be secure against any external eavesdropper.

Classical control of quantum systems

Let’s take a step back from this frenetic expansion of randomness, and appreciate the remarkable nature of these protocols. Note that, this entire time, we know almost nothing about the HRNG being tested. It could have been specially designed by some malicious adversary to try to fool your test as much as possible. It could have at its disposal enormous computational power, like access to the Halting Problem oracle. It could even use quantum entanglement in some yet-unimagined way to try to defeat your test. Yet, by isolating parts of the HRNG from each other, and employing a simple check on their input and outputs, a limited classical being can force the HRNG to produce a long stream of random bits.

In a time when random number generators can be stealthily compromised by altering the dopants in a microchip, or by the intentional weakening of cryptographic standards by government agencies, the ability to rigorously and soundly test randomness has outgrown its philosophical origins in the nature of physical law, and become something of operational significance.

To return to our original question, though: can we tell if a bit is random? We can, but only by employing some some of the deepest ideas drawn from theoretical computer science and quantum physics. More broadly, this paradigm of classically controlling untrusted quantum devices has given rise to what’s called device independent quantum information processing. Not only can we classically manipulate quantum devices to produce genuine randomness, we can do a whole lot more: today we know how to securely exchange secret keys and even perform general quantum computations in a device independent manner! The scope and potential of device independent QIP is enormous, and I think we’ve only gotten started.

— Henry Yuen

MIT Theory Retreat 2013

This past Columbus Day weekend, over 30 of us gathered together at the Danny Lewin retreat — named in honor of former MIT graduate student Danny Lewin (about whom a recent book was written). This was the second year it was held; you can read about the inaugural retreat from last year here.

This was an opportunity for theory students new and old to get to know each other and spend 3 days together talking about their research and taking part in various fun activities (more details to follow).

The concept is simple:
1. We find a place (hopefully) within reasonable driving distance preferably in an area very close to awesome hiking paths.
2. Rent houses able to sleep dozens of people and and vans (one per dozen) to get us there.
3. Spend 3 days full of both academic and recreational activities. Academic activities (in the morning) include talks on a particular predetermined subject by predetermined speakers and short research introductions (i.e 5 minute talks about our research) by each one of us. Recreational ones include of course hiking and anything else people feel like doing in the free time (e.g ultimate frisbee, soccer, board games, making campfire etc).
4. Return to Boston full of new experiences and having met people, faster than we might have met otherwise. I definitely recommend that for all first year students!

This year the chosen place was at Adirondack Mountains in upstate New York close to Lake George, where we went for hiking the first time and gave us some awesome views as you will see.

We’ll just let the pictures do the rest of the talking (okay – we’ll let the captions do the talking too.

: Beautiful, isn’t it?

: This is G, our fearless leader.

: Each morning began with an hour long talk about learning theory (thanks to G, Ilya, and Eric for volunteering), followed by an hour and a half of rapid-fire research introductions.

: As a first year, I particularly enjoyed these as it gave me a chance to listen to 37 different perspectives on what theoretical computer science is all about.

: The academics weren’t confined to the research talks. Here’s G explaining a variant of the Dining Philosopher’s problem. And some discovered that beer makes random matrix theory that much more intelligible.

: But we did manage to tear ourselves away from the whiteboard for long enough to explore the amazing surroundings. Here’s the view from the bottom…

: …here’s the view from the top…

: …and from the bottom again!

: We also played ultimate frisbee, which (unfortunately) dominated soccer that day.

: There was lots of fire. Lighting fire. Marshmallows on fire. Molding glass in the fire. Poking the fire. Blowing on the fire. Incidentally, it’s very difficult to capture good pictures of fire.

: And lots of coffee. Here’s Eric’s own mountaintop brew. As Alfréd Rényi said, “a mathematician is a machine for turning coffee into theorems.”

: And the leaves were beautiful.

All in all, it was a fantastic weekend: I learned a ton, played my first game of mental chess (I lost), and escaped Cambridge for the great outdoors. And the leaves were beautiful.

–Aloni Cohen and Themis Gouleakis

Beyond Locality-Sensitive Hashing

This is an extended version of (the only) post in my personal blog.

In this post I will introduce locality-sensitive hashing (for which Andrei Broder, Moses Charikar and Piotr Indyk have been recently awarded Paris Kanellakis Theory and Practice Award) and sketch recent developments by Alexandr Andoni, Piotr Indyk, Nguyen Le Huy and myself (see video by Alexandr Andoni, where he explains the same result from somewhat different perspective).

One problem that one encounters a lot in machine learning, databases and other areas is the near neighbor search problem (NN). Given a set of points $P$ in a $d$ -dimensional space and a threshold $r > 0$ the goal is to build a data structure that given a query $q$ reports any point from $P$ within distance at most $r$ from $q$ .

Unfortunately, all known data structures for NN suffer from the so-called “curse of dimensionality”: if the query time is $o(n)$ (hereinafter we denote $n$ the number of points in our dataset $P$ ), then either space or query time is $2^{\Omega(d)}$ .

To overcome this obstacle one can consider the approximate near neighbor search problem (ANN). Now in addition to $P$ and $r$ we are also given an approximation parameter $c > 1$ . The goal is given a query $q$ report a point from $P$ within distance $cr$ from $q$ , provided that the neighborhood of radius $r$ is not empty.

It turns out that one can overcome the curse of dimensionality for ANN (see, for example, this paper and its references). If one insists on having near-linear (in $n$ ) memory and being subexponential in the dimension, then the only known technique for ANN is locality-sensitive hashing. Let us give some definitions. Say a hash family $\mathcal{H}$ on a metric space $\mathcal{M} = (X, D)$ is $(r, cr, p_1, p_2)$ -sensitive, if for every two points $x,y \in X$

if $D(x, y) \leq r$ , then $\mathrm{Pr}_{h \sim \mathcal{H}}[h(x) = h(y)] \geq p_1$ ;
if $D(x, y) \geq cr$ , then $\mathrm{Pr}_{h \sim \mathcal{H}}[h(x) = h(y)] \leq p_2$ .

Of course, for $\mathcal{H}$ to be meaningful, we should have $p_1 > p_2$ . Informally speaking, the closer two points are, the larger probability of their collision is.

Let us construct a simple LSH family for hypercube $\{0, 1\}^d$ , equipped with Hamming distance. We set $\mathcal{H} = \{h_1, h_2, \ldots, h_d\}$ , where $h_i(x) = x_i$ . It is easy to check that this family is $(r, cr, 1 - r / d, 1 - cr / d)$ -sensitive.

In 1998 Piotr Indyk and Rajeev Motwani proved the following theorem. Suppose we have a $(r, cr, p_1, p_2)$ -sensitive hash family $\mathcal{H}$ for the metric we want to solve ANN for. Moreover, assume that we can sample and evaluate a function from $\mathcal{H}$ relatively quickly, store it efficiently, and that $p_1 = 1 / n^{o(1)}$ . Then, one can solve ANN for this metric with space roughly $O(n^{1 + \rho})$ and query time $O(n^{\rho})$ , where $\rho = \ln(1 / p_1) / \ln(1 / p_2)$ . Plugging the family from the previous paragraph, we are able to solve ANN for Hamming distance in space around $O(n^{1 + 1 / c})$ and query time $O(n^{1/c})$ . More generally, in the same paper it was proved that one can achieve $\rho \leq 1 / c$ for the case of $\ell_p$ norms for $1 \leq p \leq 2$ (via an embedding by William Johnson and Gideon Schechtman). In 2006 Alexandr Andoni and Piotr Indyk proved that one can achieve $\rho \leq 1 / c^2$ for the $\ell_2$ norm.

Thus, the natural question arises: how optimal are the abovementioned bounds on $\rho$ (provided that $p_1$ is not too tiny)? This question was resolved in 2011 by Ryan O’Donnell, Yi Wu and Yuan Zhou: they showed a lower bound $\rho \geq 1/c - o(1)$ for $\ell_1$ and $\rho \geq 1/c^2-o(1)$ for $\ell_2$ matching the upper bounds. Thus, the above simple LSH family for the hypercube is in fact, optimal!

Is it the end of the story? Not quite. The catch is that the definition of LSH families is actually too strong. The real property that is used in the ANN data structure is the following: for every pair of points $x \in P, y \in X$ we have

if $D(x, y) \leq r$ , then $\mathrm{Pr}_{h \sim \mathcal{H}}[h(x) = h(y)] \geq p_1$ ;
if $D(x, y) \geq cr$ , then $\mathrm{Pr}_{h \sim \mathcal{H}}[h(x) = h(y)] \leq p_2$ .

The difference with the definition of $(r,cr,p_1,p_2)$ -sensitive family is that we now restrict one of the points to be in a prescribed set $P$ . And it turns out that one can indeed exploit this dependency on data to get a slightly improved LSH family. Namely, we are able to achieve $\rho \leq 7 / (8c^2) + O(1 / c^3) + o(1)$ for $\ell_2$ , which by a simple embedding of $\ell_1$ into $\ell_2$ -squared gives $\rho \leq 7 / (8c) + O(1 / c^{3/2}) + o(1)$ for $\ell_1$ (in particular, Hamming distance over the hypercube). This is nice for two reasons. First, we are able to overcome the natural LSH barrier. Second, this result shows that what “practitioners” have been doing for some time (namely, data-dependent space partitioning) can give advantage in theory, too.

In the remaining text let me briefly sketch the main ideas of the result. From now on, assume that our metric is $\ell_2$ . The first ingredient is an LSH family that simplifies and improves upon $\rho = 1 / c^2$ for the case, when all data points and queries lie in a ball of radius $O(cr)$ . This scheme has strong parallels with an SDP rounding scheme of David Karger, Rajeev Motwani and Madhu Sudan.

The second (and the main) ingredient is a two-level hashing scheme that leverages the abovementioned better LSH family. First, let us recall, how the standard LSH data structure works. We start from a $(r, cr, p_1, p_2)$ -sensitive family $\mathcal{H}$ and then consider the following simple “tensoring” operation: we sample $k$ functions $h_1, h_2, \ldots, h_k$ from $\mathcal{H}$ independently and then we hash a point $x$ into a tuple $(h_1(x), h_2(x), \ldots, h_k(x))$ . It is easy to see that the new family is $(r, cr, p_1^k, p_2^k)$ -sensitive. Let us denote this family by $\mathcal{H}^k$ . Now we choose $k$ to have the following collision probabilities:

$1/n$ at distance $cr$ ;
$1/n^{\rho}$ at distance $r$

(actually, we can not set $k$ to achieve these probabilities exactly, since $k$ must be integer, that’s exactly why we need the condition $p_1 = 1 / n^{o(1)}$ ). Now we hash all the points from the dataset using a random function from $\mathcal{H}^k$ , and to answer a query $q$ we hash $q$ and enumerate all the points in the corresponding bucket, until we find anything within distance $cr$ . To analyze this simple data structure, we observe that the average number of “outliers” (points at distance more than $cr$ ) we encounter is at most one due to the choice of $k$ . On the other hand, for any near neighbor (within distance at most $r$ ) we find it with probability at least $n^{-\rho}$ , so, to boost it to constant, we build $O(n^{\rho})$ independent hash tables. As a result, we get a data structure with space $O(n^{1 + \rho})$ and query time $O(n^{\rho})$ .

Now let us show how to build a similar two-level data structure, which achieves somewhat better parameters. First, we apply the LSH family $\mathcal{H}$ for $\ell_2$ with $\rho \approx 1/c^2$ , but only partially. Namely, we choose a constant parameter $\tau > 1$ and $k$ such that the collision probabilities are as follows:

$1/n$ at distance $\tau cr$ ;
$1/n^{1/\tau^2}$ at distance $cr$ ;
$1/n^{1/(\tau c)^2}$ at distance $r$ .

Now we hash all the data points with $\mathcal{H}^k$ and argue that with high probability every bucket has diameter $O(cr)$ . But now given this “bounded buckets” condition, we can utilize the better family designed above! Namely, we hash every bucket using our new family to achieve the following probabilities:

$1/n^{1 - 1/\tau^2}$ at distance $cr$ ;
$1/n^{(1 - \Omega_{\tau}(1))(1 - 1 / \tau^2) / c^2}$ at distance $r$ .

Overall, the data structure consists of an outer hash table that uses the LSH family of Andoni and Indyk, and then every bucket is hashed using the new family. Due to independence, the collision probabilities multiply, and we get

$1/n$ at distance $cr$ ;
$1/n^{(1 - \Omega_{\tau}(1)) / c^2}$ at distance $r$ .

Then we argue as before and conclude that we can achieve $\rho \leq (1 - \Omega(1)) / c^2$ .

After carefully optimizing all the parameters, we achieve, in fact, $\rho \approx 7 / (8c^2)$ . Then we go further, and consider a multi-level scheme with several distance scales. Choosing these scales carefully, we achieve $\rho \approx 1 / (2 c^2 \ln 2)$ .

— Ilya Razenshteyn

Not so Great Ideas in Theoretical Computer Science

A student blog of MIT CSAIL Theory of Computation Group