Differential Privacy

Call for Papers - TPDP 2026

Thu, 01 Jan 2026 12:00:00 -0800

The 12th Workshop on the Theory and Practice of Differential Privacy (TPDP 2026) will take place on June 1 and 2 in Boston, MA. The deadline to submit a 4-page abstract is February 18, 2026 AoE, with notifications by April 2, 2026. The call for papers is copied below.

Differential privacy (DP) is the leading framework for data analysis with rigorous privacy guarantees. In the last two decades, it has transitioned from the realm of pure theory to large scale, real world deployments.

Differential privacy is an inherently interdisciplinary field, drawing researchers from a variety of academic communities including machine learning, statistics, security, theoretical computer science, databases, and law. The combined effort across a broad spectrum of computer science is essential for differential privacy to realize its full potential. To this end, this workshop aims to stimulate discussion among participants about both the state-of-the-art in differential privacy and the future challenges that must be addressed to make differential privacy more practical.

Specific topics of interest for the workshop include (but are not limited to):

Theory of DP
DP and security
Privacy preserving machine learning
DP and statistics
DP and data analysis
Trade-offs between privacy protection and analytic utility
DP and surveys
Programming languages for DP
Relaxations of DP
Relation to other privacy notions and methods
Experimental studies using DP
DP implementations
DP and policy making
Applications of DP
Reconstruction attacks and memorization

Submissions: Authors are invited to submit a short abstract of new work or work published since June 2025 (the most recent TPDP submission deadline). Submissions must be 4 pages maximum, not including references. Submissions may also include appendices, but these are only read at reviewer’s discretion. There is no prescribed style file, but authors should ensure a minimum of 1-inch margins and 10pt font. Submissions are not anonymized, and should include author names and affiliations.

Submissions will undergo a lightweight review process and will be judged on originality, relevance, interest, and clarity. Based on the volume of submissions to TPDP 2025 and the workshop’s capacity constraints, we expect that the review process will be somewhat more competitive than in years past. Accepted abstracts will be presented at the workshop either as a talk or a poster.

The workshop will not have formal proceedings and is not intended to preclude later publication at another venue. In-person attendance is encouraged, though authors of accepted abstracts who cannot attend in person will be invited to submit a short video to be linked on the TPDP website.

Selected papers from the workshop will be invited to submit a full version of their work for publication in a special issue of the Journal of Privacy and Confidentiality.

The submission server is live: https://tpdp26.cs.uchicago.edu/

Open Problem: Selection via Low-Sensitivity Queries

Fri, 02 May 2025 7:00:00 -0700

Two of the basic tools for building differentially private algorithms are noise addition for answering low-sensitivity queries and the exponential mechanism for selection. Could we do away with the exponential mechanism and simply use low-sensitivity queries to perform selection?

Formal Problem Statement

Recall that the exponential mechanism is a differentially private algorithm that takes a private dataset \(x \in \mathcal{X}^n\) and a public loss function \(\ell : \mathcal{X}^n \times \mathcal{Y} \to \mathbb{R}\) and returns \(Y \in \mathcal{Y}\) such that \(\mathbb{E}_Y[\ell(x, Y)] \le \min_{y\in\mathcal{Y}} \ell(x, y) + O(\frac{1}{\varepsilon} \log |\mathcal{Y}|)\), where \(\varepsilon\) is the differential privacy parameter. (We will suppress the privacy parameter for simplicity.) The question is whether we can replace the exponential mechanism with an algorithm that is based only on adding noise to low-sensitivity queries.

Problem 1: There is a (private) dataset \(x \in \mathcal{X}^n\) and a (public) loss function \(\ell : \mathcal{X}^n \times \mathcal{Y} \to \mathbb{R}\) that has sensitivity-\(1\) in its first argument. That is, for all \(x,x’ \in \mathcal{X}^n\) differing in a single entry and all \(y \in \mathcal{Y}\), we have \(|\ell(x,y) − \ell(x’,y)| \le 1\).

The goal is to construct an algorithm that outputs \(Y \in \mathcal{Y}\) such that \[\mathbb{E}_Y[\ell(x, Y)] \le \min_{y\in\mathcal{Y}} \ell(x, y) + O(\log |\mathcal{Y}|).\tag{1}\] However, the algorithm cannot access \(x\) directly. Instead there is an oracle which provides noisy answers to \(k\) sensitivity-\(1\) queries. Specifically, each query is specified by a sensitivity-\(1\) function \(q : \mathcal{X}^n \to \mathbb{R}\), which is submitted to the oracle, and the oracle returns a sample from \(\mathcal{N}(q(x),k)\). The number of queries \(k\) may be chosen arbitrarily and the queries may be specified adaptively.

Obviously, if the algorithm has direct access to \(x\) or if the oracle didn’t add any noise, this problem would be trivial (just query \(q(x)=\ell(x,y)\) for all \(y\in\mathcal{Y}\) – i.e., \(k=|\mathcal{Y}|\) queries – and output the minimum).

The noise added by the oracle ensures that the algorithm is differentially private. Thus the goal of this algorithm is directly comparable with the guarantee of the exponential mechanism.

Partial Solution

As a starting point, the following binary-tree-based algorithm attains expected excess loss \(O((\log|\mathcal{Y}|)^{3/2})\) instead of the desired \(O(\log|\mathcal{Y}|)\).

Construct a complete binary tree with the leaves corresponding to the elements of \(\mathcal{Y}\). Walk down the tree from the root to a leaf as follows and output the leaf’s element. At each node, query the oracle with \[q(x) = \frac{1}{2} \big(\min_{y\in\text{ left subtree}} \ell(x, y)\big) − \frac{1}{2} \big(\min_{y\in\text{ right subtree}} \ell(x, y)\big).\tag{2}\] If the oracle’s answer is positive, move to the right child; otherwise, move left.

The number of queries for this algorithm is \(k=\lceil \log_2|\mathcal{Y}| \rceil\). And it’s easy to check that Equation 2 has sensitivity-\(1\).

For the utility analysis, let \(A_1,A_2,\cdots,A_k\) denote the nodes on the path from the root \(A_1\) to the leaf \(A_k\) that we output. We track the minimum loss on the subtree rooted at the current node – i.e., \(B_i := \min_{y \text{ in subtree rooted at } A_i} \ell(x,y)\).

Initially, we have \(B_1 = \min_{y\in\mathcal{Y}} \ell(x, y) \), which is the desired quantity. And \(B_k\) is the loss of the final output. We also have \(B_1 \le B_2 \le \cdots \le B_k\), since each successive subtree is a subset of the previous one. To complete the analysis we need only show that \(\mathbb{E}[B_{i+1}] \le B_i + O(\sqrt{\log|\mathcal{Y}|})\) for all \(i\).

If the (noiseless) value of the query in Equation 2 is positive, then the minimizer is in the right subtree and vice versa. If at step \(i\) the algorithm chooses the “correct” child, then \(B_{i+1}=B_i\). But, if the algorithm chooses the “incorrect” child, we have \(B_{i+1} = B_i + 2|q_i(x)|\), where \(q_i\) is the query (given in Equation 2) that was asked to the oracle in step \(i\).

What is the probability of choosing the wrong child? Well, it’s the probability that the noise added to \(q_i(x)\) flips the sign – i.e., \(\mathbb{P}[\mathsf{sign}(\mathcal{N}(q_i(x),k)) \ne \mathsf{sign}(q_i(x))]\). Putting these together and doing a bit of algebraic manipulation, we have \[ \mathbb{E}[B_{i+1}] = B_i + 2|q_i(x)| \cdot \mathbb{P}[\mathcal{N}(0,k)\ge|q_i(x)|] \]\[ ~~~~~~~~~~~~~~~ \le B_i + 2\sqrt{k} \cdot \max_{v \ge 0} v \cdot \mathbb{P}[\mathcal{N}(0,1) \ge v].\tag{3}\] The quantity \(\max_{v \ge 0} v \cdot \mathbb{P}[\mathcal{N}(0,1) \ge v] \in [0.169,0.17]\) is a constant (attained at \(v \approx 0.75\)). Thus we have \(\mathbb{E}[B_k] \le 0.34 k^{3/2} = O((\log|\mathcal{Y}|)^{3/2})\).

Who Cares?

Oh man, tough crowd. I care. But it’s a fair question – why is this open problem interesting?

A positive solution to this open problem would demonstrate the power of low-sensitivity queries and illustrate how almost all differentially private tasks can be boiled down to noise addition. Note that the reverse reduction is trivial: We can use the exponential mechanism to answer low-sensitivity queries. Namely, we can set \(\ell(x,y)=|q(x)-y|\). Thus a positive solution to this problem would show an equivalence between selection and low-sensitivity queries.

In practice, the exponential mechanism works fine, so we don’t really need this algorithm. Nevertheless, I think this could lead to something insightful, and maybe even useful: There are situations where we can do better than the exponental mechanism or at least better than the standard analysis of the exponential mechanism. An alternative algorithm might open up more avenues for improving on the exponential mechanism.

To give some examples where we know how to beat the standard analysis of the exponential mechanism: First, suppose the loss function can be decomposed as \[\ell(x,y) = \ell(x,(y_1,y_2,\cdots,y_d)) = \ell_1(x,y_1) + \ell_2(x,y_2) + \cdots + \ell_d(x,y_d). \tag{4}\] Then the analysis of the exponential mechanism can also be decomposed into the composition of \(d\) independent exponential mechanisms, which yields better asymptotic results via the advanced composition theorem. A second example is when there is one option \(y_* \in \mathcal{Y}\) that stands out from the other options – i.e., \(\ell(x,y_*) \le \min_{y \in \mathcal{Y} \setminus {y_*}} \ell(x,y) - c\), where \(c\) is sufficiently large. In this case we can privately output \(y_*\) with an improved dependence on the number of options \(|\mathcal{Y}|\) [CHS14,BDRS18,BKSW19].

A negative solution – that is, an impossibility result – would show that selection is a fundamental and indivisible primitive of differentially private algorithms. This would be surprising and thus interesting. The proof technique would presumably also be novel.

Remarks

This open problem was first published in 2019. (And I asked a related question back in 2017.) I’m reposting it because, well, it’s still open. (There are a few other open problems in the 2019 list, although some have been solved by now.)

Problem 1 is stated in terms of Gaussian noise addition (and implicitly performs optimal/advanced composition). The problem also makes sense with Laplace noise addition (and basic composition). Let’s state that formally:

Problem 2: There is a (private) dataset \(x \in \mathcal{X}^n\) and a (public) loss function \(\ell : \mathcal{X}^n \times \mathcal{Y} \to \mathbb{R}\) that has sensitivity-\(1\) in its first argument. That is, for all \(x,x’ \in \mathcal{X}^n\) differing in a single entry and all \(y \in \mathcal{Y}\), we have \(|\ell(x,y) − \ell(x’,y)| \le 1\).

The goal is to construct an algorithm that outputs \(Y \in \mathcal{Y}\) such that \[\mathbb{E}_Y[\ell(x, Y)] \le \min_{y\in\mathcal{Y}} \ell(x, y) + O(\log |\mathcal{Y}|).\tag{5}\] However, the algorithm cannot access \(x\) directly. Instead there is an oracle which provides noisy answers to \(k\) sensitivity-\(1\) queries. Specifically, each query is specified by a sensitivity-\(1\) function \(q : \mathcal{X}^n \to \mathbb{R}\), which is submitted to the oracle, and the oracle returns a sample from \(q(x)+\mathsf{Lap}(k)\). The number of queries \(k\) may be chosen arbitrarily and the queries may be specified adaptively.

For pure DP, the binary tree algorithm would achieve excess loss \(O((\log |\mathcal{Y}|)^2)\) instead of \(O(\log |\mathcal{Y}|)\).

The contrast between pure DP and Gaussian DP is interesting because the exponential mechanism satisfies pure DP and relaxing to approximate DP doesn’t allow us to do any better. But, comparing Problem 1 and Problem 2, it seems like the Gaussian case should be easier. I can’t quite put my finger on it, but I feel like there’s something interesting to say about this distinction and I hope resolving this open problem would shed light on it.

Limits of Privacy Amplification by Subsampling

Mon, 21 Apr 2025 12:00:00 -0700

In our previous post we gave a brief introduction to privacy amplification by subsampling. The high-level story is that we can make differentially private algorithms faster by runninng them on a subsample of the dataset instead of the whole dataset and this comes at essentially no cost in privacy and accuracy. That story is pretty good. But now we’ll take a closer look at the details of this story.

Setting

Recall that we’re comparing the standard Laplace mechanism \(M(x) := \frac{1}{n}\sum_{x_i \in x} q(x_i) + \mathsf{Laplace}\left(\frac{1}{\varepsilon n}\right)\) to the subsampled Laplace mechanism \(\widetilde{M}_{p}(x) := \frac{1}{pn} \sum_{x_i \in S_p(x)} q(x_i) + \mathsf{Laplace}\left(\frac{1}{\varepsilon_p p n}\right)\), where \(S_p(x)\subseteq x\) is a random Poisson subsample that includes each person’s data independently with probability \(p\). Both algorithms satisfy the same \(\varepsilon\)-differential privacy guarantee. The respective mean squared error guarantees are \[\mathbb{E}\left[\left(M(x) - \frac{1}{n}\sum_{x_i \in x} q(x_i)\right)^2\right] = \frac{2}{\varepsilon^2 n^2}. \tag{1}\] and \[ \mathbb{E}\left[\left(\widetilde{M}_p(x) - \frac{1}{n}\sum_{x_i \in x} q(x_i) \right)^2\right] \le \frac{|x|}{p n^2} + \frac{2}{\varepsilon_p^2 p^2 n^2} \approx \frac{1}{p n} + \frac{2}{\varepsilon^2 n^2},\tag{2}\] where \[\varepsilon_p = \log\left(1 + \frac{1}{p} \big( e^{\varepsilon}-1 \big)\right) \approx \frac{\varepsilon}{p}. \tag{3} \]

Comparing Equation 1 with Equation 2, there are two differences: The non-private statistical error \(\frac{1}{p n}\) and the approximation from Equation 3. We’ll ignore the non-private statistical error \(\frac{1}{p n}\) in this post, since it isn’t the dominant error term for reasonable parameter regimes and, well, this is DifferentialPrivacy.org not Statistics.org.

How good is the approximation?

So let’s talk about the approximation in Equation 3, which directly affects the scale of the Laplace noise added by the subsampled mechanism \(\widetilde{M}_p\): \[\text{noise_scale}(\widetilde{M}_p) = \frac{1}{\varepsilon_p p n} = \frac{1}{pn\log\left(1 + \frac{1}{p} \big( e^{\varepsilon}-1 \big)\right)} \approx \frac{1}{\varepsilon n} = \text{noise_scale}(M). \tag{4} \] The approximation in Equation 3 comes from the Taylor series around \(\varepsilon=0\): \[\varepsilon_p = \log\left(1 + \frac{1}{p} \big( e^{\varepsilon}-1 \big)\right) = \frac{\varepsilon}{p} - \frac{(1-p)\varepsilon^2}{2p^2} + \frac{(2-p)(1-p)\varepsilon^3}{6p^3} \pm O(\varepsilon^4)\tag{5}.\] The approximation in Equation 3 is just the first term in this Taylor series.¹ We can make the approximation precise with some inequalities:² \[\frac{\varepsilon}{p+\varepsilon} \le \log\left(1 + \frac{\varepsilon}{p}\right) \le \varepsilon_p = \log\left(1 + \frac{1}{p} \big( e^{\varepsilon}-1 \big)\right) \le \frac{\varepsilon}{p}. \tag{6} \]

To get an idea of how good this approximation actually is, let’s plot the approximation ratio \[\frac{\text{noise_scale}(M)}{\text{noise_scale}(\widetilde{M}_p)} = \frac{p\varepsilon_p}{\varepsilon} = \frac{p}{\varepsilon} \log\left(1 + \frac{1}{p} \big( e^{\varepsilon}-1 \big)\right) \approx 1:\tag{7}\] (Per Equation 6, this ratio is bounded: \(\frac{p}{p+\varepsilon} \le \frac{p\varepsilon_p}{\varepsilon} \le 1\).)

This doesn’t look so good! The approximation we made in Equation 3 tells us that all of the plotted lines should be close to 1. But this seems to only be accurate when the subsampling probability \(p\) is large or when the privacy parameter \(\varepsilon\) is very small. Large subsampling probability \(p\) doesn’t make much sense for subsampling; we don’t get much speedup. So the question is how small does the privacy parameter \(\varepsilon\) need to be?

Roughly, if we want the approximation in Equation 3 to be good within constant factors, then the privacy parameter \(\varepsilon\) needs to scale linearly with the subsampling probability \(p\). I.e., \(\varepsilon=cp\) for a constant \(c\). Let’s see what the ratio looks like for various constants:

This looks slightly better. In particular, if \(\varepsilon \le 2p\), then \(\frac{p\varepsilon_p}{\varepsilon} \ge \frac{1}{2}\), which means the subsampled Laplace mechanism \(\widetilde{M}_p\) adds at most twice as much noise as the standard Laplace mechanism \(M\).

In general, if we set \(\varepsilon \le cp\), then the ratio in Equation 7 is lower bounded by \[\inf_{p\in(0,1],\varepsilon \in (0,cp]}\frac{p}{\varepsilon} \log\left(1 + \frac{1}{p} \big( e^{\varepsilon}-1 \big)\right) = \frac{1}{c} \log\big( 1+c\big).\tag{8}\] In other words, if \(\varepsilon \le cp\), then the subsampled Laplace mechanism \(\widetilde{M}_p\) adds at most \(\frac{c}{\log(1+c)}\) times as much noise as the standard Laplace mechanism \(M\). Here’s what this function looks like:

This bound on the ratio seems reasonable as long as \(c\) isn’t large. However, assuming \(\varepsilon \le cp\) is a pretty strong assumption! This is the big limitation of privacy amplification by subsampling – subsampling is free only when the privacy parameter is tiny.

Is \(\varepsilon \le c p \) a reasonable parameter regime?

It depends…

Let’s think about the machine learning application that is the biggest motivation for studying privacy amplification by subsampling.

In machine learning applications we want to answer many queries \(q_1,q_2,\cdots,q_k\). (These queries are actually high-dimensional gradients that we want to estimate, but that’s not important right now.) Suppose we have some overall privacy budget \(\varepsilon_*\). Then this needs to be divided among the \(k\) queries. Using advanced composition, we get a per-query budget of \(\varepsilon = \Theta\left(\frac{\varepsilon_*}{\sqrt{k}}\right)\).³

The overall privacy budget \(\varepsilon_*\) is a constant. So as the number of queries \(k\) increases, the per-query privacy budget shrinks; \(\varepsilon = \Theta(1/\sqrt{k})\). That’s good for subsampling; we are in the small \(\varepsilon\) regime.

Now we want \(\varepsilon \le cp\) for privacy amplification by subsampling, where \(c\) is a small constant. Thus we need \(p \ge \Omega(1/\sqrt{k})\) in the machine learning application. Is this reasonable?

The quantity \(pk\) is the expected number of times each datapoint will be sampled over the \(k\) queries. In machine learning parlance, \(pk\) is the number of training epochs and \(k\) is the number of steps. Thus \(p \ge \Omega(1/\sqrt{k})\) implies that the number of epochs is \(pk \ge \Omega(\sqrt{k})\), which is a lot. It’s common to train with as little as one epoch.

The expected size of each subsample (a.k.a. the batch size) is \(p|x|\), where \(|x|\) is the overall dataset size. We typically want the batch size to be a moderate constant – e.g., 32.⁴ So we want \(p \le O(1/|x|)\), but privacy amplification by subsampling would need us to set \(\varepsilon \le cp \le O(1/|x|)\). As before, with \(\varepsilon = \Theta(1/\sqrt{k})\), this would correspond to \(k \ge \Omega(|x|^2)\) steps and \(kp \ge \Omega(|x|)\) epochs. The number of steps being quadratic in the dataset size and the number of epochs being linear in the datset size is a lot.

The takeaway from this back-of-the-envelope calculation is that \(\varepsilon \le cp\) is well outside the typical parameter regime for machine learning applications. We have to set the hyperparameters differently for private machine learning.

Conclusion

To summarize, in our previous post the story was that privacy amplification by subsampling can be used to make differentially private algorithms faster and this comes at essentially no cost in privacy and accuracy. But, in this post, we observe that this is free only if the privacy parameter \(\varepsilon\) is tiny. Specifically, the privacy parameter needs to be on the order of the subsampling probability – i.e., \(\varepsilon\le O(p)\) – for the claim to hold up to constant factors.

In these posts, we’ve looked at univariate queries with Laplace noise. In the machine learning application, we would instead have high-dimensional queries (i.e., model gradients) with Gaussian noise. This adds a fair bit of complexity, but the moral of the story remains the same.⁵

Practitioners of differentially private machine learning have observed that larger batch sizes yield better results. The purpose of this post is to make this folklore knowledge more widely accessible.

To be clear, the limits of privacy amplification by subsampling are a very real problem in practice. Increasing the batch size mitigates the problem, but often comes at a high computational cost.⁴ Thus, in recent years, there has been a lot of research that seeks to avoid the limits of privacy amplification by subsampling.⁶

Looking at the second- and third-order terms in the Taylor series in Equation 5, we can already see that this approximation may be problematic when the subsampling probability \(p\) is small, since these terms include factors of \(1/p^2\) and \(1/p^3\) respectively. ↩
To prove the inequalities in Equation 6: Since \(\log\) is concave, Jensen’s inequality gives \(\log(1-p+pe^{\varepsilon/p}) \ge (1-p)\log(1) + p \log(e^{\varepsilon/p}) = \varepsilon\); rearranging yields the upper bound \(\varepsilon/p \ge \log(1+(e^\varepsilon-1)/p)\). On the other hand \(\varepsilon \le e^\varepsilon-1\), which yields the first inequality on the lower bound side. Finally, we have \(\log(1+x) = \int_0^x \frac{1}{1+t} \mathrm{d}t \ge \int_0^x \frac{1}{(1+t)^2} \mathrm{d}t = \frac{x}{1+x}\) for all \(x\ge0\); substituting \(x=\varepsilon/p\) yields the second inequality on the lower bound side. ↩
We’re being a bit imprecise here. We can’t apply advanced composition with pure \((\varepsilon,0)\)-differential privacy. So the overall privacy budget \(\varepsilon_*\) needs to be quantified in terms of approximate \((\varepsilon_*,\delta_*)\)-differential privacy, concentrated differential privacy, or something like that. To make things formal we could set the overall privacy budget constraint as \(\frac{1}{2}\varepsilon_*^2\)-zCDP, which gives a per-query budget of pure \((\varepsilon=\frac{\varepsilon_*}{\sqrt{k}},0)\)-differential privacy. ↩
The ideal batch size (in non-private machine learning) is determined by many factors – ultimately, you try a few settings and use whatever works best. Some very rough intuition: A major factor in determining the right batch size is hardware parallelism/pipelining (and memory constraints). Absent parallelism, smaller batch size is typically better – right down to batch size 1; generally, you make faster progress by updating the model parameters after each gradient computation. However, batch size 1 doesn’t exploit the fact that the computer hardware can usually compute multiple gradients at the same time. Larger batch sizes allow you to get more work out of the hardware in the same amount of time. But once you saturate the hardware, there’s little benefit (non-privately) to larger batch sizes. ↩ ↩²
The main added complexity of working with Gaussian noise and high-dimensional queries comes from the fact that we can’t use pure \((\varepsilon,0)\)-differential privacy for the analysis. And, if we use approximate \((\varepsilon,\delta)\)-differential privacy for the analysis, we incur superfluous \(\sqrt{\log(1/\delta)}\) factors. To get a sharper analysis we need to work with Rényi differential privacy or numerically compute the privacy loss distribution. There is a lot of very interesting work on this topic, but the high-level conclusion remains the same. ↩
For example, DP-FTRL adds negatively correlated noise instead of independent noise to the queries/gradients. Since DP-FTRL doesn’t rely on privacy amplification by subsampling, the noise added to each query/gradient needs to be large. Instead DP-FTRL relies on the fact that, when you sum up the noisy values, the noise can be made to partially cancel out. In practice, DP-FTRL often works better than relying on privacy amplification by subsampling. Another example alternative is to avoid privacy amplification by subsampling by computing gradients on the full dataset and instead accelerating the computation using second-order methods so that we require fewer iterations. ↩

Privacy Amplification by Subsampling

Sun, 13 Apr 2025 7:00:00 -0700

Privacy Amplification by Subsampling is an important property of differential privacy. It is key to making many algorithms efficient – particularly in machine learning applications. Thus a lot of work has gone into analyzing this phenomenon. In this post we will give a quick introduction to privacy amplification by subsampling and its applications. In a follow-up post, we’re going to look at the limitations of privacy amplification by subsampling – i.e., when it doesn’t quite live up to the promises.

What is Privacy Amplification by Subsampling?

The premise of privacy amplification by subsampling is that we start with a (large) dataset \(x\) and we pick a (small) random subset \(S(x) \subseteq x\) and run a DP algorithm \(M\) on that subset.¹ The question is: What are the privacy properties of the combined algorithm \(M \circ S\)? The answer depends on both the privacy properties of base algorithm \(M\) and the subsampling procedure \(S\).

Intuitively, there are two reasons why the combined algorithm \(M \circ S\) should have better privacy properties than the base algorithm \(M\): First, there is some probability \(p\) that your data \(x_i\) is included in the subsample – i.e. \(p = \mathbb{P}[x_i\in S(x)]\).² But, with probability \(1-p\), your data is not included. And, when your data is not included, you have perfect privacy. Second, the privacy adversary does not know whether or not your data is included in the subsample. This ambiguity enhances your privacy even in the case where your data is included.³

There are different possible subsampling procedures \(S\). A natural subsampling scheme is for the subsample \(S(x)\) to be a fixed-size subset of the dataset \(x\) that is otherwise uniformly random. However, it turns out to work better if each person’s data is included independently. This subsampling procedure is known as Poisson subsampling.⁴ We denote Poisson subsampling by \(S_p\), where \(p\in[0,1]\) is the probability of inclusion. In this case, the size of the subsample is not fixed. Assuming each person’s data is included with the same probability \(p\), the size is binomially distributed: \[|S_p(x)| \sim \mathsf{Binomial}(|x|,p).\tag{1}\] It also turns out to be easier to analyze differential privacy with respect to addition or removal of one person’s data, rather than with respect to replacement.

There are many privacy ampliffication by subsampling results in the literature. The gist of them is pretty much the same; the differences are about the specific assumptions they make and how tight they are. Next we’ll state and prove a very simple version.

Theorem 1 (Privacy Amplification by Subsampling for Poisson-Subsampled Approximate DP). Let \(S_p : \mathcal{X}^* \to \mathcal{X}^*\) be the Poisson subsampling operation with probability \(p\).¹ That is, for all inputs \(x\), we have \(S_p(x) \subseteq x\) where each \(x_i \in x\) is included in \(S_p(x)\) independently with probability \(p\). Let \(M : \mathcal{X}^* \to \mathcal{Y}\) satisfy \((\varepsilon,\delta)\)-differential privacy with respect to addition or removal of one person’s data. Let \(M \circ S_p : \mathcal{X}^* \to \mathcal{Y}\) denote the combined algorithm that first subsamples and then runs \(M\) – i.e., \(M \circ S_p (x) = M(S_p(x))\) for all \(x\). Then \(M \circ S_p\) satisfies \((\varepsilon’,\delta’)\)-differential privacy with respect to addition or removal of one person’s data for \[\varepsilon’ = \log\big(1+p(\exp(\varepsilon)-1)\big) ~~~~ \text{ and } ~~~~ \delta’ = p \delta. \tag{2}\]

Proof. Let \(x \in \mathcal{X}^*\) and \(x_i \in x \) be arbitrary. Let \(x’=x\setminus\{x_i\}\) be \(x\) with \(x_i\) removed. Let \(T \subseteq \mathcal{Y}\) be arbitrary. We have
\(\mathbb{P}[M(S_p(x)) \in T ] = (1-p) \mathbb{P}[M(S_p(x)) \in T \mid x_i \notin S_p(x)] + p \mathbb{P}[M(S_p(x)) \in T \mid x_i \in S_p(x)] \) \(~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ = (1-p) \mathbb{P}[M(S_p(x’)) \in T] + p \mathbb{P}[M(S_p(x’)\cup{x_i}) \in T]\)
\(~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ \le (1-p) \mathbb{P}[M(S_p(x’)) \in T] + p (e^\varepsilon \mathbb{P}[M(S_p(x’)) \in T] + \delta ) \)
\(~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ = (1-p + p e^\varepsilon ) \mathbb{P}[M(S_p(x’)) \in T] + p \delta. \)
Here we are using the fact that \(S_p(x)\) conditioned on \(x_i \notin S_p(x)\) is just \(S_p(x’)\) and the fact that \(S_p(x)\) conditioned on \(x_i \in S_p(x)\) is just \(S_p(x’)\cup{x_i}\). (This relies on the independence of Poisson sampling.) This establishes half of the result. The other direction is similar:
\(\mathbb{P}[M(S_p(x)) \in T ] = (1-p) \mathbb{P}[M(S_p(x)) \in T \mid x_i \notin S_p(x)] + p \mathbb{P}[M(S_p(x)) \in T \mid x_i \in S_p(x)] \) \(~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ = (1-p) \mathbb{P}[M(S_p(x’)) \in T] + p \mathbb{P}[M(S_p(x’)\cup{x_i}) \in T]\)
\(~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ \ge (1-p) \mathbb{P}[M(S_p(x’)) \in T] + p e^{-\varepsilon}( \mathbb{P}[M(S_p(x’)) \in T] - \delta ) \)
\(~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ = (1-p+p e^{-\varepsilon}) \mathbb{P}[M(S_p(x’)) \in T] - p e^{-\varepsilon} \delta \)
This rearranges to
\( \mathbb{P}[M(S_p(x’)) \in T] \le \frac{\mathbb{P}[M(S_p(x)) \in T ]+p e^{-\varepsilon}\delta}{1-p+p e^{-\varepsilon}} \le (1-p+pe^\varepsilon)\mathbb{P}[M(S_p(x)) \in T ] + p\delta,\)
as required. (The inequalities \(\frac{1}{1-p+pe^{-\varepsilon}} \le 1-p+pe^\varepsilon\) and \(\frac{e^{-\varepsilon}}{1-p+pe^{-\varepsilon}} \le 1\) are left as exercises for the reader.) ∎

Theorem 1 is exactly tight. That’s because the proof really only has one inequality. In particular, it is tight when the algorithm is randomized response applied to the bit indicating whether or not your data is included in the subsample.

Why is Privacy Amplification by Subsampling Useful?

Lets work out a simplified illustrative example for why privacy amplification by subsampling is useful. Let’s assume we have a large dataset \(x\in\mathcal{X}^*\) and a query \(q:\mathcal{X}\to[0,1]\) and our goal is to privately estimate the average value of the query on the dataset \(\frac{1}{n}\sum_{x_i \in x} q(x_i)\).

The obvious solution is the Laplace mechanism: \[M(x) := \frac{1}{n}\sum_{x_i \in x} q(x_i) + \mathsf{Laplace}\left(\frac{1}{\varepsilon n}\right).\tag{3}\] This is \(\varepsilon\)-differentially private and has mean squared error \[\mathbb{E}\left[\left(M(x) - \frac{1}{n}\sum_{x_i \in x} q(x_i)\right)^2\right] = \frac{2}{\varepsilon^2 n^2}. \tag{4}\] However, this takes time linear in the size of the dataset \(x\); that may be OK for one query, but, if we need to answer \(k\) queries \(q_1,q_2,\cdots,q_k\), this would take \(\Omega(k|x|)\) time.

Suppose we can subsample from the dataset in sublinear time.⁵ Ideally, suppose we can compute \(S_p(x)\) in \(O(p|x|)\) time (on average). Then we can run the Laplace mechanism on the subsample: \[\widetilde{M}_{p}(x) := \frac{1}{pn} \sum_{x_i \in S_p(x)} q(x_i) + \mathsf{Laplace}\left(\frac{1}{\varepsilon_p p n}\right) .\tag{5}\] This is faster, but how does it compare in terms of privacy and accuracy?

Before privacy amplification by subsampling, \(\widetilde{M}_p\) satisfies \(\varepsilon_p\)-differential privacy. Applying Theorem 1 we conclude that it satisfies \(\varepsilon’\)-differential privacy with \(\varepsilon’ = \log(1+p(e^{\varepsilon_p}-1))\). If we want to set \(\varepsilon_p\) to achieve \(\varepsilon’=\varepsilon\), we can invert this formula to get \[\varepsilon_p = \log\left(1 + \frac{1}{p} \big( e^{\varepsilon}-1 \big)\right) \approx \frac{\varepsilon}{p}. \tag{6} \] The approximation comes from the first order Taylor series: \(\log(1+v) = v+O(v^2)\) and \(e^v-1 = v+O(v^2)\) for \(v\to0\).

On the accuracy front, we have \[ \mathbb{E}\big[\widetilde{M}_p(x)\big] = \frac{1}{n}\sum_{x_i \in x} q(x_i) .\tag{7}\] That is, \(\widetilde{M}_p\) is unbiased. In terms of variance, we have \[ \mathbb{E}\left[\left(\widetilde{M}_p(x) - \frac{1}{n}\sum_{x_i \in x} q(x_i) \right)^2\right] = \frac{p(1-p)}{p^2 n^2} \sum_{x_i \in x} q(x_i)^2 + \frac{2}{\varepsilon_p^2 p^2 n^2}\] \[~~~~~~~~~~~~~~~~~~~~~~~~~~~~ \le \frac{|x|}{p n^2} + \frac{2}{\varepsilon_p^2 p^2 n^2}\] \[~~~~~~~~~~~~~~~~~~~~~~~~~~ \approx \frac{1}{p n} + \frac{2}{\varepsilon^2 n^2}.\tag{8}\] In the last step we substitute in the approximation from Equation 6.⁶

Now let’s compare the linear-time mechanism \(M\) with the subsampled mechanism \(\widetilde{M}_p\): We have the same privacy guarantee. Comparing the accuracy guarantee in Equation 4 with that in Equation 8 we see two differences – the approximation (more on that shortly) and the extra \(\frac{1}{pn}\) term. This extra term is a low order term when \[\frac{1}{pn} \le \frac{1}{\varepsilon^2 n^2} \iff p \ge \varepsilon^2 n \iff \varepsilon \le \sqrt{\frac{p}{n}}.\tag{9}\] In other words, when \(\varepsilon\) is sufficiently small, the statistical error \(\frac{1}{\sqrt{pn}}\) is dominated by the scale of the noise added for privacy \(\frac{1}{\varepsilon_p p n}\approx\frac{1}{\varepsilon n}\). The statistical error is unrelated to privacy; it is something people are used to and we don’t need to worry about it too much.⁷

The upshot is that, for sufficiently small values of \(\varepsilon\), the error of the subsampled Laplace mechanism \(\widetilde{M}_p\) is approximately the same as the standard Laplace mechanism \(M\). Thus we get a faster algorithm with essentially no cost in privacy and accuracy.

This is very useful in machine learning applications, where the query \(q\) computes a gradient. However, gradients are usually higher-dimensional, rather than one-dimensional. This adds some complexity, but doesn’t fundamentally alter the story; essentially we need to analyze Gaussian noise rather than Laplace noise.

Conclusion

To summarize, we showed that privacy amplification by subsampling can be used to make differentially private algorithms faster. This comes at essentially no cost in privacy and accuracy, which is why it’s a really valuable tool.

In the next post, we’re going to look a little deeper at when the story above breaks down. When do we need to pay in privacy or accuracy for privacy amplification by subsampling?

If you want to dig deeper into privacy amplification by subsampling, see, e.g., this survey and the references therein.

This post uses set notation \(x_i \in S(x) \subseteq x\) somewhat informally. Things become a bit imprecise if there are duplicates – i.e., \(x_i=x_j\) for \(i \ne j\), so we assume this issue doesn’t arise. To make things formal we could define the index set \(S\) of the subsample separate from the subsample \(S(x)\); then we would condition on \(i \in S\) instead of \(x_i \in S(x)\). We use \(\mathcal{X}^* = \bigcup_{n=0}^\infty \mathcal{X}^n\) to denote the set of all finite tuples/multisets with elements in \(\mathcal{X}\) ↩ ↩²
For simplicity, we assume that the probability of inclusion \(\mathbb{P}[x_i\in S(x)]\) is the same for all individuals \(i\). In general, it can be different, in which case we would work with the largest probability \(p = \max_i \mathbb{P}[x_i\in S(x)]\). ↩
Under pure differential privacy, there is no privacy amplification by subsampling when the adversary knows whether or not your data was included in the subsample. (However, under approximate or Rényi differential privacy there is some amplification, but less than when the subsample remains secret.) ↩
Intuitively, the reason independent inclusion is better than having a fixed-size subsample is that, if the size of the subsample is known, then knowing whether other people’s data is included or excluded reveals information about whether your data is included or excluded. I have no idea why it’s called Poisson subsampling instead of Binomial subsampling. ↩
This is a nontrivial supposition. Often different subsampling schemes are used in practice because they are easier to implement than Poisson subsampling. ↩
Sweeping details under the rug: Since we’re defining differential privacy with respect to addition or removal of one person’s data, the size of the dataset \(|x|\) is itself private. Thus we only assume that \(n \approx |x|\). ↩
For simplicity, we’re looking at one-dimensional estimation. In higher dimensions, there’s an additional reason why the statistical error term isn’t a big deal: The error due to privacy grows with the dimension, while the statistical error doesn’t. ↩

Differentially Private Algorithms that Never Fail

Sun, 09 Mar 2025 7:00:00 -0700

Most differentially private algorithms fail with some nonzero probability. For example, when adding Gaussian or Laplace noise, there is some chance that the noise deviates significantly from its mean. But, fortunately, large deviations are unlikely. In this post we’re going to take a closer look at failure modes of DP algorithms and we’ll present some generic methods for reducing – or even eliminating – the failure probability.

Let’s be precise about what we mean by failure probability: Let’s assume we have a \((\varepsilon,\delta)\)-differentally private algorithm \(M : \mathcal{X}^n \to \mathcal{Y}\) and we have a loss function \(\ell : \mathcal{Y} \times \mathcal{X}^n \to \mathbb{R}\).¹ The (worst-case)² failure probability \(\beta\) of \(M\) is \[\beta := \max_{x\in\mathcal{X}^n} \mathbb{P}[\ell(M(x),x)>\alpha],\tag{1}\] where \(\alpha\) is some target value for the loss.³

For example, if \(M(x)=q(x)+\mathsf{Laplace}(1/\varepsilon)\) is the Laplace mechanism and \(\ell(y,x)=|y-q(x)|\) is the absolute error, then the failure probability is the tail probability \(\beta = \exp(-\varepsilon\alpha)\). If we want to eliminate the failure probability, we could use truncated Laplace noise instead of regular Laplace noise.⁴ And – spoiler alert – that’s the kind of method we’re going to look at.

To be clear, in this post we’re talking about failures of utility, which are different from failures of privacy. In a previous post, we talked about privacy failures; roughly, the \(\delta\) in \((\varepsilon,\delta)\)-DP captures the probability of a privacy failure. Privacy failures are a lot harder to fix than utility failures (which is kinda the point of this post).

Here’s our problem: We’re given a DP algorithm \(M\) with failure probability \(\beta\), and we want to modify the algorithm to get a new DP algorithm \(\widetilde{M}\) with failure probability \(\widetilde{\beta}<\beta\). Ideally, we want \(\widetilde{\beta}=0\).

Warmup: Absorbing the failure probability into \(\delta\)

Let’s start with a simple trick to get zero failure probability. This trick should hopefully give you some intuition for why it’s even possible to have zero failure probability under DP.

Suppose that, in addition to the \((\varepsilon,\delta)\)-DP algorithm \(M\) with failure probability \(\beta=\max_x\mathbb{P}[\ell(M(x),x)>\alpha]\), we have a non-private algorithm \(\check{M} : \mathcal{X}^n \to \mathcal{Y}\) that never fails – i.e., \(\max_x \mathbb{P}[\ell(\check{M}(x),x)>\alpha]=0\).⁵

Now let’s define \(\widetilde{M}(x)\) as follows. First, compute \(y=M(x)\). If \(\ell(y,x)\le\alpha\), return \(y\). If \(\ell(y,x)>\alpha\), compute \(\check{y}=\check{M}(x)\) and return \(\check{y}\).¹

Clearly \(\widetilde{M}\) now has zero failure probability. What about privacy?

Fix arbitrary neighbouring \(x,x’\in\mathcal{X}^n\) and a measurable \(S\subset\mathcal{Y}\). Define \(S^* := \{ y \in S : \ell(y,x)\le\alpha \}\). Now we have \[ ~~~~~~~~~~~~~~~~~~~~~\mathbb{P}[\widetilde{M}(x)\in S] = \mathbb{P}[M(x)\in S^*] + \mathbb{P}[\ell(M(x),x)>\alpha] \cdot \mathbb{P}[\check{M}(x)\in S]\] \[ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\le e^\varepsilon \mathbb{P}[M(x’)\in S^*] + \delta + \mathbb{P}[\ell(M(x),x)>\alpha] \cdot 1 \] \[ \le e^\varepsilon \mathbb{P}[\widetilde{M}(x’)\in S] + \delta + \beta. \tag{2} \] Thus \(\widetilde{M}\) is \((\varepsilon,\delta+\beta)\)-DP. In other words, we’ve absorbed the utility failure probability \(\beta\) into the privacy failure probability \(\delta\).²

This trick is neat since it lets us eliminate one of the parameters (\(\beta\)), but, in practice, you might not want to do this. We’re swapping a utility failure for a privacy failure and that often isn’t a great trade.

This trick only works if you already have a small failure probability \(\beta\).⁶ What if we start with a large failure probability, say, \(\beta=0.1\) or even \(\beta=0.9\)? We can amplifiy the probability of a getting a successful result by running the algorithm multiple times. Naïvely, the privacy cost increases according to composition; plus we need to select one of the runs to output, which requires looking at the input. This is roughly what we will do next, but we will avoid composition (sort of).

Avoiding silent failures

Above, we non-privately checked the failure condition \(\ell(\check{M}(x),x)>\alpha\). Intuitively, using a non-private test must cost us a lot in terms of privacy. Thus, to do better, we have to rely on a private test of the failure condition.

We can’t do much with an arbitrary loss function, so we need to make some assumptions. First, we will assume the loss has sensitivity \(\le1\). Second, we will assume that there is some wiggle room in the loss threshold \(\alpha\). Specifically, while the original algorithm \(M\) guarantees loss \(\le\alpha\) with probability \(\ge1-\beta\), our modified algorithm will guarantee loss \(\le\widetilde{\alpha}:=\alpha+2\tau\), where \(\tau=O(\log(1/\delta)/\varepsilon)\).

Are these assumptions reasonable? First, if the loss is high-sensitivity, then we can apply tricks like inverse sensitivity to get a low-sensitivity loss. Second, we can contrast with the exponential mechanism, which guarantees loss \(\le\mathsf{OPT}+O(\log|\mathcal{Y}|/\varepsilon)\). Thus the wiggle room we’re asking for is comparable to (or better than) what we’d get from the exponential mechanism, assuming \(\delta\) isn’t super tiny – specifically, \(\log (1/\delta) \le O(\log|\mathcal{Y}|)\).

Now we can modify the algorithm \(M : \mathcal{X}^n \to \mathcal{Y}\) to also output an estimate of the loss using truncated Laplace noise. Call this new algorithm \(\overline{M} : \mathcal{X}^n \to \mathcal{Y} \times \mathbb{R}\). If \(M\) is \((\varepsilon,\delta)\)-DP, we can make \(\overline{M}\) satisfy \((\overline{\varepsilon}=2\varepsilon,\overline{\delta}=2\delta)\)-DP and guarantee that the error of the loss estimate is \(\le \tau = O(\log(1/\delta)/\varepsilon)\) with probability 1.⁷

The benefit of this modified DP algorithm \(\overline{M}\) is that it won’t fail silently. If the loss is high, we will know about it.

Conditioning on success

To recap, we have a \((\overline{\varepsilon}=2\varepsilon,\overline{\delta}=2\delta)\)-DP algorithm \(\overline{M} : \mathcal{X}^n \to \mathcal{Y} \times \mathbb{R}\) with the following properties. Let \(x \in \mathcal{X}^n\) be arbitrary. Then, for \( (Y,Z) \gets \overline{M}(x)\), we have \[\mathbb{P}[\ell(Y,x) \le \alpha]\ge 1-\beta ~~~~~\text{ and }~~~~~ \mathbb{P}[|Z-\ell(Y,x)|\le \tau]=1,\tag{3}\] where \(\tau=O(\log(1/\delta)/\varepsilon)\). It follows that \(\mathbb{P}[Z \le \alpha + \tau] \ge 1-\beta\) and that \(Z \le \alpha + \tau \implies \ell(Y,x) \le \alpha + 2\tau\) with probability 1.

Now we define our final algorithm \(\widetilde{M} : \mathcal{X}^n \to \mathcal{Y}\):

Function \(\widetilde{M}(x)\):

Repeat as long as necessary:

Compute \((y,z) \gets \overline{M}(x)\).

If \(z \le \alpha + \tau\), return \(y\) and halt. Otherwise continue.

In other words, the output of \(\widetilde{M}\) is the output of \(\overline{M}\) conditioned on the reported loss being \(\le \alpha + \tau\). In symbols, \(\mathbb{P}[\widetilde{M}(x)=y]=\mathbb{P}[Y=y\mid Z \le \alpha + \tau]\) for \( (Y,Z) \gets \overline{M}(x)\). The number of times the loop in \(\widetilde{M}\) runs is geometrically distributed with mean \(\le\frac{1}{1-\beta}\). Note that \(\widetilde{M}\) needs to know⁸ the utility threshold \(\alpha\) and, if for some reason this threshold is wrong, we could get an infinite loop!²

By construction, we have \(\mathbb{P}[\ell(\widetilde{M}(x),x) \le \alpha + 2\tau ] = 1\). That is, we have zero failure probability. Now, what about privacy?

Theorem 1. Let \(\widetilde{M} : \mathcal{X}^n \to \mathcal{Y}\) be defined as above. Assume \(\overline{M} : \mathcal{X}^n \to \mathcal{Y} \times \mathbb{R}\) is \((\overline\varepsilon,\overline\delta)\)-differentially private and, for all inputs \(x\in\mathcal{X}^n\), if \((Y,Z)\gets\overline{M}(x)\), then \(\mathbb{P}[Z\le\alpha+\tau]\ge1-\beta>0\).² Then \(\widetilde{M}\) satisfies \((\widetilde{\varepsilon},\widetilde{\delta})\)-differential privacy for \[\widetilde{\varepsilon}=2\overline{\varepsilon} - \log(1-\overline{\delta}/(1-\beta)) ~~~~\text{ and }~~~~ \widetilde{\delta}=\frac{\overline{\delta}}{1-\beta} .\tag{4}\]

Proof. Let \(x,x’\in\mathcal{X}\) be neighbouring inputs. Let \((Y,Z) \gets \overline{M}(x)\) and \((Y’,Z’) \gets \overline{M}(x’)\). The distribution of \(\widetilde{M}(x)\) is that of \(Y\) conditioned on \(Z \le \alpha + \tau\). Similarly, the distribution of \(\widetilde{M}(x’)\) is that of \(Y’\) conditioned on \(Z’ \le \alpha + \tau\). Let \(S \subset \mathcal{Y}\) be arbitrary but measurable. It suffices to show that \[\mathbb{P}[Y \in S \mid Z \le \alpha + \tau] \le e^{\widetilde{\varepsilon}} \mathbb{P}[Y’ \in S \mid Z’ \le \alpha + \tau] + \widetilde{\delta}.\] We have
\( \mathbb{P}[Y \in S \mid Z \le \alpha + \tau] \)
   \(= \frac{\mathbb{P}[Y \in S ~\&~ Z \le \alpha + \tau]}{\mathbb{P}[Z \le \alpha + \tau]}\)
   \( \le \frac{e^{\overline{\varepsilon}}\mathbb{P}[Y’ \in S ~\&~ Z’ \le \alpha + \tau] + \overline{\delta}}{\mathbb{P}[Z \le \alpha + \tau]}\)
   \( \le \frac{e^{\overline{\varepsilon}}\mathbb{P}[Y’ \in S ~\&~ Z’ \le \alpha + \tau]}{e^{-\overline{\varepsilon}}(\mathbb{P}[Z’ \le \alpha + \tau]-\overline{\delta})} + \frac{\overline{\delta}}{\mathbb{P}[Z \le \alpha + \tau]}\)
   \( = \frac{e^{2\overline{\varepsilon}}\mathbb{P}[Y’ \in S ~\&~ Z’ \le \alpha + \tau]}{\mathbb{P}[Z’ \le \alpha + \tau]}\frac{\mathbb{P}[Z’ \le \alpha + \tau]}{\mathbb{P}[Z’ \le \alpha + \tau]-\overline{\delta}} + \frac{\overline{\delta}}{\mathbb{P}[Z \le \alpha + \tau]}\)
   \( = e^{2\overline{\varepsilon}}\mathbb{P}[Y’ \in S \mid Z’ \le \alpha + \tau] \frac{1}{1-\overline{\delta}/\mathbb{P}[Z’ \le \alpha + \tau]} + \frac{\overline{\delta}}{\mathbb{P}[Z \le \alpha + \tau]} \)
   \( \le \frac{e^{2\overline{\varepsilon}}}{1-\overline{\delta}/(1-\beta)}\mathbb{P}[Y’ \in S \mid Z’ \le \alpha + \tau] + \frac{\overline{\delta}}{1-\beta} \)
   \( = e^{\widetilde{\varepsilon}} \mathbb{P}[Y’ \in S \mid Z’ \le \alpha + \tau] + \widetilde{\delta}.\) ∎

As long as \(1-\beta \ge \Omega(1)\), Theorem 1 gives \(\widetilde{\varepsilon} = 2\overline{\varepsilon} + O(\overline{\delta}) = 4\varepsilon + O(\delta)\) and \(\widetilde{\delta}=O(\overline{\delta})=O(\delta)\). Putting the pieces together, we have the following result.

Theorem 2 Let \(\ell : \mathcal{Y} \times \mathcal{X}^n \to \mathbb{R}\) have sensitivity 1 in its second argument and let \(\alpha,\beta,\varepsilon,\delta\in\mathbb{R}\) with \(0\le\beta<1-2\delta<1\). Let \(M : \mathcal{X}^n \to \mathcal{Y}\) satisfy \((\varepsilon,\delta)\)-DP and assume \(\mathbb{P}[\ell(M(x),x) \le \alpha ] \ge 1-\beta \) for all \(x \in \mathcal{X}^n\).
Then there exists \(\widetilde{M} : \mathcal{X}^n \to \mathcal{Y}\) that is \((4\varepsilon-\log\left(1-\frac{2\delta}{1-\beta}\right),\frac{2\delta}{1-\beta})\)-DP and \(\mathbb{P}[\ell(\widetilde{M}(x),x) \le \alpha + 2\tau ] = 1\) for all \(x\in\mathcal{X}^n\), where \(\tau=O(\log(1/\delta)/\varepsilon)\) is the truncation threshold for \((\varepsilon,\delta)\)-DP truncated Laplace noise.⁴

Note that our utility failure probability \(\beta\) appears in the privacy parameters of Theorem 2.² This is a bit unintuitive, but we saw how it can happen earlier with the trick of absorbing the utility failure as a privacy failure. The dependence here is milder than before; e.g., we can start with a high utility failure probability, e.g. \(\beta=0.5\), and still get a low final privacy failure probability \(\widetilde{\delta}\le10^{-6}\).

Overall we pay a constant factor in the privacy parameters and suffer an additive increase in the loss in order to eliminate the failure probability. And (unlike the earlier trick) this is true even if the initial failure probability was quite large.

Conclusion

We’ve presented two methods for eliminating the failure probability from DP algorithms. The first method simply moves the failure from utility to privacy; this has obvious downsides. The second method avoids these downsides and is applicable even when the initial failure probability is large, but it blows up the privacy parameters by a multiplicative factor and requires some wiggle room in the loss. The second method is based on a result by Gupta, Ligett, McSherry, Roth, & Talwar [GLMRT10 Theorem 10.2].

In both cases, we crucially exploit the nonzero \(\delta\) in approximate \((\varepsilon,\delta)\)-DP. And one of the high-level take-home messages of this post is simply that \(\delta\) can absorb utility failures, in addition to privacy failures.

For simplicity, this post has focused on fully eliminating the failure probability. What if, instead, we just want to reduce it? Is \(\delta\) still crucial? No! The second method we presented works even with \(\widetilde{\delta}=0\) or with Rényi DP; but we cannot entirely eliminate the failure probability. The math gets messier, but the high-level idea is pretty simple: Instead of using truncated Laplace noise, we use regular Laplace noise (to avoid nonzero \(\delta\)). This means there’s a chance that \(\overline{M}\) falsely reports low loss, which means there’s a chance of failure. But, as long as the chance of falsely reporting a low loss is much smaller than the chance of correctly reporting a low loss, the overall failure probability is low.

If you want to learn more about extensions of the second method, read the papers of Liu & Talwar [LT19], Papernot & Steinke [PS22], and Cohen, Lyu, Nelson, Sarlós, & Stemmer [CLNSS23]. These methods are particularly useful in settings where the initial success probability is low, e.g. \(1-\beta=0.01\), such as when there is some element of random guessing involved.

The other key take-home message of this post is that the failure probability shouldn’t be a first-order concern, at least from a theoretical perspective. In particular, if we obtain bounds on the expected error, then we can obtain high-probability bounds via this method.⁹

In many cases the reductions we presented are not practical; it’s usually easier to directly modify the algorithm to reduce the failure probability. However, the fact that these generic methods exist offers an explanation for why, in practice, failure probabilities are relatively easy to manage.

Note that we’re implicitly assuming that the loss \(\ell\) is known – i.e., it is something we can compute when designing algorithms. In particular, the loss must be an empirical loss, rather than a population loss. ↩ ↩²
It’s important that we have a provable worst-case failure probability bound for the original algorithm \(M\), since we want a provable privacy guarantee. In particular, if we only have a heuristic that works for most inputs \(x\), but fails badly on other inputs, then we cannot get a provable DP guarantee using these methods. It is possible that heuristics can be modified to fail gracefully and thus these methods can be salvaged, but that’s beyond the scope of this post. ↩ ↩² ↩³ ↩⁴ ↩⁵
For simplicity, in this post we will (mostly) talk about failure as a boolean event; i.e., there is a hard utility threshold at \(\alpha\). Of course, in most cases, there is not a hard threshold and it makes sense to talk about the tail probability \(\beta\) as a function of the threshold \(\alpha\), rather than a single value. Note that we look at the worst-case over inputs \(x\); that is, we aren’t in a statistical setting where inputs are random and we aren’t considering (non-private) statical errors. ↩
To achieve \((\varepsilon,\delta)\)-DP with \(0 < \delta \le \frac{1}{2}\), we can use Laplace noise truncted to magnitude \(\tau = \frac{1+\log(1/2\delta)}{\varepsilon} = O(\log(1/\delta)/\varepsilon)\). Truncated Laplace noise is folklore [L16]; Holohan et al. [HABA18] give a sharp analysis. ↩ ↩²
Note that we can always just define \(\check{M}(x) = \mathsf{argmin}_y \ell(y,x)\) or we can re-run \(M\) until we achieve the desired loss. ↩
Recall, that the privacy failure probablity should be tiny – e.g., \(\delta \le 10^{-6}\) – for the privacy guarantee to be compelling. ↩
For simplicity, we’re setting the privacy parameters of the truncated Laplace to be the same as for \(M\). In practice, this might be excessive and a different balance would work better. Also, some algorithms naturally output an estimate of their error and so this modification may not be necessary. ↩
If the utility threshold \(\alpha\) isn’t known (e.g., if it depends on the input \(x\)), then there are other methods than can be used [LT19,PS22,CLNSS23], but this is beyond the scope of this blog post. ↩
To be precise, if we have \(\mathbb{E}[\ell(M(x),x)]\le \alpha_*\) and \(\ell(y,x)\ge0\), then Markov’s inequality gives \(\mathbb{P}[\ell(M(x),x)\le\alpha]\ge1-\frac{\alpha}{\alpha_*}\) for all \(\alpha\ge\alpha_*\). We can plug this bound into Theorem 2 to get a high-probability bound. ↩

Tight RDP & zCDP Bounds from Pure DP

Mon, 27 May 2024 10:00:00 -0700

There are multiple ways to quantify differential privacy, including pure DP [DMNS06], approximate DP [DKMMN06], Concentrated DP [DR16,BS16], Rényi DP [M17], Gaussian DP [DRS19], & function-DP [DRS19]. Fortunately, these definitions are similar enough that we can convert between most of them (with some loss in parameters).

In this post, we consider converting from pure DP to Rényi DP and Concentrated DP. In particular, we will provide optimal results, which are an improvement on what is currently in the literature. But first, let’s recap the relevant definitions.

Definitions: Pure DP, Rényi DP, & zCDP

For notational simplicity, we will assume the output space of the algorithms is discrete and that the algorithms’ output distributions have full support.¹

Definition 1 (Pure DP): A randomized algorithm \(M : \mathcal{X}^n \to \mathcal{Y}\) satisfies \(\varepsilon\)-differential privacy if, for all pairs of inputs \(x, x’ \in \mathcal{X}^n\) differing only on the data of a single individual, we have \[\forall y \in \mathcal{Y} ~~~~~ \log\left(\frac{\mathbb{P}[M(x)=y]}{\mathbb{P}[M(x’)=y]}\right) \le \varepsilon.\]

Pure DP is the simplest (and first) definition and is very convenient for analysis. Pure DP can also be called pointwise DP because the guarantee holds for all points \(y\), whereas all the other definitions either bound some quantity averaged over \(y\) or quantify over sets \(S \subseteq \mathcal{Y}\).

Definition 2 (Rényi DP): A randomized algorithm \(M : \mathcal{X}^n \to \mathcal{Y}\) satisfies \((\alpha,\widehat\varepsilon)\)-Rényi differential privacy if, for all pairs of inputs \(x, x’ \in \mathcal{X}^n\) differing only on the data of a single individual, we have \[ \frac{1}{\alpha-1} \log \left( \underset{Y \gets M(x’)}{\mathbb{E}}\left[ \left( \frac{\mathbb{P}[M(x)=Y]}{\mathbb{P}[M(x’)=Y]} \right)^\alpha \right] \right) \le \widehat\varepsilon.\]

Rényi DP is a more flexible definition than pure DP. But this flexibility comes at the cost of complexity. The definition has two parameters, but we can usually trade off these parameters. Thus it is often better to think of it as being parameterized by a function \(\widehat\varepsilon(\alpha)\), which gives us a \((\alpha,\widehat\varepsilon(\alpha))\)-RDP bound for all \(\alpha>1\) simultaneously. However, in many cases – such as the Gaussian mechanism – the function is linear, or can be bounded by a linear function.

Definition 3 (zero-Concentrated DP (zCDP)): A randomized algorithm \(M : \mathcal{X}^n \to \mathcal{Y}\) satisfies \(\rho\)-zCDP if, for all pairs of inputs \(x, x’ \in \mathcal{X}^n\) differing only on the data of a single individual, we have \[ \forall \alpha > 1 ~~~~~ \frac{1}{\alpha-1} \log \left( \underset{Y \gets M(x’)}{\mathbb{E}}\left[ \left( \frac{\mathbb{P}[M(x)=Y]}{\mathbb{P}[M(x’)=Y]} \right)^\alpha \right] \right) \le \alpha\rho.\]

This definition is equivalent to satisfying \((\alpha,\rho\alpha)\)-RDP for all \(\alpha>1\); zCDP can be thought of as a single-parameter version of RDP, which gives us many of the benefits of RDP without the complexity.

Converting Pure DP to Rényi DP

It is immediate from the definitions that \(\varepsilon\)-DP implies \((\alpha,\varepsilon)\)-RDP for all \(\alpha>1\).² This is just saying that the average value is at most the maximum value. We can do better than this:

Theorem 4 (Pure DP to Rényi DP): Let \(M : \mathcal{X}^n \to \mathcal{Y}\) be a randomized algorithm satisfying \(\varepsilon\)-differential privacy. Then \(M\) satisfies \((\alpha,\widehat\varepsilon(\alpha))\)-Rényi DP for all \(\alpha>1\), where \[ \widehat\varepsilon(\alpha) = \frac{1}{\alpha-1} \log \left( \frac{1}{e^\varepsilon+1} e^{\alpha \varepsilon} + \frac{e^\varepsilon}{e^\varepsilon+1} e^{-\alpha \varepsilon} \right) \]\[ = \varepsilon - \frac{1}{\alpha-1} \log \left( \frac{1+e^{-\varepsilon}}{1 + e^{-(2\alpha-1)\varepsilon}} \right). \] Furthermore, this bound is tight.

Proof.³ Fix neighbouring inputs \(x, x’ \in \mathcal{X}^n\) and fix \(\alpha>1\).

First note that this bound is tight when \(M\) corresponds to randomized response. That is, if \(M(x) = \mathsf{Bernoulli}(\tfrac{e^\varepsilon}{e^\varepsilon+1})\) and \(M(x’) = \mathsf{Bernoulli}(\tfrac{1}{e^\varepsilon+1})\), then the expression in the theorem statement is simply the expression in the definition of Rényi DP. Since this is consistent with \(M\) satisfying \(\varepsilon\)-DP, this proves tightness of the result. To prove the result it only remains to show that randomized response is indeed the worst case \(M\).

We make two additional observations: (1) The definition of pure DP implies \( \frac{\mathbb{P}[M(x)=y]}{\mathbb{P}[M(x’)=y]} \le e^\varepsilon \) for all \(y \in \mathcal{Y}\). But the definition of pure DP is symmetric in \(x\) and \(x’\), so we can swap them and obtain a two-sided bound: \[ \forall y \in \mathcal{Y} ~~~~~ e^{-\varepsilon} \le \frac{\mathbb{P}[M(x)=y]}{\mathbb{P}[M(x’)=y]} \le e^\varepsilon.\] (2) Since \(\sum_y \mathbb{P}[M(x)=y] = 1\), we have \[ \underset{Y \gets M(x’)}{\mathbb{E}}\left[ \frac{\mathbb{P}[M(x)=Y]}{\mathbb{P}[M(x’)=Y]} \right] = \sum_y \mathbb{P}[M(x’)=y] \cdot \frac{\mathbb{P}[M(x)=y]}{\mathbb{P}[M(x’)=y]} = 1. \]

Now we define a randomized rounding function \(A : [e^{-\varepsilon},e^\varepsilon] \to \{e^{-\varepsilon},e^\varepsilon\}\) by \(\mathbb{E}_A [A(z)] = z \). That is, for all \( z \in [e^{-\varepsilon},e^\varepsilon] \), we have \[\underset{A}{\mathbb{P}}[A(z)=e^\varepsilon]=\frac{z-e^{-\varepsilon}}{e^\varepsilon-e^{-\varepsilon}} ~~~ \text{ and } ~~~ \underset{A}{\mathbb{P}}[A(z)=e^{-\varepsilon}]=\frac{e^\varepsilon-z}{e^\varepsilon-e^{-\varepsilon}}.\] Since \( v \mapsto v^\alpha \) is convex, by Jensen’s inequality, for all \( z \in [e^{-\varepsilon},e^\varepsilon] \), we have \[z^\alpha = \mathbb{E}_A[A(z)]^\alpha \le \mathbb{E}_A[A(z)^\alpha] = \frac{z-e^{-\varepsilon}}{e^\varepsilon-e^{-\varepsilon}} \cdot e^{\varepsilon\alpha} + \frac{e^\varepsilon-z}{e^\varepsilon-e^{-\varepsilon}} e^{-\alpha\varepsilon}. \] Applying this inequality to the quantity of interest with \(z = \frac{\mathbb{P}[M(x)=Y]}{\mathbb{P}[M(x’)=Y]} \), we get \[ \underset{Y \gets M(x’)}{\mathbb{E}}\left[ \left( \frac{\mathbb{P}[M(x)=Y]}{\mathbb{P}[M(x’)=Y]} \right)^\alpha \right] \le \underset{Y \gets M(x’) }{\mathbb{E}}\left[ \frac{\frac{\mathbb{P}[M(x)=Y]}{\mathbb{P}[M(x’)=Y]}-e^{-\varepsilon}}{e^\varepsilon-e^{-\varepsilon}} \cdot e^{\varepsilon\alpha} + \frac{e^\varepsilon-\frac{\mathbb{P}[M(x)=Y]}{\mathbb{P}[M(x’)=Y]}}{e^\varepsilon-e^{-\varepsilon}} e^{-\alpha\varepsilon} \right] .\] Observation 1 tells us that this is valid, since \(z \in [e^{-\varepsilon},e^\varepsilon]\). Observation 2 and linearity of expectations gives \[ \underset{Y \gets M(x’) }{\mathbb{E}}\left[ \frac{\frac{\mathbb{P}[M(x)=Y]}{\mathbb{P}[M(x’)=Y]}-e^{-\varepsilon}}{e^\varepsilon-e^{-\varepsilon}} \cdot e^{\varepsilon\alpha} + \frac{e^\varepsilon-\frac{\mathbb{P}[M(x)=Y]}{\mathbb{P}[M(x’)=Y]}}{e^\varepsilon-e^{-\varepsilon}} e^{-\alpha\varepsilon} \right] = \frac{1-e^{-\varepsilon}}{e^\varepsilon-e^{-\varepsilon}} \cdot e^{\varepsilon\alpha} + \frac{e^\varepsilon-1}{e^\varepsilon-e^{-\varepsilon}} e^{-\alpha\varepsilon}.\] We have \(\frac{1-e^{-\varepsilon}}{e^\varepsilon-e^{-\varepsilon}} = \frac{e^\varepsilon-1}{e^{2\varepsilon}-1} = \frac{e^\varepsilon-1}{(e^\varepsilon-1)(e^\varepsilon+1)} = \frac{1}{e^\varepsilon+1}\) and, similarly,\(\frac{e^\varepsilon-1}{e^\varepsilon-e^{-\varepsilon}} = \frac{e^\varepsilon}{e^\varepsilon+1}\). Combining the equalities and inequalities gives \[ e^{(\alpha-1)\widehat\varepsilon(\alpha)} = \underset{Y \gets M(x’)}{\mathbb{E}}\left[ \left( \frac{\mathbb{P}[M(x)=Y]}{\mathbb{P}[M(x’)=Y]} \right)^\alpha \right] \le \frac{1}{e^\varepsilon+1} e^{\alpha\varepsilon} + \frac{e^\varepsilon}{e^\varepsilon+1} e^{-\alpha\varepsilon},\] which establishes the result. The equivalence of the two expressions in the theorem statement is a matter of algebraic manipulation; the second expression is more suitable for numerical computation. ∎

Converting Pure DP to zCDP

The RDP bound in Theorem 4 is tight, but a bit unwieldy. Now we look at zCDP bounds, which are looser but simpler. The trivial bound gives that \(\varepsilon\)-DP implies \(\varepsilon\)-zCDP. In a previous post we proved that \(\varepsilon\)-DP implies \(\frac12\varepsilon^2\)-zCDP.⁴ Now we prove a tight bound:

Theorem 5 (Pure DP to zCDP): Let \(M : \mathcal{X}^n \to \mathcal{Y}\) be a randomized algorithm satisfying \(\varepsilon\)-differential privacy. Then \(M\) satisfies \(\rho\)-zCDP for all \(\alpha>1\), where \[ \rho = \frac{e^\varepsilon-1}{e^\varepsilon+1} \varepsilon \le \frac12 \varepsilon^2. \] Furthermore, this bound is tight.

To prove this result, we use the following result, which is a tighter version of Hoeffding’s lemma.

Proposition 6 (Kearns-Saul inequality [KS13,BK13,AMN19]): For all \(p \in [0,1]\) and all \(t\in\mathbb{R}\), we have \[1-p + p \cdot e^t \le \exp\left(t \cdot p + t^2 \cdot \frac{1-2p}{4\log((1-p)/p)}\right).\]

Proof of Theorem 5. By Theorem 4, \(M\) satisfies \((\alpha,\widehat\varepsilon(\alpha))\)-Rényi DP for all \(\alpha>1\), where \[ e^{(\alpha-1)\widehat\varepsilon(\alpha)} = \frac{1}{e^\varepsilon+1} e^{\alpha \varepsilon} + \frac{e^\varepsilon}{e^\varepsilon+1} e^{-\alpha \varepsilon} .\] We need to show \(\widehat\varepsilon(\alpha) \le \rho\alpha\) for all \(\alpha>1\). Fix \(\alpha>1\).

Let \(p = \tfrac{1}{e^\varepsilon+1}\). Then \[ \frac{1}{e^\varepsilon+1} e^{\alpha \varepsilon} + \frac{e^\varepsilon}{e^\varepsilon+1} e^{-\alpha \varepsilon} = e^{-\alpha\varepsilon} \cdot \left( 1-p + p e^{2\alpha\varepsilon} \right) .\] By the Kearns-Saul inequality, \[ e^{-\alpha\varepsilon} \cdot \left( 1-p + p e^{2\alpha\varepsilon} \right) \le \exp\left((2p-1)\alpha\varepsilon + ( 2 \alpha \varepsilon)^2 \cdot \frac{1-2p}{4\log((1-p)/p)}\right) .\] Since \(2p-1 = - \tfrac{e^\varepsilon-1}{e^\varepsilon + 1}\) and \( \frac{1-p}{p} = e^\varepsilon \), this simplifies to \[ \exp\left((2p-1)\alpha\varepsilon + ( 2 \alpha \varepsilon)^2 \cdot \frac{1-2p}{4\log((1-p)/p)}\right) = \exp\left( -\alpha\varepsilon\frac{e^\varepsilon-1}{e^\varepsilon+1} + 4 \alpha^2 \varepsilon^2 \frac{\frac{e^\varepsilon-1}{e^\varepsilon+1}}{4\varepsilon} \right)\]\[ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ = \exp\left( (\alpha-1) \alpha \varepsilon \frac{e^\varepsilon-1}{e^\varepsilon+1} \right). \] Combining the inequalities yields \( \widehat\varepsilon(\alpha) \le \alpha \varepsilon \frac{e^\varepsilon-1}{e^\varepsilon+1} \), which gives the result.

Tightness is witnessed by randomized response and by taking the limit \(\alpha \to 1\). ∎

Numerical Comparison

Let’s see what these improved bounds look like:

This first plot compares the tight Rényi DP bound from Theorem 4 (solid line) with the trivial bound (\(\widehat\varepsilon(\alpha)\le\varepsilon\), dotted line) and the bound implied by zCDP (\(\widehat\varepsilon(\alpha)\le\alpha\rho\), dashed line) via Theorem 5. We consider \(\varepsilon=\frac12\) (red lines, bottom), \(\varepsilon=1\) (green lines, middle), and \(\varepsilon=2\) (blue lines, top).

We see that the trivial bound is tight as the Rényi order \(\alpha\) becomes large, while the zCDP bound is tight for small Rényi orders (i.e., \(\alpha\to1\)). The smaller \(\varepsilon\) is, the later this transition occurs.

This second plot compares the tight zCDP bound from Theorem 5 (solid magenta line) against the trivial bound (dotted yellow line) and the quadratic bound (dashed cyan line).

We see that, for small values of \(\varepsilon\), the quadratic bound is tight, while for large values of \(\varepsilon\), the trivial bound is tight.

Conclusion

In this post, we have given improved bounds for converting from pure DP to Rényi DP and zCDP. Numerically, these bounds are a modest improvement over the standard bounds.

The bounds are tight when the algorithm corresponds to randomized response. However, in many cases we can prove better bounds for specific algorithms. For example, in a previous post, we proved better zCDP bounds for the exponential mechanism.

Another popular pure DP mechanism is Laplace noise addition. Mironov [M17, Proposition 6] computed a tight Rényi DP bound specifically for the Laplace mechanism: Adding Laplace noise with scale \(1/\varepsilon\) to a sensitivity-1 function guarantees \(\varepsilon\)-DP and also \((\alpha,\widehat\varepsilon_{\text{Lap}}(\alpha))\)-RDP for all \(\alpha>1\) and \[\widehat\varepsilon_{\text{Lap}}(\alpha) = \frac{1}{\alpha-1}\log\left( \frac{\alpha}{2\alpha-1} e^{(\alpha-1)\varepsilon} + \frac{\alpha-1}{2\alpha-1} e^{-\alpha\varepsilon} \right).\]

Acknowledgements

Thanks to Damien Desfontaines for prompting this post. To the best of my knowledge this improved conversion first appeared in a Tweet by Yu-Xiang Wang.

In general, we can replace \(\frac{\mathbb{P}[M(x)=y]}{\mathbb{P}[M(x’)=y]}\) with the Radon-Nikodym derivative of the probability distribution given by \(M(x)\) with respect to the probability distribution given by \(M(x’)\) evaluated at \(y\). If the output distributions do not have full support, we must handle division by zero; to do this we take \(\frac{0}{0} = 1\) and \(\frac{\eta}{0} = \infty\) for \(\eta>0\). ↩
To be more precise, we have \[\underset{Y \gets M(x’)}{\mathbb{E}}\left[ \left( \frac{\mathbb{P}[M(x)=Y]}{\mathbb{P}[M(x’)=Y]} \right)^\alpha \right] \le \underset{Y \gets M(x’)}{\mathbb{E}}\left[ \frac{\mathbb{P}[M(x)=Y]}{\mathbb{P}[M(x’)=Y]} \right] \cdot \max_y \left( \frac{\mathbb{P}[M(x)=y]}{\mathbb{P}[M(x’)=y]} \right)^{\alpha-1} \le 1 \cdot \left( e^\varepsilon \right)^{\alpha-1},\] which yields the trivial conversion. Here we use Observation 2 from the proof of Theorem 4. ↩
This proof technique is due to Bun & Steinke [BS16, Proposition 3.3]. ↩
Bun & Steinke [BS16, Proposition 3.3] first established this bound, although with a more involved proof. Earlier papers [DRV10,DR16] proved slightly weaker bounds. ↩

NeurIPS 2023 Outstanding Paper: Privacy auditing in just one run

Tue, 02 Jan 2024 12:00:00 -0400

NeurIPS 2023 just wrapped up, and one of the two outstanding paper awards went to Privacy Auditing with One (1) Training Run, by Thomas Steinke, Milad Nasr, and Matthew Jagielski. The main result of this paper is a method for auditing the (differential) privacy guarantees of an algorithm, but much faster and more practically than previous methods. In this post, we’ll dive into what this all means.

In case you’re new to this: by now, it has been well established that ML models can leak information about their training data. This has recently been demonstrated in a spectacular fashion for large language models and diffusion models, showing that these models are prone to regurgitating elements from their training dataset verbatim. Beyond these models, training data leakage can occur to a variety of degrees in other statistical settings. This can of course be problematic if the training data contains sensitive personal information that we do not wish to disclose. It may may also be relevant to other adjacent considerations, including copyright infringement, which we don’t delve into here.

While there have been a number of heuristic proposals for how to deal with such problems, only one method has stood the test of time: differential privacy (DP). Roughly speaking, an algorithm (e.g., a model’s training procedure) is differentially private if its output has limited dependence (in some precise sense) on any single datapoint. This has many convenient implications: if a training procedure is differentially private, the resulting model is very unlikely to spit out training data, it is hard to predict whether a particular datapoint was in its training dataset, etc. This strong notion of privacy has been adopted by a number of organizations, including Google, Microsoft, and the US Census Bureau in the 2020 US Census. Differential privacy is a quantitative guarantee, parameterized by a value \(\varepsilon \geq 0\): the smaller \(\varepsilon\) is, the stronger the privacy protection (albeit at the cost of utility).

In order to say an algorithm is differentially private, we have to prove it. By analyzing the algorithm, we obtain an upper bound on the value of \(\varepsilon\), i.e., a guarantee that the algorithm satsfies at least some prescribed level of privacy. And we can be confident in this guarantee without running a single line of code! A rich line of work studies a differentially private analogue of stochastic gradient descent (which includes per-example gradient clipping followed by Gaussian noise addition), providing tighter and tighter upper bounds on the value of \(\varepsilon\).

Is there any way to empirically audit the privacy of an algorithm? Provided a purportedly private procedure, is there an algorithm we can run to lower bound the value of \(\varepsilon\)? This would discover that the procedure enjoys privacy no better than some particular level. There’s many reasons one might want to audit an algorithm’s privacy guarantees:

We can see if our privacy proof is tight: if we prove and audit matching values of \(\varepsilon\), then we know that neither can be improved.
We can see if our privacy proof is wrong: if we audit a value of \(\varepsilon\) that is greater than the value we prove, then we know there was a bug in our privacy proof.
If we’re unable to rigorously prove an algorithm is private, auditing gives some heuristic measure of how private the algorithm is (though this is not considered best practice in settings where privacy is paramount: auditing only lower bounds \(\varepsilon\), the true value may be much higher).

There is a long line of work on this question from the perspective of membership inference attacks. In a membership inference attack, we consider training a model on either a) some training dataset, or b) the same training dataset but with the inclusion of one extra datapoint (sometimes called a canary). If we can correctly guess whether the canary was or was not in the training set, then we say the membership inference attack was successful. However, recall that differential privacy limits the dependence on individual datapoints: if an algorithm is private, it means that membership inference attacks should not be very successful. Conversely, if an attack is very successful, then it say the algorithm is quantitatively not so private. In other words, such membership inference attacks serve as an auditing for the privacy of the algorithm.

An important technical point is that differential privacy is a probabilistic guarantee. A single membership inference attack success or failure may happen by chance: in order to make conclusions about the privacy level of a procedure, we need to run the attack several times in order to estimate the rate of success. Since for machine learning models, each attack corresponds to one training run, this can quickly result in prohibitive overheads. As one extreme example, one work trains 250,000 models to audit a proposed private training algorithm, revealing a bug in its privacy proof. While these are small models (CNNs trained on MNIST), and the authors admit their auditing was overkill (they only needed to train 1,000 models), in modern settings, even a single extra training run is prohibitively expensive, thus rendering such privacy auditing methods impractical.

Here’s where the work of Steinke, Nasr, and Jagielski comes in: it performs privacy auditing with just one (1) training run. This could even be the same as your actual training run, thus incurring minimal overhead with respect to the standard training pipeline. Their method does this by randomly inserting multiple canaries into the dataset rather than just a single one, and privacy is audited by trying to guess which canaries were and were not trained on. If one can correctly guess the status of many canaries, this implies that the procedure is not very private. The analysis of this framework is the tricky part, and gets quite technical. While textbook analysis of the addition/removal of multiple canaries would rely on a property of differential privacy known as “group privacy,” this turns out to be lossy. Instead, the authors appeal to connections between differential privacy and generalization: they show that if you add multiple canaries i.i.d. for a single run, this behaves similarly to having multiple runs each with a single canary.

In short, this work is a breakthrough in privacy auditing. It allows us to substantially reduce the computational overhead, from prohibitive to essentially negligible. Up to this point, privacy auditing has mostly been employed by those with a surplus of compute: I’m excited to see how this work will make it more accessible to the GPU-poor. Congratulations to Thomas, Milad, and Matthew on their fantastic result!

Open problem(s) - How generic can composition results be?

Mon, 18 Sep 2023 21:00:00 -0400

The composition theorem is a cornerstone of differential privacy literature. In its most basic formulation, it states that if two mechanisms \(\mathcal{M}_1\) and \(\mathcal{M}_2\) are respectively \(\varepsilon_1\)-DP and \(\varepsilon_2\)-DP, then the mechanism \(\mathcal{M}\) defined by \(\mathcal{M}(D)=\left(\mathcal{M}_1(D),\mathcal{M}_2(D)\right)\) is \((\varepsilon_1+\varepsilon_2)\)-DP. A large body of work focused on proving extensions of this composition theorem. These extensions are of two kinds.

Some composition results apply to different settings than fixed mechanisms.
Other extend known results to variants of differential privacy.

In this blog post, we review existing results, and outline natural open questions appearing on both fronts. We stumbled upon these open questions while building general-purpose differential privacy infrastructure, and we believe that solving them could have a positive impact on the usability and privacy/accuracy trade-offs provided by such tools.

Different settings for composition

First, let’s discuss what it means to compose two DP mechanisms.

Sequential composition

In the original composition result [DMNS06], all mechanisms \(\mathcal{M}_1\), \(\mathcal{M}_2\), etc., are fixed in advance, and have a predetermined privacy budget (resp. \(\varepsilon_1\), \(\varepsilon_2\), etc.). They only take the sensitive data \(D\) as input: \(\mathcal{M}_2\) cannot see nor depend on \(\mathcal{M}_1(D)\). This setting is typically called sequential composition.

Adaptive composition

Shortly afterwards, the result was extended to a setting called adaptive composition [DKMMN06]. In this context, each mechanism can access the outputs of previous mechanisms: for example, \(\mathcal{M}_2\) takes as input not only the sensitive data \(D\), but also \(\mathcal{M}_1(D)\). However, the privacy budget associated with each mechanism is still fixed.

Fully adaptive composition

A natural extension of adaptive composition consists in allowing the privacy budget of each mechanism to depend on previous outputs. This setting is called fully adaptive composition [RRUV16]. It captures a setting in which a single analyst is interacting with a DP interface, and can change which queries to run and their budget based on past results.

Composition theorems in the fully adaptive setting are of two types.

Privacy filters assume that the DP interface has a fixed, total budget, and will refuse to answer queries once that budget is exhausted.
Privacy odometers, by contrast, allow the analyst to run arbitrarily many queries using as much budget as they want, and quantify the privacy loss over time.

Somewhat surprisingly, there are separation results between both types: one can obtain tighter composition theorems with privacy filters than privacy odometers.

Concurrent composition

This is, however, not the end of the story. Fully adaptive composition captures a setting in which a single analyst interacts with a DP interface. What if multiple analysts have access to this interface, each with their own budget? Concurrent composition [VW21] captures this idea. In this setting, the mechanisms that are being composed are interactive (we denote them by IM in the diagram below), and the analysts interacting with each mechanism can share results with each other, and adaptively decide which queries to run. The goal is to quantify the total privacy budget cost, across analysts: do existing results extend to the composition of interactive mechanisms?

Fully concurrent composition?

In concurrent composition as defined in [VW21], the number of analysts and their respective privacy budget is fixed upfront. This means that concurrent composition and fully adaptive composition results are incomparable. This suggests an even more generic setting, which (to the best of our knowledge) has not been studied in the literature: a kind of concurrent composition, where the number of analysts and their budget is not predefined. Let’s call this fully concurrent composition. In this setting, an analyst with a certain privacy budget would be able to spin off a new interactive mechanism, with an adaptively-chosen privacy budget, that can also be interacted with concurrently.

This setting might seem pointless — why would analysts want to do this? — but proving composition results in this context would help building DP interfaces that combine expressivity and conceptual simplicity. To understand why, let’s take a look at how Tumult Analytics¹ allows users to use its parallel composition feature.

Tumult Analytics has a concept of a Session, which is initialized on some sensitive data with a given privacy budget. Users can submit queries to this Session using a query language implemented in Python. Each query executed by the Session will consume part of the overall privacy budget, and return DP results. The use can then examine these results to decide which queries to submit to the Session next, and with which privacy budget. So far, this matches the fully adaptive setting, in its privacy filter formulation.

But Tumult Analytics also allows users to split their sensitive data depending on the value of an attribute, and perform different operations in each partition of the data. With this feature, users can write algorithms that use parallel composition, which is very useful. This partitioning operation takes a fraction of the privacy budget, and spins off sub-Sessions that each have access to a subset of the original data. The following diagram visualizes an example of this process.

At the beginning, there is one Session with a privacy budget of \(\varepsilon_1=3\). After the partitioning operation, there are now three Sessions: the original Session that has access to all the data and has a leftover privacy budget of \(\varepsilon_1=2\), and two sub-Sessions that each have access to a partition of the data and have a privacy budget of \(\varepsilon_2=1\). The analyst using this interface can interact with any of these three Sessions, and interleave queries between each, in a fully interactive manner. This means that even though there is a single user interacting with the data, the setting is similar to concurrent composition: each Session is an interactive object with a maximum privacy budget. However, note that the privacy budget associated with each of the sub-Sessions could, in principle, depend on the result of past queries. This suggests that we need composition results that take this into account, and capture the fully concurrent setting suggested above.

Composition for variants of differential privacy

Existing results and natural questions

A large number of variants and extensions of differential privacy have been proposed in the literature. In many cases, a benefit of these alternative definitions is to improve the privacy analysis of mechanisms that compose a large number of simpler primitives. For example, the \(n\)-fold composition of \(\varepsilon\)-DP mechanisms is \(n\varepsilon\)-DP, but the \(n\)-fold composition of \((\varepsilon,\delta)\)-DP mechanisms is also \((\varepsilon’,\delta’)\)-DP, with \(\varepsilon’\approx\sqrt{n}\varepsilon\) and \(\delta’\approx n\delta\). Machine learning applications often use the moments accountant to perform privacy accounting, relying on the composition property of Rényi DP [Mir17, ACGMMTZ16]. Gaussian DP and its generalization \(f\)-DP [DRS22] are also used in this context [BDLS20]. Meanwhile, statistical use cases using the Gaussian mechanism often use zero-concentrated DP [BS16] (zCDP) for their privacy analysis [Des21]; the approximate version of this definition is also useful when queries are grouped by an unknown domain [SDH23].

It is thus natural to study the composition of these variants under the settings described in the previous section. For many variants and composition settings, optimal composition results have been proven. We give an overview in the following table.

	Sequential	Adaptive	Fully adaptive	Concurrent
\(\varepsilon\)-DP	[DMNS06]	[DKMMN06]	[RRUV16]	[VW21]
\((\varepsilon,\delta)\)-DP	[KOV15]	[KOV15]	[WRRW22]*	[WRRW22, Lyu22]
Gaussian DP	[DRS22]	[DRS22]	[ST22]	[VZ22]
\(f\)-DP	[DRS22]	[DRS22]		[VZ22]
\((\alpha,\varepsilon)\)-Rényi DP	[Mir17]	[Mir17]	[FZ21]	[Lyu22]
\(\rho\)-zero-concentrated DP	[BS16]	[BS16]	[FZ21]	[Lyu22]
\(\delta\)-approx. \(\rho\)-zCDP	[BS16]	[BS16]	[WRRW22]

* Only asymptotically optimal for small ε.

This summary already suggests a few natural open questions: it is not known whether the fully adaptive composition results for \((\varepsilon,\delta)\)-DP can be improved, there is no fully adaptive composition theorem for \(f\)-DP, or concurrent for \((\rho,\delta)\)-approximate zCDP.

Reordering mechanisms during the privacy analysis

Let’s assume for a moment that the table above is completed, and that we have optimal composition theorems for all the variants of interest and all settings. Consider an analyst using a differential privacy framework, and performing multiple operations in a fully adaptive way. Some of these operations are using \(\rho\)-zCDP, others are \((\varepsilon,\delta)\)-DP, alternatively, with varying parameters. How should the privacy accounting be done in such a scenario?

In the context of sequential composition, it would be natural to reorder those mechanisms: consider the equivalent situation where all \(\rho\)-zCDP mechanisms occur first, and all \((\varepsilon,\delta)\)-DP mechanisms occur afterwards. In this setting, the zCDP mechanisms can be first be composed using the zCDP composition rule. The overall zCDP guarantee can then be converted to \((\varepsilon,\delta)\)-DP, and composed with the other \((\varepsilon,\delta)\)-DP guarantees. This will lead to a tighter privacy analysis than converting every individual \(\rho\)-zCDP mechanism to \((\varepsilon,\delta)\)-DP, and composing those guarantees.

However, we would need an additional theoretical result to perform this kind of reordering operation in a fully adaptive context: the fact that composition results exist for \((\varepsilon,\delta)\)-DP and \(\rho\)-zCDP does not mean they can be combined. How to resolve this problem, and make it possible to use the same privacy accounting techniques in the sequential setting and in the fully adaptive or fully concurrent setting? This leads to a natural open question: when performing the privacy analysis of a privacy filter, can one “reorder” the mechanisms when composing them? Answering this positively would allow DP frameworks to implement tighter privacy accounting at a relatively low cost in complexity. It might very well be that the answer to this open question is negative. In that case, proving such a separation result would be of significant theoretical interest in the study of DP composition.

Composing privacy loss distributions

When we say that a mechanism is \((\varepsilon,\delta)\)-DP, or \(\rho\)-zCDP, we are giving a “global” bound on the privacy loss random variable, defined by: \[ \mathcal{L}_{D,D’}(o) = \ln\left(\frac{\mathbb{P}\left[\mathcal{M}(D)=o\right]}{\mathbb{P}\left[\mathcal{M}(D’)=o\right]}\right) \] for all neighboring inputs \(D\) and \(D’\).

An alternative approach to privacy accounting consists in fully describing this random variable. One approach to do this uses the formalism of privacy loss distributions (PLDs) [SMM18]. The PLD of a mechanism is defined as: \[ \omega(y) = \mathbb{P}_{o\sim\mathcal{M}(D)}\left[\mathcal{L}_{D,D’}(o)=y\right]. \]

In the sequential composition setting, PLDs can be used for tight privacy analysis. This relies on a conceptually simple result: if \(\omega\) is the PLD of \(\mathcal{M}\) and \(\omega’\) is the PLD of \(\mathcal{M}’\) on neighboring databases \(D\), \(D’\), then the PLD of the composition of \(\mathcal{M}\) and \(\mathcal{M}’\) is \(\omega\ast\omega’\), where \(\ast\) is the convolution operator. Of course, when doing privacy accounting, we don’t want \(\omega\) and \(\omega’\) to depend on the pair of databases, so we replace them by worst-case PLDs, that are “larger” than all possible PLDs for neighboring databases.

Using PLDs for privacy accounting can be done numerically [MM18, KJH20, KJPH21, GLW21, GKKM22, DGKKM22] or analytically [ZDW22]. This family of approaches is convenient because it is very generic: DP frameworks can use a tight upper bound PLD when known, and fall back to a worst-case PLD corresponding to \(\varepsilon\)-DP or \((\varepsilon,\delta)\)-DP when the mechanism is too complex. Unfortunately, the composition result mentioned above has only been proven in the sequential composition setting [MM18]. Extending it to adaptive composition is straightforward, but extending it to the fully adaptive setting (with privacy filters) or the concurrent setting does not seem trivial.

This leads us to our last open question: can these privacy accounting techniques be used in the fully adaptive or concurrent settings?

Summary

In this blog post, we gave a high-level overview of different settings and variants of composition theorems. Along the way, we listed a number of natural open questions.

Can we define a setting that generalizes both fully adaptive composition and concurrent composition? What composition results hold in that setting?
Can we “fill in the blanks” among existing composition results? Namely, can we prove optimal composition results for \((\varepsilon,\delta)\)-DP and \(f\)-DP in the fully adaptive setting, and for \((\varepsilon,\delta)\)-approximate zCDP in the concurrent setting?
In the fully adaptive setting with privacy filters, can one reorder mechanisms when computing their cumulative privacy loss, to optimize the privacy accounting?
Can we prove fully adaptive and concurrent composition results for privacy accounting based on privacy loss distributions?

Progress on these open questions would either uncover surprising additional separation results, or enable usability and utility improvements to general-purpose DP infrastructure. We’re excited about both prospects!

Tumult Analytics is a differential privacy framework used by institutions such as the U.S. Census Bureau, the IRS, or the Wikimedia Foundation. It is developed by Tumult Labs, the employer of the author of this blog post. ↩

Beyond Local Sensitivity via Down Sensitivity

Tue, 12 Sep 2023 10:00:00 -0700

In our previous post, we discussed local sensitivity and how we can get accuracy guarantees that scale with local sensitivity, which can be much better than the global sensitivity guarantees attained via standard noise addition mechanisms. In this post, we will look at what we can do when even the local sensitivity is unbounded. This is obviously a challenging setting, but it turns out that not all hope is lost.

As a motivating example, suppose we have a dataset \(x=(x_1,x_2,\cdots,x_n)\) and we want to approximate \(\max_i x_i \) in a differentially private manner. The difficulty is that adding a single element to \(x\) can increase the maximum arbitrarily. That is, if \(x’=(x_1,x_2,\cdots,x_n,\infty)\), then \(\max_i x’_i=\infty\). Differential privacy requires us to make the outputs \(M(x)\) and \(M(x’)\) indistinguishable, which seems to directly contradict our accuracy goal \(M(x) \approx \max_i x_i\).

One solution to the problem of unbounded sensitivity is to clip the inputs, so that the sensitivity becomes bounded. But this requires knowing a good a priori approximate upper bound on the \(x_i\)s. Trying to find such an upper bound is probably the very reason we want to approximate the maximum in the first place!

Another solution is to “aim lower:” Instead of aiming to approximate the largest element \(x_{(n)} := \max_i x_i\), we can aim to approximate the \(k\)-th largest element \(x_{(n-k+1)}\). The \(k\)-th largest element has bounded local sensitivity, which means we can apply the inverse sensitivity mechanism or similar tools. And – spoiler alert – this is essentially what we will do. However, we will present an algorithm that is more general than just for approximating the maximum.

The algorithm we present is due to Fang, Dong, and Yi [FDY22]. In terms of applications, a natural setting where we may need to approximate functions of unbouned local sensitivity is when each person can contribute multiple items to the dataset. This setting is often referred to as “user-level differential privacy” or “user DP.”¹ For example, if we have a collection of web browsing histories, we may wish to estimate the total number of webpages visited; this has unbounded local sensitivity because a single person could visit an arbitrary number of webpages.

Down Sensitivity

Observe that, while adding one element to the input can increase the maximum arbitrarily, removing one element can only decrease it by the gap between the largest and second-largest elements \(x_{(n)}-x_{(n-1)}\). In other words, the maximum satisfies some kind of one-sided local sensitivity bound. This is the general property we will rely on.

We define the \(k\)-down sensitivity² of the function \(f : \mathcal{X}^* \to \mathbb{R}\) at the input \(x\in\mathcal{X}^*\) as \[\mathsf{DS}^k_f(x) := \sup_{x’ \subseteq x : \mathrm{dist}(x,x’) \le k} |f(x)-f(x’)|. \tag{1}\] Here \(\mathrm{dist} : \mathcal{X}^* \times \mathcal{X}^* \to \mathbb{R}\) is the size of the symmetric difference between the two input tuples/multisets \(\mathrm{dist}(x,x’) = |x \setminus x’| + | x’ \setminus x |\), which defines a metric. In other words, it measures how many people’s data must be added or removed to get from one dataset to the other. For comparison, the local sensitivity is \[\mathsf{LS}^k_f(x) := \sup_{x’\in\mathcal{X}^* : \mathrm{dist}(x,x’) \le k} |f(x)-f(x’)|. \tag{2}\] The difference between Equations 1 and 2 is simply that down sensitivity only considers removing elements from \(x\), while local sensitivity considers both addition and removal. Thus, the down sensitivity is at most the local sensitivity, which is, in turn, upper bounded by the global sensitivity: \(\mathsf{DS}^k_f(x) \le \mathsf{LS}^k_f(x) \le k \cdot \mathsf{GS}_f\).

Intuitively, what is nice about down sensitivity is that it only considers the actual data we have at hand. It doesn’t consider any hypothetical people’s data that could be added to the dataset. It is appealing to only have to deal with “real” data.

Our goal now is to estimate \(f(x)\) in a differentially private manner, where the accuracy guarantee scales with the down sensitivity.

Monotonicity Assumption

In order to do anything, we need some assumptions about the function \(f : \mathcal{X}^* \to \mathcal{Y}\) that we are trying to approximate. First we will assume that \(\mathcal{Y} \subseteq \mathbb{R}\) is finite and \(f\) is surjective.³ The main assumption is monotonicity: \[\forall x’ \subseteq x \in \mathcal{X}^* ~~~ f(x’) \le f(x). \tag{3}\] The maximum and many other example functions satisfy this assumption.

Intuitively, we need this assumption to ensure that the down sensitivity is well-behaved. Specifically, Lemma 1 below requires monotonicity.

As an example of what could happen if we don’t make this assumption, consider the function \(\mathrm{sum}(x) := \sum_i x_i\) and the pair of neighbouring inputs \(x=(1,1,\cdots,1)\in\mathcal{Y}^n,x’=(1,1,\cdots,1,-100n)\in\mathcal{Y}^{n+1}\). Then, for all \(1 \le k\le n\), we have \(\mathsf{DS}_{\mathrm{sum}}^k(x)=k\), but \(\mathsf{DS}_{\mathrm{sum}}^k(x’)=100n\).

Note that the sum is monotone if we restrict to non-negative inputs. In general, we can take any function \(g\) and convert it into a monotone function \(f\) by defining \(f(x) = \max\{ g(\check{x}) : \check{x} \subseteq x \}\). Depending on the context, this \(f\) may or may not be a good proxy for \(g\).

A Loss With Bounded Global Sensitivity

Given a monotone function \(f : \mathcal{X}^* \to \mathbb{R}\), we define a loss function \(\ell : \mathcal{X}^* \times \mathbb{R} \to \mathbb{Z}_{\ge 0}\) by \[\ell(x,y) := \min\{ \mathrm{dist}(x,\tilde{x}) : \tilde{x} \subseteq x, f(\tilde{x}) \le y \}. \tag{4}\] In other words, \(\ell(x,y)\) measures how many entries of \(x\) we need to remove to decrease the function value until \(f(x) \le y\). Yet another way to think of it is that \(\ell(x,y)\) is the distance from the point \(x\) to the set \(f^{-1}((-\infty,y]) \cap \{ \tilde{x} : \tilde{x} \subseteq x \} \).

Figure 1: Visualization of the loss \(\ell(x,y)\) corresponding to \(f(x)=\max_i x_i\) for a dataset representing the distribution \(\mathrm{Binomial}(5,1/2)\) i.e. the true maximum is \(5\) and the dataset is \(x=(0,\underbrace{1,1,1,1,1}_{5\times},\underbrace{2,2,\cdots,2}_{10\times},\underbrace{3,3,\cdots,3}_{10\times},\underbrace{4,4,4,4,4}_{5\times},5)\).

The key property we need is that this loss has bounded sensitivity. We split the proof into Lemmas 1 and 2.

Lemma 1. Let \(f : \mathcal{X}^* \to \mathbb{R}\) satisfy the monotonicity property in Equation 3. Define \(\ell : \mathcal{X}^* \times \mathbb{R} \to \mathbb{Z}_{\ge 0}\) as in Equation 4.
Let \(x’ \subseteq x \in \mathcal{X}^*\). Then \(\ell(x’,y)\le\ell(x,y)\) for all \(y \in \mathbb{R}\).

Proof. Fix \(y \in \mathbb{R}\) and \(x’ \subseteq x \in \mathcal{X}^*\). Let \(x_\Delta = x \setminus x’ \subseteq x\), so that \(x’ = x \setminus x_\Delta \).

Let \(\widehat{x} \subseteq x\) satisfy \(f(\widehat{x})\le y\) and \(\mathrm{dist}(x,\widehat{x})=\ell(x,y)\). Define \(\widehat{x}’ = \widehat{x} \setminus x_\Delta\). This ensures \(\widehat{x}’ \subseteq x’\) and \[\mathrm{dist}(x’,\widehat{x}’) = \mathrm{dist}(x \setminus x_\Delta , \widehat{x} \setminus x_\Delta ) \le \mathrm{dist}(x,\widehat{x}).\]

By monotonicity, \(f(\widehat{x}’) \le f(\widehat{x}) \le y\). Thus \[\ell(x’,y) = \min\{ \mathrm{dist}(x’,\tilde{x}’) : \tilde{x}’ \subseteq x’, f(\tilde{x}’) \le y \}\]\[ \le \mathrm{dist}(x’,\widehat{x}’) \le \mathrm{dist}(x,\widehat{x}) = \ell(x,y).\] ∎

Lemma 2. Let \(f : \mathcal{X}^* \to \mathbb{R}\). Define \(\ell : \mathcal{X}^* \times \mathbb{R} \to \mathbb{Z}_{\ge 0}\) as in Equation 4.
Let \(x’ \subseteq x \in \mathcal{X}^*\). Then \(\ell(x,y)\le\ell(x’,y)+\mathrm{dist}(x,x’)\) for all \(y \in \mathbb{R}\).

Proof. Fix \(y \in \mathbb{R}\) and \(x’ \subseteq x \in \mathcal{X}^*\).

Let \(\widehat{x}’ \subseteq x’\) satisfy \(f(\widehat{x}’)\le y\) and \(\mathrm{dist}(x’,\widehat{x}’)=\ell(x’,y)\). Since \(\widehat{x}’ \subseteq x’ \subseteq x\), we have \[\ell(x,y) = \min\{ \mathrm{dist}(x,\tilde{x}) : \tilde{x} \subseteq x, f(\tilde{x}) \le y \} \le \mathrm{dist}(x,\widehat{x}’) \]\[ \le \mathrm{dist}(x,x’) + \mathrm{dist}(x’,\widehat{x}’) = \ell(x,y)+\mathrm{dist}(x,x’),\] by the triangle inequality, as required. ∎

Note that we only needed the monotonicity assumption for Lemma 1. Combining the two lemmas gives \[ \forall x’ \subseteq x ~ \forall y ~~~~~ \ell(x’,y) \le \ell(x,y) \le \ell(x’,y) + \mathrm{dist}(x,x’).\] Overall we have the following guarantee.

Proposition 3. (Global Sensitivity of the Loss) Let \(f : \mathcal{X}^* \to \mathbb{R}\) satisfy the monotonicity property in Equation 3. Define \(\ell : \mathcal{X}^* \times \mathbb{R} \to \mathbb{Z}_{\ge 0}\) as in Equation 4.
Then, for all \(x, x’ \in \mathcal{X}^*\) and all \(y \in \mathbb{R}\), we have \[|\ell(x,y)-\ell(x’,y)| \le \mathrm{dist}(x,x’).\]

Proof. Fix \(x, x’ \in \mathcal{X}^*\) and \(y \in \mathbb{R}\). Let \(x’’ = x \cap x’\). Since \(x’’ \subset x’\) and \(f\) is assumed to be monotone, Lemma 1 gives \(\ell(x’’ ,y) \le \ell(x’,y)\). Also \(x’’ \subset x\), whence Lemma 2 gives \(\ell(x,y) \le \ell(x’’ , y) + \mathrm{dist}(x , x’’ )\). Note that \( \mathrm{dist}(x , x’’ ) = | x \setminus x’’ | = | x \setminus x’ | \le \mathrm{dist}(x , x’ ).\) Combining inequalities gives \(\ell(x,y) \le \ell(x’ , y) + \mathrm{dist}(x , x’ )\). The other direction is symmetric. ∎

The Shifted Inverse Sensitivity Mechanism

Let’s recap where we are: We have a monotone function \(f : \mathcal{X}^* \to \mathcal{Y}\), where \(\mathcal{Y} \subseteq \mathbb{R}\) is finite. We want to approximate \(f(x)\) privately. Equation 4 gives us a loss \(\ell\) that is low-sensitivity. We have \(\ell(x,f(x))=0\) and, if \(y < f(x)\) decreases, the loss \(\ell(x,y)\) increases (depending on the down sensitivity of \(f\)). So far, so good. The problem is that if \(y > f(x)\) increases, the loss \(\ell(x,y)\) doesn’t increase. This means we can’t just throw this loss into the exponential mechanism.

Intuitively, the way we get around this problem is by looking for a value \(y\) such that the loss \(\ell(x,y)\) is greater than zero, but not too large. That is, we “shift” our goal from trying to minimize \(\ell(x,y)\) to minimizing something like \(|\ell(x,y)-\tau|\) for some integer \(\tau>0\). Going back to the example of the maximum, this corresponds to aiming for the \((\tau+1)\)-th largest value instead of the largest value. The hope is that we get an output with \(|\ell(x,y)-\tau|<\tau\), which for the maximum example corresponds roughly to getting a value between the largest value and the \(2\tau\)-th largest value.

Fang, Dong, and Yi [FDY22] directly apply the exponential mechanism [MT07] with a loss of the form \(|\ell(x,y)-\tau|\).⁴ This yields the following guarantee.

Theorem 4. (Shifted Inverse Sensitivity Mechanism) Let \(f : \mathcal{X}^* \to \mathcal{Y}\) be monotone (Equation 3), where \(\mathcal{Y} \subseteq \mathbb{R}\) is finite. Let \(\varepsilon>0\) and \(\beta \in (0,1)\). Then there exists an \(\varepsilon\)-differentially private \(M : \mathcal{X}^* \to \mathcal{Y}\) with the following accuracy guarantee. For all \(x \in \mathcal{X}^*\), we have \[\mathbb{P}\left[ f(x) \ge M(x) \ge f(x) - \mathsf{DS}_f^{2\tau}(x) \right] \ge 1 - \beta,\] where \(\tau=\left\lceil\frac{2}{\varepsilon}\log\left(\frac{|\mathcal{Y}|}{\beta}\right)\right\rceil\).

This is exactly the kind of guarantee we were aiming for; the accuracy scales with the down sensitivity, which could be much smaller than either the local sensitivity or the global sensitivity. Note that the guarantee gives an underestimate: \(M(x) \le f(x)\). This is inherent. If the function has infinite “up sensitivity,” then we cannot give an upper bound in a differentially private manner.

The shifted inverse sensitivity mechanism has the same limitations as the inverse sensitivity mechanism that we discussed in our previous post. Namely, computing the loss can be computationally intractable for general functions and we have a \(\log|\mathcal{Y}|\) dependence. (We will discuss how to improve this next.) An additional limitation is that we need the monotonicity assumption. But, as discussed earlier, down sensitivity behaves weirdly without this assumption.

Beyond the Exponential Mechanism

Applying the exponential mechanism to find \(y\) with \(\ell(x,y)\approx\tau\) yields a clean guarantee in Theorem 4. However, there are other methods we can apply which may be simpler⁴ and give better asymptotic guarantees.

Observe that the loss \(\ell(x,y)\) is a decreasing function of \(y\). The exponential mechanism does not exploit this structure. A very natural alternative algorithm is to perform binary search.⁵

We describe the algorithm in pseudocode and briefly analyze it: The input is the loss \(\ell\) defined in Equation 4, the dataset \(x\), an ordered enumeration of the set of outputs \(\mathcal{Y} = \{y_0 \le y_1 \le \cdots \le y_{|\mathcal{Y}|-1} \}\), and parameters \(\sigma,\tau>0\).

def noisy_binary_search(loss, x, Y, sigma, tau):
     i_min = 0
     i_max = len(Y) - 1
     while i_min + 1 < i_max:
          k = (i_min + i_max) // 2
          v = loss(x, Y[k]) + laplace(sigma)
          if v <= tau:
               i_max = k
          else:
               i_min = k
     return Y[i_max]

Since each iteration satisfies \(\frac1\sigma\)-differential privacy and there are at most \(\lceil \log_2 |\mathcal{Y}| \rceil-1\) iterations, the algorithm satisfies \(\varepsilon\)-differential privacy for \(\varepsilon = \frac{\log_2 |\mathcal{Y}|}{\sigma} \) by basic composition. Alternatively, using advanced composition, we see that the algorithm satisfies \(\rho\)-zCDP [BS16] for \(\rho = \frac{\log_2 |\mathcal{Y}|}{2\sigma^2} \).

By a union bound, each noise sample has magnitude at most \(\tau\) with probability at least \(1 - \exp(-\tau/\sigma) \cdot \log_2|\mathcal{Y}|\).⁶ Assuming the noise magnitudes are \(\le\tau\), the binary search maintains the invariants \(\ell(x,y_{i_\min})>0\) and \(\ell(x,y_{i_\max})\le 2\tau\). These invariants imply \(y_{i_\min} < f(x)\) and \(y_{i_\max} \ge f(x) - \mathsf{DS}_f^{2\tau}(x)\) respectively. At the end of the binary search, \(i_\min+1 \ge i_\max\) and thus \(y_{i_\min} < f(x)\) implies \(y_{i_\max} \le f(x)\).

Setting \(\tau = \sigma \cdot \log\left(\frac{\log_2|\mathcal{Y}|}{\beta}\right)\) and \(\sigma = \frac{\log_2|\mathcal{Y}|}{\varepsilon}\) yields a result similar to Theorem 4.

Setting \(\tau = \sigma \cdot \log\left(\frac{\log_2|\mathcal{Y}|}{\beta}\right)\) and \(\sigma = \sqrt{\frac{\log_2|\mathcal{Y}|}{2\rho}}\) yields the following result for concentrated differential privacy [DR16,BS16]. Note that setting \(\rho = \frac{\varepsilon^2}{4\log(1/\delta)+4\varepsilon}\) suffices to give \((\varepsilon,\delta)\)-differential privacy [e.g. S22 Remark 15].

Theorem 5. (Shifted Inverse Sensitivity Mechanism with Concentrated Differential Privacy) Let \(f : \mathcal{X}^* \to \mathcal{Y}\) be monotone (Equation 3), where \(\mathcal{Y} \subseteq \mathbb{R}\) is finite. Let \(\rho>0\) and \(\beta \in (0,1)\). Then there exists an \(\rho\)-zCDP \(M : \mathcal{X}^* \to \mathcal{Y}\) with the following accuracy guarantee. For all \(x \in \mathcal{X}^*\), we have \[\mathbb{P}\left[ f(x) \ge M(x) \ge f(x) - \mathsf{DS}_f^{2\tau}(x) \right] \ge 1 - \beta,\] where \(\tau = \sqrt{\frac{\log_2|\mathcal{Y}|}{2\rho}} \cdot \log\left(\frac{\log_2|\mathcal{Y}|}{\beta}\right) \).

Comparing Theorems 4 and 5 we see an asymptotic improvement in the dependence on the size of the output space \(|\mathcal{Y}|\). (This improvement is the benefit of advanced composition.) Theorem 4 gives \(\tau = \Theta(\log|\mathcal{Y}|)\), while Theorem 5 gives \(\tau = \Theta(\sqrt{\log|\mathcal{Y}|} \cdot \log \log |\mathcal{Y}|)\).⁷ In exchange, Theorem 4 gives a pure differential privacy guarantee (i.e. \((\varepsilon,\delta)\)-DP with \(\delta=0\)), while Theorem 5 gives a concentrated differential privacy guarantee, which can be translated to approximate differential privacy (i.e. \((\varepsilon,\delta)\)-DP with \(\delta>0\)).

We can actually do even better than binary search! The problem we’re solving with binary search is actually an instance of the generalized interior point problem [BDRS18] (which is essentially the same as quasi-concave optimization [CLNSS23]). This problem and its variants have been extensively studied in the context of private learning [BNS13,BNSV15,etc.] The upshot is that, under \((\varepsilon,\delta)\)-differential privacy, we can achieve the same result as Theorems 4 and 5 with \(\tau = \frac{\log(1/\delta)}{\varepsilon} \cdot 2^{O(\log^* |\mathcal{Y}|)}\), where \(\log^*\) denotes the iterated logaritm.

Theorem 6. (Shifted Inverse Sensitivity Mechanism with Approximate Differential Privacy) Let \(f : \mathcal{X}^* \to \mathcal{Y}\) be monotone (Equation 3), where \(\mathcal{Y} \subseteq \mathbb{R}\) is finite. Let \(\varepsilon>0\) and \(\delta \in (0,.1)\). Then there exists an \((\varepsilon,\delta)\)-differentially private \(M : \mathcal{X}^* \to \mathcal{Y}\) with the following accuracy guarantee. For all \(x \in \mathcal{X}^*\), we have \[\mathbb{P}\left[ f(x) \ge M(x) \ge f(x) - \mathsf{DS}_f^{2\tau}(x) \right] \ge \frac{9}{10},\] where \(\tau = \frac{\log(1/\delta)}{\varepsilon} \cdot 2^{O(\log^* |\mathcal{Y}|)}\).

The iterated logarithm is an unbelievably slow-growing function. Thus Theorem 6 improves on Theorems 4 and 5 in terms of the dependence on \(|\mathcal{Y}|\). However, the dependence on \(\delta\) is worse than Theorem 5 (\(\tau=\Theta(\log(1/\delta))\) versus \(\tau=\Theta(\sqrt{\log(1/\delta)})\)). (Theorem 4 achieves \(\delta=0\).)

Conclusion

In this post we’ve covered the shifted inverse sensitivity mechanism of Fang, Dong, and Yi [FDY22], as well as some extensions.

The key takeaway is that we can privately approximate a monotone function with error scaling with the down sensitivity. This is particularly interesting in settings where the local and global sensitivities are large. Down sensitivity is an appealing notion because it is entirely defined by the “real” dataset; its definition (Equation 1) does not consider hypothetical data items that aren’t in the dataset.

Fang, Dong, and Yi [FDY22] show that the shifted inverse sensitivity mechanism attains strong instance optimality guarantees. In other words, up to logarithmic factors, no differentially private mechanism can achieve better error guarantees.

We can view the shifted inverse sensitivity mechanism as a reduction. It reduces the task of approximating a monotone function to a problem akin to approximating the median. (More precisely, it reduces it to a generalized interior point problem.) We think this is a neat addition to the toolkit of differentially private algorithms

We emphasize that user-level differential privacy is not an alternative privacy definition, rather it is the standard definition of differential privacy with a data schema allowing multiple data items per person. In contrast, most of the differential privacy literature assumes a one-to-one correspondence between people and data items. Note that we prefer the terminology “person”/”people” rather than “user”/”users.” The “user” terminology is specific to the tech industry and may be confusing in other contexts; e.g., in the context of the US Census Bureau, “users” are the entities (such as government agencies) that use data provided by the bureau, rather than the people whose data the bureau collects. ↩
The name “down sensitivity” is due to Raskhodnikova and Smith [RS15]. The name local empirical sensitivity has also been used [CZ13]. The \(k\)-down sensitivity should not be confused with the down sensitivity at distance \(k\), which is defined by \(\mathsf{DS}_f^{(k)}(x) := \sup \{ \mathsf{DS}_f^1(x’) : \mathrm{dist}(x,x’) \le k \}\). Note that \(\mathsf{DS}_f^k(x) \le k \cdot \mathsf{DS}_f^{(k-1)}(x)\). ↩
The finiteness assumption can be relaxed somewhat, but we do need some kind of constraint on the output space to ensure utility. The surjectivity assumption simply ensures that the loss is always finite; alternatively we could allow the loss to take the value infinity. Note that we define \(\mathcal{X}^* := \bigcup_{n=0}^\infty \mathcal{X}^n\) to be the set of all finite tuples of elements in \(\mathcal{X}\); we use subset notation \(x’ \subseteq x \) to denote that \(x’\) can be obtained by removing elements from \(x\) (and potentially permuting). ↩
Alas, there is a technical issue we need to deal with in order to apply the exponential mechanism: The loss function is far from continuous, so there may not exist any \(y\) such that \(|\ell(x,y)-\tau|<\tau\). For example, computing the maximum of the dataset \(x=(1,1,\cdots,1)\) gives a loss function with \(\ell(x,y)=0\) for all \(y \ge 1\) and \(\ell(x,y)=n\) for all \(y < 1\); i.e., no \(y\) gives \(0<\ell(x,y)<n\). The way we fix this issue is as follows. Observe that we can decompose \(|\ell(x,y)-\tau|=\max\{\ell(x,y)-\tau,\tau-\ell(x,y)\}\). Now we define a slightly different loss function: \[\overline{\ell}(x,y) := \min\{ \mathrm{dist}(x,\tilde{x}) : \tilde{x} \subseteq x, f(\tilde{x}) < y \}. \tag{A}\] Equation A defining \(\overline{\ell}(x,y)\) differs from Equation 4 defining \(\ell(x,y)\) only in that we replace “\(\le\)” with “\(<\)”. The modified loss \(\overline\ell\) still has low sensitivity; the proof is identical to that of Proposition 3. Now we can run the exponential mechanism with the loss \[\ell^*(x,y) := \max\{\ell(x,y)-\tau,\tau-\overline{\ell}(x,y)\}. \tag{B}\] This loss has low sensitivity and, for \(\hat{y} = \min\{f(\tilde{x}):\tilde{x}\subseteq x, \mathrm{dist}(x,\tilde{x})\le\tau\}\), we have \(\ell(x,\hat{y})\le\tau\) and \(\overline{\ell}(x,\hat{y})>\tau\), which implies \(\ell^*(x,\hat{y}) \le 0\). Thus we can use \(\ell^*(x,y)\) in place of \(|\ell(x,y)-\tau|\) to fix this technical issue. Setting \(\tau=\left\lceil\frac{2}{\varepsilon}\log\left(\frac{|\mathcal{Y}|}{\beta}\right)\right\rceil\) and running the exponential mechanism with loss \(\ell^*\) yields Theorem 4. Specifically, the guarantee of the exponential mechanism is \(\mathbb{P}\left[ \ell^*(x,M(x)) < \frac{2}{\varepsilon}\log\left(\frac{|\mathcal{Y}|}{\beta}\right)\right]\ge 1-\beta\). Then \(\tau-\overline{\ell}(x,M(x)))< \frac{2}{\varepsilon}\log\left(\frac{|\mathcal{Y}|}{\beta}\right)\) implies \(\overline{\ell}(x,M(x))>0\), which implies \(M(x)\le f(x)\). Similarly, \(\ell(x,M(x))-\tau < \frac{2}{\varepsilon}\log\left(\frac{|\mathcal{Y}|}{\beta}\right)\) implies \(\ell(x,M(x))<2\tau\), which implies that \(M(x) \ge f(\tilde{x})\) for some \(\tilde{x}\subseteq x\) with \(\mathrm{dist}(x,\tilde{x})<2\tau\); by the definition of down sensitivity, \(|f(x)-f(\tilde{x})| \le \mathsf{DS}_f^{2\tau}(x)\) and so \(M(x) \ge f(\tilde{x}) \ge f(x) - \mathsf{DS}_f^{2\tau}(x)\), as required. ↩ ↩²
To the best of our knowledge, differentially private binary search was first proposed by Blum, Ligett, and Roth [BLR08]. This algorithmic idea has been used in various other papers [e.g., BSU17,FS17,DGMSS21] ↩
Note that we can also use Gaussian noise instead of Laplace noise. This would yield a slightly better accuracy guarantee for the same concentrated differential privacy guarantee. Specifically, this would give \(\tau = O\left(\sqrt{\frac1\rho \cdot \log |\mathcal{Y}| \cdot \log \left( \frac{\log | \mathcal{Y} |}{\beta}\right)}\right)\). ↩
We can shave the loglog term in Theorem 5 to get \(\tau = \Theta(\sqrt{\log|\mathcal{Y}|})\) either by using a noise-tolerant version of binary search [KK07] or by using non-independent noise [SU15,GZ20,GKM21,DK22]. ↩

Beyond Global Sensitivity via Inverse Sensitivity

Tue, 05 Sep 2023 09:00:00 -0700

The most well-known and widely-used method for achieving differential privacy is to compute the true function value \(f(x)\) and then add Laplace or Gaussian noise scaled to the global sensitivity of \(f\). This may be overly conservative. In this post we’ll show how we can do better.

The global sensitivity of a function \(f : \mathcal{X}^* \to \mathbb{R}\) is defined by \[ \mathsf{GS}_f := \sup_{x,x’\in\mathcal{X}^* : \mathrm{dist}(x,x’) \le 1} |f(x)-f(x’)|, \tag{1}\] where \(\mathrm{dist}(x,x’)\le 1\) denotes that \(x\) and \(x’\) are neighbouring datasets (i.e. they differ only by the addition, removal, or replacement of one person’s data); more generally, \(\mathrm{dist}(\cdot,\cdot)\) is the corresponding metric on datasets (i.e., Hamming distance).¹

The global sensitivity considers datasests that have nothing to do with the dataset at hand and which could be completely unrealistic. Many functions have infinite global sensitivity, but, on reasonably nice datasets, their local sensitivity is much lower.

Local Sensitivity

The \(k\)-local sensitivity² of a function \(f : \mathcal{X}^* \to \mathbb{R}\) at \(x \in \mathcal{X}^*\) is defined by \[\mathsf{LS}^k_f(x) := \sup_{x’\in\mathcal{X}^* : \mathrm{dist}(x,x’) \le k} |f(x)-f(x’)|. \tag{2}\] Often, we fix \(k=1\) and we may drop the superscript: \(\mathsf{LS}_f(x) := \mathsf{LS}_f^1(x)\). Note that the local sensitivity is always at most the global sensitivity: \(\mathsf{LS}_f^k(x) \le k \cdot \mathsf{GS}_f\).

As a concrete example, the median has infinite global sensitivity, but for realistic data the local sensitivity is quite reasonable. Specifically, \[\mathsf{LS}^k_{\mathrm{median}}(x_1, \cdots, x_n) = \max\left\{ \left|x_{(\tfrac{n+1}{2})}-x_{(\tfrac{n+1}{2}+k)}\right|, \left|x_{(\tfrac{n+1}{2})}-x_{(\tfrac{n+1}{2}-k)}\right| \right\},\tag{3}\] where \( x_{(1)} \le x_{(2)} \le \cdots \le x_{(n)}\) denotes the input in sorted order and \(n\) is assumed to be odd, so, in particular, \(\mathrm{median}(x_1, \cdots, x_n) = x_{(\tfrac{n+1}{2})}\). For example, if \(X_1, \cdots X_n\) are i.i.d. samples from a standard Gaussian and \(k \ll n\), then \(\mathsf{LS}^k_{\mathrm{median}}(X_1, \cdots, X_n) \le O(k/n)\) with high probability.

Using Local Sensitivity

Intuitively, the local sensitivity is the “real” sensitivity of the function and the global sensitivity is only a worst-case upper bound. Thus it seems natural to add noise scaled to the local sensitivity instead of the global sensitivity.

Unfortunately, naïvely adding noise scaled to local sensitivity doesn’t satisfy differential privacy. The problem is that the local sensitivity itself can reveal information. For example, consider the median on the inputs \(x=(1,2,2),x’=(2,2,2)\). The output distributions of the algorithm on these two inputs must be similar. In both cases the median is \(2\), so that is a good start for ensuring that the distributions are similar. But the local sensitivity is different: \(\mathsf{LS}^1_{\mathrm{median}}(x)=1\) versus \(\mathsf{LS}^1_{\mathrm{median}}(x’)=0\). So, if we add noise scaled to local sensitivity, then, on input \(x’\), we deterministically output \(2\), while, on input \(x\), we output a random number. If we use continuous Laplace or Gaussian noise, then the random number will be a non-integer almost surely. Thus the output perfectly distinguishes the two inputs, which is a catastrophic violation of differential privacy.

The good news is that we can exploit local sensitivity; we just need to do a bit more work. In fact, there are many methods in the differential privacy literature to exploit local sensitivity.

The best-known methods for exploiting local sensitivity are smooth sensitivity [NRS07]³ and propose-test-release [DL09]⁴.

In this post we will cover a different general-purpose technique. This technique is folklore.⁵ It was first systematically studied by Asi and Duchi [AD20,AD20], who also named the method the inverse sensitivity mechanism.

The Inverse Sensitivity Mechanism

Consider a function \(f : \mathcal{X}^* \to \mathcal{Y}\). Our goal is to estimate \(f(x)\) in a differentially private manner. But we do not make any assumptions about the global sensitivity of the function.

For simplicity we will assume that \(\mathcal{Y}\) is finite and that \(f\) is surjective.⁶

Now we define a loss function \(\ell : \mathcal{X}^* \times \mathcal{Y} \to \mathbb{Z}_{\ge0}\) by \[\ell(x,y) := \min\left\{ \mathrm{dist}(x,\tilde{x}) : \tilde{x}\in\mathcal{X}^*, f(\tilde{x})=y \right\}.\tag{4}\] In other words, \(\ell(x,y)\) measures how many entries of \(x\) we need to add or remove until \(f(x)=y\). Yet another way to think of it is that \(\ell(x,y)\) is the distance from the point \(x\) to the set \(f^{-1}(y)\). (Hence the name inverse sensitivity.)

The loss is minimized by the desired answer: \(\ell(x,f(x))=0\). Intuitively, the loss \(\ell(x,y)\) increases as \(y\) moves further from \(f(x)\). So approximately minimizing this loss should produce a good approximation to \(f(x)\), as desired.

The trick is that this loss always has bounded global sensitivity – i.e., \(\mathsf{GS}_\ell \le 1\) – no matter what the sensitivity of \(f\) is!

Lemma 1. Let \(f : \mathcal{X}^* \to \mathcal{Y}\) be arbitrary and define \(\ell : \mathcal{X}^* \times \mathcal{Y} \to \mathbb{Z}_{\ge0}\) as in Equation 4. Then, for all \(x,x’\in\mathcal{X}^*\) with \(\mathrm{dist}(x,x’)\le 1\) and all \(y \in \mathcal{Y}\), we have \(|\ell(x,y)-\ell(x’,y)|\le 1\).

Proof. Fix \(x,x’\in\mathcal{X}^*\) with \(\mathrm{dist}(x,x’)\le 1\) and \(y \in \mathcal{Y}\). Let \(\widehat{x} \in\mathcal{X}^*\) satisfy \(\ell(x,y)=\mathrm{dist}(x,\widehat{x})\) and \(f(\widehat{x})=y\). By definition, \[\ell(x’,y) = \min\left\{ \mathrm{dist}(x’,\tilde{x}) : f(\tilde{x})=y \right\} \le \mathrm{dist}(x’,\widehat{x}).\] By the triangle inequality, \[\mathrm{dist}(x’,\widehat{x}) \le \mathrm{dist}(x’,x)+\mathrm{dist}(x,\widehat{x}) \le 1 + \ell(x,y).\] Thus \(\ell(x’,y) \le \ell(x,y)+1\) and, by symmetry, \(\ell(x,y) \le \ell(x’,y)+1\), as required. ∎

This means that we can run the exponential mechanism [MT07] to select from \(\mathcal{Y}\) using the loss \(\ell\).⁷ That is, the inverse sensitivity mechanism is defined by \[\forall y \in \mathcal{Y} ~~~~~ \mathbb{P}[M(x)=y] ;= \frac{\exp\left(-\frac{\varepsilon}{2}\ell(x,y)\right)}{\sum_{y’\in\mathcal{Y}}\exp\left(-\frac{\varepsilon}{2}\ell(x,y’)\right)}.\tag{5}\] By the properties of the exponential mechanism and Lemma 1, \(M\) satisfies differential privacy:

Theorem 2. (Privacy of the Inverse Sensitivity Mechanism) Let \(M : \mathcal{X}^* \to \mathcal{Y}\) be as defined in Equation 5 with the loss from Equation 4. Then \(M\) satisfies \(\varepsilon\)-differential privacy (and \(\frac18\varepsilon^2\)-zCDP).

Utility Guarantee

The privacy guarantee of the inverse sensitivity mechanism is easy and, in particular, it doesn’t depend on the properties of \(f\). This means that the utility will need to depend on the properties of \(f\).

By the standard properties of the exponential mechanism, we can guaranatee that the output has low loss:

Lemma 3. Let \(M : \mathcal{X}^* \to \mathcal{Y}\) be as defined in Equation 5 with the loss from Equation 4. For all inputs \(x \in \mathcal{X}^*\) and all \(\beta\in(0,1)\), we have \[\mathbb{P}\left[\ell(x,M(x)) < \frac2\varepsilon\log\left(\frac{|\mathcal{Y}|}{\beta}\right) \right] \ge 1-\beta.\tag{6}\]

Proof. Let \(B_x = \left\{ y \in \mathcal{Y} : \ell(x,y) \ge \frac2\varepsilon\log\left(\frac{|\mathcal{Y}|}{\beta}\right) \right\}\) be the subset of \(\mathcal{Y}\) with high loss. Then \[ \mathbb{P}[M(x)\in B_x] = \frac{\sum_{y \in B_x} \exp\left(-\frac{\varepsilon}{2}\ell(x,y)\right)}{\sum_{y’\in\mathcal{Y}}\exp\left(-\frac{\varepsilon}{2}\ell(x,y’)\right)} \]\[ \le \frac{|B_x| \cdot \exp\left(-\frac{\varepsilon}{2}\frac2\varepsilon\log\left(\frac{|\mathcal{Y}|}{\beta}\right) \right)}{\exp\left(-\frac{\varepsilon}{2}\ell(x,f(x))\right)}\]\[= \frac{|B_x| \cdot \frac{\beta}{|\mathcal{Y}|}}{1} \le \beta, \] as required. ∎

Now we need to translate this loss bound into something easier to interpret – local sensitivity.

Suppose \(y \gets M(x)\). Then we have some loss \(k=\ell(x,y)\). What this means is that there exists \(\tilde{x}\in\mathcal{X}^*\) with \(f(\tilde{x})=y\) and \(\mathrm{dist}(x,\tilde{x})\le k\). By the definition of local sensitivity, \(|f(x)-y| = |f(x)-f(\tilde{x})| \le \mathsf{LS}_f^k(x)\). This means we can translate the loss guarantee of Lemma 3 into an accuracy guarantee in terms of local sensitivity:

Theorem 4. (Utility of the Inverse Sensitivity Mechanism) Let \(M : \mathcal{X}^* \to \mathcal{Y}\) be as defined in Equation 5 with the loss from Equation 4. For all inputs \(x \in \mathcal{X}^*\) and all \(\beta\in(0,1)\), we have \[\mathbb{P}\left[\left|M(x)-f(x)\right| \le \mathsf{LS}_f^k(x) \right] \ge 1-\beta,\tag{7}\] where \(k=\left\lfloor\frac2\varepsilon\log\left(\frac{|\mathcal{Y}|}{\beta}\right)\right\rfloor\).

We can tie this back to our concrete example of the median. Per Equation 3, \[\mathsf{LS}^k_{\mathrm{median}}(x_1, \cdots, x_n) \le \left|x_{(\tfrac{n+1}{2}+k)}-x_{(\tfrac{n+1}{2}-k)}\right| .\] Thus the error guarantee of Theorem 4 for the median would scale with the spread of the data. E.g., if \(k=\tfrac{n+1}{4}\), then \(\mathsf{LS}^k_{\mathrm{median}}(x_1, \cdots, x_n)\) is at most the interquartile range of the data.

How does this compare with the usual global sensitivity approach? The \(\varepsilon\)-differentially private Laplace mechanism is given by \(\widehat{M}(x):=f(x)+\mathsf{Laplace}(\mathsf{GS}_f/\varepsilon)\). For all \(x \in \mathcal{X}^*\) and all \(\beta\in(0,1/2)\), we have the utility guarantee \[\mathbb{P}\left[\left|\widehat{M}(x)-f(x)\right| \le \mathsf{GS}_f \cdot \frac1\varepsilon \log\left(\frac{1}{2\beta}\right) \right] \ge 1-\beta.\tag{8}\] Comparing Equations 7 and 8, we see that neither guarantee dominates the other. On one hand, the local sensitivity can be much smaller than the global sensitivity. On the other hand, we pick up a dependence on \(\log|\mathcal{Y}|\). In particular, in the worst case where the local sensitivity matches the global sensitivity \(\mathsf{LS}_f^k(x)=k\cdot\mathsf{GS}_f\), the inverse sensitivity mechanism is worse by a factor of \[\frac{\mathsf{LS}_f^k(x)}{\mathsf{GS}_f \cdot \frac1\varepsilon \log\left(\frac{1}{2\beta}\right)} = 2 \frac{\log(2|\mathcal{Y}|)}{\log(1/2\beta)}+2.\tag{9}\] Hence the inverse sensitivity mechanism is most useful in situations where the local sensitivity is significantly smaller than the global sensitivity.

Conclusion

In this post we’ve covered the inverse sensitivity mechanism and showed that it is private regardless of the sensitivity of the function \(f\) and we showed that it gives error guarantees that scale with the local sensitivity of \(f\), rather than its global sensitivity.

The inverse sensitivity mechanism is a simple demonstration that there is more to differential privacy than simply adding noise scaled to global sensitivity; there are many more techniques in the literature.

The inverse sensitivity mechanism has two main limitations. First, it is, in general, not computationally efficient. Computing the loss function is intractable for an arbitrary \(f\) (but can be done efficiently for several examples like the median and variants of principal component analysis and linear regression [AD20]). Second, the \(\log|\mathcal{Y}|\) term in the accuracy guarantee is problematic when the output space is large, such as when we have high-dimensional outputs. While there are other techniques that can be used instead of inverse sensitivity, they suffer from some of the same limitations. Thus finding ways around these limitations is an active research topic [BKSW19,FDY22,HKMN23,DHK23,BHS23,AUZ23].

The inverse sensitivity mechanism’s accuracy can be shown to be instance-optimal up to logarithmic factors [AD20,AD20]. That is, up to logarithmic factors, no differentially private mechanism can achieve better error guarantees. Up to logarithmic factors, the inverse sensitivity mechanism outperforms other methods for exploiting local sensitivity, namely smooth sensitivity [NRS07]³ and propose-test-release [DL09]⁴.

We leave you with a riddle: What can we do if even the local sensitivity of our function is unbounded? For example, suppose we want to approximate \(f(x) = \max_i x_i\). Surprisingly, there are still things we can do; see our follow-up post.

We define \(\mathcal{X}^* = \bigcup_{n = 0}^\infty \mathcal{X}^n\) to be the set of all input tuples of arbitrary size. The metric \(\mathrm{dist} : \mathcal{X}^* \times \mathcal{X}^* \to \mathbb{R}\) can be arbitrary. E.g. we can allow addition, removal, and/or replacement of an individual’s data. For simplicity, we consider univariate functions here. But the definitions of global and local sensitivity easily extend to to vector-valued functions by taking a norm: \[ \mathsf{GS}_f := \sup_{x,x’\in\mathcal{X}^* : \mathrm{dist}(x,x’) \le 1} \|f(x)-f(x’)\|.\] If we use the 2-norm, then this cleanly corresponds to adding spherical Gaussian noise. The 1-norm corresponds to adding independent Laplace noise to the coordinates. ↩
The local sensitivity is also known as the local modulus of continuity [AD20,AD20]. Note that this should not be confused with the local sensitivity at distance \(k\) [NRS07], which is defined by \(\sup \{ \mathsf{LS}_f^1(x’) : \mathrm{dist}(x,x’) \le k \}\). ↩
Briefly, smooth sensitivity is an upper bound on the local sensitivity which itself has low sensitivity in a multiplicative sense. That is, \(\mathsf{LS}_f^1(x) \le \mathsf{SS}_f^t(x)\) and \(\mathsf{SS}_f^t(x) \le e^t \cdot \mathsf{SS}_f^t(x’) \) for neighbouring \(x,x’\). This suffices to ensure that we can add noise scaled to \(\mathsf{SS}_f^t(x)\). However, that noise usually needs to be more heavy-tailed than for global sensitivity [BS19]. ↩ ↩²
Roughly, the propose-test-release framework computes an upper bound on the local sensitivity in a differentially private manner and then uses this upper bound as the noise scale. (We hope to give more detail about both propose-test-release and smooth sensitivity in future posts.) ↩ ↩²
Properly attributing the inverse sensitivity mechanism is difficult. The earliest published instances of the inverse sensitivity mechanism of which we are aware of are from 2011 and 2013 [MMNW11§3.1,JS13§5]; but this was not novel even then. Asi and Duchi [AD20§1.2] state that McSherry and Talwar [MT07] considered it in 2007. In any case, the name we use was coined in 2020 [AD20]. ↩
Assuming that the output space \(\mathcal{Y}\) is finite is a significant assumption. While it can be relaxed a bit [AD20], it is to some extent an unavoidable limitation [BNSV15,ALMM19]. For example, to apply the inverse sensitivity mechanism to the median, we must discretize and bound the inputs; bounding the inputs does impose a finite global sensitivity, but the dependence on the bound is logarithmic, so the bound can be fairly large. Assuming that the function is surjective is a minor assumption that ensures that the loss in Equation 4 is always well-defined; otherwise we can define the loss to be infinite for points that are not in the range of the function. ↩
Note that we can use other selection algorithms, such as permute-and-flip [MS20] or report-noisy-max [DKSSWXZ21] or gap-max [CHS14,BDRS18,BKSW19]. ↩