tcs math

Lower bounds for MTS

Fri, 20 Apr 2018 00:00:00 +0000

Consider a rooted tree $T=(V,E)$ with leaf set $\mathcal L\subseteq V$ and positive vertex weights $w : V\setminus \mathcal L\to \mathbb R_+$ that are non-increasing along root-leaf paths. We recall the ultrametric $d_w$ defined on $\mathcal L$ by

$$ d_w(\ell,\ell') \mathrel{\vcenter{:}}= w(\mathrm{lca}(\ell,\ell')). $$

Say that $(\mathcal L,d_w)$ is a $\tau$-HST metric for some $\tau \geq 1$ if $\mathcal L$ is the leaf set of some weighted tree $T=(V,E)$, and $d_w$ is the ultrametric corresponding to a weight $w$ on $T$ with the property that $w(y) \leq w(x)/\tau$ whenever $y$ is a child of $x$. (So for a finite metric space, the notions of $1$-HST metric and ultrametic are equivalent.) Say that $(\mathcal L,d)$ is a strict $\tau$-HST metric if we require the stronger property that $w(y)=w(x)/\tau$ whenever $x,y \in V$ are internal nodes and $y$ is a child of $x$.

Two metrics $d$ and $d'$ on a set $X$ are $K$-equivalent if there is a constant $c > 0$ such that \[d(x,y) \leq c\, d'(x,y) \leq K d(x,y)\qquad \forall x,y \in X\,.\] We leave the following statements as an exercise: For every $\tau > 1$, every $1$-HST is $\tau$-equivalent to some strict $\tau$-HST.

There is a constant $C \geq 1$ such that if $(\mathcal L,d_w)$ is a $C (\log n)^2$-HST metric, where $n=|\mathcal L|$, then the competitive ratio for MTS on $(\mathcal L,d_w)$ is $\Omega\left(\log n\right)$.

While the preceding theorem applies only to $\tau$-HSTs with $\tau$ sufficiently large, the next lemma shows how we can improve the separation parameter by passing to a subset of leaves.

If $(\mathcal L,d)$ is a strict $2$-HST, then for every $k \in \mathbb N$, there is a subset $\mathcal L' \subseteq \mathcal L$ with

$$ |\mathcal{L}'| \geq |\mathcal{L}|^{1/k}, $$

and such that $(\mathcal L', d)$ is a $2^k$-HST metric.

Combining this with Theorem 1 yields:

If $(\mathcal L,d)$ is a $1$-HST metric with $n=|\mathcal L|$, then the competitive ratio for MTS on $(\mathcal L,d)$ is at least $\Omega\left(\frac{\log n}{\log \log n}\right)$.

Recall that the Metric Ramsey theorem from the preceding lecture implies that every $n$-point metric space contains a subset of size $\sqrt{n}$ that is $O(1)$-equivalent to a $1$-HST. This yields:

For every $n$-point metric space, the competitive ratio for MTS is at least $\Omega(\log n/\log \log n)$.

We will not prove Theorem 1, but we will prove a lower bound for two special cases that together capture the essential elements of the full argument. The general case is discussed at the end.

The complete d-ary tree

The first case we’ll consider is when $T$ is a $d$-ary tree of height $h$, so that $|\mathcal L|=d^h$. Define $w(x)=\tau^{-\mathrm{dist}_T(x,r)}$, where $r$ is the root of $T$ and $\mathrm{dist}_T$ denotes the combinatorial distance on $T$. Then $(\mathcal L,d_w)$ is a $\tau$-HST.

Our goal is to establish a lower bound of $\Omega(h \log d)$ on the competitive ratio for MTS when $\tau$ is sufficiently large. Note that when $\tau = \Theta(1)$ and $d=2$, it is an open problem to exhibit an $\Omega(h)$ lower bound, and proving such a bound is likely the most difficult obstacle in obtaining an $\Omega(\log n)$ lower bound for every $n$-point ultrametric.

Our costs functions will be of the form $\{\epsilon\cdot c_\ell : \ell \in \mathcal L\}$, where $\epsilon> 0$ is a number, and \[c_\ell(\ell') = \begin{cases} 0 & \ell=\ell' \\ 1 & \ell \neq \ell'\,. \end{cases}\] More succinctly: $c_\ell \mathrel{\vcenter{:}}= \mathbf{1}-\mathbf{1}_\ell$.

Let $\alpha_h$ be the competitive ratio for height $h$. We will prove inductively that $\alpha_h \geq \Omega(h \log d)$.

Consider first the case $h=1$. We define the height-$1$ cost sequence. The root $r$ has $d$ leaves beneath it. We do the following $d \log d$ times: Choose a random leaf $\ell$ and play the cost function $c_\ell$. It is straightforward that if $\mathrm{alg}_1$ denotes the cost of an online algorithm, then: \[\mathbb{E}[\mathrm{alg}_1] \geq \log d\,.\] (Note that if a cost comes at $\ell$, then the algorithm must either move and pay movement cost $1$, or stay in place and pay service cost $1$.)

Consider the following offline algorithm: Move to the leaf that will be chosen the smallest number of times, and stay there. The movement cost if $1$, and a standard coupon collecting argument shows that the service cost is $O(1)$, hence: \[\mathbb{E}[\mathrm{opt}_1] \leq O(1)\,.\] We conclude that the competitive ratio satisfies \[\alpha_1 \geq c \log d\] for some $c > 0$.

Consider now the case of $h \geq 2$. We define the height-$h$ cost sequence. The root has beneath it $d$ subtrees $T_1, T_2, \ldots, T_d$ of height $h-1$. For some number $\mu_h$ to be chosen shortly, we do the following $\mu_h d$ times: Choose $i \in \{1,2,\ldots,d\}$ uniformly at random and play in $T_i$ the height-$(h-1)$ cost sequence scaled by $1/\tau$.

Consider first what an online algorithm can do. If the algorithm’s state lies in $T_i$ when a height-$(h-1)$ cost sequence is imposed on $T_i$, it can either leave $T_i$ at some point during the sequence, or it can remain in $T_i$. Hence the algorithm pays at least $\min\left(1,\tau^{-1} \mathbb{E}[\mathrm{alg}_{h-1}]\right)$. It follows that:

\[\mathbb{E}[\mathrm{alg}_h] \geq \mu_h \min\left(1,\tau^{-1} \mathbb{E}[\mathrm{alg}_{h-1}]\right) \geq \mu_h \min\left(1,\tau^{-1} \alpha_{h-1} \mathbb{E}[\mathrm{opt}_{h-1}]\right)\,.\]

Thus if

\begin{equation}\label{eq:tauchoice} \tau \geq \alpha_{h-1} \mathbb{E}[\mathrm{opt}_{h-1}]\end{equation}

it holds that \begin{equation}\label{eq:alg} \mathbb{E}[\mathrm{alg}_h] \geq \mu_h \tau^{-1} \alpha_{h-1} \mathbb{E}[\mathrm{opt}_{h-1}]\,.\end{equation} We will choose $\tau$ so that \eqref{eq:tauchoice} holds.

Let us now analyze an offline algorithm. Let $\rho_i$ denote the number of times that $T_i$ is chosen. The offline algorithm first moves to some $T_i$ with $\rho_i$ minimal, and then plays optimally against level-$(h-1)$ cost sequences arriving in $T_i$.

If $\mu_h \geq \Omega(\log d)$, then the normal approximation to a binomial distribution shows that there is a constant $0 < c' < 1$ such that \[\mathbb{E}\left[\min(\rho_1,\ldots,\rho_d)\right] \leq \mu_h - c'\sqrt{\mu_h \log d}\,.\] Hence incorporating the movement cost yields: \[\mathbb{E}[\mathrm{opt}_h] \leq 1 + \left(\mu_h - c'\sqrt{\mu_h \log d}\right) \tau^{-1} \mathbb{E}[\mathrm{opt}_{h-1}]\,.\]

We now choose $\mu_h$ so that \[c' \sqrt{\mu_h \log d} \tau^{-1} \mathbb{E}[\mathrm{opt}_{h-1}] = 2\,,\] i.e., \[\mu_h \mathrel{\vcenter{:}}= \frac{4}{\log d} \left(\frac\tau{c' \mathbb{E}[\mathrm{opt}_{h-1}]}\right)^2\] Note from \eqref{eq:tauchoice}, we have \begin{equation}\label{eq:muh} \mu_h \geq \frac{4}{(c')^2\log d} \alpha_{h-1}^2.\end{equation} Since $\alpha_{h-1} \geq \alpha_1 \geq \Omega(\log d)$ it follows that $\mu_h \geq \log d$ for $c'$ chosen small enough. (In particular, the normal approximation applies.)

Our choice of $\mu_h$ ensures that \begin{align} \mathbb{E}[\mathrm{opt}_h] &\leq \left(\mu_h-\frac{c'}{2} \sqrt{\mu_h \log d}\right) \tau^{-1} \mathbb{E}[\mathrm{opt}_{h-1}] \label{eq:opt}\end{align} Combining this with \eqref{eq:alg} yields

$$ \alpha_h \geq \frac{\mathbb{E}[\mathrm{alg}_h]}{\mathbb{E}[\mathrm{opt}_h]} \geq \frac{\alpha_{h-1}}{1-\frac{c'}{2} \sqrt{\frac{\log d}{\mu_h}}} \geq \alpha_{h-1} \left(1+\frac{c'}{2} \sqrt{\frac{\log d}{\mu_h}}\right), $$

where the latter inequality holds because $c' < 1$ and $\mu_h \geq \log d$. Using \eqref{eq:muh}, this gives \[\alpha_h \geq \alpha_{h-1} + \frac{c'}{2} \log d\,,\] completing the argument.

Note that in \eqref{eq:tauchoice}, we required $\tau \geq \alpha_{h-1} \mathbb{E}[\mathrm{opt}_{h-1}]$. So let us use \eqref{eq:opt} to compute: \[\mathbb{E}[\mathrm{opt}_h] \leq \mu_h \tau^{-1} \mathbb{E}[\mathrm{opt}_{h-1}] \leq \frac{O(1)}{\log d} \frac\tau{\mathbb{E}[\mathrm{opt}_{h-1}]}\] Since \[\mathbb{E}[\mathrm{opt}_h] \cdot \mathbb{E}[\mathrm{opt}_{h-1}] \leq O(\tau/\log d)\,,\] and $\mathbb{E}[\mathrm{opt}_h] \geq \mathbb{E}[\mathrm{opt}_{h-1}] \geq 1$ for all $h$, it follows that $\mathbb{E}[\mathrm{opt}_h] \leq O(\sqrt{\tau/\log d})$, meaning that $\tau \geq \alpha_{h-1} \mathbb{E}[\mathrm{opt}_{h-1}]$ will be satisfied for $\tau \geq C(\alpha_{h-1}/\log d)^2 \asymp h^2$. Since $h \asymp \frac{\log n}{\log d}$, this yields completes the proof of in the special case of a $d$-regular HST.

The “superincreasing” metric

We will now consider a highly unbalanced family of HSTs. These were analyzed in a paper of Karloff, Rabani, and Ravid.

Let us define the trees $\{T_n\}$ as follows: The root $r_{n}$ of $T_n$ has two children. The left child is a copy of $T_{n-1}$ of rooted at $r_{n-1}$, and the right child is a single leaf. We denote $w_n(r_n)=1$, and $w_n(r_{n-1}) = 1/\tau_n$, where $\tau_n > \tau_{n-1} > \cdots > 0$ is a sequence of positive weights we will choose soon, and such that $\tau_n \leq O(\log n)$.

The basic structure of our argument will be similar to the $d$-regular case. We do the following $2 \mu_n$ times: Choose $i \in \{L,R\}$ uniformly at random. If $i=L$, we play a height-$(n-1)$ cost sequence on the left subtree (scaled by $1/\tau_n$). Otherwise, we play $c_\ell$, where $\ell$ is the leaf constituting the right subtree.

Consider an online algorithm. If the algorithm sits in the left subtree when $i=L$, then either the algorithm moves out of the subtree (movement cost $1$), or it incurs the cost of a height-$(n-1)$ inductive sequence scaled by $1/\tau_n$. If it sits in the right subtree, then it pays cost $1$ (either by moving or staying and incurring the service cost). If we choose $\tau_n \mathrel{\vcenter{:}}=\alpha_{n-1} \mathbb{E}[\mathrm{opt}_{n-1}]$, then such an algorithm always pays at least $\tau_n^{-1} \alpha_{n-1} \mathbb{E}[\mathrm{opt}_{n-1}]$ in any of these cases. Therefore:

\begin{equation}\label{eq:algn} \mathbb{E}[\mathrm{alg}_n] \geq \mu_n \tau_n^{-1} \alpha_{n-1} \mathbb{E}[\mathrm{opt}_{n-1}]. \end{equation}

We now bound the cost of the optimal offline algorithm. Let $\rho_L$ be the number of times $i=L$ and let $\rho_R \mathrel{\vcenter{:}}= 2\mu_n - \rho_L$. The algorithm will always sit by default in the left subtree. If $\rho_R=0$, it will move to the right subtree, suffer zero service cost there, and then move back to the left subtree (paying total cost $2$). If $\rho_R > 0$, the algorithm will remain in the left subtree and play an optimal strategy for $T_{n-1}$.

This yields: \[\begin{aligned} \mathbb{E}[\mathrm{opt}_n] &\leq \left(1-{\mathbb{P}}[\rho_R=0]\right) \mathbb{E}\left[\rho_L \tau_n^{-1} \mathrm{opt}_{n-1} \mid \rho_R > 0\right] + 2 {\mathbb{P}}[\rho_R=0] \\ &\leq (1-4^{-\mu_n}) \mu_n \tau_n^{-1} \mathbb{E}[\mathrm{opt}_{n-1}] + 2 \cdot 4^{-\mu_n}.\end{aligned}\] Now choose: \[\mu_n \mathrel{\vcenter{:}}=\frac{4 \tau_n}{\mathbb{E}[\mathrm{opt}_{n-1}]} = 4 \alpha_{n-1}\,.\] This yields: \[\mathbb{E}[\mathrm{opt}_n] \leq \left(1-\tfrac12 4^{-4 \alpha_n}\right) \mu_n \tau_n^{-1} \mathbb{E}[\mathrm{opt}_{n-1}].\]

Combined with \eqref{eq:algn}, we have \[\alpha_n \geq \frac{\mathbb{E}[\mathrm{alg}_n]}{\mathbb{E}[\mathrm{opt}_n]} \geq \alpha_{n-1} + \tfrac12 4^{-4\alpha_{n-1}}.\] This implies that $\alpha_n \geq \beta \log n$ for some positive $\beta < 1$. (To see this observe, that if $f(x)=\beta \log x$, then $f'(x) = \beta/x < \frac12 4^{-4 \beta \log x}$ for $\beta$ chosen small enough.)

Note furthermore that \[\mathbb{E}[\mathrm{opt}_n] \leq 4\,,\] and therefore $\tau_n \leq O(\log n)$.

The general case

We have demonstrated lower bounds in two cases: When the underlying tree $T$ is regular, and when it is unbalanced and binary. The general case can be proved by combining these two strategies along any $\Omega((\log n)^2)$-HST. The next lemma contains the basic idea. We leave it to the reader as a basic exercise. (Hint: Use the concavity of $x \mapsto \sqrt{x}$.)

Suppose that $n=n_1+n_2+\cdots+n_m$ where $n_1 \geq n_2 \geq \cdots \geq n_m \geq 1$. Then either $\sqrt{n_1}+\sqrt{n_2} \geq \sqrt{n}$, or there is some $\ell \geq 3$ such that $\ell \cdot \sqrt{n_\ell} \geq \sqrt{n}$.

Suppose now that $T$ is a rooted tree with subtrees $T_1, T_2, \ldots, T_m$ beneath the root, and such that $T_i$ has $n_i$ leaves with $n_1 \geq n_2 \geq \cdots \geq n_m \geq 1$. The lemma says we can either consider only the first two subtrees (the binary, possibly “unbalanced” case), or we can take the first $\ell$ subtrees for some $\ell \geq 3$, prune them so they all have exactly $n_\ell$ leaves, and then prove a lower bound. Analyzing these two cases correspond roughly to the two lower bound arguments above.

Approximation by ultrametrics

Thu, 12 Apr 2018 00:00:00 +0000

We now take a short detour away from mirror descent, and instead examine how a special class of metric spaces called ultrametrics control the competitive ratio of MTS in general metric spaces. Hold onto your seats; we’re about to compress two decades and 10 papers into a few paragraphs. For the sake of continuity, bibliographic remarks are held to the end of the post. $\def\e{\varepsilon}$

Metric approximations

For a metric space $(X,d)$, let $\alpha_{\mathrm{mts}}(X,d)$ denote the best (randomized) competitive ratio for MTS on $(X,d)$. Suppose $D$ is some other metric on $X$ that satisfies

\begin{equation}\label{eq:distortion} \frac{D(x,y)}{K} \leq d(x,y) \leq D(x,y) \qquad \forall x,y \in X\,. \end{equation}

Then it is straightforward to verify that $\alpha_{\mathrm{mts}}(X,d) \leq K \cdot \alpha_{\mathrm{mts}}(X,D)$.

But there is a weaker form of approximation that still allows for such a conclusion. Suppose that $\mathbf{D}$ is a random metric that satisfies, for every $x,y \in X$:

With probability one, $\mathbf{D}(x,y) \geq d(x,y)$,
$\mathbb{E}\left[\mathbf{D}(x,y)\right] \leq K \cdot d(x,y)$.

If $\alpha_{\mathrm{mts}}(X,\mathbf{D}) \leq \alpha$ with probability one, we claim that $\alpha_{\mathrm{mts}}(X,d) \leq K \alpha$.

The algorithm achieving this is as follows: Sample the random metric $\mathbf{D}$. Now given a cost sequence $\left\langle c_t : X \to \mathbb{R}_+ \mid t \geq 1\right\rangle$, let $\left\langle x_0, x_1, x_2, \ldots \right\rangle$ be the random sequence of points produced by an $\alpha$-competitive randomized algorithm for $(X,\mathbf{D})$. Then:

\[ \mathbb{E}\left[\sum_{t=1}^T \left.\vphantom{\bigoplus} \mathbf{D}(x_t, x_{t-1}) \ \right| \mathbf{D}\right] \leq \alpha \sum_{t=1}^T \mathbf{D}\!\left(x_t^{*}, x_{t-1}^{*}\right) + O(1)\,, \]

where $\left\langle x_t^* : t \geq 0\right\rangle$ is an optimal offline sequence for $(X,d)$. (This may not be the optimal sequence for $(X,\mathbf{D})$, but the algorithm is certainly competitive against non-optimal sequences as well.)

Note that the expectation here is taken only with respect to the randomness in the online algorithm. If we take expectation with respect to $\mathbf{D}$ as well, it follows that

\[ \mathbb{E}\left[\sum_{t=1}^T d(x_t, x_{t-1})\right] \leq \alpha \mathbb{E}\left[\sum_{t=1}^T \mathbf{D}\left(x_t^{*}, x_{t-1}^{*}\right)\right] + O(1) \leq K \alpha \sum_{t=1}^T d\!\left(x_t^{*}, x_{t-1}^{*}\right) + O(1)\,, \]

where we have used property (1) of $\mathbf{D}$ for the LHS and property (2) to bound the RHS.

Ultrametrics

Let $T=(V,E)$ be a finite, rooted tree, and let $\mathcal{L}$ denote the leaves of $T$. Suppose $w : V\setminus \mathcal{L} \to \mathbb{R}_+$ is a function that assigns positive weights to the internal vertices of $T$ such that the vertex weights are non-increasing along root-leaf paths. Then one can define a distance on $\mathcal{L}$ by [ d_w\left(\ell,\ell’\right) \mathrel{\vcenter{:}}= w\left(\mathrm{lca}(\ell,\ell’)\right). ] This is an ultrametric on $\mathcal{L}$ (and all finite ultrametrics arise in this way).

It turns out that ultrametrics essentially control the competitive ratio for metrical task systems on finite metric spaces. This follows from the next two facts that hold for an arbitrary $n$-point metric space $(X,d)$.

There is a random ultrametric $\mathbf{D}$ on $(X,d)$ such that (1) and (2) are satisfied with $K \leq O(\log n)$.

By our earlier remarks, this implies that the competitive ratio for MTS on $(X,d)$ is at most $O(\log n)$ times the competitive ratio for $n$-point ultrametrics.

There is a subset $X' \subseteq X$ with $|X'| \geq \sqrt{n}$ and an ultrametric $D$ on $X'$ such that \eqref{eq:distortion} is satisfied with $K \leq O(1)$.

Since $\alpha_{\mathrm{mts}}(X,d) \geq \alpha_{\mathrm{mts}}(X’,d) \geq \Omega(\alpha_{\mathrm{mts}}(X’,D))$, lower bounds on the competitive ratio for ultrametrics yield lower bounds for $(X,d)$ as well. Finally, we remark that MTS on ultrametrics is now well-understood.

If $(X,D)$ is an $n$-point ultrametric, then \[ \Omega\left(\frac{\log n}{\log \log n}\right) \leq \alpha_{\mathrm{mts}}(X,D) \leq O(\log n)\,. \]

There are ultrametrics (e.g., as we have seen already, when $(X,D)$ is the uniform metric) for which the $O(\log n)$ upper bound in Theorem 3 is tight. Whether the LHS can be made $\Omega(\log n)$ is an intriguing open problem; we will address it in the next lecture. In conjunction with Theorem 1 and Theorem 2, this yields:

For any $n$-point metric space $(X,d)$: \[ \Omega\left(\frac{\log n}{\log \log n}\right) \leq \alpha_{\mathrm{mts}}(X,d) \leq O((\log n)^2)\,. \]

Perhaps the central remaining open question for MTS on general metric spaces is whether the upper bound can be improved to $O(\log n)$.

Ultrametrics from partitions

Consider a partition $P$ of $X$. Say that $P$ is $\Delta$-bounded if $S \in P \implies \mathrm{diam}_X(S) \leq \Delta$. For a point $x \in X$, we write $P(x)$ for the unique set in $P$ that contains $x$.

Suppose now that $\mathcal{P} = \{ P_j : j \in \mathbb{Z} \}$ is a sequence of partitions of $X$ such that $P_j$ is $8^j$-bounded for every $j \in \mathbb{Z}$, and define the metric

\[ D_{\mathcal{P}}(x,y) \mathrel{\vcenter{:}}= \max \left\{ 8^{j+1} : P_j(x) \neq P_j(y) \right\}. \]

One can check that $D_{\mathcal{P}}$ is an ultrametric on $X$ (imagine the tree structure induced by the partitions), and moreover

\begin{equation}\label{eq:exp} D_{\mathcal{P}}(x,y) \geq d(x,y) \qquad \forall x,y \in X\,. \end{equation}

This follows from: $D_{\mathcal{P}}(x,y) \leq 8^j \implies P_j(x)=P_j(y) \implies d(x,y) \leq 8^j$.

Define $B(x,r) \mathrel{\vcenter{:}}= \{ y \in X : d(x,y) \leq r \}$ to be the ball of radius $r$ about $x \in X$.

Suppose $(X,d)$ is an $n$-point metric space. Then for every $\Delta > 0$, there is a random $\Delta$-bounded partition $P$ of $X$ such that for every $r \leq \Delta/8$:

\begin{equation}\label{eq:rp} \mathbb{P}\left[\vphantom{\bigoplus} B(x,r) \subseteq P(x)\right] \geq \exp\left(\frac{-8r}{\Delta} \log \frac{|B(x,\Delta)|}{|B(x,\Delta/8)|}\right). \end{equation}

Remarkably, the random partitioning lemma can be used to prove both Theorem 1 and Theorem 2. We will first establish these consequences and then prove the lemma.

Approximation by a random ultrametric

Let’s prove Theorem 1. Let $\mathcal{P} = \{ P_j : j \in \mathbb{Z} \}$ be the random sequence where $P_j$ results from the random partitioning lemma applied with $\Delta = 8^j$ and we take the partitions to be mutually independent. Then from \eqref{eq:exp}, we know $\mathbf{D}_{\mathcal{P}}(x,y) \geq d(x,y)$ for all $x,y \in X$. Now fix $x \neq y \in X$, and let $j_0 \mathrel{\vcenter{:}}= \min \{ j : 8^{j} \geq d(x,y) \}$. Then:

\[ \mathbb{E}\left[\mathbf{D}_{\mathcal{P}}(x,y)\right] \leq 8^{j_0+1} + \sum_{j > j_0} \mathbb{P}[P_j(x) \neq P_j(y)] 8^{j+1}, \]

and using $\mathbb{P}[P_j(x) = P_j(y)] \geq \mathbb{P}[B(x, d(x,y)) \subseteq P_j(x)]$ yields

\begin{align*} \sum_{j > j_0} \mathbb{P}[P_j(x) \neq P_j(y)] 8^{j+1} &\leq \sum_{j \in \mathbb{Z}} 8^{j+2} \frac{d(x,y)}{8^{j}} \log \frac{|B(x, 8^j)|}{|B(x,8^{j-1})|} \\ &\leq 64\, d(x,y) \sum_{j \in \mathbb{Z}} \log \frac{|B(x, 8^j)|}{|B(x,8^{j-1}|} \\ &= 64 \,d(x,y) \log n\,. \end{align*}

where the second inequality follows from \eqref{eq:rp} and the fact that $e^{-x} \geq 1-x$, and in the last inequality we evaluate a telescoping sum.

Finding a large approximate ultrametric inside $X$

Now we prove Theorem 2. Let $\mathcal{P}$ be the same random partition sequence chosen above and fix some $0 < \e < 1/8$. Define the random subset:

\[ \mathbf{S} \mathrel{\vcenter{:}}= \left\{ x \in X : B(x, \e 8^j) \subseteq P_j(x) \ \forall j \in \mathbb{Z}\right\}\,. \]

We claim that

\[ D_{\mathcal{P}}(x,y) \geq d(x,y) \geq \frac{\e}{8} D_{\mathcal{P}}(x,y) \qquad \forall x,y \in \mathbf{S}\,. \]

The LHS is simply from \eqref{eq:exp}. For the RHS, observe that if $D_{\mathcal{P}}(x,y)=8^{j+1}$, then $P_j(x) \neq P_j(y)$, hence $B(x,\e 8^j) \subseteq P_j(x) \implies d(x,y) \geq \e 8^j$.

Now the random partitioning lemma gives, for any $x \in X$,

\[ \mathbb{P}[x \in \mathbf{S}] \geq \prod_{j \in \mathbb{Z}} \exp\left(- 8\e \log \frac{|B(x,8^j)|}{|B(x,8^{j-1})|}\right) = \exp\left(-8 \e \log n\right) = n^{-8\e}\,. \]

By linearity of expectation: $\mathbb{E}[|\mathbf{S}|] \geq n^{1-8\e}$. Taking $\e \mathrel{\vcenter{:}}= 1/16$ completes the proof.

Proof of the random partitioning lemma

Let $\mathbf{R} \in [\Delta/4,\Delta/2]$ be chosen uniformly, and let $X = \{x_1, x_2, \ldots, x_n\}$ be a uniformly random ordering of the points in $X$. Our random partitioning is formed by iteratively carving out balls:

\begin{equation}\label{eq:ckr} P \mathrel{\vcenter{:}}= \left\{ B(x_i, \mathbf{R}) \setminus \bigcup_{j < i} B(x_j, \mathbf{R}) : i=1,2,\ldots,n\right\}. \end{equation}

Clearly $P$ is $\Delta$-bounded by construction.

Fix $r \leq \Delta/8$ and observe first that

\[ \mathbb{P}\left[B(x,r) \subseteq P(x) \mid \mathbf{R}\right] \geq \frac{|B(x,\mathbf{R}-r)|}{|B(x,\mathbf{R}+r)|}. \]

This follows because if we condition on $\mathbf{R}$, then only centers in $B(x,\mathbf{R}+r)$ can decide the fate of $B(x,r)$, and the corresponding cluster will contain all of $B(x,r)$ if the center lies in $B(x,\mathbf{R}-r)$.

Thus we have:

\begin{align*} \mathbb{P}\left[B(x,r) \subseteq P(x)\right] &\geq \mathbb{E}\left[\frac{|B(x,\mathbf{R}-r)|}{|B(x,\mathbf{R}+r)|}\right] \\ &= \mathbb{E}\left[\exp\left(- \log \frac{|B(x,\mathbf{R}+r)|}{|B(x,\mathbf{R}-r)|}\right)\right] \\ &\geq \exp \left(\mathbb{E}\left[- \log \frac{|B(x,\mathbf{R}+r)|}{|B(x,\mathbf{R}-r)|}\right]\right) \\ &\geq \exp\left(\frac{- 8 r}{\Delta} \log \frac{|B(x,\Delta)|}{|B(x,\Delta/8)|}\right)\,, \end{align*}

where the second inequality uses convexity of $e^{-x}$ and the last inequality comes from the calculation

\begin{align*} \mathbb{E}\left[ \log \frac{|B(x,\mathbf{R}+r)|}{|B(x,\mathbf{R}-r)|}\right] &= \frac{4}{\Delta} \int_{\Delta/4}^{\Delta/2} \log \frac{|B(x,R+r)|}{|B(x,R-r)|}\,dR \\ &\leq \frac{8r}{\Delta} \log \frac{|B(x,\Delta/2+r)|}{|B(x,\Delta/4-r)|} \\ &\leq \frac{8r}{\Delta} \log \frac{|B(x,\Delta)|}{|B(x,\Delta/8)|}\,, \end{align*}

using $r \leq \Delta/8$.

Historical remarks

The random embedding theorem (Theorem 1) is due to Fakcharoenphol, Rao, and Talwar. The first such result was proved by Bartal, and he later obtained a near-optimal bound of $O(\log n \log \log n)$ on the distortion. The use of random tree approximations of arbitrary metric spaces for online algorithms arose somewhat earlier in the work of Alon, Karp, Peleg, and West, specifically in relation to the $k$-server problem.

The metric Ramsey theorem (Theorem 2) is a result of Bartal, Linial, Mendel, and Naor, improving over an earlier bound of Bartal, Bollobas, and Mendel who established that one can take $|X’| \geq \exp(c \sqrt{\log n})$ for some $c > 0$.

The upper bound in Theorem 3 is from a forthcoming paper with Bubeck, Cohen, and Y. T. Lee, and improves the $O(\log n \log \log n)$ upper bound of Fiat and Mendel to the optimal value (up to a constant factor). The lower bound in Theorem 3 is from the aforementioned paper of Bartal, Bollobas, and Mendel; we will discuss it in the next lecture.

The random partitioning lemma and the proof of Theorem 2 presented here come from a paper of Mendel and Naor. This proof of the random partitioning lemma is somewhat cleaner than the one presented there. The (now famous) distribution on random partitions described in \eqref{eq:ckr} is from Calinescu, Karloff, and Rabani.

Metrical task systems on a weighted star

Mon, 09 Apr 2018 00:00:00 +0000

Let’s recall the definition of MTS. $\def\K{\mathsf{K}} \def\R{\mathbb{R}} \def\seteq{\mathrel{\vcenter{:}}=} \def\cE{\mathcal{E}} \def\argmin{\mathrm{argmin}} \def\llangle{\left\langle} \def\rrangle{\right\rangle} \def\1{\mathbf{1}} \def\e{\varepsilon} \def\Lip{\mathrm{Lip}}$ There is a metric space $(X,d)$. At each time, we receive a cost function $c_t : X \to \R_+$, and need to respond with a point $x_t \in X$. The cost we pay at time $t$ is the sum of the service cost and the movement cost:

\[ c_t(x_t) + d(x_{t-1},x_t). \]

Our goal is to be competitive against the best offline algorithm: It should hold that for every $T \geq 1$,

\[ \sum_{t=1}^T c_t(x_t) + d(x_{t-1},x_t) \leq \alpha \left(\sum_{t=1}^T c_t(x^*_t) + d(x^*_{t-1},x^*_t)\right) + C \]

where $C > 0$ is some constant independent of the cost sequence, and $\llangle x^*_t : t=0,1,\ldots,T\rrangle$ is some optimal offline sequence for $\llangle c_t : t=1,2,\ldots,T\rrangle$. Such an online algorithm is said to be $\alpha$-competitive.

It is a long-standing conjecture that there is an $O(\log n)$-competitive randomized algorithm for MTS on every $n$-point metric space. Actually, it is conjectured that the competitive ratio is $\Theta(\log n)$ for every $n$-point metric space. We will see the known $\Omega\left(\frac{\log n}{\log \log n}\right)$ lower bound of Bartal, Bollobas, and Mendel in a few lectures.

Let us remark that a lower bound of $\Omega(\log n)$ for the $n$-point uniform metric is straightforward. At every point in time, the adversary chooses a uniformly random $z_t \in X$ and defines the cost function by $c_t(z_t)=+\infty$ and $c_t(z)=0$ for $z \neq z_t$. Clearly any online algorithm incurs movement cost $\asymp t/n$ in expectation after $t$ steps. On the other hand, an offline algorithm can break the request sequence into phases $0 = t_0 < t_1 < t_2 < \cdots$ such that the times $\{t_i\}$ are minimal subject to the constraint $\{1,2,\ldots,n\} = \{z_{t_i+1}, \ldots, z_{t_{i+1}}\}$. Clearly the offline algorithm only has to move once per phase, and by a standard coupon collector argument, the expected length of each phase is $\asymp n \log n$. Thus the offline algorithm only incurs movement cost $\asymp t/(n \log n)$.

In this lecture, we’ll establish the conjectured upper bound for the special case when $(X,d)$ is the path metric on an $n$-point weighted star: $X={1,2,\ldots,n}$ and $d(i,j)=w_i+w_j$ for some collection $w_1,w_2,\ldots,w_n > 0$ of positive weights. This will be a building block of our eventual algorithm for trees, which will then be used to obtain an $O((\log n)^2)$-competitive algorithm for any metric space.

The transportation distance

As in the first lecture, instead of a randomized strategy, we will play a fractional strategy: At every time $t$, a probability distribution $p_t : X \to [0,1]$. In this case, our service cost is $\sum_{x \in X} p_t(x) c_t(x)$, and our movement cost is the transportation distance $W_1(p_{t-1},p_t)$. Also known as the Earthmover distance, $W_1(p,q)$ this is the cost of the minimal transport plan between probability distributions $p$ and $q$.

A primary reason for looking first at weighted star metrics is that their transportation distance can be described by a weighted $\ell_1$ norm: $W_1(p,q) = \|p-q\|_{\ell_1(w)},$ where

\[ \|v\|_{\ell_1(w)} = \sum_{x \in X} w_x |v(x)|. \]

The online algorithm

As in the first lecture, we will design our algorithm in continuous time (and this is without loss of generality). Given some continuous trajectory of cost functions $c_t : X \to \R_+$ with $t \geq 0$, our algorithm will be a trajectory $p_t$ of distributions, and our instantenous cost is described by

\[ \|\partial_t p_t\|_{\ell_1(w)} + \langle c_t,p_t\rangle, \]

where $\langle f,g\rangle = \sum_{x \in X} f(x) g(x)$.

We will design the algorithm in the mirror descent framework of the previous lecture. The underlying convex body will be the probability simplex on $X$:

\[ \K \seteq \left\{ p \in [0,1]^X : \sum_{x \in X} p(x)=1\right\}. \]

The second object we need is the mirror map $\Phi : \K \to \R$. Our control will be $F(t,\cdot)=- c_t$. Recall from the preceding lecture that if we do continuous-time mirror descent on $\K$ using $\Phi$, it will hold that the corresponding Bregman divergence $D_{\Phi}$ acts as a Lyapunov functional: For every fixed distribution $q \in \K$:

\begin{equation}\label{eq:lya} \partial_t D_{\Phi}(q; p_t) \leq \langle c_t, q-p_t\rangle. \end{equation}

In other words, if our algorithm is paying more cost than $q$, then we are getting “closer” to $q$ in the Bregman distance.

This allows us to control our service cost against a fixed target $q \in \K$. We need to choose $\Phi$ with two additional things in mind: We also want the movement cost to be controlled, and we want to compare ourselves to a possibly moving target $q_t^* = \1_{x_t^*}$.

Controlling the movement cost by the service cost

In order to control the movement cost, we will simply try to make it comparable to the current instananeous service cost. Let us recall the trajectory of mirror descent from the preceding lecture:

\begin{equation}\label{eq:evo} \partial_t p_t = (\nabla^2 \Phi(p_t))^{-1} \left(- c_t - \lambda_t\right), \end{equation}

where $\lambda_t \in N_{\K}(p_t)$. Recall that $\lambda_t$ represents the set of normal forces that are constraining us to lie in $\K$.

Let’s ignore $\lambda_t$ for the moment, and calculate the instaneous movement cost induced by the cost function alone:

\begin{equation}\label{eq:compare} \|\partial_t p_t\|_{\ell_1(w)} = \sum_{x \in X} w_x \left((\nabla^2 \Phi(p_t))^{-1} c_t\right)(x). \end{equation}

Thus if we want $\|\partial_t p_t\|_{\ell_1(w)} \asymp \langle c_t,p_t\rangle$, it make sense to choose $\Phi$ to be a weighted entropy:

\[ \Phi_w(p) \seteq \sum_{x \in X} w_x p(x) \log p(x). \]

In this case, the Hessian $\nabla^2 \Phi_w(p)$ is a diagonal matrix with:

\[ \left(\nabla^2 \Phi_w(p)\right)_{x,x} = \frac{w_x}{p(x)}, \]

and \eqref{eq:compare} becomes

\[ \|\partial_t p_t\|_{\ell_1(w)} = \sum_{x \in X} p_t(x) c_t(x), \]

as desired.

Tracking a moving target

The second thing we need to address is that \eqref{eq:lya} compares our service cost against that of a fixed distribution $q$. In order to analyze how we track a moving target, let’s take a step back and compute $\partial_t D_{\Phi}(z_t; y)$ for a general strictly convex function $\Phi$. Recall that

\[ D_{\Phi}(z;y) = \Phi(z) - \Phi(y) - \langle \nabla \Phi(y), z-y\rangle, \]

and therefore

\begin{align} \partial_t D_{\Phi}(z_t; y) &= \langle \nabla \Phi(z_t) - \nabla \Phi(y), \partial_t z_t \rangle \nonumber \\ &\leq \left(\|\nabla \Phi(y)\|_* + \|\nabla \Phi(z_t)\|_*\right) \|\partial_t z_t\| \nonumber \\ &\leq 2 \Lip_{\K, \|\cdot\|_*} (\Phi) \cdot \|\partial_t z_t\|.\label{eq:move} \end{align}

The first inequality holds for any norm $\|\cdot\|$, we use $\|\cdot\|_*$ for the dual norm, and we write

\[ \Lip_{\K,\|\cdot\|_*}(f) = \sup_{z \in \K} \|\nabla f(z)\|_*. \]

Returning from the abstract to our current setting, it makes sense to choose $\|\cdot\| = \|\cdot\|_{\ell_1(w)}$, and then \eqref{eq:move} exactly says that when a point moves away from us in the Bregman divergence, it must pay for this with its own movement cost! Using the chain rule, we now have

\begin{equation}\label{eq:breg2} \partial_t D_{\Phi}(q_t; p_t) \leq \langle c_t, q_t-p_t\rangle + 2 \Lip_{\K, \|\cdot\|_{*}} (\Phi) \cdot \|\partial_t q_t\|_{\ell_1(w)} \end{equation}

This looks great, except that we need to calculate the Lipshitz constant of $\Phi$. If we try this for the weighted entropy $\Phi_w$, we immediately run into a problem: Since $(\nabla \Phi_w(p))(x) = w_x (1+\log p(x))$, the $\ell_{\infty}$ norm blows up as $p(x) \to 0$. This is a manifestation of the fact, demonstrated in the first lecture, that exponential weights is too conservative to be competitive against a moving target.

The exploration shift, revisited

To fix this problem, we will consider the shifted entropy:

\[ \Phi \seteq \Phi_{w,\delta}(p) \seteq \sum_{x \in X} w_x (p(x) + \delta) \log (p(x)+\delta). \]

Observe that now:

\[ \Lip_{\K,\|\cdot\|_*} (\Phi) \leq \log (1/\delta). \]

Plugging this into \eqref{eq:move} gives

\begin{equation}\label{eq:move2} \partial_t D_{\Phi}(q_t; p_t) \leq \langle c_t, q_t-p_t\rangle + 2 \log(1/\delta) \|\partial_t q_t\|_{\ell_1(w)}. \end{equation}

Note that rearranging yields

\begin{equation}\label{eq:anal} \langle c_t, p_t \rangle \leq \langle c_t,q_t\rangle + 2 \log(1/\delta) \|\partial_t q_t\|_{\ell_1(w)} - \partial_t D_{\Phi}(q_t;p_t). \end{equation}

If $q_0=p_0$, then $D_{\Phi}(p_0;q_0) = 0$ and $D_{\Phi} \geq 0$ always, so integrating would seem to give us an $O(\log (1/\delta))$-competitive algorithm.

But this is only true if $\|\partial_t p_t\|_{\ell_1(w)} \asymp \langle c_t, p_t\rangle$, and our previous calculations toward this end do not necessarily hold once we incorporate the shift by $\delta$. This will be the fundamental tension in the course: Making $\delta$ larger encourages “exploration” that is necessary to respond quickly enough to changes in the cost function. But exploration comes at the cost of movement.

In the setting where $\K$ is the simplex, it is possible to control things by hand, but for more complicated convex bodies, the normal forces described by $\lambda_t$ in \eqref{eq:evo} will be substantially more complicated. In the next lecture, we will use this approach to derive an $O(\log k)$-competitive algorithm for the weighted $k$-paging problem, and we will implement a different strategy for the exploration shift.

Analyzing the movement of $p_t$

So now let us describe the algorithm $p_t$ using our shifted entropy map and \eqref{eq:evo}:

\begin{equation}\label{eq:evo2} \partial_t p_t(x) = \frac{p_t(x)+\delta}{w_x} \left(-c_t(x) - \lambda_t(x)\right). \end{equation}

Toward this end, let us write

\[ \lambda_t = \xi_t + \mu_t, \]

where $\xi_t : X \to \R_+$ are the multipliers corresponding to the positivity constraints $p(x) \geq 0$ and $\mu_t : X \to \R$ is the multiplier corresponding to the constraint $\sum_{x \in X} p(x) = 1$. Note that this decomposition is not necessarily unique, but what we want is that the complementary slackness conditions hold: $\xi_t(x) > 0 \implies p_t(x)=0$. (See the discussion of the normal cone for a polytope from the preceding lecture.)

As in the first lecture, it will suffice to bound only one direction of the movement. In this case, we will consider the negative coordinates in $\partial_t p_t$. It is a straightforward calculation to check that $\mu_t \leq 0$ by summing over $x$ in \eqref{eq:evo2} (we are reducing the probability mass in response to a cost function, so $\mu_t$ is keeping the total probability mass from dropping). Therefore \eqref{eq:evo2} gives

\[ \left\|\left(\partial_t p_t\right)_-\right\|_{\ell_1(w)} \leq \llangle p_t+\delta \1, c_t + \xi_t\rrangle. \]

Note first that $\langle p_t, c_t+\xi_t\rangle = \langle p_t, c_t\rangle$ by complementary slackness.

Finally, consider any fixed $r_0 \in \K$ and use $\langle \mu_t, r_0- p_t\rangle=0$ to write

\[ \langle c_t + \xi_t, r_0 - p_t\rangle = \langle c_t + \lambda_t, r_0 - p_t\rangle = \llangle \nabla^2 \Phi(p_t) \partial_t p_t, p_t-r_0 \rrangle = \partial_t D_{\Phi}(r_0; p_t). \]

Putting this all together and using $r_0 = \frac{1}{n} \1$ yields

\[ \left\|\left(\partial_t p_t\right)_-\right\|_{\ell_1(w)} \leq \left((1+\delta n) \langle c_t,p_t\rangle + \delta n \partial_t D_{\Phi}(\tfrac{1}{n} \1; p_t)\right). \]

Thus if we set $\delta = 1/n$, our movement cost becomes proportional to our service cost, and \eqref{eq:anal} shows that our algorithm is $O(\log n)$-competitive.

Navigating a convex body online

Fri, 06 Apr 2018 00:00:00 +0000

In the last lecture, we saw some algorithms that, while simple and appealing, were somewhat unmotivated. We now try to derive them from general principles, and in a setting that will allow us to attack other problems in competitive analysis. $\def\K{\mathsf{K}} \def\R{\mathbb{R}} \def\seteq{\mathrel{\vcenter{:}}=} \def\cE{\mathcal{E}} \def\argmin{\mathrm{argmin}} \def\llangle{\left\langle} \def\rrangle{\right\rangle} \def\1{\mathbf{1}} \def\e{\varepsilon}$

Gradient descent: The proximal view

Let us first recall the upper bound we derived for the regret in the last lecture:

\begin{equation}\label{eq:regret} R_T \leq \sum_{t=1}^T \left[ \|p_t-p_{t+1}\|_1 + \left\langle p_{t+1}, \ell_t - \ell_t(x_T^*) \right\rangle\right]. \end{equation}

Trying to minimize this expression leads to the question of how we should update our probability distribution $p_t \to p_{t+1}$ to simultaneously be stable (control the first term) and competitive (the second term).

A very natural algorithm in this setting is gradient descent. Indeed, suppose that $\ell : \R^n \to \R$ is differentiable, and consider the optimization

\[ \min \left\{ \frac12 \|x-x_0\|^2 + \eta \ell(x) : x \in \R^n \right\}, \]

where $\eta > 0$ is a small constant and $\|\cdot\|$ denotes the Euclidean norm. Then first-order optimality conditions dictate that the optimizer satisfies [ x^* = x_0 - \eta \nabla \ell(x_0) + O(\eta^2)\,. ]

Two questions immediately arise: Why do we use the Euclidean norm when our reference problem \eqref{eq:regret} refers to the $\ell_1$ norm, and if $x$ is meant to encode a probability distribution, how do we maintain this constraint for $x^*$?

Projected gradient descent

Let’s address the feasibility problem first. Suppose $\K \subseteq \R^n$ is a closed convex set and $F : \R^n \to \R^n$ is a sufficiently smooth vector field (think of $F = \nabla \ell$). How should we move in the direction of $F$ while simultaneously remaining inside $\K$?

The unconstrained flow along $F$ can be described as a trajectory $x : [0,\infty) \to \R^n$ given by [ x’(t) = F(x(t))\,. ] The most natural way to keep this flow inside $\K$ is to project back into the body whenever we leave. Define the Euclidean projection

\[ P_{\K}(y) \seteq \argmin \left\{ \|y-z\|^2 : z \in \K \right\}, \]

and the result of taking an infinitesimal step in direction $v$ and and then projecting: [ \Pi_{\K}(x,v) \seteq \lim_{\e \to 0} \frac{P_{\K}(x+\e v) -x}{\e}\,. ] Then the projected dynamics looks like [ x’(t) = \Pi_{\K} \left(x(t), F(x(t))\right)\,. ] This is an example of a projected dynamical system. Having now addressed feasibility, we are left to consider the role of the Euclidean norm.

A Riemannian version

One can view $\Pi_{\K}(x, \cdot)$ as a function on the tangent space at $x$. To specify such a projection, we only need a local Euclidean structure. An inner product $\langle \cdot,\cdot\rangle_x$ that varies smoothly over $x \in \K$ is precisely a Riemannian metric.

Equivalently, we specify at every point $x \in \K$, a smoothly varying positive-definite matrix $M(x)$ so that

\begin{align*} \langle u,v\rangle_{M,x} &= \langle u, M(x) v\rangle \\ \|u\|^2_{M,x} &= \langle u,u\rangle_{M,x}. \end{align*}

The associated projection operator is then given by

\begin{align*} P_{\K}^M(y; x) &\seteq \argmin \left\{ \left\|y-z\right\|_{M,x}^2 : z \in \K \right\} \\ \Pi_{\K}^M(x,v) &\seteq \lim_{\e \to 0} \frac{P_{\K}^M(x+\e v,x)-x}{\e}\,. \end{align*}

This leads to the dynamical system:

\begin{align*} x'(t) &= \Pi^M_{\K}\left(x(t),F(x(t))\right) \\ x(0) &= x_0 \in \K\,. \end{align*}

Lyapunov functions

The problem with stating things at this level of generality is that even when $F = \nabla \ell$ is the gradient of a convex function $\ell : \R^n \to \R$, we don’t have a global way of controlling convergence of $F(x(t))$ to $\min \{ F(x) : x \in \K \}$. In the Euclidean setting ($M(x) \equiv \mathrm{Id}$), there is a natural Lyapunov function: If $\ell$ is convex and $\ell(x^*) = \min \{ \ell(x) : x \in \K \}$, then for every $x \in \K$:

\[ \langle - \nabla \ell(x), x^* - x\rangle \geq 0\,. \]

In other words, gradient descent always makes progress toward $x^*$.

If $x’(t) = \Pi_{\K}\left(x(t), \nabla \ell(x(t))\right)$, then in the language of competitive analysis, the quantity $\frac12 \|x(t)-x^*\|^2$ acts a potential function (a global measure of progress).

We will consider geometries that come equipped with such a Lyapunov function. In a sense that can be formalized in various ways, these are the Hessian structures on $\R^n$, i.e., those arising when $M(x) = \nabla^2 \Phi(x)$ for some strictly convex function $\Phi : \K \to \R$.

Mirror descent dynamics

Consider now a compact, convex set $\K \subseteq \R^n$, a strictly convex function $\Phi : \K \to \R$, and a continuous time-varying vector field $F : [0,\infty) \times \K \to \R^n$. We will refer to continuous-time mirror descent as the dynamics specified by

\begin{align*} x'(t) &= \Pi_{\K}^{\nabla^2 \Phi}\left(\vphantom{\bigoplus} x(t), F(t, x(t))\right) \\ x(0) &= x_0 \in \K. \end{align*}

We will sometimes refer to $\Phi$ as the mirror map.

As one might expect, we can decompose $x’(t)$ into two components: One flowing in the direction $F(t,x(t))$, and the other component arising from the normal forces that are keeping $x(t)$ inside $\K$. We recall the normal cone to $\K$ at $x$ is given by

\[ N_{\K}(x) = \left\{ p \in \R^n : \langle p,y -x \rangle \leq 0 \textrm{ for all } y \in \K \right\}. \]

This is the set of directions that point out of the body $\K$. The next theorem is proved in the paper k-server via multiscale entropic regularization.

If $\nabla^2 \Phi(x)^{-1}$ is continuous on $\K$, then for any $x_0 \in \K$, there is an absolutely continuous trajectory $x : [0,\infty) \to \K$ satisfying \begin{align} \nabla^2 \Phi(x(t)) x'(t) &\in F(t,x(t)) - N_{\K}(x(t)), \label{eq:inclusion}\\ x(0) &= x_0.\nonumber \end{align} Moreover, if $\nabla^2 \Phi(x)$ is Lipschitz on $\K$ and $F$ is locally Lipschitz, then the solution is unique.

Note that \eqref{eq:inclusion} is a differential inclusion: We only require that the derivative lies in the specified set.

Lagrangian multipliers

If $\K$ is a polyhedron, the one can write

\begin{equation}\label{eq:polyhedron} \K = \{ x \in \R^n : Ax \leq b \}, \qquad A \in \R^{m \times n}, b \in \R^m\,. \end{equation}

In this case, the normal cone at $x$ is the cone spanned by the normals of the tight constraints at $x$:

\begin{equation}\label{eq:cone-poly} N_{\K}(x) = \left\{ A^T y : y \geq 0 \textrm{ and } y^T(b-Ax)=0 \right\}. \end{equation}

Consider now the application of Theorem MD to a polyhedron and a solution $x : [0,\infty) \to \K$, $\lambda : [0,\infty) \to \R^n$ such that

\begin{equation}\label{eq:traj} \nabla^2 \Phi(x(t)) x'(t) = F(t,x(t)) - \lambda(t), \end{equation}

and $\lambda(t) \in N_{\K}(x(t))$ for $t \geq 0$.

Let us consider the dual variables to the constraints \eqref{eq:polyhedron}: We can fix a measurable $\hat{\lambda} : [0,\infty) \to \R^m_+$ such that [ A^T \hat{\lambda}(t) = \lambda(t), \quad t \geq 0. ] Now \eqref{eq:cone-poly} and $\lambda(t) \in N_{\K}(x(t))$ yield the complementary-slackness conditions: For all $i=1,2,\ldots,m$ and $t \geq 0$: [ \hat{\lambda}_i(t) > 0 \implies \langle A_i,x(t)\rangle = b_i, ] where $A_i$ is the $i$th row of $A$.

The Bregman divergence as a Lyapunov function

We promised earlier the existence of a functional to control the dynamics, and this is provided by the Bregman divergence associated to $\Phi$:

\[ D_{\Phi}(y; x) \seteq \Phi(y) - \Phi(x) - \langle \nabla \Phi(x), y-x\rangle\,. \]

Let $x(t)$ be a trajectory satisfying \eqref{eq:traj}. Then for any $y \in \K$:

\begin{align} \partial_t D_{\Phi}(y; x(t)) &= - \langle \nabla \Phi(x(t)), x'(t)\rangle + \langle \nabla \Phi(x(t)), x'(t) \rangle -\langle \partial_t \Phi(x(t)), y-x(t)\rangle \nonumber \\ &= - \langle \nabla^2 \Phi(x(t)) x'(t), y - x(t) \rangle \nonumber \\ &= - \langle F(t,x(t)) - \lambda(t), y-x(t)\rangle \nonumber \\ &\leq - \langle F(t,x(t)), y-x(t)\rangle \label{eq:div}\,, \end{align}

where the last inequality used that $y \in \K$ and $\lambda(t) \in N_{\K}(x(t))$.

If $F(t,x(t)) = - c(t)$ is a cost function, say, then this inequality aligns with a goal stated at the beginning of the first lecture: As long as the algorithm $x(t)$ is suffering more cost than some feasible point $y \in \K$, we would like to be “learning” about $y$.

The algorithm from last time

In the next lecture, we will use this framework to derive and analyze algorithms for metrical task systems (MTS) and the $k$-server problem. For now, let us show that the algorithm and analysis from last time (for MTS on uniform metrics) fit precisely into our framework.

Suppose that $\K = \{ x \in \R_+^n : \sum_{i=1}^n x_i = 1 \}$ is the probability simplex and

\[ \Phi(x) = \sum_{i=1}^n (x_i+\delta) \log (x_i+\delta) \]

is the (negative) entropy with some shift by $\delta > 0$. In the next lecture, we will see why the negative entropy arises naturally as a mirror map.

Then $\nabla^2 \Phi(x)$ is a diagonal matrix with $\left(\nabla^2 \Phi(x)\right)_{ii} = \frac{1}{x_i+\delta}$. Let $F(t,\cdot) = -c(t)$ be a time-varying cost vector with $c(t) \geq 0$.

Therefore \eqref{eq:traj} gives

\begin{equation}\label{eq:shifted} x_i'(t) = (x_i(t)+\delta) \left(-c_i(t) + \hat{\mu}(t) - \hat{\lambda}_i(t)\right). \end{equation} Here, $\hat{\lambda}_i(t)$ is the Lagrangian multiplier corresponding to the constraint $x_i \geq 0$, and $\hat{\mu}(t)$ is the multiplier corresponding to $\sum_{i=1}^n x_i = 1$.

This is precisley the algorithm described before (as an exercise, one might try rewriting it to match exactly), and \eqref{eq:div} constitutes half of the analysis. In the next lecture, we will discuss some general methods for the other half: Tracking the movement cost.

Regret minimization and competitive analysis

Sun, 01 Apr 2018 00:00:00 +0000

These are notes for the first lecture of a course I am co-teaching with Seb Bubeck on Competitive analysis via convex optimization.

I want to set the groundwork by reviewing the bandits model in online learning and the standard exponential weights algorithm, and then trying to extend it to the setting of competitive analysis where the analogous problem goes by the name of metrical task systems. Seeing where things go wrong will give us a plan to follow for much of the course. Some of the objects and algorithms in this lecture may appear unmotivated; that will be addressed soon.

Regret minimization

Here we review quickly the basic multi-arm bandits model in online learning. For more background see, e.g., this survey of Bubeck and Cesa-Binachi. $\def\seteq{\mathrel{\vcenter{:}}=} \def\cE{\mathcal{E}} \def\R{\mathbb{R}} \def\argmin{\mathrm{argmin}} \def\llangle{\left\langle} \def\rrangle{\right\rangle} \def\1{\mathbf{1}} \def\e{\varepsilon}$

In the standard bandits model, we have a set of experts $\cE = \{1,2,\ldots,N\}$, and loss functions $\ell_1,\ell_2,\ldots$ arriving over time, where $\ell_t : \cE \to [0,1]$.

For the sake of simplicity, we will work in the full information model. At time $t \geq 1$, we have seen $\ell_1,\ldots,\ell_{t-1}$. We choose some strategy $x_t \in \cE$ and incur cost $\ell_t(x_t)$. Our goal is to minimize the total cost incurred: $\sum_{t=1}^T \ell_t(x_t)$. In the setting of adversarial bandits, we will allow ourselves to employ a randomized strategy, playing instead a distribution $p_t : \cE \to [0,1]$ at time $t$.

In the regret minimization framework, we compare our expected loss to that of the optimal fixed strategy: The regret incurred up to time $T$ is

$$R_T \seteq \sum_{t=1}^T \langle p_t, \ell_t\rangle - \min_{x \in \cE} \sum_{t=1}^T \ell_t(x)\,,$$

where for $f,g : \cE \to \R$, we write $\langle f,g\rangle = \sum_{y \in \cE} f(y) g(y)$.

We will bound the regret in two steps. Denote $x_T^* \seteq \argmin_{x \in \cE} \sum_{t=1}^T \ell_t(x)$. Then:

\begin{align} R_T &= \sum_{t=1}^T \langle p_t - p_{t+1},\ell_t\rangle + \sum_{t=1}^T \langle p_{t+1}, \ell_t-\ell_t(x_T^*)\rangle \nonumber \\ &\leq \sum_{t=1}^T \|p_t-p_{t+1}\|_1 + \sum_{t=1}^T \langle p_{t+1}, \ell_t-\ell_t(x_T^*)\rangle,\label{eq:pseudo-regret} \end{align}

where the inequality uses that the losses lie in $[0,1]$.

Continuous-time dynamics

For a number of reasons, the analysis will be much cleaner in continuous time. We can think about the losses as a trajectory $\left\{ \ell_t : \cE \to [0,1] \mid t \geq 0 \right\}$, where the instantaneous loss incurred is $\ell_t\,dt$. As long as we are bounding the expression \eqref{eq:pseudo-regret}, our continuous-time model will be more general than the discrete-time model. The latter can be recovered by considering a trajectory $\{ \ell_t : t \geq 0 \}$ that is piecewise-constant. If we were bounding the regret itself, the continuous-time model would have a time advantage, but once we shift time by one as in \eqref{eq:pseudo-regret}, this advantage disappears.

Now we can analogously bound the regret:

\begin{equation}\label{eq:cont-regret} R_T \leq \int_0^T \|\partial_t p_t\|_1 \,dt + \int_0^T \langle p_t, \ell_t-\ell_t(x_T^*)\rangle\,dt\,, \end{equation}

where $x_T^* \seteq \argmin_{x \in \cE} \int_0^T \ell_t(x)\,dt$.

We will start with $p_0(x) = 1/N$ for every $x \in \cE$, and employ the following exponential-weights strategy for updating $p_t$: \begin{equation}\label{eq:dynamics} \partial_t \log p_t(x) = \eta\left( - \ell_t(x) + \langle p_t, \ell_t\rangle\right), \end{equation} where $\eta > 0$ is a parameter called the learning rate that we will choose soon. This algorithm will be motivated more in the next lecture, but for now one can note that we are simply doing continuous-time gradient descent on the vector $\log p_t(x)$: We are moving in the direction $- \eta \ell_t dt$. The additional additive term in \eqref{eq:dynamics} is there to maintain the constraint that $p_t$ is a probability distribution.

One can see this clearly by rewriting \eqref{eq:dynamics} as: \begin{equation}\label{eq:exp-form} \partial_t p_t(x) = p_t(x) \, \eta \left(- \ell_t(x) + \langle p_t,\ell_t\rangle\right), \end{equation} so that $\sum_{x \in \cE} \partial_t p_t(x) = 0$.

The Lyapunov functional, aka the potential

In order to bound the regret $R_T$, we will employ the philosophy underlying the entire course: If we are incurring more cost than $x_T^*$, we would like to be learning about $x_T^*$ in a suitable sense.

Define: [ D(x; p) \seteq - \log p(x)\,. ] Note that for any $x \in \cE$, we have $D(x; p_0) = \log N$, and $D(x; p_t) \geq 0$ for all $t \geq 0$.

For any $x \in \cE$: \begin{equation}\label{eq:lya} \partial_t D(x;p_t) = - \partial_t \log p_t(x) = \eta \left(\ell_t(x) - \langle p_t,\ell_t\rangle \right). \end{equation} If we think of $D(x;p_t)$ as the “distance” from $p_t$ to $x$, then our distance to $x$ is decreasing proportional to the advantage $x$ has over our strategy $p_t$. Integrating \eqref{eq:lya} over $[0,T]$ gives [ D(x; p_0) - D(x; p_T) = \eta \int_0^T \langle p_t, \ell_t-\ell_t(x)\rangle\,dt ]

Applying this with $x=x_T^*$ and recalling \eqref{eq:cont-regret}, we have

\begin{align*} R_T &\leq \int_0^T \|\partial_t p_t\|_1\,dt + \frac{D(x_0^*;p_0)-D(x_T^*;p_T)}{\eta} \\ &\leq \int_0^T \|\partial_t p_t\|_1\,dt + \frac{\log N}{\eta} \\ &\leq \eta T + \frac{\log N}{\eta}\,, \end{align*}

where the last inequality uses \eqref{eq:exp-form} and the fact that the losses are in $[0,1]$ to write $\|\partial_t p_t\|_1 \leq \eta \|p_t\|_1 = \eta$.

Setting $\eta = \sqrt{\frac{\log N}{T}}$ yields the standard regret bound [ R_T \leq 2 \sqrt{T \log N}\,. ] If one does not know the final time $T$, it is not difficult to see that one can choose a time-dependent learning rate $\eta=\eta(t)=\sqrt{\frac{\log N}{t}}$ to obtain a similar result. result

Competitive analysis

The competitive analysis analog of the bandits framework goes by the name of metrical task systems (MTS). This problem was introduced in 1992 by Borodin, Linial, and Saks.

This setting has three major differences:

At each time $t \geq 1$, we receive a cost function $c_t : \cE \to [0,\infty)$, and we choose an action $x_t \in \cE$ after receiving $c_t$.
There is a metric $d$ on $\cE$ that makes $(\cE,d)$ into a metric space, and in addition to the service cost $c_t(x_t)$, we pay a switching cost $d(x_{t-1},x_t)$ for playing $x_t$.
We compare ourselves against an offline optimum that has the same ability to switch strategies.

Say that an online algorithm $\llangle x_1, x_2, \ldots,x_T\rrangle$ is $\alpha$-competitive if, for every $x_0 \in \cE$ and every cost sequence $c_1,c_2,\ldots,c_T$, it holds that

$$ \sum_{t=1}^T c_t(x_t) + d(x_{t-1},x_t) \leq \alpha\left(\sum_{t=1}^T c_t(x^*_t) + d(x_{t-1}^*, x_t^*)\right) + O(1)\,, $$

where $\llangle x_0^*, x_1^*, x_2^*, \ldots, x_T^*\rrangle$ is the optimal offline strategy with $x_0^*=x_0$, i.e., the optimal strategy in hindsight. The additive $O(1)$ term is a constant that should be independent of the request sequence.

It is conjectured that the competitive ratio is $O(\log N)$ for every $N$-point metric space. In the coming weeks, we will see an $O((\log N)^2)$-competitive algorithm based on upcoming joint work with Bubeck, Cohen, and Y. T. Lee. This improves slightly on the $O((\log N)^2 \log \log N)$ bound of Fiat and Mendel.

Attempting exponential weights

The simplest setting for MTS is when the metric $(\cE,d)$ is uniform, i.e., $d(x,y)=\1_{\{x\neq y\}}$ for $x,y \in \cE$.

Just as in the bandits setting, we can consider a continuous-time trajectory $\left\{ c_t : \cE \to [0,\infty) \mid t \geq 0\right\}$ on cost functions. And we might try to employ a similar strategy: \begin{equation}\label{eq:cost-dynamics} \partial_t \log p_t(x) = - c_t(x) + \langle p_t, c_t\rangle. \end{equation}

But now we run into a major hurdle: Such an algorithm cannot be $\alpha$-competitive for any $\alpha < \infty$.

Suppose we arrange that for some $x_0 \in \cE$ and $t_0 > 0$ and small $\epsilon > 0$, it holds that $p_{t_0}(x_0) = 1-\epsilon$. We can do this by having an adversary play the cost function $c(x)=\1_{\cE \setminus \{x_0\}}(x)$ for a long enough period of time so that any competitive algorithm must put almost all the probability mass on $x_0$.

Now the adversary changes the cost function to $c(x)=\1_{x_0}(x)$. The dynamics specified by \eqref{eq:cost-dynamics} will move much too slowly! Indeed: \[ \partial_t \log p_t(x_0) = -1+(1-p_t(x_0)) = - p_t(x_0), \] i.e., \[ \partial_t p_t(x_0) = - p_t(x_0)^2, \] thus it will take roughly $1/\epsilon$ time before $p_t(x_0) < 1/2$.

Thus our algorithm will incur cost $\asymp 1/\epsilon$ while the optimal offline algorithm incurs cost $O(1)$. In the bandits setting, this was fine because every fixed strategy incurs cost $\asymp 1/\epsilon$. But in the setting of competitive analysis, our algorithm needs to be a lot more nimble to keep up with an offline algorithm that can switch strategies.

The exploration shift

To fix this, we will design an algorithm that devotes a constant fraction of the service cost it is currently incurring to exploring the strategy space. Essentially, this can be achieved by pretending that $p_t(x) \geq 1/(2N)$ for every $x \in \cE$. (Recall that $N = |\cE|$.) This transformation (mixing with the uniform distribution) is not uncommon in the bandit literature. In the setting of metrical task systems, I saw it for the first time in this paper of Bansal, Buchbinder, and Naor on the weighted paging problem.

Define $p_0(x)=\1_{x_0}(x)$ where $x_0 \in X$ is the starting point. Let $\delta > 0$ be a number we will choose soon, and consider the dynamics: \begin{equation}\label{eq:mts-dynamics} \partial_t \log (p_t(x)+\delta) = - \hat{c}_t(x) + \llangle \frac{p_t+\delta}{1+\delta N}, \hat{c}_t\rrangle\,, \end{equation} which can be written equivalently as \begin{equation}\label{eq:mts-move} \partial_t p_t(x) = (p_t(x)+\delta) \left(- \hat{c}_t(x) + \llangle \frac{p_t+\delta}{1+\delta N}, \hat{c}_t\rrangle\right). \end{equation} A natural choice is to take $\hat{c}_t(x)=c_t(x)$, but this presents a problem: Unlike \eqref{eq:cost-dynamics}, these dynamics no longer enforce that $p_t(x) \geq 0$.

Lagrangian multipliers

In this relatively simple setting, we can consider reduced costs $\hat{c}_t(x) \seteq c_t(x)-\lambda_t(x)$ satisfying:

$\lambda_t(x) \geq 0$ for all $t \geq 0$,
$p_t(x) = 0 \implies \partial_t p_t(x) \geq 0$,
$\lambda_t(x) > 0 \implies p_t(x)=0$.

With these constraints in place, the corresponding trajectory $\{p_t : t \geq 0\}$ will always be a probability measure, and we can charge ourselves $\hat{c}_t(x)$ without worrying about cheating, since $p_t(x) \hat{c}_t(x) = p_t(x) c_t(x)$ will always hold. The existence of functions $\{ \lambda_t : \cE \to [0,\infty) \mid t \geq 0\}$ is a slightly subtle issue that will be addressed formally in the coming lectures. These are Lagrangian multipliers corresponding to the constraints $p_t(x) \geq 0$ for $x \in \cE$.

As we will see in future lectures, the Lagrangian multipliers will not adversely affect the potential analysis, but they could cause our algorithm to incur movement cost. In the present setting, things are going in the beneficial direction: The multipliers correspond to reduced costs, and therefore they actually slow down our movement.

The potential analysis

Define now the potential \[ D_{\delta}(x; p) \seteq - \log (p(x)+\delta). \] We are interested in $\partial_t D_{\delta}(x_t^*; p_t)$. Let's first consider the derivative with respect to $x_t^*$. For any $x,y \in \cE$: \[ \left|D_{\delta}(x; p) - D_{\delta}(y ;p)\right| \leq \log(1/\delta)\,. \] Thus for any $T \geq 0$: \begin{equation}\label{eq:first-mts} D_{\delta}(x_{T}^*; p_T) - D_{\delta}(x_0^*, p_0) \leq \log(1/\delta) \sum_{t=1}^{\lfloor T\rfloor} d(x_t^*, x_{t-1}^*) + \int_0^T \partial_t D_{\delta}\left(x^*_{\lfloor t\rfloor}; p_t\right)\,dt\,. \end{equation} To analyze the latter term, use \eqref{eq:mts-dynamics} to observe that for any $x \in \cE$, \[ \partial_t D_{\delta}(x;p_t) = \hat{c}_t(x) - \llangle \frac{p_t+\delta}{1+\delta N}, \hat{c}_t\rrangle, \] hence \[ \int_0^T \partial_t D_{\delta}\left(x_{\lfloor t\rfloor}^*; p_t\right)\,dt = \int_0^T \hat{c}_t(x_t^*)\,dt - \int_0^T \llangle \frac{p_t+\delta}{1+\delta N}, \hat{c}_t\rrangle\,dt\,. \] Plugging this into \eqref{eq:first-mts} and rearranging gives \begin{align}\nonumber (1+& \delta N)^{-1} \int_0^T \langle p_t + \delta, \hat{c}_t\rangle\,dt \\ &\leq \left[ D_{\delta}(x_{0}^*; p_0) - D_{\delta}(x_T^*, p_T) \right] + \log(1/\delta) \sum_{t=1}^{\lfloor T\rfloor} d(x_t^*, x_{t-1}^*) + \int_0^T \hat{c}_t(x_t^*)\,dt \nonumber \\ &\leq \log(1/\delta) \sum_{t=1}^{\lfloor T\rfloor} d(x_t^*, x_{t-1}^*) + \int_0^T c_t(x_t^*)\,dt\,,\label{eq:second-mts} \end{align} where we have used the fact that the term in brackets is nonpositive, and $c_t \geq \hat{c}_t$ pointwise. This looks great: We have bounded the service cost of our algorithm by the movement and service costs of the optimal algorithm. We are left to consider the movement cost of the algorithm.

The movement cost

Here we use a trick from online algorithms: Instead of bounding the total movement cost $\int_0^T \|\partial_t p_t\|_1\,dt,$ we will bound only the incoming movement $\int_0^T \|\left(\partial_t p_t\right)_+\|_1\,dt$.

Since “what goes in must come out (unless it stays there forever),” we have:

$$ \int_0^T \|\partial_t p_t\|_1\,dt \leq 2 \int_0^T \|\left(\partial_t p_t\right)_+\|_1\,dt + 1\,. $$ And now \eqref{eq:mts-move} gives $$ \left\|\left(\partial_t p_t\right)_+\right\|_1 \leq \langle p_t+\delta,\hat{c}_t\rangle\,. $$

Combining this with \eqref{eq:second-mts} and using the fact that $\langle p_t, c_t\rangle = \langle p_t, \hat{c}_t\rangle$ yields \begin{align*} \int_0^T \left(\langle p_t,c_t\rangle + \|\partial_t p_t\|_1\right)\,dt &\leq 1 + 3 \int_0^T \langle p_t + \delta ,\hat{c}_t\rangle\,dt \\ & \leq 1 + 3(1+\delta N) \left[\log(1/\delta) \sum_{t=1}^{\lfloor T\rfloor} d(x_t^*, x_{t-1}^*) + \int_0^T c_t(x_t^*)\,dt\right]. \end{align*} Thus setting $\delta = 1/N$ yields an $O(\log N)$-competitive algorithm for MTS on a uniform metric space.

A comment about absolute continuity

Note that in \eqref{eq:first-mts}, we have used the fundamental theorem of calculus to integrate a derivative. In order for this to be valid, it must be that $\log p_t(x)$ is absolutely continuous as a function of $t$. When we argue formally about the existence of the Lagrangian multipliers $\lambda_t$, we will need to ensure that the resulting trajectory is absolutely continuous.

tcsmath relaunch

Sat, 31 Mar 2018 00:00:00 +0000

I am relaunching tcsmath on a new platform in anticipation of resuming regular posting. The old pages should still be available here tcsmath, and some of them will be slowly migrated to the new format. There may be some DNS/https hiccups. I hope those are resolved soon.

An entropy optimal drift

Sat, 21 Nov 2015 00:00:00 +0000

## Construction of Föllmer's drift In a previous post, we saw how an entropy-optimal drift process could be used to prove the Brascamp-Lieb inequalities. Our main tool was a result of Föllmer that we now recall and justify. Afterward, we will use it to prove the Gaussian log-Sobolev inequality. Consider $f : \mathbb R^n \to \mathbb R_+$ with $\int f \,d\gamma_n = 1$, where $\gamma_n$ is the standard Gaussian measure on $\mathbb R^n$. Let $\\{B_t\\}$ denote an $n$-dimensional Brownian motion with $B_0=0$. We consider all processes of the form \begin{equation}\label{eq:drift} W_t = B_t + \int_0^t v_s\,ds\,, \end{equation} where $\\{v_s\\}$ is a progressively measurable drift and such that $W_1$ has law $f\,d\gamma_n$.

It holds that \[ D(f d\gamma_n \,\|\, d\gamma_n) = \min D(W_{[0,1]} \,\|\, B_{[0,1]}) = \min \frac12 \int_0^1 \mathbb{E}\,\|v_t\|^2\,dt\,, \] where the minima are over all processes of the form \eqref{eq:drift}.

In the preceding post (Lemma 2), we have already seen that for any drift of the form \eqref{eq:drift}, it holds that \[ D(f d\gamma_n \,\|\,d\gamma_n) \leq \frac12 \int_0^1 \mathbb{E}\,\|v_t\|^2\,dt \leq D(W_{[0,1]} \,\|\, B_{[0,1]})\,, \] thus we need only exhibit a drift $\\{v_t\\}$ achieving equality.

We define \[ v_t = \nabla \log P_{1-t} f(W_t) = \frac{\nabla P_{1-t} f(W_t)}{P_{1-t} f(W_t)}\,, \] where $\\{P_t\\}$ is the Brownian semigroup defined by \[ P_t f(x) = \mathbb{E}[f(x + B_t)]\,. \]

Note that $v_t$ is almost surely constant conditioned on the past, hence the chain rule yields \begin{equation}\label{eq:chain} D(W_{[0,1]} \,\|\, B_{[0,1]}) = \frac12 \int_0^1 \mathbb{E}\,\|v_t\|^2\,dt\,. \end{equation} (See line (7) of Lemma 2 in the previous post. Note that $h(v_t)=0$ since $v_t$ is deterministic given the past.) We are left to show that $W_1$ has law $f \,d\gamma_n$ and $D(W_{[0,1]} \,\|\, B_{[0,1]}) = D(f d\gamma_n \,\|\,d\gamma_n)$.

We will prove the first fact using Girsanov's theorem to argue about the change of measure between $\{W_t\}$ and $\{B_t\}$. As in the previous post, we will argue somewhat informally using the heuristic that the law of $dB_t$ is a Gaussian random variable in $\mathbb R^n$ with covariance $dt \cdot I$. Itô's formula states that this heuristic is justified (see our use of the formula below).

The following lemma says that, given any sample path $\{W_s : s \in [0,t]\}$ of our process up to time $s$, the probability that Brownian motion (without drift) would have "done the same thing is $\frac{1}{M_t}$.

I chose to present various steps in the next proof at varying levels of formality. The arguments have the same structure as corresponding formal proofs, but I thought (perhaps naïvely) that this would be instructive.

Let $\mu_t$ denote the law of $\\{W_s : s \in [0,t]\\}$. If we define \[ M_t = \exp\left(-\int_0^t \langle v_s,dB_s\rangle - \frac12 \int_0^t \|v_s\|^2\,ds\right)\,, \] then under the measure $\nu_t$ given by \[ d\nu_t = M_t \,d\mu_t\,, \] the process $\\{W_s : s \in [0,t]\\}$ has the same law as $\\{B_s : s \in [0,t]\\}$.

We argue by analogy with the discrete proof. First, let us define the infinitesimal ``transition kernel'' of Brownian motion using our heuristic that $dB_t$ has covariance $dt \cdot I$: \[ p(x,y) = \frac{e^{-\|x-y\|^2/2dt}}{(2\pi dt)^{n/2}}\,. \]

We can also compute the (time-inhomogeneous) transition kernel $q_t$ of $\\{W_t\\}$: \[ q_t(x,y) = \frac{e^{-\|v_t dt + x - y\|^2/2dt}}{(2\pi dt)^{n/2}} = p(x,y) e^{-\frac12 \|v_t\|^2 dt} e^{-\langle v_t, x-y\rangle}\,. \] Here we are using that $dW_t = dB_t + v_t\,dt$ and $v_t$ is deterministic conditioned on the past, thus the law of $dW_t$ is a normal with mean $v_t\,dt$ and covariance $dt \cdot I$.

To avoid confusion of derivatives, let's use $\alpha_t$ for the density of $\mu_t$ and $\beta_t$ for the density of Brownian motion (recall that these are densities on paths). Now let us relate the density $\alpha_{t+dt}$ to the density $\alpha_{t}$. We use here the notations $\\{\hat W_t, \hat v_t, \hat B_t\\}$ to denote a (non-random) sample path of $\\{W_t\\}$: \begin{align*} \alpha_{t+dt}(\hat W_{[0,t+dt]}) &= \alpha_t(\hat W_{[0,t]}) q_t(\hat W_t, \hat W_{t+dt}) \\ &= \alpha_t(\hat W_{[0,t]}) p(\hat W_t, \hat W_{t+dt}) e^{-\frac12 \|\hat v_t\|^2\,dt-\langle \hat v_t,\hat W_t-\hat W_{t+dt}\rangle} \\ &= \alpha_t(\hat W_{[0,t]}) p(\hat W_t, \hat W_{t+dt}) e^{-\frac12 \|\hat v_t\|^2\,dt+\langle \hat v_t,d \hat W_t\rangle} \\ &= \alpha_t(\hat W_{[0,t]}) p(\hat W_t, \hat W_{t+dt}) e^{\frac12 \|\hat v_t\|^2\,dt+\langle \hat v_t, d \hat B_t\rangle}\,, \end{align*} where the last line uses $d\hat W_t = d\hat B_t + \hat v_t\,dt$.

Now by ``heuristic'' induction, we can assume $\alpha_t(\hat W_{[0,t]})=\frac{1}{M_t} \beta_t(\hat W_{[0,t]})$, yielding \begin{align*} \alpha_{t+dt}(\hat W_{[0,t+dt]}) &= \frac{1}{M_t} \beta_t(\hat W_{[0,t]}) p(\hat W_t, \hat W_{t+dt}) e^{\frac12 \|\hat v_t\|^2\,dt+\langle \hat v_t, d \hat B_t\rangle} \\ &= \frac{1}{M_{t+dt}} \beta_t(\hat W_{[0,t]}) p(\hat W_t, \hat W_{t+dt}) \\ &= \frac{1}{M_{t+dt}} \beta_{t+dt}(\hat W_{[0,t+dt]})\,. \end{align*} In the last line, we used the fact that $p$ is the infinitesimal transition kernel for Brownian motion.

## The Gaussian log-Sobolev inequality Consider again a measurable $f : \mathbb R^n \to \mathbb R_+$ with $\int f\,d\gamma_n=1$. Let us define $\mathrm{Ent}_{\gamma_n}(f) = D(f\,d\gamma_n \,\\|\,d\gamma_n)$. Then the classical log-Sobolev inequality in Gaussian space asserts that \begin{equation}\label{eq:logsob} \mathrm{Ent}_{\gamma_n}(f) \leq \frac12 \int \frac{\|\nabla f\|^2}{f}\,d\gamma_n\\,. \end{equation} First, we discuss the correct way to interpret this. Define the Ornstein-Uhlenbeck semi-group $\\{U_t\\}$ by its action \\[ U_t f(x) = \mathbb{E}[f(e^{-t} x + \sqrt{1-e^{-2t}} B_1)]\\,. \\] This is the natural stationary diffusion process on Gaussian space. For every measurable $f$, we have \\[ U_t f \to \int f d\gamma_n \quad \textrm{ as $t \to \infty$}\,, \\] or equivalently \\[ \mathrm{Ent}_{\gamma_n}(U_t f) \to 0 \quad \textrm{ as $t \to \infty$}\,. \\]

The log-Sobolev inequality yields quantitative convergence in the relative entropy distance as follows: Define the Fisher information \[ I(f) = \int \frac{\|\nabla f\|^2}{f} \,d\gamma_n\,. \]

One can check that $$ \frac{d}{dt} \mathrm{Ent}_{\gamma_n} (U_t f)\Big|_{t=0} = - I(f)\,, $$ thus the Fisher information describes the instantaneous decay of the relative entropy of $f$ under diffusion.

So we can rewrite the log-Sobolev inequality as: \[ - \frac{d}{dt} \mathrm{Ent}_{\gamma_n}(U_t f)\Big|_{t=0} \geq \mathrm{Ent}_{\gamma_n}(f)\,. \] This expresses the intuitive fact that when the relative entropy is large, its rate of decay toward equilibrium is faster.