<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>tcs math</title>
    <description>some mathematics &amp; computation</description>
    <link>http://tcsmath.github.io/</link>
    <atom:link href="http://tcsmath.github.io/feed.xml" rel="self" type="application/rss+xml" />
    
      <item>
        <title>Lower bounds for MTS</title>
        <description>&lt;p&gt;Consider a rooted tree &lt;span class=&quot;math inline&quot;&gt;\(T=(V,E)\)&lt;/span&gt; with leaf set &lt;span class=&quot;math inline&quot;&gt;\(\mathcal L\subseteq V\)&lt;/span&gt; and positive vertex weights &lt;span class=&quot;math inline&quot;&gt;\(w : V\setminus \mathcal L\to \mathbb R_+\)&lt;/span&gt; that are non-increasing along root-leaf paths. We recall the ultrametric &lt;span class=&quot;math inline&quot;&gt;\(d_w\)&lt;/span&gt; defined on &lt;span class=&quot;math inline&quot;&gt;\(\mathcal L\)&lt;/span&gt; by 
&lt;p&gt;
$$
  d_w(\ell,\ell') \mathrel{\vcenter{:}}= w(\mathrm{lca}(\ell,\ell')).
$$
&lt;/p&gt;

&lt;p&gt;
Say that &lt;span class=&quot;math inline&quot;&gt;\((\mathcal L,d_w)\)&lt;/span&gt; is a &lt;span&gt;&lt;em&gt;&lt;span class=&quot;math inline&quot;&gt;\(\tau\)&lt;/span&gt;-HST metric&lt;/em&gt;&lt;/span&gt; for some &lt;span class=&quot;math inline&quot;&gt;\(\tau \geq 1\)&lt;/span&gt; if &lt;span class=&quot;math inline&quot;&gt;\(\mathcal L\)&lt;/span&gt; is the leaf set of some weighted tree &lt;span class=&quot;math inline&quot;&gt;\(T=(V,E)\)&lt;/span&gt;, and &lt;span class=&quot;math inline&quot;&gt;\(d_w\)&lt;/span&gt; is the ultrametric corresponding to a weight &lt;span class=&quot;math inline&quot;&gt;\(w\)&lt;/span&gt; on &lt;span class=&quot;math inline&quot;&gt;\(T\)&lt;/span&gt; with the property that &lt;span class=&quot;math inline&quot;&gt;\(w(y) \leq w(x)/\tau\)&lt;/span&gt; whenever &lt;span class=&quot;math inline&quot;&gt;\(y\)&lt;/span&gt; is a child of &lt;span class=&quot;math inline&quot;&gt;\(x\)&lt;/span&gt;. (So for a finite metric space, the notions of &lt;span class=&quot;math inline&quot;&gt;\(1\)&lt;/span&gt;-HST metric and ultrametic are equivalent.) Say that &lt;span class=&quot;math inline&quot;&gt;\((\mathcal L,d)\)&lt;/span&gt; is a &lt;span&gt;&lt;em&gt;strict &lt;span class=&quot;math inline&quot;&gt;\(\tau\)&lt;/span&gt;-HST metric&lt;/em&gt;&lt;/span&gt; if we require the stronger property that &lt;span class=&quot;math inline&quot;&gt;\(w(y)=w(x)/\tau\)&lt;/span&gt; whenever &lt;span class=&quot;math inline&quot;&gt;\(x,y \in V\)&lt;/span&gt; are internal nodes and &lt;span class=&quot;math inline&quot;&gt;\(y\)&lt;/span&gt; is a child of &lt;span class=&quot;math inline&quot;&gt;\(x\)&lt;/span&gt;.&lt;/p&gt;
&lt;p&gt;Two metrics &lt;span class=&quot;math inline&quot;&gt;\(d\)&lt;/span&gt; and &lt;span class=&quot;math inline&quot;&gt;\(d'\)&lt;/span&gt; on a set &lt;span class=&quot;math inline&quot;&gt;\(X\)&lt;/span&gt; are &lt;span&gt;&lt;em&gt;&lt;span class=&quot;math inline&quot;&gt;\(K\)&lt;/span&gt;-equivalent&lt;/em&gt;&lt;/span&gt; if there is a constant &lt;span class=&quot;math inline&quot;&gt;\(c &amp;gt; 0\)&lt;/span&gt; such that &lt;span class=&quot;math display&quot;&gt;\[d(x,y) \leq c\, d'(x,y) \leq K d(x,y)\qquad \forall x,y \in X\,.\]&lt;/span&gt; We leave the following statements as an exercise:
For every &lt;span class=&quot;math inline&quot;&gt;\(\tau &amp;gt; 1\)&lt;/span&gt;, every &lt;span class=&quot;math inline&quot;&gt;\(1\)&lt;/span&gt;-HST is &lt;span class=&quot;math inline&quot;&gt;\(\tau\)&lt;/span&gt;-equivalent to some strict &lt;span class=&quot;math inline&quot;&gt;\(\tau\)&lt;/span&gt;-HST.&lt;/p&gt;

&lt;a name=&quot;thm:bbm&quot;&gt;&lt;/a&gt;
&lt;p class=&quot;theorem&quot; text=&quot;Bartal-Bollobas-Mendel&quot; ord=&quot;1&quot;&gt;
There is a constant &lt;span class=&quot;math inline&quot;&gt;\(C \geq 1\)&lt;/span&gt; such that if &lt;span class=&quot;math inline&quot;&gt;\((\mathcal L,d_w)\)&lt;/span&gt; is a &lt;span class=&quot;math inline&quot;&gt;\(C (\log n)^2\)&lt;/span&gt;-HST metric, where &lt;span class=&quot;math inline&quot;&gt;\(n=|\mathcal L|\)&lt;/span&gt;, then the competitive ratio for MTS on &lt;span class=&quot;math inline&quot;&gt;\((\mathcal L,d_w)\)&lt;/span&gt; is &lt;span class=&quot;math inline&quot;&gt;\(\Omega\left(\log n\right)\)&lt;/span&gt;.&lt;/p&gt;

While the preceding theorem applies only to $\tau$-HSTs with $\tau$ sufficiently large,
the next lemma shows how we can improve the separation parameter by passing
to a subset of leaves.

&lt;p class=&quot;lemma&quot;&gt;
If &lt;span class=&quot;math inline&quot;&gt;\((\mathcal L,d)\)&lt;/span&gt; is a strict &lt;span class=&quot;math inline&quot;&gt;\(2\)&lt;/span&gt;-HST, then for every &lt;span class=&quot;math inline&quot;&gt;\(k \in \mathbb N\)&lt;/span&gt;, there is a subset &lt;span class=&quot;math inline&quot;&gt;\(\mathcal L' \subseteq \mathcal L\)&lt;/span&gt; with
&lt;p&gt;
$$
|\mathcal{L}'| \geq |\mathcal{L}|^{1/k},
$$
&lt;/p&gt;
and such that &lt;span class=&quot;math inline&quot;&gt;\((\mathcal L', d)\)&lt;/span&gt; is a &lt;span class=&quot;math inline&quot;&gt;\(2^k\)&lt;/span&gt;-HST metric.
&lt;/p&gt;


&lt;p&gt;Combining this with &lt;a href=&quot;#thm:bbm&quot;&gt;Theorem 1&lt;/a&gt; yields:&lt;/p&gt;
&lt;p class=&quot;corollary&quot;&gt;
If &lt;span class=&quot;math inline&quot;&gt;\((\mathcal L,d)\)&lt;/span&gt; is a &lt;span class=&quot;math inline&quot;&gt;\(1\)&lt;/span&gt;-HST metric with &lt;span class=&quot;math inline&quot;&gt;\(n=|\mathcal L|\)&lt;/span&gt;, then the competitive ratio for MTS on &lt;span class=&quot;math inline&quot;&gt;\((\mathcal L,d)\)&lt;/span&gt; is at least
&lt;span class=&quot;math inline&quot;&gt;\(\Omega\left(\frac{\log n}{\log \log n}\right)\)&lt;/span&gt;.
&lt;/p&gt;

&lt;p&gt;Recall that the Metric Ramsey theorem from the preceding lecture implies that every &lt;span class=&quot;math inline&quot;&gt;\(n\)&lt;/span&gt;-point metric space contains a subset of size &lt;span class=&quot;math inline&quot;&gt;\(\sqrt{n}\)&lt;/span&gt; that is &lt;span class=&quot;math inline&quot;&gt;\(O(1)\)&lt;/span&gt;-equivalent to a &lt;span class=&quot;math inline&quot;&gt;\(1\)&lt;/span&gt;-HST. This yields:&lt;/p&gt;

&lt;p class=&quot;theorem&quot;&gt;
For every &lt;span class=&quot;math inline&quot;&gt;\(n\)&lt;/span&gt;-point metric space, the competitive ratio for MTS is at least &lt;span class=&quot;math inline&quot;&gt;\(\Omega(\log n/\log \log n)\)&lt;/span&gt;.
&lt;/p&gt;

&lt;p&gt;We will not prove &lt;a href=&quot;#thm:bbm&quot;&gt;Theorem 1&lt;/a&gt;, but we will prove a lower bound for two special cases that together capture the essential elements of the full argument. The general case is discussed at the end.&lt;/p&gt;

&lt;h2&gt;The complete d-ary tree&lt;/h2&gt;

&lt;p&gt;The first case we’ll consider is when &lt;span class=&quot;math inline&quot;&gt;\(T\)&lt;/span&gt; is a &lt;span class=&quot;math inline&quot;&gt;\(d\)&lt;/span&gt;-ary tree of height &lt;span class=&quot;math inline&quot;&gt;\(h\)&lt;/span&gt;, so that &lt;span class=&quot;math inline&quot;&gt;\(|\mathcal L|=d^h\)&lt;/span&gt;. Define &lt;span class=&quot;math inline&quot;&gt;\(w(x)=\tau^{-\mathrm{dist}_T(x,r)}\)&lt;/span&gt;, where &lt;span class=&quot;math inline&quot;&gt;\(r\)&lt;/span&gt; is the root of &lt;span class=&quot;math inline&quot;&gt;\(T\)&lt;/span&gt; and &lt;span class=&quot;math inline&quot;&gt;\(\mathrm{dist}_T\)&lt;/span&gt; denotes the combinatorial distance on &lt;span class=&quot;math inline&quot;&gt;\(T\)&lt;/span&gt;. Then &lt;span class=&quot;math inline&quot;&gt;\((\mathcal L,d_w)\)&lt;/span&gt; is a &lt;span class=&quot;math inline&quot;&gt;\(\tau\)&lt;/span&gt;-HST.&lt;/p&gt;
&lt;p&gt;Our goal is to establish a lower bound of &lt;span class=&quot;math inline&quot;&gt;\(\Omega(h \log d)\)&lt;/span&gt; on the competitive ratio for MTS when &lt;span class=&quot;math inline&quot;&gt;\(\tau\)&lt;/span&gt; is sufficiently large. Note that when &lt;span class=&quot;math inline&quot;&gt;\(\tau = \Theta(1)\)&lt;/span&gt; and &lt;span class=&quot;math inline&quot;&gt;\(d=2\)&lt;/span&gt;, it is an open problem to exhibit an &lt;span class=&quot;math inline&quot;&gt;\(\Omega(h)\)&lt;/span&gt; lower bound, and proving such a bound is likely the most difficult obstacle in obtaining an &lt;span class=&quot;math inline&quot;&gt;\(\Omega(\log n)\)&lt;/span&gt; lower bound for every &lt;span class=&quot;math inline&quot;&gt;\(n\)&lt;/span&gt;-point ultrametric.&lt;/p&gt;
&lt;p&gt;Our costs functions will be of the form &lt;span class=&quot;math inline&quot;&gt;\(\{\epsilon\cdot c_\ell : \ell \in \mathcal L\}\)&lt;/span&gt;, where &lt;span class=&quot;math inline&quot;&gt;\(\epsilon&amp;gt; 0\)&lt;/span&gt; is a number, and &lt;span class=&quot;math display&quot;&gt;\[c_\ell(\ell') = \begin{cases}
                        0 &amp;amp; \ell=\ell' \\
                        1 &amp;amp; \ell \neq \ell'\,.
   \end{cases}\]&lt;/span&gt; More succinctly: &lt;span class=&quot;math inline&quot;&gt;\(c_\ell \mathrel{\vcenter{:}}= \mathbf{1}-\mathbf{1}_\ell\)&lt;/span&gt;.&lt;/p&gt;
&lt;p&gt;Let &lt;span class=&quot;math inline&quot;&gt;\(\alpha_h\)&lt;/span&gt; be the competitive ratio for height &lt;span class=&quot;math inline&quot;&gt;\(h\)&lt;/span&gt;. We will prove inductively that &lt;span class=&quot;math inline&quot;&gt;\(\alpha_h \geq \Omega(h \log d)\)&lt;/span&gt;.&lt;/p&gt;
&lt;p&gt;Consider first the case &lt;span class=&quot;math inline&quot;&gt;\(h=1\)&lt;/span&gt;. We define the height-&lt;span class=&quot;math inline&quot;&gt;\(1\)&lt;/span&gt; cost sequence. The root &lt;span class=&quot;math inline&quot;&gt;\(r\)&lt;/span&gt; has &lt;span class=&quot;math inline&quot;&gt;\(d\)&lt;/span&gt; leaves beneath it. We do the following &lt;span class=&quot;math inline&quot;&gt;\(d \log d\)&lt;/span&gt; times: Choose a random leaf &lt;span class=&quot;math inline&quot;&gt;\(\ell\)&lt;/span&gt; and play the cost function &lt;span class=&quot;math inline&quot;&gt;\(c_\ell\)&lt;/span&gt;. It is straightforward that if &lt;span class=&quot;math inline&quot;&gt;\(\mathrm{alg}_1\)&lt;/span&gt; denotes the cost of an online algorithm, then:
&lt;span class=&quot;math display&quot;&gt;\[\mathbb{E}[\mathrm{alg}_1] \geq \log d\,.\]&lt;/span&gt; (Note that if a cost comes at &lt;span class=&quot;math inline&quot;&gt;\(\ell\)&lt;/span&gt;, then the algorithm must either move and pay movement cost &lt;span class=&quot;math inline&quot;&gt;\(1\)&lt;/span&gt;, or stay in place and pay service cost &lt;span class=&quot;math inline&quot;&gt;\(1\)&lt;/span&gt;.)&lt;/p&gt;
&lt;p&gt;Consider the following offline algorithm: Move to the leaf that will be chosen the smallest number of times, and stay there. The movement cost if &lt;span class=&quot;math inline&quot;&gt;\(1\)&lt;/span&gt;, and a standard coupon collecting argument shows that the service cost is &lt;span class=&quot;math inline&quot;&gt;\(O(1)\)&lt;/span&gt;, hence: &lt;span class=&quot;math display&quot;&gt;\[\mathbb{E}[\mathrm{opt}_1] \leq O(1)\,.\]&lt;/span&gt; We conclude that the competitive ratio satisfies &lt;span class=&quot;math display&quot;&gt;\[\alpha_1 \geq c \log d\]&lt;/span&gt; for some &lt;span class=&quot;math inline&quot;&gt;\(c &amp;gt; 0\)&lt;/span&gt;.&lt;/p&gt;
&lt;p&gt;Consider now the case of &lt;span class=&quot;math inline&quot;&gt;\(h \geq 2\)&lt;/span&gt;. We define the height-&lt;span class=&quot;math inline&quot;&gt;\(h\)&lt;/span&gt; cost sequence. The root has beneath it &lt;span class=&quot;math inline&quot;&gt;\(d\)&lt;/span&gt; subtrees &lt;span class=&quot;math inline&quot;&gt;\(T_1, T_2, \ldots, T_d\)&lt;/span&gt; of height &lt;span class=&quot;math inline&quot;&gt;\(h-1\)&lt;/span&gt;. For some number &lt;span class=&quot;math inline&quot;&gt;\(\mu_h\)&lt;/span&gt; to be chosen shortly, we do the following &lt;span class=&quot;math inline&quot;&gt;\(\mu_h d\)&lt;/span&gt; times: Choose &lt;span class=&quot;math inline&quot;&gt;\(i \in \{1,2,\ldots,d\}\)&lt;/span&gt; uniformly at random and play in &lt;span class=&quot;math inline&quot;&gt;\(T_i\)&lt;/span&gt; the height-&lt;span class=&quot;math inline&quot;&gt;\((h-1)\)&lt;/span&gt; cost sequence scaled by &lt;span class=&quot;math inline&quot;&gt;\(1/\tau\)&lt;/span&gt;.&lt;/p&gt;
&lt;p&gt;Consider first what an online algorithm can do. If the algorithm’s state lies in &lt;span class=&quot;math inline&quot;&gt;\(T_i\)&lt;/span&gt; when a height-&lt;span class=&quot;math inline&quot;&gt;\((h-1)\)&lt;/span&gt; cost sequence is imposed on &lt;span class=&quot;math inline&quot;&gt;\(T_i\)&lt;/span&gt;, it can either leave &lt;span class=&quot;math inline&quot;&gt;\(T_i\)&lt;/span&gt; at some point during the sequence, or it can remain in &lt;span class=&quot;math inline&quot;&gt;\(T_i\)&lt;/span&gt;. Hence the algorithm pays at least
$\min\left(1,\tau^{-1} \mathbb{E}[\mathrm{alg}_{h-1}]\right)$.
It follows that:

&lt;p&gt;
\[\mathbb{E}[\mathrm{alg}_h] \geq \mu_h \min\left(1,\tau^{-1} \mathbb{E}[\mathrm{alg}_{h-1}]\right)
   \geq \mu_h \min\left(1,\tau^{-1} \alpha_{h-1} \mathbb{E}[\mathrm{opt}_{h-1}]\right)\,.\]
&lt;/p&gt;
Thus if 
   &lt;p&gt;
   \begin{equation}\label{eq:tauchoice}
   \tau \geq \alpha_{h-1} \mathbb{E}[\mathrm{opt}_{h-1}]\end{equation}
   &lt;/p&gt;
it holds that &lt;span class=&quot;math display&quot;&gt;\begin{equation}\label{eq:alg}
   \mathbb{E}[\mathrm{alg}_h] 
   \geq \mu_h \tau^{-1} \alpha_{h-1} \mathbb{E}[\mathrm{opt}_{h-1}]\,.\end{equation}&lt;/span&gt;
We will choose &lt;span class=&quot;math inline&quot;&gt;\(\tau\)&lt;/span&gt; so that
\eqref{eq:tauchoice} holds.&lt;/p&gt;

&lt;p&gt;
Let us now analyze an offline algorithm. Let &lt;span class=&quot;math inline&quot;&gt;\(\rho_i\)&lt;/span&gt; denote the number of times that &lt;span class=&quot;math inline&quot;&gt;\(T_i\)&lt;/span&gt; is chosen. The offline algorithm first moves to some &lt;span class=&quot;math inline&quot;&gt;\(T_i\)&lt;/span&gt; with &lt;span class=&quot;math inline&quot;&gt;\(\rho_i\)&lt;/span&gt; minimal, and then plays optimally against level-&lt;span class=&quot;math inline&quot;&gt;\((h-1)\)&lt;/span&gt; cost sequences arriving in &lt;span class=&quot;math inline&quot;&gt;\(T_i\)&lt;/span&gt;.&lt;/p&gt;
&lt;p&gt;If &lt;span class=&quot;math inline&quot;&gt;\(\mu_h \geq \Omega(\log d)\)&lt;/span&gt;, then the normal approximation to a binomial distribution shows that there is a constant &lt;span class=&quot;math inline&quot;&gt;\(0 &amp;lt; c' &amp;lt; 1\)&lt;/span&gt; such that &lt;span class=&quot;math display&quot;&gt;\[\mathbb{E}\left[\min(\rho_1,\ldots,\rho_d)\right] \leq \mu_h - c'\sqrt{\mu_h \log d}\,.\]&lt;/span&gt; Hence incorporating the movement cost yields: &lt;span class=&quot;math display&quot;&gt;\[\mathbb{E}[\mathrm{opt}_h] \leq 1 + \left(\mu_h - c'\sqrt{\mu_h \log d}\right) \tau^{-1} \mathbb{E}[\mathrm{opt}_{h-1}]\,.\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;We now choose &lt;span class=&quot;math inline&quot;&gt;\(\mu_h\)&lt;/span&gt; so that &lt;span class=&quot;math display&quot;&gt;\[c' \sqrt{\mu_h \log d} \tau^{-1} \mathbb{E}[\mathrm{opt}_{h-1}] = 2\,,\]&lt;/span&gt; i.e., &lt;span class=&quot;math display&quot;&gt;\[\mu_h \mathrel{\vcenter{:}}= \frac{4}{\log d} \left(\frac\tau{c' \mathbb{E}[\mathrm{opt}_{h-1}]}\right)^2\]&lt;/span&gt; Note from \eqref{eq:tauchoice}, we have &lt;span class=&quot;math display&quot;&gt;\begin{equation}\label{eq:muh}
   \mu_h \geq \frac{4}{(c')^2\log d} \alpha_{h-1}^2.\end{equation}&lt;/span&gt; Since &lt;span class=&quot;math inline&quot;&gt;\(\alpha_{h-1} \geq \alpha_1 \geq \Omega(\log d)\)&lt;/span&gt; it follows that &lt;span class=&quot;math inline&quot;&gt;\(\mu_h \geq \log d\)&lt;/span&gt; for &lt;span class=&quot;math inline&quot;&gt;\(c'\)&lt;/span&gt; chosen small enough. (In particular, the normal approximation applies.)&lt;/p&gt;
&lt;p&gt;Our choice of &lt;span class=&quot;math inline&quot;&gt;\(\mu_h\)&lt;/span&gt; ensures that
&lt;span class=&quot;math display&quot;&gt;
\begin{align}
   \mathbb{E}[\mathrm{opt}_h] &amp;amp;\leq \left(\mu_h-\frac{c'}{2} \sqrt{\mu_h \log d}\right) \tau^{-1} \mathbb{E}[\mathrm{opt}_{h-1}]
\label{eq:opt}\end{align}&lt;/span&gt;
Combining this with \eqref{eq:alg} yields
&lt;p&gt;
   $$
   \alpha_h \geq \frac{\mathbb{E}[\mathrm{alg}_h]}{\mathbb{E}[\mathrm{opt}_h]}
   \geq \frac{\alpha_{h-1}}{1-\frac{c'}{2} \sqrt{\frac{\log d}{\mu_h}}}
   \geq \alpha_{h-1} \left(1+\frac{c'}{2} \sqrt{\frac{\log d}{\mu_h}}\right),
   $$
&lt;/p&gt;
&lt;/p&gt;
where the latter inequality holds because &lt;span class=&quot;math inline&quot;&gt;\(c' &amp;lt; 1\)&lt;/span&gt; and &lt;span class=&quot;math inline&quot;&gt;\(\mu_h \geq \log d\)&lt;/span&gt;. Using \eqref{eq:muh}, this gives &lt;span class=&quot;math display&quot;&gt;\[\alpha_h \geq \alpha_{h-1} + \frac{c'}{2} \log d\,,\]&lt;/span&gt; completing the argument.&lt;/p&gt;
&lt;p&gt;Note that in \eqref{eq:tauchoice}, we required &lt;span class=&quot;math inline&quot;&gt;\(\tau \geq \alpha_{h-1} \mathbb{E}[\mathrm{opt}_{h-1}]\)&lt;/span&gt;. So let us use \eqref{eq:opt} to compute: &lt;span class=&quot;math display&quot;&gt;\[\mathbb{E}[\mathrm{opt}_h] \leq \mu_h \tau^{-1} \mathbb{E}[\mathrm{opt}_{h-1}] \leq \frac{O(1)}{\log d} \frac\tau{\mathbb{E}[\mathrm{opt}_{h-1}]}\]&lt;/span&gt; Since &lt;span class=&quot;math display&quot;&gt;\[\mathbb{E}[\mathrm{opt}_h] \cdot \mathbb{E}[\mathrm{opt}_{h-1}] \leq O(\tau/\log d)\,,\]&lt;/span&gt; and &lt;span class=&quot;math inline&quot;&gt;\(\mathbb{E}[\mathrm{opt}_h] \geq \mathbb{E}[\mathrm{opt}_{h-1}] \geq 1\)&lt;/span&gt; for all &lt;span class=&quot;math inline&quot;&gt;\(h\)&lt;/span&gt;,
it follows that &lt;span class=&quot;math inline&quot;&gt;\(\mathbb{E}[\mathrm{opt}_h] \leq O(\sqrt{\tau/\log d})\)&lt;/span&gt;, meaning that &lt;span class=&quot;math inline&quot;&gt;\(\tau \geq \alpha_{h-1} \mathbb{E}[\mathrm{opt}_{h-1}]\)&lt;/span&gt; will be satisfied for &lt;span class=&quot;math inline&quot;&gt;\(\tau \geq C(\alpha_{h-1}/\log d)^2 \asymp h^2\)&lt;/span&gt;. Since &lt;span class=&quot;math inline&quot;&gt;\(h \asymp \frac{\log n}{\log d}\)&lt;/span&gt;, this yields completes the proof of in the special case of a &lt;span class=&quot;math inline&quot;&gt;\(d\)&lt;/span&gt;-regular HST.&lt;/p&gt;

&lt;h2&gt;The “superincreasing” metric&lt;/h2&gt;

&lt;p&gt;We will now consider a highly unbalanced family of HSTs.  These were analyzed
in a paper of &lt;a href=&quot;https://epubs.siam.org/doi/pdf/10.1137/S0097539792224838&quot;&gt;Karloff, Rabani, and Ravid&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Let us define the trees &lt;span class=&quot;math inline&quot;&gt;\(\{T_n\}\)&lt;/span&gt; as follows: The root &lt;span class=&quot;math inline&quot;&gt;\(r_{n}\)&lt;/span&gt; of &lt;span class=&quot;math inline&quot;&gt;\(T_n\)&lt;/span&gt; has two children. The left child is a copy of &lt;span class=&quot;math inline&quot;&gt;\(T_{n-1}\)&lt;/span&gt; of rooted at &lt;span class=&quot;math inline&quot;&gt;\(r_{n-1}\)&lt;/span&gt;, and the right child is a single leaf. We denote &lt;span class=&quot;math inline&quot;&gt;\(w_n(r_n)=1\)&lt;/span&gt;, and &lt;span class=&quot;math inline&quot;&gt;\(w_n(r_{n-1}) = 1/\tau_n\)&lt;/span&gt;, where &lt;span class=&quot;math inline&quot;&gt;\(\tau_n &amp;gt; \tau_{n-1} &amp;gt; \cdots &amp;gt; 0\)&lt;/span&gt; is a sequence of positive weights we will choose soon, and such that &lt;span class=&quot;math inline&quot;&gt;\(\tau_n \leq O(\log n)\)&lt;/span&gt;.&lt;/p&gt;
&lt;p&gt;The basic structure of our argument will be similar to the &lt;span class=&quot;math inline&quot;&gt;\(d\)&lt;/span&gt;-regular case. We do the following &lt;span class=&quot;math inline&quot;&gt;\(2 \mu_n\)&lt;/span&gt; times: Choose &lt;span class=&quot;math inline&quot;&gt;\(i \in \{L,R\}\)&lt;/span&gt; uniformly at random. If &lt;span class=&quot;math inline&quot;&gt;\(i=L\)&lt;/span&gt;, we play a height-&lt;span class=&quot;math inline&quot;&gt;\((n-1)\)&lt;/span&gt; cost sequence on the left subtree (scaled by &lt;span class=&quot;math inline&quot;&gt;\(1/\tau_n\)&lt;/span&gt;). Otherwise, we play &lt;span class=&quot;math inline&quot;&gt;\(c_\ell\)&lt;/span&gt;, where &lt;span class=&quot;math inline&quot;&gt;\(\ell\)&lt;/span&gt; is the leaf constituting the right subtree.&lt;/p&gt;
&lt;p&gt;Consider an online algorithm. If the algorithm sits in the left subtree when &lt;span class=&quot;math inline&quot;&gt;\(i=L\)&lt;/span&gt;, then either the algorithm moves out of the subtree (movement cost &lt;span class=&quot;math inline&quot;&gt;\(1\)&lt;/span&gt;), or it incurs the cost of a height-&lt;span class=&quot;math inline&quot;&gt;\((n-1)\)&lt;/span&gt; inductive sequence scaled by &lt;span class=&quot;math inline&quot;&gt;\(1/\tau_n\)&lt;/span&gt;. If it sits in the right subtree, then it pays cost &lt;span class=&quot;math inline&quot;&gt;\(1\)&lt;/span&gt; (either by moving or staying and incurring the service cost). If we choose
&lt;span class=&quot;math inline&quot;&gt;\(\tau_n \mathrel{\vcenter{:}}=\alpha_{n-1} \mathbb{E}[\mathrm{opt}_{n-1}]\)&lt;/span&gt;, then such an algorithm always pays at least &lt;span class=&quot;math inline&quot;&gt;\(\tau_n^{-1} \alpha_{n-1} \mathbb{E}[\mathrm{opt}_{n-1}]\)&lt;/span&gt; in any of these cases. Therefore:
&lt;/p&gt;
&lt;p&gt;
\begin{equation}\label{eq:algn}
\mathbb{E}[\mathrm{alg}_n] \geq \mu_n \tau_n^{-1} \alpha_{n-1} \mathbb{E}[\mathrm{opt}_{n-1}].
\end{equation}
&lt;/p&gt;
&lt;p&gt;We now bound the cost of the optimal offline algorithm. Let &lt;span class=&quot;math inline&quot;&gt;\(\rho_L\)&lt;/span&gt; be the number of times &lt;span class=&quot;math inline&quot;&gt;\(i=L\)&lt;/span&gt; and let &lt;span class=&quot;math inline&quot;&gt;\(\rho_R \mathrel{\vcenter{:}}= 2\mu_n - \rho_L\)&lt;/span&gt;. The algorithm will always sit by default in the left subtree. If &lt;span class=&quot;math inline&quot;&gt;\(\rho_R=0\)&lt;/span&gt;, it will move to the right subtree, suffer zero service cost there, and then move back to the left subtree (paying total cost &lt;span class=&quot;math inline&quot;&gt;\(2\)&lt;/span&gt;). If &lt;span class=&quot;math inline&quot;&gt;\(\rho_R &amp;gt; 0\)&lt;/span&gt;, the algorithm will remain in the left subtree and play an optimal strategy for &lt;span class=&quot;math inline&quot;&gt;\(T_{n-1}\)&lt;/span&gt;.&lt;/p&gt;
&lt;p&gt;This yields: &lt;span class=&quot;math display&quot;&gt;\[\begin{aligned}
   \mathbb{E}[\mathrm{opt}_n] &amp;amp;\leq \left(1-{\mathbb{P}}[\rho_R=0]\right) \mathbb{E}\left[\rho_L \tau_n^{-1} \mathrm{opt}_{n-1} \mid \rho_R &amp;gt; 0\right]
   + 2 {\mathbb{P}}[\rho_R=0]  \\
   &amp;amp;\leq (1-4^{-\mu_n}) \mu_n \tau_n^{-1} \mathbb{E}[\mathrm{opt}_{n-1}] + 2 \cdot 4^{-\mu_n}.\end{aligned}\]&lt;/span&gt; Now choose: &lt;span class=&quot;math display&quot;&gt;\[\mu_n \mathrel{\vcenter{:}}=\frac{4 \tau_n}{\mathbb{E}[\mathrm{opt}_{n-1}]} = 4 \alpha_{n-1}\,.\]&lt;/span&gt; This yields: &lt;span class=&quot;math display&quot;&gt;\[\mathbb{E}[\mathrm{opt}_n] \leq \left(1-\tfrac12 4^{-4 \alpha_n}\right) \mu_n \tau_n^{-1} \mathbb{E}[\mathrm{opt}_{n-1}].\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;Combined with \eqref{eq:algn}, we have
&lt;span class=&quot;math display&quot;&gt;\[\alpha_n \geq \frac{\mathbb{E}[\mathrm{alg}_n]}{\mathbb{E}[\mathrm{opt}_n]} \geq \alpha_{n-1} + \tfrac12 4^{-4\alpha_{n-1}}.\]&lt;/span&gt; This implies that &lt;span class=&quot;math inline&quot;&gt;\(\alpha_n \geq \beta \log n\)&lt;/span&gt; for some positive &lt;span class=&quot;math inline&quot;&gt;\(\beta &amp;lt; 1\)&lt;/span&gt;. (To see this observe, that if &lt;span class=&quot;math inline&quot;&gt;\(f(x)=\beta \log x\)&lt;/span&gt;, then &lt;span class=&quot;math inline&quot;&gt;\(f'(x) = \beta/x &amp;lt; \frac12 4^{-4 \beta \log x}\)&lt;/span&gt; for &lt;span class=&quot;math inline&quot;&gt;\(\beta\)&lt;/span&gt; chosen small enough.)&lt;/p&gt;
&lt;p&gt;Note furthermore that &lt;span class=&quot;math display&quot;&gt;\[\mathbb{E}[\mathrm{opt}_n] \leq 4\,,\]&lt;/span&gt; and therefore &lt;span class=&quot;math inline&quot;&gt;\(\tau_n \leq O(\log n)\)&lt;/span&gt;.&lt;/p&gt;

&lt;h2&gt;The general case&lt;/h2&gt;

&lt;p&gt;We have demonstrated lower bounds in two cases: When the underlying tree &lt;span class=&quot;math inline&quot;&gt;\(T\)&lt;/span&gt; is regular, and when it is unbalanced and binary. The general case can be proved by combining these two strategies along any &lt;span class=&quot;math inline&quot;&gt;\(\Omega((\log n)^2)\)&lt;/span&gt;-HST. The next lemma contains the basic idea. We leave it to the reader as a basic exercise. (Hint: Use the concavity of &lt;span class=&quot;math inline&quot;&gt;\(x \mapsto \sqrt{x}\)&lt;/span&gt;.)&lt;/p&gt;
&lt;p class=&quot;lemma&quot;&gt;
Suppose that &lt;span class=&quot;math inline&quot;&gt;\(n=n_1+n_2+\cdots+n_m\)&lt;/span&gt; where &lt;span class=&quot;math inline&quot;&gt;\(n_1 \geq n_2 \geq \cdots \geq n_m \geq 1\)&lt;/span&gt;. Then either &lt;span class=&quot;math inline&quot;&gt;\(\sqrt{n_1}+\sqrt{n_2} \geq \sqrt{n}\)&lt;/span&gt;, or there is some &lt;span class=&quot;math inline&quot;&gt;\(\ell \geq 3\)&lt;/span&gt; such that &lt;span class=&quot;math inline&quot;&gt;\(\ell \cdot \sqrt{n_\ell} \geq \sqrt{n}\)&lt;/span&gt;.&lt;/p&gt;

&lt;p&gt;Suppose now that &lt;span class=&quot;math inline&quot;&gt;\(T\)&lt;/span&gt; is a rooted tree with subtrees &lt;span class=&quot;math inline&quot;&gt;\(T_1, T_2, \ldots, T_m\)&lt;/span&gt; beneath the root, and such that &lt;span class=&quot;math inline&quot;&gt;\(T_i\)&lt;/span&gt; has &lt;span class=&quot;math inline&quot;&gt;\(n_i\)&lt;/span&gt; leaves with &lt;span class=&quot;math inline&quot;&gt;\(n_1 \geq n_2 \geq \cdots \geq n_m \geq 1\)&lt;/span&gt;. The lemma says we can either consider only the first two subtrees (the binary, possibly “unbalanced” case), or we can take the first &lt;span class=&quot;math inline&quot;&gt;\(\ell\)&lt;/span&gt; subtrees for some &lt;span class=&quot;math inline&quot;&gt;\(\ell \geq 3\)&lt;/span&gt;, prune them so they all have exactly &lt;span class=&quot;math inline&quot;&gt;\(n_\ell\)&lt;/span&gt; leaves, and then prove a lower bound. Analyzing these two cases correspond roughly to the two lower bound arguments above.&lt;/p&gt;
</description>
        <pubDate>Fri, 20 Apr 2018 00:00:00 +0000</pubDate>
        <link>http://tcsmath.github.io/online/2018/04/20/mts-lower-bounds/</link>
        <guid isPermaLink="true">http://tcsmath.github.io/online/2018/04/20/mts-lower-bounds/</guid>
      </item>
    
      <item>
        <title>Approximation by ultrametrics</title>
        <description>&lt;p&gt;We now take a short detour away from mirror descent, and
instead examine how a special class of metric spaces called &lt;em&gt;ultrametrics&lt;/em&gt;
control the competitive ratio of MTS in general metric spaces.
Hold onto your seats; we’re about to compress two decades and 
10 papers into a few paragraphs.
For the sake of continuity, bibliographic remarks are held to the end of the post.
&lt;script type=&quot;math/tex&quot;&gt;\def\e{\varepsilon}&lt;/script&gt;&lt;/p&gt;

&lt;h2 id=&quot;metric-approximations&quot;&gt;Metric approximations&lt;/h2&gt;

&lt;p&gt;For a metric space $(X,d)$, let $\alpha_{\mathrm{mts}}(X,d)$ denote
the best (randomized) competitive ratio for MTS on $(X,d)$.
Suppose $D$ is some other metric on $X$ that satisfies&lt;/p&gt;
&lt;p&gt;
\begin{equation}\label{eq:distortion}
   \frac{D(x,y)}{K} \leq d(x,y) \leq D(x,y) \qquad \forall x,y \in X\,.
\end{equation}
&lt;/p&gt;
&lt;p&gt;Then it is straightforward to verify that $\alpha_{\mathrm{mts}}(X,d) \leq K \cdot \alpha_{\mathrm{mts}}(X,D)$.&lt;/p&gt;

&lt;p&gt;But there is a weaker form of approximation that still allows for such a conclusion.
Suppose that $\mathbf{D}$ is a &lt;em&gt;random&lt;/em&gt; metric that satisfies, for every $x,y \in X$:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;With probability one, $\mathbf{D}(x,y) \geq d(x,y)$,&lt;/li&gt;
  &lt;li&gt;$\mathbb{E}\left[\mathbf{D}(x,y)\right] \leq K \cdot d(x,y)$.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If $\alpha_{\mathrm{mts}}(X,\mathbf{D}) \leq \alpha$ with probability one, we claim that
$\alpha_{\mathrm{mts}}(X,d) \leq K \alpha$.&lt;/p&gt;

&lt;p&gt;The algorithm achieving this is as follows:  Sample the random metric $\mathbf{D}$.
Now given a cost sequence $\left\langle c_t : X \to \mathbb{R}_+ \mid t \geq 1\right\rangle$,
let $\left\langle x_0, x_1, x_2, \ldots \right\rangle$ be the random sequence of points produced
by an $\alpha$-competitive randomized algorithm for $(X,\mathbf{D})$.  Then:&lt;/p&gt;
&lt;p&gt;
\[
   \mathbb{E}\left[\sum_{t=1}^T \left.\vphantom{\bigoplus}
         \mathbf{D}(x_t, x_{t-1}) \ \right| \mathbf{D}\right] \leq
         \alpha \sum_{t=1}^T \mathbf{D}\!\left(x_t^{*}, x_{t-1}^{*}\right)
         + O(1)\,,
\]
&lt;/p&gt;
&lt;p&gt;where $\left\langle x_t^* : t \geq 0\right\rangle$
is an optimal offline sequence for $(X,d)$.
(This may not be the optimal sequence for $(X,\mathbf{D})$, but the algorithm
is certainly competitive against non-optimal sequences as well.)&lt;/p&gt;

&lt;p&gt;Note that the expectation here is taken only with respect to the randomness in the online algorithm.
If we take expectation with respect to $\mathbf{D}$ as well, it follows that&lt;/p&gt;
&lt;p&gt;
\[
   \mathbb{E}\left[\sum_{t=1}^T d(x_t, x_{t-1})\right] \leq
   \alpha \mathbb{E}\left[\sum_{t=1}^T \mathbf{D}\left(x_t^{*}, x_{t-1}^{*}\right)\right]
         + O(1)
         \leq K \alpha
   \sum_{t=1}^T d\!\left(x_t^{*}, x_{t-1}^{*}\right) + O(1)\,,
\]
&lt;/p&gt;
&lt;p&gt;where we have used property (1) of $\mathbf{D}$ for the LHS and property (2) to bound the RHS.&lt;/p&gt;

&lt;h2 id=&quot;ultrametrics&quot;&gt;Ultrametrics&lt;/h2&gt;

&lt;p&gt;Let $T=(V,E)$ be a finite, rooted tree, and let $\mathcal{L}$ denote the leaves of $T$.
Suppose $w : V\setminus \mathcal{L} \to \mathbb{R}_+$ is a function that assigns
positive weights to the internal vertices of $T$ such that the vertex
weights are non-increasing along root-leaf paths.
Then one can define a distance on $\mathcal{L}$ by
[
   d_w\left(\ell,\ell’\right) \mathrel{\vcenter{:}}= w\left(\mathrm{lca}(\ell,\ell’)\right).
]
This is an &lt;em&gt;&lt;a href=&quot;https://en.wikipedia.org/wiki/Ultrametric_space&quot;&gt;ultrametric&lt;/a&gt;&lt;/em&gt; on $\mathcal{L}$ (and all finite ultrametrics
arise in this way).&lt;/p&gt;

&lt;p&gt;It turns out that ultrametrics essentially control the competitive ratio
for metrical task systems on finite metric spaces.
This follows from the next two facts that
hold for an arbitrary $n$-point metric space $(X,d)$.&lt;/p&gt;

&lt;p&gt;&lt;a name=&quot;thm1&quot;&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p class=&quot;theorem&quot; text=&quot;Random embedding&quot; ord=&quot;1&quot;&gt;
There is a random ultrametric $\mathbf{D}$ on $(X,d)$ such that (1) and (2) are satisfied with $K \leq O(\log n)$.
&lt;/p&gt;

&lt;p&gt;By our earlier remarks, this implies that the competitive ratio for MTS on $(X,d)$
is at most $O(\log n)$ times the competitive ratio for $n$-point ultrametrics.&lt;/p&gt;

&lt;p&gt;&lt;a name=&quot;thm2&quot;&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p class=&quot;theorem&quot; text=&quot;Metric Ramsey&quot; ord=&quot;2&quot;&gt;
There is a subset $X' \subseteq X$ with $|X'| \geq \sqrt{n}$ and
an ultrametric $D$ on $X'$ such that 
\eqref{eq:distortion} is satisfied with $K \leq O(1)$.
&lt;/p&gt;

&lt;p&gt;Since $\alpha_{\mathrm{mts}}(X,d) \geq \alpha_{\mathrm{mts}}(X’,d) \geq \Omega(\alpha_{\mathrm{mts}}(X’,D))$,
lower bounds on the competitive ratio for ultrametrics yield lower bounds for $(X,d)$ as well.
Finally, we remark that MTS on ultrametrics is now well-understood.&lt;/p&gt;

&lt;p&gt;&lt;a name=&quot;thm3&quot;&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p class=&quot;theorem&quot; text=&quot;MTS on ultrametrics&quot; ord=&quot;3&quot;&gt;
   If $(X,D)$ is an $n$-point ultrametric, then
   \[
      \Omega\left(\frac{\log n}{\log \log n}\right) \leq \alpha_{\mathrm{mts}}(X,D) \leq O(\log n)\,.
   \]
&lt;/p&gt;

&lt;p&gt;There are ultrametrics (e.g., as we have seen already, when $(X,D)$ is the uniform metric)
for which the $O(\log n)$ upper bound in &lt;a href=&quot;#thm3&quot;&gt;Theorem 3&lt;/a&gt; is tight.
Whether the LHS can be made $\Omega(\log n)$ is an intriguing open problem;
we will address it in the next lecture.
In conjunction with &lt;a href=&quot;#thm1&quot;&gt;Theorem 1&lt;/a&gt; and &lt;a href=&quot;#thm2&quot;&gt;Theorem 2&lt;/a&gt;,
this yields:&lt;/p&gt;

&lt;p&gt;&lt;a name=&quot;cor4&quot;&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p class=&quot;corollary&quot; text=&quot;MTS&quot; ord=&quot;4&quot;&gt;
For any $n$-point metric space $(X,d)$:
\[
   \Omega\left(\frac{\log n}{\log \log n}\right) \leq \alpha_{\mathrm{mts}}(X,d) \leq O((\log n)^2)\,.
\]
&lt;/p&gt;

&lt;p&gt;Perhaps the central remaining open question for MTS on general metric spaces
is whether the upper bound can be improved to $O(\log n)$.&lt;/p&gt;

&lt;h2 id=&quot;ultrametrics-from-partitions&quot;&gt;Ultrametrics from partitions&lt;/h2&gt;

&lt;p&gt;Consider a partition $P$ of $X$.  Say that $P$ is &lt;em&gt;$\Delta$-bounded&lt;/em&gt;
if $S \in P \implies \mathrm{diam}_X(S) \leq \Delta$.
For a point $x \in X$, we write $P(x)$ for the unique set in $P$ that contains $x$.&lt;/p&gt;

&lt;p&gt;Suppose now that $\mathcal{P} = \{ P_j : j \in \mathbb{Z} \}$ is a sequence of partitions of $X$
such that $P_j$ is $8^j$-bounded for every $j \in \mathbb{Z}$, and define the metric&lt;/p&gt;
&lt;p&gt;
\[
   D_{\mathcal{P}}(x,y) \mathrel{\vcenter{:}}= \max \left\{ 8^{j+1} : P_j(x) \neq P_j(y) \right\}.
\]
&lt;/p&gt;
&lt;p&gt;One can check that $D_{\mathcal{P}}$ is an ultrametric on $X$ (imagine
the tree structure induced by the partitions), and moreover&lt;/p&gt;
&lt;p&gt;
\begin{equation}\label{eq:exp}
   D_{\mathcal{P}}(x,y) \geq d(x,y) \qquad \forall x,y \in X\,.
\end{equation}
&lt;/p&gt;
&lt;p&gt;This follows from:  $D_{\mathcal{P}}(x,y) \leq 8^j \implies P_j(x)=P_j(y) \implies d(x,y) \leq 8^j$.&lt;/p&gt;

&lt;p&gt;Define $B(x,r) \mathrel{\vcenter{:}}= \{ y \in X : d(x,y) \leq r \}$ to be the ball of radius $r$ about $x \in X$.&lt;/p&gt;

&lt;p class=&quot;lemma&quot; text=&quot;Random partition lemma&quot;&gt;
   Suppose $(X,d)$ is an $n$-point metric space.  Then for every $\Delta &amp;gt; 0$,
   there is a random $\Delta$-bounded partition $P$ of $X$ such that for every $r \leq \Delta/8$:
   &lt;p&gt;
   \begin{equation}\label{eq:rp}
      \mathbb{P}\left[\vphantom{\bigoplus}
         B(x,r) \subseteq P(x)\right] \geq \exp\left(\frac{-8r}{\Delta} \log \frac{|B(x,\Delta)|}{|B(x,\Delta/8)|}\right).
   \end{equation}
   &lt;/p&gt;
&lt;/p&gt;

&lt;p&gt;Remarkably, the random partitioning lemma can be used to prove both &lt;strong&gt;Theorem 1&lt;/strong&gt; and &lt;strong&gt;Theorem 2&lt;/strong&gt;.
We will first establish these consequences and then prove the lemma.&lt;/p&gt;

&lt;h2 id=&quot;approximation-by-a-random-ultrametric&quot;&gt;Approximation by a random ultrametric&lt;/h2&gt;

&lt;p&gt;Let’s prove &lt;a href=&quot;#thm1&quot;&gt;Theorem 1&lt;/a&gt;.
Let $\mathcal{P} = \{ P_j : j \in \mathbb{Z} \}$ be the random sequence where $P_j$ results from the
random partitioning lemma applied with $\Delta = 8^j$ and we take the partitions to be mutually independent.
Then from \eqref{eq:exp}, we know $\mathbf{D}_{\mathcal{P}}(x,y) \geq d(x,y)$ for all $x,y \in X$.
Now fix $x \neq y \in X$, and let $j_0 \mathrel{\vcenter{:}}= \min \{ j : 8^{j} \geq d(x,y) \}$.  Then:&lt;/p&gt;
&lt;p&gt;
\[
   \mathbb{E}\left[\mathbf{D}_{\mathcal{P}}(x,y)\right] \leq 8^{j_0+1} + \sum_{j &amp;gt; j_0} \mathbb{P}[P_j(x) \neq P_j(y)] 8^{j+1},
\]
&lt;/p&gt;
&lt;p&gt;and using $\mathbb{P}[P_j(x) = P_j(y)] \geq \mathbb{P}[B(x, d(x,y)) \subseteq P_j(x)]$ yields&lt;/p&gt;
&lt;p&gt;
\begin{align*}
\sum_{j &amp;gt; j_0} \mathbb{P}[P_j(x) \neq P_j(y)] 8^{j+1}
&amp;amp;\leq \sum_{j \in \mathbb{Z}} 8^{j+2} \frac{d(x,y)}{8^{j}} \log \frac{|B(x, 8^j)|}{|B(x,8^{j-1})|} \\
&amp;amp;\leq 64\, d(x,y) \sum_{j \in \mathbb{Z}} \log \frac{|B(x, 8^j)|}{|B(x,8^{j-1}|} \\
&amp;amp;= 64 \,d(x,y) \log n\,.
\end{align*}
&lt;/p&gt;
&lt;p&gt;where the second inequality follows from \eqref{eq:rp} and the fact that $e^{-x} \geq 1-x$,
and in the last inequality we evaluate a telescoping sum.&lt;/p&gt;

&lt;h2 id=&quot;finding-a-large-approximate-ultrametric-inside-x&quot;&gt;Finding a large approximate ultrametric inside $X$&lt;/h2&gt;

&lt;p&gt;Now we prove &lt;a href=&quot;#thm2&quot;&gt;Theorem 2&lt;/a&gt;.
Let $\mathcal{P}$ be the same random partition sequence chosen above and fix some $0 &amp;lt; \e &amp;lt; 1/8$.
Define the random subset:&lt;/p&gt;
&lt;p&gt;
\[
   \mathbf{S} \mathrel{\vcenter{:}}= \left\{ x \in X : B(x, \e 8^j) \subseteq P_j(x) \ \forall j \in \mathbb{Z}\right\}\,.
\]
&lt;/p&gt;
&lt;p&gt;We claim that&lt;/p&gt;
&lt;p&gt;
\[
   D_{\mathcal{P}}(x,y) \geq d(x,y) \geq \frac{\e}{8} D_{\mathcal{P}}(x,y) \qquad \forall x,y \in \mathbf{S}\,.
\]
&lt;/p&gt;
&lt;p&gt;The LHS is simply from \eqref{eq:exp}.
For the RHS, observe that if $D_{\mathcal{P}}(x,y)=8^{j+1}$, then $P_j(x) \neq P_j(y)$,
hence $B(x,\e 8^j) \subseteq P_j(x) \implies d(x,y) \geq \e 8^j$.&lt;/p&gt;

&lt;p&gt;Now the random partitioning lemma gives, for any $x \in X$,&lt;/p&gt;
&lt;p&gt;
\[
   \mathbb{P}[x \in \mathbf{S}] \geq \prod_{j \in \mathbb{Z}} \exp\left(- 8\e \log \frac{|B(x,8^j)|}{|B(x,8^{j-1})|}\right)
   = \exp\left(-8 \e \log n\right) = n^{-8\e}\,.
\]
&lt;/p&gt;
&lt;p&gt;By linearity of expectation: $\mathbb{E}[|\mathbf{S}|] \geq n^{1-8\e}$.  Taking $\e \mathrel{\vcenter{:}}= 1/16$ completes the proof.&lt;/p&gt;

&lt;h2 id=&quot;proof-of-the-random-partitioning-lemma&quot;&gt;Proof of the random partitioning lemma&lt;/h2&gt;

&lt;p&gt;Let $\mathbf{R} \in [\Delta/4,\Delta/2]$ be chosen uniformly,
and let $X = \{x_1, x_2, \ldots, x_n\}$ be a uniformly random
ordering of the points in $X$.
Our random partitioning is formed by iteratively carving out balls:&lt;/p&gt;
&lt;p&gt;
\begin{equation}\label{eq:ckr}
   P \mathrel{\vcenter{:}}= \left\{ B(x_i, \mathbf{R}) \setminus \bigcup_{j &amp;lt; i} B(x_j, \mathbf{R}) : i=1,2,\ldots,n\right\}.
\end{equation}
&lt;/p&gt;
&lt;p&gt;Clearly $P$ is $\Delta$-bounded by construction.&lt;/p&gt;

&lt;p&gt;Fix $r \leq \Delta/8$ and observe first that&lt;/p&gt;
&lt;p&gt;
\[
   \mathbb{P}\left[B(x,r) \subseteq P(x) \mid \mathbf{R}\right] \geq \frac{|B(x,\mathbf{R}-r)|}{|B(x,\mathbf{R}+r)|}.
\]
&lt;/p&gt;
&lt;p&gt;This follows because if we condition on $\mathbf{R}$, then only centers in $B(x,\mathbf{R}+r)$
can decide the fate of $B(x,r)$, and the corresponding cluster will contain
all of $B(x,r)$ if the center lies in $B(x,\mathbf{R}-r)$.&lt;/p&gt;

&lt;p&gt;Thus we have:&lt;/p&gt;
&lt;p&gt;
\begin{align*}
   \mathbb{P}\left[B(x,r) \subseteq P(x)\right] &amp;amp;\geq \mathbb{E}\left[\frac{|B(x,\mathbf{R}-r)|}{|B(x,\mathbf{R}+r)|}\right] \\
                                         &amp;amp;= \mathbb{E}\left[\exp\left(- \log \frac{|B(x,\mathbf{R}+r)|}{|B(x,\mathbf{R}-r)|}\right)\right]  \\
                                         &amp;amp;\geq 
   \exp \left(\mathbb{E}\left[- \log \frac{|B(x,\mathbf{R}+r)|}{|B(x,\mathbf{R}-r)|}\right]\right)  \\
&amp;amp;\geq \exp\left(\frac{- 8 r}{\Delta} \log \frac{|B(x,\Delta)|}{|B(x,\Delta/8)|}\right)\,,
\end{align*}
&lt;/p&gt;
&lt;p&gt;where the second inequality uses convexity of $e^{-x}$ and
the last inequality comes from the calculation&lt;/p&gt;
&lt;p&gt;
\begin{align*}
\mathbb{E}\left[ \log \frac{|B(x,\mathbf{R}+r)|}{|B(x,\mathbf{R}-r)|}\right] 
&amp;amp;= \frac{4}{\Delta} \int_{\Delta/4}^{\Delta/2} 
\log \frac{|B(x,R+r)|}{|B(x,R-r)|}\,dR \\
&amp;amp;\leq \frac{8r}{\Delta} \log \frac{|B(x,\Delta/2+r)|}{|B(x,\Delta/4-r)|} \\
&amp;amp;\leq \frac{8r}{\Delta} \log \frac{|B(x,\Delta)|}{|B(x,\Delta/8)|}\,,
\end{align*}
&lt;/p&gt;
&lt;p&gt;using $r \leq \Delta/8$.&lt;/p&gt;

&lt;h2 id=&quot;historical-remarks&quot;&gt;Historical remarks&lt;/h2&gt;

&lt;p&gt;The random embedding theorem (&lt;a href=&quot;#thm1&quot;&gt;Theorem 1&lt;/a&gt;)
is due to &lt;a href=&quot;https://www.sciencedirect.com/science/article/pii/S0022000004000637&quot;&gt;Fakcharoenphol, Rao, and Talwar&lt;/a&gt;.
The first such result was &lt;a href=&quot;http://ieeexplore.ieee.org/iel3/4141/11790/00548477.pdf&quot;&gt;proved by Bartal&lt;/a&gt;, and he later
&lt;a href=&quot;https://dl.acm.org/citation.cfm?id=276725&quot;&gt;obtained a near-optimal bound&lt;/a&gt;
of $O(\log n \log \log n)$ on the distortion.
The use of random tree approximations of arbitrary metric spaces
for online algorithms arose somewhat earlier in
the work of &lt;a href=&quot;http://www.icsi.berkeley.edu/ftp/pub/techreports/1991/tr-91-066.pdf&quot;&gt;Alon, Karp, Peleg, and West&lt;/a&gt;,
specifically in relation to the $k$-server problem.&lt;/p&gt;

&lt;p&gt;The metric Ramsey theorem (&lt;a href=&quot;#thm2&quot;&gt;Theorem 2&lt;/a&gt;)
is a result of &lt;a href=&quot;https://arxiv.org/abs/math/0406353&quot;&gt;Bartal, Linial, Mendel, and Naor&lt;/a&gt;,
improving over an earlier bound of &lt;a href=&quot;https://arxiv.org/abs/cs/0406028&quot;&gt;Bartal, Bollobas, and Mendel&lt;/a&gt;
who established that one can take $|X’| \geq \exp(c \sqrt{\log n})$ for some $c &amp;gt; 0$.&lt;/p&gt;

&lt;p&gt;The upper bound in &lt;a href=&quot;#thm3&quot;&gt;Theorem 3&lt;/a&gt;
is from a forthcoming paper with Bubeck, Cohen, and Y. T. Lee,
and improves the
$O(\log n \log \log n)$ upper bound of &lt;a href=&quot;https://arxiv.org/abs/cs/0406034&quot;&gt;Fiat and Mendel&lt;/a&gt;
to the optimal value (up to a constant factor).
The lower bound in &lt;a href=&quot;#thm3&quot;&gt;Theorem 3&lt;/a&gt;
is from the aforementioned paper of Bartal, Bollobas, and Mendel;
we will discuss it in the next lecture.&lt;/p&gt;

&lt;p&gt;The random partitioning lemma
and the proof of &lt;a href=&quot;#thm2&quot;&gt;Theorem 2&lt;/a&gt;
presented here come from a paper of &lt;a href=&quot;https://arxiv.org/abs/cs/0511084&quot;&gt;Mendel and Naor&lt;/a&gt;.
This proof of the random partitioning lemma is somewhat
cleaner than the one presented there.
The (now famous) distribution on random partitions 
described in \eqref{eq:ckr} is from &lt;a href=&quot;https://epubs.siam.org/doi/abs/10.1137/S0097539701395978&quot;&gt;Calinescu, Karloff, and Rabani&lt;/a&gt;.&lt;/p&gt;

</description>
        <pubDate>Thu, 12 Apr 2018 00:00:00 +0000</pubDate>
        <link>http://tcsmath.github.io/online/2018/04/12/metric-approx/</link>
        <guid isPermaLink="true">http://tcsmath.github.io/online/2018/04/12/metric-approx/</guid>
      </item>
    
      <item>
        <title>Metrical task systems on a weighted star</title>
        <description>&lt;p&gt;Let’s recall the definition of MTS.
&lt;script type=&quot;math/tex&quot;&gt;\def\K{\mathsf{K}}
\def\R{\mathbb{R}}
\def\seteq{\mathrel{\vcenter{:}}=}
\def\cE{\mathcal{E}}
\def\argmin{\mathrm{argmin}}
\def\llangle{\left\langle}
\def\rrangle{\right\rangle}
\def\1{\mathbf{1}}
\def\e{\varepsilon}
\def\Lip{\mathrm{Lip}}&lt;/script&gt;
There is a metric space $(X,d)$.
At each time, we receive a cost function $c_t : X \to \R_+$,
and need to respond with a point $x_t \in X$.
The cost we pay at time $t$ is the sum of the &lt;em&gt;service cost&lt;/em&gt;
and the &lt;em&gt;movement cost:&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;
\[
   c_t(x_t) + d(x_{t-1},x_t).
\]
&lt;/p&gt;
&lt;p&gt;Our goal is to be competitive against the best &lt;em&gt;offline&lt;/em&gt; algorithm:
It should hold that for every $T \geq 1$,&lt;/p&gt;

&lt;p&gt;
   \[
   \sum_{t=1}^T c_t(x_t) + d(x_{t-1},x_t) \leq \alpha \left(\sum_{t=1}^T c_t(x^*_t) + d(x^*_{t-1},x^*_t)\right) + C
\]
&lt;/p&gt;

&lt;p&gt;where $C &amp;gt; 0$ is some constant independent of the cost sequence, and $\llangle x^*_t : t=0,1,\ldots,T\rrangle$
is some optimal offline sequence for $\llangle c_t : t=1,2,\ldots,T\rrangle$.
Such an online algorithm is said to be &lt;em&gt;$\alpha$-competitive.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;It is a long-standing conjecture that there is an $O(\log n)$-competitive randomized algorithm
for MTS on every $n$-point metric space.
Actually, it is conjectured that the competitive ratio is $\Theta(\log n)$ for every $n$-point metric space.
We will see the known $\Omega\left(\frac{\log n}{\log \log n}\right)$ lower bound of
&lt;a href=&quot;https://arxiv.org/abs/cs/0406028&quot;&gt;Bartal, Bollobas, and Mendel&lt;/a&gt; in a few lectures.&lt;/p&gt;

&lt;p&gt;Let us remark that a lower bound of $\Omega(\log n)$ for the $n$-point uniform metric is straightforward.
At every point in time, the adversary chooses a uniformly random $z_t \in X$ and defines
the cost function by $c_t(z_t)=+\infty$ and $c_t(z)=0$ for $z \neq z_t$.
Clearly any online algorithm incurs movement cost $\asymp t/n$ in expectation after $t$ steps.
On the other hand, an offline algorithm can break the request sequence into phases
$0 = t_0 &amp;lt; t_1 &amp;lt; t_2 &amp;lt; \cdots$ such that the times $\{t_i\}$
are minimal subject to the constraint $\{1,2,\ldots,n\} = \{z_{t_i+1}, \ldots, z_{t_{i+1}}\}$.
Clearly the offline algorithm only has to move once per phase, and by a standard coupon collector
argument, the expected length of each phase is $\asymp n \log n$.  Thus the offline algorithm
only incurs movement cost $\asymp t/(n \log n)$.&lt;/p&gt;

&lt;p&gt;In this lecture, we’ll establish
the conjectured upper bound for the special case
when $(X,d)$ is the path metric on an $n$-point weighted star:  $X={1,2,\ldots,n}$
and $d(i,j)=w_i+w_j$ for some collection $w_1,w_2,\ldots,w_n &amp;gt; 0$
of positive weights.
This will be a building block of our eventual algorithm
for trees, which will then be used to obtain
an $O((\log n)^2)$-competitive algorithm for any metric space.&lt;/p&gt;

&lt;h2 id=&quot;the-transportation-distance&quot;&gt;The transportation distance&lt;/h2&gt;

&lt;p&gt;As in the first lecture, instead of a randomized strategy,
we will play a fractional strategy:  At every time $t$, a probability
distribution $p_t : X \to [0,1]$.
In this case, our service cost is $\sum_{x \in X} p_t(x) c_t(x)$, and our movement cost is
the &lt;em&gt;transportation distance&lt;/em&gt; $W_1(p_{t-1},p_t)$.  Also known as the Earthmover distance,
$W_1(p,q)$ this is the cost of the minimal transport plan between probability distributions
$p$ and $q$.&lt;/p&gt;

&lt;p&gt;A primary reason for looking first at weighted star metrics is that their transportation
distance can be described by a weighted $\ell_1$ norm:
$W_1(p,q) = \|p-q\|_{\ell_1(w)},$ where&lt;/p&gt;
&lt;p&gt;
\[
   \|v\|_{\ell_1(w)} = \sum_{x \in X} w_x |v(x)|.
\]
&lt;/p&gt;

&lt;h2 id=&quot;the-online-algorithm&quot;&gt;The online algorithm&lt;/h2&gt;

&lt;p&gt;As in the first lecture, we will design our algorithm in continuous time
(and this is without loss of generality).
Given some continuous trajectory of cost functions $c_t : X \to \R_+$ with $t \geq 0$,
our algorithm will be a trajectory $p_t$ of distributions, and our instantenous
cost is described by&lt;/p&gt;
&lt;p&gt;
\[
   \|\partial_t p_t\|_{\ell_1(w)} + \langle c_t,p_t\rangle,
\]
&lt;/p&gt;
&lt;p&gt;where $\langle f,g\rangle = \sum_{x \in X} f(x) g(x)$.&lt;/p&gt;

&lt;p&gt;We will design the algorithm in the mirror descent framework of the previous lecture.
The underlying convex body will be the probability simplex on $X$:&lt;/p&gt;
&lt;p&gt;
\[
   \K \seteq \left\{ p \in [0,1]^X : \sum_{x \in X} p(x)=1\right\}.
\]
&lt;/p&gt;

&lt;p&gt;The second object we need is the &lt;em&gt;mirror map&lt;/em&gt; $\Phi : \K \to \R$.
Our control will be $F(t,\cdot)=- c_t$.
Recall from the preceding lecture that if we do continuous-time mirror descent on $\K$ using $\Phi$,
it will hold
that the corresponding Bregman divergence $D_{\Phi}$ acts as a Lyapunov functional:
For every fixed distribution $q \in \K$:&lt;/p&gt;
&lt;p&gt;
\begin{equation}\label{eq:lya}
   \partial_t D_{\Phi}(q; p_t) \leq \langle c_t, q-p_t\rangle.
\end{equation}
&lt;/p&gt;
&lt;p&gt;In other words, if our algorithm is paying more cost than $q$, then we are getting “closer” to $q$
in the Bregman distance.&lt;/p&gt;

&lt;p&gt;This allows us to control our service cost against a &lt;em&gt;fixed&lt;/em&gt; target $q \in \K$.
We need to choose $\Phi$ with two additional things in mind:  We also want the movement cost
to be controlled, and we want to compare ourselves to a possibly moving target $q_t^* = \1_{x_t^*}$.&lt;/p&gt;

&lt;h2 id=&quot;controlling-the-movement-cost-by-the-service-cost&quot;&gt;Controlling the movement cost by the service cost&lt;/h2&gt;

&lt;p&gt;In order to control the movement cost, we will simply try to make it comparable
to the current instananeous service cost.
Let us recall the trajectory of mirror descent from the preceding lecture:&lt;/p&gt;
&lt;p&gt;
\begin{equation}\label{eq:evo}
   \partial_t p_t = (\nabla^2 \Phi(p_t))^{-1} \left(- c_t - \lambda_t\right),
\end{equation}
&lt;/p&gt;
&lt;p&gt;where $\lambda_t \in N_{\K}(p_t)$.  Recall that $\lambda_t$ represents the set of normal forces
that are constraining us to lie in $\K$.&lt;/p&gt;

&lt;p&gt;Let’s ignore $\lambda_t$ for the moment, and calculate the instaneous movement cost induced by the cost function alone:&lt;/p&gt;

&lt;p&gt;
\begin{equation}\label{eq:compare}
   \|\partial_t p_t\|_{\ell_1(w)} = \sum_{x \in X} w_x \left((\nabla^2 \Phi(p_t))^{-1} c_t\right)(x).
\end{equation}
&lt;/p&gt;

&lt;p&gt;Thus if we want $\|\partial_t p_t\|_{\ell_1(w)} \asymp \langle c_t,p_t\rangle$, it make sense to choose $\Phi$
to be a weighted entropy:&lt;/p&gt;
&lt;p&gt;
\[
   \Phi_w(p) \seteq \sum_{x \in X} w_x p(x) \log p(x).
\]
&lt;/p&gt;
&lt;p&gt;In this case, the Hessian $\nabla^2 \Phi_w(p)$ is a diagonal matrix with:&lt;/p&gt;
&lt;p&gt;
\[
   \left(\nabla^2 \Phi_w(p)\right)_{x,x} = \frac{w_x}{p(x)},
\]
&lt;/p&gt;
&lt;p&gt;and \eqref{eq:compare} becomes&lt;/p&gt;
&lt;p&gt;
\[
   \|\partial_t p_t\|_{\ell_1(w)} = \sum_{x \in X} p_t(x) c_t(x),
\]
&lt;/p&gt;
&lt;p&gt;as desired.&lt;/p&gt;

&lt;h2 id=&quot;tracking-a-moving-target&quot;&gt;Tracking a moving target&lt;/h2&gt;

&lt;p&gt;The second thing we need to address is that \eqref{eq:lya} compares our service cost
against that of a fixed distribution $q$.  In order to analyze how we track a moving target,
let’s take a step back and compute $\partial_t D_{\Phi}(z_t; y)$ for a general 
strictly convex function $\Phi$.  Recall that&lt;/p&gt;
&lt;p&gt;
\[
   D_{\Phi}(z;y) = \Phi(z) - \Phi(y) - \langle \nabla \Phi(y), z-y\rangle,
\]
&lt;/p&gt;
&lt;p&gt;and therefore&lt;/p&gt;
&lt;p&gt;
\begin{align}
   \partial_t D_{\Phi}(z_t; y) &amp;amp;= \langle \nabla \Phi(z_t) - \nabla \Phi(y), \partial_t z_t \rangle \nonumber \\
                               &amp;amp;\leq \left(\|\nabla \Phi(y)\|_* + \|\nabla \Phi(z_t)\|_*\right) \|\partial_t z_t\| \nonumber \\
                               &amp;amp;\leq 2 \Lip_{\K, \|\cdot\|_*} (\Phi) \cdot \|\partial_t z_t\|.\label{eq:move}
   \end{align}
&lt;/p&gt;
&lt;p&gt;The first inequality holds for any norm $\|\cdot\|$, we use $\|\cdot\|_*$ for the dual norm, and
we write&lt;/p&gt;
&lt;p&gt;
\[
   \Lip_{\K,\|\cdot\|_*}(f) = \sup_{z \in \K} \|\nabla f(z)\|_*.
\]
&lt;/p&gt;

&lt;p&gt;Returning from the abstract to our current setting, it makes sense to choose $\|\cdot\| = \|\cdot\|_{\ell_1(w)}$,
and then \eqref{eq:move} exactly says that when a point moves away
from us in the Bregman divergence, it must pay for this with its own movement cost!
Using the chain rule, we now have&lt;/p&gt;
&lt;p&gt;
\begin{equation}\label{eq:breg2}
   \partial_t D_{\Phi}(q_t; p_t) \leq \langle c_t, q_t-p_t\rangle +
2 \Lip_{\K, \|\cdot\|_{*}} (\Phi) \cdot \|\partial_t q_t\|_{\ell_1(w)}
\end{equation}
&lt;/p&gt;

&lt;p&gt;This looks great, except that we need to calculate the Lipshitz constant of $\Phi$.
If we try
this for the weighted entropy $\Phi_w$, we immediately run into a problem:
Since $(\nabla \Phi_w(p))(x) = w_x (1+\log p(x))$, the $\ell_{\infty}$ norm blows up as $p(x) \to 0$.
This is a manifestation of the fact, demonstrated in the first lecture,
that exponential weights is too conservative to be competitive against a moving target.&lt;/p&gt;

&lt;h2 id=&quot;the-exploration-shift-revisited&quot;&gt;The exploration shift, revisited&lt;/h2&gt;

&lt;p&gt;To fix this problem, we will consider the shifted entropy:&lt;/p&gt;
&lt;p&gt;
\[
   \Phi \seteq \Phi_{w,\delta}(p) \seteq \sum_{x \in X} w_x (p(x) + \delta) \log (p(x)+\delta).
\]
&lt;/p&gt;

&lt;p&gt;Observe that now:&lt;/p&gt;
&lt;p&gt;
\[
   \Lip_{\K,\|\cdot\|_*} (\Phi) \leq \log (1/\delta).
\]
&lt;/p&gt;
&lt;p&gt;Plugging this into \eqref{eq:move} gives&lt;/p&gt;
&lt;p&gt;
\begin{equation}\label{eq:move2}
   \partial_t D_{\Phi}(q_t; p_t) \leq \langle c_t, q_t-p_t\rangle + 2 \log(1/\delta) \|\partial_t q_t\|_{\ell_1(w)}.
\end{equation}
&lt;/p&gt;
&lt;p&gt;Note that rearranging yields&lt;/p&gt;
&lt;p&gt;
\begin{equation}\label{eq:anal}
   \langle c_t, p_t \rangle \leq \langle c_t,q_t\rangle + 2  \log(1/\delta) \|\partial_t q_t\|_{\ell_1(w)} - \partial_t
   D_{\Phi}(q_t;p_t).
\end{equation}
&lt;/p&gt;
&lt;p&gt;If $q_0=p_0$, then $D_{\Phi}(p_0;q_0) = 0$ and $D_{\Phi} \geq 0$ always, so integrating would seem
to give us an $O(\log (1/\delta))$-competitive algorithm.&lt;/p&gt;

&lt;p&gt;But this is only true if $\|\partial_t p_t\|_{\ell_1(w)} \asymp \langle c_t, p_t\rangle$,
and our previous calculations toward this end do not necessarily hold
once we incorporate the shift by $\delta$.
This will be the fundamental tension in the course:
Making $\delta$ larger encourages “exploration” that is necessary
to respond quickly enough to changes in the cost function.
But exploration comes at the cost of movement.&lt;/p&gt;

&lt;p&gt;In the setting where $\K$ is the simplex, it is possible to control things by hand,
but for more complicated convex bodies, the normal forces described by $\lambda_t$
in \eqref{eq:evo} will be substantially more complicated.
In the next lecture, we will use this approach to 
derive an $O(\log k)$-competitive algorithm for the weighted $k$-paging problem,
and we will implement a different strategy for the exploration shift.&lt;/p&gt;

&lt;h2 id=&quot;analyzing-the-movement-of-p_t&quot;&gt;Analyzing the movement of $p_t$&lt;/h2&gt;

&lt;p&gt;So now let us describe the algorithm $p_t$ using our shifted entropy map and \eqref{eq:evo}:&lt;/p&gt;
&lt;p&gt;
\begin{equation}\label{eq:evo2}
   \partial_t p_t(x) = \frac{p_t(x)+\delta}{w_x} \left(-c_t(x) - \lambda_t(x)\right).
\end{equation}
&lt;/p&gt;

&lt;p&gt;Toward this end, let us write&lt;/p&gt;
&lt;p&gt;
\[
   \lambda_t = \xi_t + \mu_t,
\]
&lt;/p&gt;
&lt;p&gt;where $\xi_t : X \to \R_+$ are the multipliers corresponding to the positivity constraints $p(x) \geq 0$
and $\mu_t : X \to \R$ is the multiplier corresponding to the constraint $\sum_{x \in X} p(x) = 1$.
Note that this decomposition is not necessarily unique, but what we want is that
the complementary slackness conditions hold:  $\xi_t(x) &amp;gt; 0 \implies p_t(x)=0$.
(See the discussion of the normal cone for a polytope from the preceding lecture.)&lt;/p&gt;

&lt;p&gt;As in the first lecture, it will suffice to bound only one direction of the movement.
In this case, we will consider the negative coordinates in $\partial_t p_t$.
It is a straightforward calculation to check that $\mu_t \leq 0$ by summing over $x$ in \eqref{eq:evo2}
(we are reducing the probability mass in response to a cost function, so $\mu_t$ is keeping
the total probability mass from dropping).  Therefore \eqref{eq:evo2} gives&lt;/p&gt;
&lt;p&gt;
\[
   \left\|\left(\partial_t p_t\right)_-\right\|_{\ell_1(w)} \leq \llangle p_t+\delta \1, c_t + \xi_t\rrangle.
\]
&lt;/p&gt;
&lt;p&gt;Note first that $\langle p_t, c_t+\xi_t\rangle = \langle p_t, c_t\rangle$
by complementary slackness.&lt;/p&gt;

&lt;p&gt;Finally, consider any fixed $r_0 \in \K$ and
use $\langle \mu_t, r_0- p_t\rangle=0$ to write&lt;/p&gt;
&lt;p&gt;
\[
   \langle c_t + \xi_t, r_0 - p_t\rangle = \langle c_t + \lambda_t, r_0 - p_t\rangle
   = \llangle \nabla^2 \Phi(p_t) \partial_t p_t, p_t-r_0 \rrangle = \partial_t D_{\Phi}(r_0; p_t).
\]
&lt;/p&gt;

&lt;p&gt;Putting this all together and using $r_0 = \frac{1}{n} \1$ yields&lt;/p&gt;
&lt;p&gt;
\[
   \left\|\left(\partial_t p_t\right)_-\right\|_{\ell_1(w)}
   \leq \left((1+\delta n) \langle c_t,p_t\rangle + \delta n \partial_t D_{\Phi}(\tfrac{1}{n} \1; p_t)\right).
\]
&lt;/p&gt;
&lt;p&gt;Thus if we set $\delta = 1/n$, our movement cost becomes proportional to our service cost, and \eqref{eq:anal}
shows that our algorithm is $O(\log n)$-competitive.&lt;/p&gt;

</description>
        <pubDate>Mon, 09 Apr 2018 00:00:00 +0000</pubDate>
        <link>http://tcsmath.github.io/online/2018/04/09/mts-on-star/</link>
        <guid isPermaLink="true">http://tcsmath.github.io/online/2018/04/09/mts-on-star/</guid>
      </item>
    
      <item>
        <title>Navigating a convex body online</title>
        <description>&lt;p&gt;In the last lecture, we saw some algorithms that, while simple and appealing,
were somewhat unmotivated.  We now try to derive them from general
principles, and in a setting that will allow us to attack other
problems in competitive analysis.
&lt;script type=&quot;math/tex&quot;&gt;\def\K{\mathsf{K}}
\def\R{\mathbb{R}}
\def\seteq{\mathrel{\vcenter{:}}=}
\def\cE{\mathcal{E}}
\def\argmin{\mathrm{argmin}}
\def\llangle{\left\langle}
\def\rrangle{\right\rangle}
\def\1{\mathbf{1}}
\def\e{\varepsilon}&lt;/script&gt;&lt;/p&gt;

&lt;h2 id=&quot;gradient-descent--the-proximal-view&quot;&gt;Gradient descent:  The proximal view&lt;/h2&gt;

&lt;p&gt;Let us first recall the upper bound we derived for the regret in the last lecture:&lt;/p&gt;

&lt;p&gt;
\begin{equation}\label{eq:regret}
   R_T \leq \sum_{t=1}^T \left[ \|p_t-p_{t+1}\|_1 + \left\langle p_{t+1}, \ell_t - \ell_t(x_T^*) \right\rangle\right].
\end{equation}
&lt;/p&gt;

&lt;p&gt;Trying to minimize this expression leads to the question of how we should update our probability distribution $p_t \to p_{t+1}$
to simultaneously be stable (control the first term) and competitive (the second term).&lt;/p&gt;

&lt;p&gt;A very natural algorithm in this setting is gradient descent.
Indeed, suppose that $\ell : \R^n \to \R$ is differentiable, and consider the optimization&lt;/p&gt;
&lt;p&gt;
\[
   \min \left\{  \frac12 \|x-x_0\|^2 + \eta \ell(x) : x \in \R^n \right\},
\]
&lt;/p&gt;
&lt;p&gt;where $\eta &amp;gt; 0$ is a small constant and $\|\cdot\|$ denotes the Euclidean norm.  Then first-order optimality
conditions dictate that the optimizer satisfies
[
   x^* = x_0 - \eta \nabla \ell(x_0) + O(\eta^2)\,.
]&lt;/p&gt;

&lt;p&gt;Two questions immediately arise:  Why do we use the Euclidean norm when our reference problem \eqref{eq:regret}
refers to the $\ell_1$ norm, and if $x$ is meant to encode a probability distribution,
how do we maintain this constraint for $x^*$?&lt;/p&gt;

&lt;h2 id=&quot;projected-gradient-descent&quot;&gt;Projected gradient descent&lt;/h2&gt;

&lt;p&gt;Let’s address the feasibility problem first.  Suppose $\K \subseteq \R^n$ is a closed convex set
and $F : \R^n \to \R^n$ is a sufficiently smooth vector field (think of $F = \nabla \ell$).
How should we move in the direction of $F$ while simultaneously remaining inside $\K$?&lt;/p&gt;

&lt;p&gt;The unconstrained flow along $F$ can be described as
a trajectory $x : [0,\infty) \to \R^n$ given by
[
   x’(t) = F(x(t))\,.
]
The most natural way to keep this flow inside $\K$ is to project back into the body whenever we leave.
Define the Euclidean projection&lt;/p&gt;
&lt;p&gt;
\[
   P_{\K}(y) \seteq \argmin \left\{ \|y-z\|^2 : z \in \K \right\},
\]
&lt;/p&gt;
&lt;p&gt;and the result of taking an infinitesimal step in direction $v$ and and then projecting:
[
   \Pi_{\K}(x,v) \seteq \lim_{\e \to 0} \frac{P_{\K}(x+\e v) -x}{\e}\,.
]
Then the projected dynamics looks like
[
   x’(t) = \Pi_{\K} \left(x(t), F(x(t))\right)\,.
]
This is an example of a &lt;a href=&quot;https://en.wikipedia.org/wiki/Projected_dynamical_system&quot;&gt;projected dynamical system&lt;/a&gt;.
Having now addressed feasibility,
we are left to consider the role of the Euclidean norm.&lt;/p&gt;

&lt;h2 id=&quot;a-riemannian-version&quot;&gt;A Riemannian version&lt;/h2&gt;

&lt;p&gt;One can view $\Pi_{\K}(x, \cdot)$ as a function on the tangent space at $x$.
To specify such a projection,
we only need a local Euclidean structure.
An inner product $\langle \cdot,\cdot\rangle_x$ that varies smoothly
over $x \in \K$ is precisely a Riemannian metric.&lt;/p&gt;

&lt;p&gt;Equivalently, we specify at every point $x \in \K$, a smoothly varying positive-definite matrix $M(x)$
so that&lt;/p&gt;
&lt;p&gt;
\begin{align*}
   \langle u,v\rangle_{M,x} &amp;amp;= \langle u, M(x) v\rangle \\
   \|u\|^2_{M,x} &amp;amp;= \langle u,u\rangle_{M,x}.
\end{align*}
&lt;/p&gt;
&lt;p&gt;The associated projection operator is then given by&lt;/p&gt;
&lt;p&gt;
\begin{align*}
   P_{\K}^M(y; x) &amp;amp;\seteq \argmin \left\{ \left\|y-z\right\|_{M,x}^2 : z \in \K \right\} \\
   \Pi_{\K}^M(x,v) &amp;amp;\seteq \lim_{\e \to 0} \frac{P_{\K}^M(x+\e v,x)-x}{\e}\,.
\end{align*}
&lt;/p&gt;
&lt;p&gt;This leads to the dynamical system:&lt;/p&gt;
&lt;p&gt;
\begin{align*}
   x'(t) &amp;amp;= \Pi^M_{\K}\left(x(t),F(x(t))\right) \\
   x(0) &amp;amp;= x_0 \in \K\,.
\end{align*}
&lt;/p&gt;

&lt;h2 id=&quot;lyapunov-functions&quot;&gt;Lyapunov functions&lt;/h2&gt;

&lt;p&gt;The problem with stating things at this level of generality is that even when
$F = \nabla \ell$ is the gradient of a convex function $\ell : \R^n \to \R$,
we don’t have a global way of controlling convergence of $F(x(t))$ to $\min \{ F(x) : x \in \K \}$.
In the Euclidean setting ($M(x) \equiv \mathrm{Id}$), there is a natural &lt;a href=&quot;https://en.wikipedia.org/wiki/Lyapunov_function&quot;&gt;Lyapunov function&lt;/a&gt;:
If $\ell$ is convex and $\ell(x^*) = \min \{ \ell(x) : x \in \K \}$,
then for every $x \in \K$:&lt;/p&gt;
&lt;p&gt;
\[
   \langle - \nabla \ell(x), x^* - x\rangle \geq 0\,.
\]
&lt;/p&gt;
&lt;p&gt;In other words,
gradient descent always makes progress toward $x^*$.&lt;/p&gt;

&lt;p&gt;If $x’(t) = \Pi_{\K}\left(x(t), \nabla \ell(x(t))\right)$, then
in the language of competitive analysis, the quantity $\frac12 \|x(t)-x^*\|^2$
acts a potential function (a global measure of progress).&lt;/p&gt;

&lt;p&gt;We will consider geometries that come equipped with such a
Lyapunov function.  In a sense that can be formalized in various ways,
these are the &lt;em&gt;Hessian structures on $\R^n$&lt;/em&gt;, i.e., those
arising when $M(x) = \nabla^2 \Phi(x)$ for some strictly convex function $\Phi : \K \to \R$.&lt;/p&gt;

&lt;h1 id=&quot;mirror-descent-dynamics&quot;&gt;Mirror descent dynamics&lt;/h1&gt;

&lt;p&gt;Consider now a compact, convex set $\K \subseteq \R^n$,
a strictly convex function $\Phi : \K \to \R$,
and a continuous time-varying vector field $F : [0,\infty) \times \K \to \R^n$.
We will refer to &lt;strong&gt;continuous-time mirror descent&lt;/strong&gt;
as the dynamics specified by&lt;/p&gt;
&lt;p&gt;
\begin{align*}
   x'(t) &amp;amp;= \Pi_{\K}^{\nabla^2 \Phi}\left(\vphantom{\bigoplus} x(t), F(t, x(t))\right) \\
   x(0)  &amp;amp;= x_0 \in \K.
\end{align*}
&lt;/p&gt;
&lt;p&gt;We will sometimes refer to $\Phi$ as the &lt;em&gt;mirror map.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;As one might expect, we can decompose $x’(t)$ into two components:  One flowing in the direction $F(t,x(t))$,
and the other component arising from the normal forces that are keeping $x(t)$ inside $\K$.
We recall the &lt;strong&gt;normal cone to $\K$ at $x$&lt;/strong&gt; is given by&lt;/p&gt;
&lt;p&gt;
\[
   N_{\K}(x) = \left\{ p \in \R^n : \langle p,y -x \rangle \leq 0 \textrm{ for all } y \in \K \right\}.
\]
&lt;/p&gt;
&lt;p&gt;This is the set of directions that point out of the body $\K$.
The next theorem is proved in 
the paper
&lt;a href=&quot;https://homes.cs.washington.edu/~jrl/papers/pdf/kserver.pdf&quot;&gt;k-server via multiscale entropic regularization&lt;/a&gt;.&lt;/p&gt;

&lt;p class=&quot;theorem&quot; text=&quot;MD&quot;&gt;
&lt;a name=&quot;thm:md&quot;&gt;&lt;/a&gt;
   If $\nabla^2 \Phi(x)^{-1}$ is continuous on $\K$, then for any $x_0 \in \K$, there
   is an absolutely continuous trajectory $x : [0,\infty) \to \K$ satisfying
   \begin{align}
      \nabla^2 \Phi(x(t)) x'(t) &amp;amp;\in F(t,x(t)) - N_{\K}(x(t)), \label{eq:inclusion}\\
      x(0) &amp;amp;= x_0.\nonumber
   \end{align}
   Moreover, if $\nabla^2 \Phi(x)$ is Lipschitz on $\K$ and $F$ is locally Lipschitz, then the solution is unique.
&lt;/p&gt;

&lt;p&gt;Note that \eqref{eq:inclusion} is a &lt;a href=&quot;https://en.wikipedia.org/wiki/Differential_inclusion&quot;&gt;differential inclusion&lt;/a&gt;:
We only require that the derivative lies in the specified set.&lt;/p&gt;

&lt;h2 id=&quot;lagrangian-multipliers&quot;&gt;Lagrangian multipliers&lt;/h2&gt;

&lt;p&gt;If $\K$ is a polyhedron, the one can write&lt;/p&gt;
&lt;p&gt;
\begin{equation}\label{eq:polyhedron}
   \K = \{ x \in \R^n : Ax \leq b \}, \qquad A \in \R^{m \times n}, b \in \R^m\,.
\end{equation}
&lt;/p&gt;
&lt;p&gt;In this case, the normal cone at $x$ is the cone spanned by the normals of the tight constraints at $x$:&lt;/p&gt;
&lt;p&gt;
\begin{equation}\label{eq:cone-poly}
   N_{\K}(x) = \left\{ A^T y : y \geq 0 \textrm{ and } y^T(b-Ax)=0 \right\}.
\end{equation}
&lt;/p&gt;

&lt;p&gt;Consider now the application of &lt;a href=&quot;#thm:md&quot;&gt;Theorem MD&lt;/a&gt; to a polyhedron and a solution $x : [0,\infty) \to \K$,
$\lambda : [0,\infty) \to \R^n$ such that&lt;/p&gt;
&lt;p&gt;
\begin{equation}\label{eq:traj}
   \nabla^2 \Phi(x(t)) x'(t) = F(t,x(t)) - \lambda(t),
\end{equation}
&lt;/p&gt;
&lt;p&gt;and $\lambda(t) \in N_{\K}(x(t))$ for $t \geq 0$.&lt;/p&gt;

&lt;p&gt;Let us consider the dual variables to the constraints \eqref{eq:polyhedron}:
We can fix a measurable $\hat{\lambda} : [0,\infty) \to \R^m_+$ such that
[
   A^T \hat{\lambda}(t) = \lambda(t), \quad t \geq 0.
]
Now \eqref{eq:cone-poly} and $\lambda(t) \in N_{\K}(x(t))$ yield the complementary-slackness
conditions:  For all $i=1,2,\ldots,m$ and $t \geq 0$:
[
   \hat{\lambda}_i(t) &amp;gt; 0 \implies \langle A_i,x(t)\rangle = b_i,
]
where $A_i$ is the $i$th row of $A$.&lt;/p&gt;

&lt;h2 id=&quot;the-bregman-divergence-as-a-lyapunov-function&quot;&gt;The Bregman divergence as a Lyapunov function&lt;/h2&gt;

&lt;p&gt;We promised earlier the existence of a functional to control the dynamics,
and this is provided by the &lt;em&gt;Bregman divergence associated to $\Phi$&lt;/em&gt;:&lt;/p&gt;
&lt;p&gt;
\[
   D_{\Phi}(y; x) \seteq \Phi(y) - \Phi(x) - \langle \nabla \Phi(x), y-x\rangle\,.
\]
&lt;/p&gt;
&lt;p&gt;Let $x(t)$ be a trajectory satisfying \eqref{eq:traj}.  Then for any $y \in \K$:&lt;/p&gt;
&lt;p&gt;
\begin{align}
   \partial_t D_{\Phi}(y; x(t)) &amp;amp;= - \langle \nabla \Phi(x(t)), x'(t)\rangle  + \langle \nabla \Phi(x(t)), x'(t) \rangle
   -\langle \partial_t \Phi(x(t)), y-x(t)\rangle \nonumber \\
                                &amp;amp;= - \langle \nabla^2 \Phi(x(t)) x'(t), y - x(t) \rangle \nonumber \\
                                &amp;amp;= - \langle F(t,x(t)) - \lambda(t), y-x(t)\rangle \nonumber \\
                                &amp;amp;\leq - \langle F(t,x(t)), y-x(t)\rangle \label{eq:div}\,,
\end{align}
&lt;/p&gt;
&lt;p&gt;where the last inequality used that $y \in \K$ and $\lambda(t) \in N_{\K}(x(t))$.&lt;/p&gt;

&lt;p&gt;If $F(t,x(t)) = - c(t)$ is a &lt;em&gt;cost function&lt;/em&gt;, say, then this inequality
aligns with a goal stated at the beginning of the first lecture:
As long as the algorithm $x(t)$ is suffering more cost than some feasible point $y \in \K$,
we would like to be “learning” about $y$.&lt;/p&gt;

&lt;h2 id=&quot;the-algorithm-from-last-time&quot;&gt;The algorithm from last time&lt;/h2&gt;

&lt;p&gt;In the next lecture, we will use this framework to derive and analyze
algorithms for metrical task systems (MTS) and the $k$-server problem.
For now, let us show that the algorithm and analysis
from last time (for MTS on uniform metrics) fit precisely into our framework.&lt;/p&gt;

&lt;p&gt;Suppose that $\K = \{ x \in \R_+^n : \sum_{i=1}^n x_i = 1 \}$ is the probability simplex and&lt;/p&gt;
&lt;p&gt;
\[
   \Phi(x) = \sum_{i=1}^n (x_i+\delta) \log (x_i+\delta)
\]
&lt;/p&gt;
&lt;p&gt;is the (negative) entropy with some shift by $\delta &amp;gt; 0$.
In the next lecture, we will see why the negative entropy
arises naturally as a mirror map.&lt;/p&gt;

&lt;p&gt;Then $\nabla^2 \Phi(x)$ is a diagonal matrix with $\left(\nabla^2 \Phi(x)\right)_{ii} = \frac{1}{x_i+\delta}$.
Let $F(t,\cdot) = -c(t)$ be a time-varying cost vector with $c(t) \geq 0$.&lt;/p&gt;

&lt;p&gt;Therefore \eqref{eq:traj} gives&lt;/p&gt;
&lt;p&gt;
\begin{equation}\label{eq:shifted}
   x_i'(t) = (x_i(t)+\delta) \left(-c_i(t) + \hat{\mu}(t) - \hat{\lambda}_i(t)\right).
\end{equation}
Here, $\hat{\lambda}_i(t)$ is the Lagrangian multiplier corresponding to the constraint $x_i \geq 0$,
and $\hat{\mu}(t)$ is the multiplier corresponding to $\sum_{i=1}^n x_i = 1$.
&lt;/p&gt;

&lt;p&gt;This is precisley the algorithm described before (as an exercise,
one might try rewriting it to match exactly), and \eqref{eq:div} constitutes half of the analysis.
In the next lecture, we will discuss some general methods for the other half:  Tracking the movement cost.&lt;/p&gt;

</description>
        <pubDate>Fri, 06 Apr 2018 00:00:00 +0000</pubDate>
        <link>http://tcsmath.github.io/online/2018/04/06/navigating/</link>
        <guid isPermaLink="true">http://tcsmath.github.io/online/2018/04/06/navigating/</guid>
      </item>
    
      <item>
        <title>Regret minimization and competitive analysis</title>
        <description>&lt;p&gt;These are notes for the first lecture of a course I am co-teaching with &lt;a href=&quot;https://sbubeck.com/&quot;&gt;Seb Bubeck&lt;/a&gt; on
&lt;a href=&quot;https://homes.cs.washington.edu/~jrl/teaching/cse599I-spring-2018/&quot;&gt;Competitive analysis via convex optimization&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;I want to set the groundwork by reviewing the bandits model in online learning
and the standard exponential weights algorithm, and then trying to extend it
to the setting of competitive analysis where the analogous
problem goes by the name of &lt;em&gt;metrical task systems&lt;/em&gt;.
Seeing where things go wrong will give us a plan to follow for much of the course.
Some of the objects and algorithms in this lecture may appear
unmotivated; that will be addressed soon.&lt;/p&gt;

&lt;h2 id=&quot;regret-minimization&quot;&gt;Regret minimization&lt;/h2&gt;

&lt;p&gt;Here we review quickly the basic &lt;em&gt;multi-arm bandits model&lt;/em&gt; in online learning.
For more background see, e.g., this &lt;a href=&quot;http://sbubeck.com/SurveyBCB12.pdf&quot;&gt;survey of Bubeck and Cesa-Binachi&lt;/a&gt;.
&lt;script type=&quot;math/tex&quot;&gt;\def\seteq{\mathrel{\vcenter{:}}=}
\def\cE{\mathcal{E}}
\def\R{\mathbb{R}}
\def\argmin{\mathrm{argmin}}
\def\llangle{\left\langle}
\def\rrangle{\right\rangle}
\def\1{\mathbf{1}}
\def\e{\varepsilon}&lt;/script&gt;&lt;/p&gt;

&lt;p&gt;In the standard bandits model, we have a set of
experts $\cE = \{1,2,\ldots,N\}$, and
loss functions
$\ell_1,\ell_2,\ldots$ arriving over time,
where $\ell_t : \cE \to [0,1]$.&lt;/p&gt;

&lt;p&gt;For the sake of simplicity, we will work in the full
information model.  At time $t \geq 1$, we have seen $\ell_1,\ldots,\ell_{t-1}$.
We choose some strategy $x_t \in \cE$ and incur cost $\ell_t(x_t)$.
Our goal is to minimize the total cost incurred:
$\sum_{t=1}^T \ell_t(x_t)$.
In the setting of adversarial bandits, we will allow ourselves
to employ a randomized strategy, playing instead a distribution $p_t : \cE \to [0,1]$ at time $t$.&lt;/p&gt;

&lt;p&gt;In the regret minimization framework, we compare our expected loss to
that of the optimal fixed strategy:  The &lt;em&gt;regret incurred up to time $T$&lt;/em&gt; is&lt;/p&gt;
&lt;p&gt;
$$R_T \seteq \sum_{t=1}^T \langle p_t, \ell_t\rangle - \min_{x \in \cE} \sum_{t=1}^T \ell_t(x)\,,$$
&lt;/p&gt;
&lt;p&gt;where for $f,g : \cE \to \R$, we write $\langle f,g\rangle = \sum_{y \in \cE} f(y) g(y)$.&lt;/p&gt;

&lt;p&gt;We will bound the regret in two steps.
Denote $x_T^* \seteq \argmin_{x \in \cE} \sum_{t=1}^T \ell_t(x)$.  Then:&lt;/p&gt;
&lt;p&gt;
\begin{align}
   R_T &amp;amp;= \sum_{t=1}^T \langle p_t - p_{t+1},\ell_t\rangle + 
   \sum_{t=1}^T \langle p_{t+1}, \ell_t-\ell_t(x_T^*)\rangle \nonumber \\
   &amp;amp;\leq \sum_{t=1}^T \|p_t-p_{t+1}\|_1 + 
   \sum_{t=1}^T \langle p_{t+1}, \ell_t-\ell_t(x_T^*)\rangle,\label{eq:pseudo-regret}
\end{align}
&lt;/p&gt;
&lt;p&gt;where the inequality uses that the losses lie in $[0,1]$.&lt;/p&gt;

&lt;h3 id=&quot;continuous-time-dynamics&quot;&gt;Continuous-time dynamics&lt;/h3&gt;

&lt;p&gt;For a number of reasons, the analysis will be much cleaner in continuous time.
We can think about the losses as a trajectory $\left\{ \ell_t : \cE \to [0,1] \mid t \geq 0 \right\}$,
where the instantaneous loss incurred is $\ell_t\,dt$.
As long as we are bounding the expression \eqref{eq:pseudo-regret},
our continuous-time model will be &lt;em&gt;more general&lt;/em&gt;
than the discrete-time model.  The latter can be recovered by
considering a trajectory $\{ \ell_t : t \geq 0 \}$
that is piecewise-constant.
If we were bounding the regret itself, the continuous-time model would have a time advantage,
but once we shift time by one as in \eqref{eq:pseudo-regret},
this advantage disappears.&lt;/p&gt;

&lt;p&gt;Now we can analogously bound the regret:&lt;/p&gt;
&lt;p&gt;
\begin{equation}\label{eq:cont-regret}
   R_T \leq \int_0^T \|\partial_t p_t\|_1 \,dt + \int_0^T \langle p_t, \ell_t-\ell_t(x_T^*)\rangle\,dt\,,
\end{equation}
&lt;/p&gt;
&lt;p&gt;where $x_T^* \seteq \argmin_{x \in \cE} \int_0^T \ell_t(x)\,dt$.&lt;/p&gt;

&lt;p&gt;We will start with $p_0(x) = 1/N$ for every $x \in \cE$, and
employ the following exponential-weights strategy for updating $p_t$:
\begin{equation}\label{eq:dynamics}
\partial_t \log p_t(x) = \eta\left( - \ell_t(x) + \langle p_t, \ell_t\rangle\right),
\end{equation}
where $\eta &amp;gt; 0$ is a parameter called the &lt;em&gt;learning rate&lt;/em&gt; that we will choose soon.
This algorithm will be motivated more in the next lecture, but for now one can note that
we are simply doing continuous-time gradient descent on the vector $\log p_t(x)$:  We are moving
in the direction $- \eta \ell_t dt$.  The additional additive term in \eqref{eq:dynamics}
is there to maintain the constraint that $p_t$ is a probability distribution.&lt;/p&gt;

&lt;p&gt;One can see this clearly by rewriting \eqref{eq:dynamics} as:
\begin{equation}\label{eq:exp-form}
   \partial_t p_t(x) = p_t(x) \, \eta \left(- \ell_t(x) + \langle p_t,\ell_t\rangle\right),
\end{equation}
so that $\sum_{x \in \cE} \partial_t p_t(x) = 0$.&lt;/p&gt;

&lt;h3 id=&quot;the-lyapunov-functional-aka-the-potential&quot;&gt;The Lyapunov functional, aka the potential&lt;/h3&gt;

&lt;p&gt;In order to bound the regret $R_T$, we will employ the philosophy underlying
the entire course:  If we are incurring more cost than $x_T^*$, we would like to be &lt;em&gt;learning&lt;/em&gt;
about $x_T^*$ in a suitable sense.&lt;/p&gt;

&lt;p&gt;Define:
[
   D(x; p) \seteq - \log p(x)\,.
]
Note that for any $x \in \cE$, we have $D(x; p_0) = \log N$, and $D(x; p_t) \geq 0$ for all $t \geq 0$.&lt;/p&gt;

&lt;p&gt;For any $x \in \cE$:
\begin{equation}\label{eq:lya}
   \partial_t D(x;p_t) = - \partial_t \log p_t(x) = \eta \left(\ell_t(x) - \langle p_t,\ell_t\rangle \right).
\end{equation}
If we think of $D(x;p_t)$ as the “distance” from $p_t$ to $x$, then our distance to $x$
is decreasing proportional to the advantage $x$ has over our strategy $p_t$.
Integrating \eqref{eq:lya} over $[0,T]$ gives
[
   D(x; p_0) - D(x; p_T) = \eta \int_0^T \langle p_t, \ell_t-\ell_t(x)\rangle\,dt
]&lt;/p&gt;

&lt;p&gt;Applying this with $x=x_T^*$ and recalling \eqref{eq:cont-regret}, we have&lt;/p&gt;
&lt;p&gt;
\begin{align*}
   R_T  &amp;amp;\leq \int_0^T \|\partial_t p_t\|_1\,dt + \frac{D(x_0^*;p_0)-D(x_T^*;p_T)}{\eta} \\
        &amp;amp;\leq \int_0^T \|\partial_t p_t\|_1\,dt + \frac{\log N}{\eta} \\
        &amp;amp;\leq \eta T + \frac{\log N}{\eta}\,,
\end{align*}
&lt;/p&gt;
&lt;p&gt;where the last inequality uses \eqref{eq:exp-form} and the fact that the losses are in $[0,1]$
to write $\|\partial_t p_t\|_1 \leq \eta \|p_t\|_1 = \eta$.&lt;/p&gt;

&lt;p&gt;Setting $\eta = \sqrt{\frac{\log N}{T}}$ yields the standard regret bound
[
   R_T \leq 2 \sqrt{T \log N}\,.
]
If one does not know the final time $T$, it is not difficult to see that one can choose
a time-dependent learning rate $\eta=\eta(t)=\sqrt{\frac{\log N}{t}}$ to obtain a similar result.
result&lt;/p&gt;

&lt;h2 id=&quot;competitive-analysis&quot;&gt;Competitive analysis&lt;/h2&gt;

&lt;p&gt;The competitive analysis analog of the bandits framework goes by the name of
&lt;em&gt;metrical task systems (MTS)&lt;/em&gt;.  This problem was introduced in 1992 by
&lt;a href=&quot;http://www.cs.huji.ac.il/~nati/PAPERS/bls_online.pdf&quot;&gt;Borodin, Linial, and Saks&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;This setting has three major differences:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;At each time $t \geq 1$, we receive a &lt;em&gt;cost function&lt;/em&gt; $c_t : \cE \to [0,\infty)$, and we choose
an action $x_t \in \cE$ &lt;em&gt;after&lt;/em&gt; receiving $c_t$.&lt;/li&gt;
  &lt;li&gt;There is a metric $d$ on $\cE$ that makes $(\cE,d)$ into a metric space,
and in addition to the &lt;em&gt;service cost&lt;/em&gt; $c_t(x_t)$, we pay a &lt;em&gt;switching cost&lt;/em&gt; $d(x_{t-1},x_t)$
for playing $x_t$.&lt;/li&gt;
  &lt;li&gt;We compare ourselves against an offline optimum that has the &lt;em&gt;same ability to switch strategies&lt;/em&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Say that an online algorithm $\llangle x_1, x_2, \ldots,x_T\rrangle$ is $\alpha$-competitive
if, for every $x_0 \in \cE$ and every cost sequence $c_1,c_2,\ldots,c_T$, it holds that&lt;/p&gt;
&lt;p&gt;
$$
   \sum_{t=1}^T c_t(x_t) + d(x_{t-1},x_t) \leq \alpha\left(\sum_{t=1}^T c_t(x^*_t) + d(x_{t-1}^*, x_t^*)\right) + O(1)\,,
$$
&lt;/p&gt;
&lt;p&gt;where $\llangle x_0^*, x_1^*, x_2^*, \ldots, x_T^*\rrangle$ is the optimal &lt;em&gt;offline&lt;/em&gt; strategy
with $x_0^*=x_0$, i.e., the optimal strategy in hindsight.  The additive $O(1)$ term is a constant
that should be independent of the request sequence.&lt;/p&gt;

&lt;p&gt;It is conjectured that the competitive ratio is $O(\log N)$ for every $N$-point metric space.
In the coming weeks, we will see an $O((\log N)^2)$-competitive algorithm
based on upcoming joint work with Bubeck, Cohen, and Y. T. Lee.
This improves slightly on the $O((\log N)^2 \log \log N)$ bound of &lt;a href=&quot;https://arxiv.org/abs/cs/0406034&quot;&gt;Fiat and Mendel&lt;/a&gt;.&lt;/p&gt;

&lt;h3 id=&quot;attempting-exponential-weights&quot;&gt;Attempting exponential weights&lt;/h3&gt;

&lt;p&gt;The simplest setting for MTS is when the metric $(\cE,d)$ is uniform, i.e., $d(x,y)=\1_{\{x\neq y\}}$ for $x,y \in \cE$.&lt;/p&gt;

&lt;p&gt;Just as in the bandits setting, we can consider a continuous-time trajectory
$\left\{ c_t : \cE \to [0,\infty) \mid t \geq 0\right\}$ on cost
functions.  And we might try to employ a similar strategy:
\begin{equation}\label{eq:cost-dynamics}
   \partial_t \log p_t(x) = - c_t(x) + \langle p_t, c_t\rangle.
\end{equation}&lt;/p&gt;

&lt;p&gt;But now we run into a major hurdle:  Such an algorithm cannot be $\alpha$-competitive for any $\alpha &amp;lt; \infty$.&lt;/p&gt;

&lt;p&gt;
Suppose we arrange that for some $x_0 \in \cE$ and $t_0 &amp;gt; 0$ and small $\epsilon &amp;gt; 0$,
it holds that $p_{t_0}(x_0) = 1-\epsilon$.
We can do this by having an adversary play the cost function $c(x)=\1_{\cE \setminus \{x_0\}}(x)$
for a long enough period of time
so that any competitive algorithm must put almost all the probability mass on $x_0$.
&lt;/p&gt;

&lt;p&gt;
Now the adversary changes the cost function to $c(x)=\1_{x_0}(x)$.
The dynamics specified by \eqref{eq:cost-dynamics} will move much too slowly!
Indeed:
\[
   \partial_t \log p_t(x_0) = -1+(1-p_t(x_0)) = - p_t(x_0),
\]
i.e.,
\[
   \partial_t p_t(x_0) = - p_t(x_0)^2,
\]
thus it will take roughly $1/\epsilon$ time before $p_t(x_0) &amp;lt; 1/2$.
&lt;/p&gt;

&lt;p&gt;Thus our algorithm will incur cost $\asymp 1/\epsilon$ while the optimal offline algorithm incurs cost $O(1)$.
In the bandits setting, this was fine because every fixed strategy incurs cost $\asymp 1/\epsilon$.
But in the setting of competitive analysis, our algorithm needs to be a lot more nimble
to keep up with an offline algorithm that can switch strategies.&lt;/p&gt;

&lt;h3 id=&quot;the-exploration-shift&quot;&gt;The exploration shift&lt;/h3&gt;

&lt;p&gt;To fix this, we will design an algorithm that devotes a constant fraction of the service cost
it is currently incurring to exploring the strategy space.
Essentially, this can be achieved by pretending that $p_t(x) \geq 1/(2N)$ for every $x \in \cE$.
(Recall that $N = |\cE|$.)
This transformation (mixing with the uniform distribution)
is not uncommon in the bandit literature.
In the setting of metrical task systems, I saw it for the first time
in this paper of &lt;a href=&quot;/assets/papers/pot.pdf&quot;&gt;Bansal, Buchbinder, and Naor&lt;/a&gt; on the weighted paging problem.&lt;/p&gt;

&lt;p&gt;Define $p_0(x)=\1_{x_0}(x)$ where $x_0 \in X$ is the starting point.
Let $\delta &amp;gt; 0$ be a number we will choose soon,
and consider the dynamics:
\begin{equation}\label{eq:mts-dynamics}
   \partial_t \log (p_t(x)+\delta) = - \hat{c}_t(x) + \llangle \frac{p_t+\delta}{1+\delta N}, \hat{c}_t\rrangle\,,
\end{equation}
which can be written equivalently as
\begin{equation}\label{eq:mts-move}
   \partial_t p_t(x) = (p_t(x)+\delta) \left(- \hat{c}_t(x) + 
   \llangle \frac{p_t+\delta}{1+\delta N}, \hat{c}_t\rrangle\right).
\end{equation}
A natural choice is to take $\hat{c}_t(x)=c_t(x)$, but this presents a problem:
Unlike \eqref{eq:cost-dynamics}, these dynamics no longer enforce
that $p_t(x) \geq 0$.&lt;/p&gt;

&lt;h4 id=&quot;lagrangian-multipliers&quot;&gt;Lagrangian multipliers&lt;/h4&gt;

&lt;p&gt;In this relatively simple setting,
we can consider reduced costs
$\hat{c}_t(x) \seteq c_t(x)-\lambda_t(x)$ satisfying:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;$\lambda_t(x) \geq 0$ for all $t \geq 0$,&lt;/li&gt;
  &lt;li&gt;$p_t(x) = 0 \implies \partial_t p_t(x) \geq 0$,&lt;/li&gt;
  &lt;li&gt;$\lambda_t(x) &amp;gt; 0 \implies p_t(x)=0$.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;With these constraints in place,
the corresponding trajectory $\{p_t : t \geq 0\}$
will always be a probability measure,
and we can charge ourselves $\hat{c}_t(x)$ without
worrying about cheating, since $p_t(x) \hat{c}_t(x) = p_t(x) c_t(x)$
will always hold.
The existence of functions $\{ \lambda_t : \cE \to [0,\infty) \mid t \geq 0\}$
is a slightly subtle issue that will be addressed formally
in the coming lectures.  These are Lagrangian multipliers corresponding
to the constraints $p_t(x) \geq 0$ for $x \in \cE$.&lt;/p&gt;

&lt;p&gt;As we will see in future lectures, the Lagrangian multipliers
will not adversely affect the potential analysis,
but they could cause our algorithm to incur movement cost.
In the present setting, things are going in the beneficial direction:
The multipliers correspond to &lt;em&gt;reduced&lt;/em&gt; costs, and therefore
they actually slow down our movement.&lt;/p&gt;

&lt;h4 id=&quot;the-potential-analysis&quot;&gt;The potential analysis&lt;/h4&gt;

&lt;p&gt;
Define now the potential
\[
   D_{\delta}(x; p) \seteq - \log (p(x)+\delta).
\]
We are interested in $\partial_t D_{\delta}(x_t^*; p_t)$.
Let's first consider the derivative with respect to $x_t^*$.
For any $x,y \in \cE$:
\[
   \left|D_{\delta}(x; p) - D_{\delta}(y ;p)\right| \leq \log(1/\delta)\,.
\]
Thus for any $T \geq 0$:
\begin{equation}\label{eq:first-mts}
   D_{\delta}(x_{T}^*; p_T) - D_{\delta}(x_0^*, p_0)
   \leq \log(1/\delta) \sum_{t=1}^{\lfloor T\rfloor} d(x_t^*, x_{t-1}^*) + \int_0^T \partial_t D_{\delta}\left(x^*_{\lfloor t\rfloor}; p_t\right)\,dt\,.
\end{equation}
To analyze the latter term, use \eqref{eq:mts-dynamics}
to observe that for any $x \in \cE$,
\[
   \partial_t D_{\delta}(x;p_t) = \hat{c}_t(x) - \llangle \frac{p_t+\delta}{1+\delta N}, \hat{c}_t\rrangle,
\]
hence
\[
   \int_0^T \partial_t D_{\delta}\left(x_{\lfloor t\rfloor}^*; p_t\right)\,dt =
   \int_0^T \hat{c}_t(x_t^*)\,dt - \int_0^T \llangle \frac{p_t+\delta}{1+\delta N}, \hat{c}_t\rrangle\,dt\,.
\]
Plugging this into \eqref{eq:first-mts} and rearranging gives
\begin{align}\nonumber
   (1+&amp;amp; \delta N)^{-1} \int_0^T \langle p_t + \delta, \hat{c}_t\rangle\,dt \\
&amp;amp;\leq 
   \left[ D_{\delta}(x_{0}^*; p_0) - D_{\delta}(x_T^*, p_T) \right]
    + \log(1/\delta) \sum_{t=1}^{\lfloor T\rfloor} d(x_t^*, x_{t-1}^*) + 
   \int_0^T \hat{c}_t(x_t^*)\,dt  \nonumber \\
   &amp;amp;\leq
    \log(1/\delta) \sum_{t=1}^{\lfloor T\rfloor} d(x_t^*, x_{t-1}^*) + 
    \int_0^T c_t(x_t^*)\,dt\,,\label{eq:second-mts}
\end{align}
where we have used the fact that the term in brackets is nonpositive,
and $c_t \geq \hat{c}_t$ pointwise.
This looks great:  We have
bounded the service cost of our algorithm by the movement and service
costs of the optimal algorithm.
We are left to consider the movement cost of the algorithm.
&lt;/p&gt;

&lt;h3 id=&quot;the-movement-cost&quot;&gt;The movement cost&lt;/h3&gt;

&lt;p&gt;
Here we use a trick from online algorithms:
Instead of bounding the total movement cost
$\int_0^T \|\partial_t p_t\|_1\,dt,$
we will bound only the &lt;em&gt;incoming movement&lt;/em&gt;
$\int_0^T \|\left(\partial_t p_t\right)_+\|_1\,dt$.
&lt;/p&gt;

&lt;p&gt;Since “what goes in must come out (unless it stays there forever),” we have:&lt;/p&gt;
&lt;p&gt;
$$
   \int_0^T \|\partial_t p_t\|_1\,dt \leq
   2 \int_0^T \|\left(\partial_t p_t\right)_+\|_1\,dt + 1\,.
$$
And now \eqref{eq:mts-move} gives
$$
   \left\|\left(\partial_t p_t\right)_+\right\|_1 \leq \langle p_t+\delta,\hat{c}_t\rangle\,.
$$
&lt;/p&gt;

&lt;p&gt;
Combining this with \eqref{eq:second-mts} and using the fact that $\langle p_t, c_t\rangle = \langle p_t, \hat{c}_t\rangle$
yields
\begin{align*}
   \int_0^T \left(\langle p_t,c_t\rangle + \|\partial_t p_t\|_1\right)\,dt &amp;amp;\leq
   1 + 3 \int_0^T \langle p_t + \delta ,\hat{c}_t\rangle\,dt \\
   &amp;amp; \leq 1 + 3(1+\delta N) \left[\log(1/\delta) \sum_{t=1}^{\lfloor T\rfloor} d(x_t^*, x_{t-1}^*) + 
\int_0^T c_t(x_t^*)\,dt\right].
\end{align*}
Thus setting $\delta = 1/N$ yields an $O(\log N)$-competitive algorithm for MTS on a uniform metric space.
&lt;/p&gt;

&lt;h4 id=&quot;a-comment-about-absolute-continuity&quot;&gt;A comment about absolute continuity&lt;/h4&gt;

&lt;p&gt;Note that in \eqref{eq:first-mts}, we have used the fundamental theorem of calculus
to integrate a derivative.
In order for this to be valid, it must be that $\log p_t(x)$ is absolutely continuous
as a function of $t$.
When we argue formally about the existence of the Lagrangian multipliers $\lambda_t$,
we will need to ensure that the resulting trajectory is absolutely continuous.&lt;/p&gt;

</description>
        <pubDate>Sun, 01 Apr 2018 00:00:00 +0000</pubDate>
        <link>http://tcsmath.github.io/online/2018/04/01/competitive-analysis/</link>
        <guid isPermaLink="true">http://tcsmath.github.io/online/2018/04/01/competitive-analysis/</guid>
      </item>
    
      <item>
        <title>tcsmath relaunch</title>
        <description>&lt;p&gt;I am relaunching tcsmath on a new platform in anticipation
of resuming regular posting.  The old pages
should still be available here &lt;a href=&quot;https://tcsmath.wordpress.com/&quot;&gt;tcsmath&lt;/a&gt;,
and some of them will be slowly migrated to the new format.
There may be some DNS/https hiccups.  I hope those are resolved soon.&lt;/p&gt;

</description>
        <pubDate>Sat, 31 Mar 2018 00:00:00 +0000</pubDate>
        <link>http://tcsmath.github.io/admin/2018/03/31/relaunch/</link>
        <guid isPermaLink="true">http://tcsmath.github.io/admin/2018/03/31/relaunch/</guid>
      </item>
    
      <item>
        <title>An entropy optimal drift</title>
        <description>## Construction of F&amp;ouml;llmer's drift

In a previous post, we saw how an entropy-optimal drift process
could be used to prove the Brascamp-Lieb inequalities.
Our main tool was a result of F&amp;ouml;llmer that we now recall and justify.
Afterward, we will use it to prove the Gaussian log-Sobolev inequality.

Consider $f : \mathbb R^n \to \mathbb R_+$ with $\int f \,d\gamma_n = 1$,
where $\gamma_n$ is the standard Gaussian measure on $\mathbb R^n$.
Let $\\{B_t\\}$ denote an $n$-dimensional Brownian motion with $B_0=0$.
We consider all processes of the form
\begin{equation}\label{eq:drift}
W_t = B_t + \int_0^t v_s\,ds\,,
\end{equation}
where $\\{v_s\\}$ is a progressively measurable drift
and such that $W_1$ has law $f\,d\gamma_n$.

&lt;div class=&quot;theorem&quot; text=&quot;Energy-entropy&quot;&gt;
It holds that
\[
D(f d\gamma_n \,\|\, d\gamma_n) = \min D(W_{[0,1]} \,\|\, B_{[0,1]}) = \min \frac12 \int_0^1 \mathbb{E}\,\|v_t\|^2\,dt\,,
\]
where the minima are over all processes of the form \eqref{eq:drift}.
&lt;/div&gt;

&lt;div class=&quot;proof&quot;&gt;
In the preceding post (Lemma 2), we have already seen that
for any drift of the form \eqref{eq:drift}, it holds that
\[
D(f d\gamma_n \,\|\,d\gamma_n) \leq \frac12 \int_0^1 \mathbb{E}\,\|v_t\|^2\,dt \leq D(W_{[0,1]} \,\|\, B_{[0,1]})\,,
\]
thus we need only exhibit a drift $\\{v_t\\}$ achieving equality.

&lt;p&gt;
We define
\[
v_t = \nabla \log P_{1-t} f(W_t) = \frac{\nabla P_{1-t} f(W_t)}{P_{1-t} f(W_t)}\,,
\]
where $\\{P_t\\}$ is the Brownian semigroup defined by
\[
P_t f(x) = \mathbb{E}[f(x + B_t)]\,.
\]
&lt;/p&gt;

&lt;p&gt;
Note that $v_t$ is almost surely constant conditioned on the past, hence the chain rule yields
\begin{equation}\label{eq:chain}
D(W_{[0,1]} \,\|\, B_{[0,1]}) =
\frac12 \int_0^1 \mathbb{E}\,\|v_t\|^2\,dt\,.
\end{equation}
(See line (7) of Lemma 2 in the previous post.  Note that $h(v_t)=0$ since $v_t$ is deterministic given the past.)
We are left to show that $W_1$ has law $f \,d\gamma_n$ and $D(W_{[0,1]} \,\|\, B_{[0,1]}) = D(f d\gamma_n \,\|\,d\gamma_n)$.
&lt;/p&gt;

&lt;p&gt;
We will prove the first fact using Girsanov's theorem to argue about
the change of measure between $\{W_t\}$ and $\{B_t\}$.
As in the previous post, we will argue somewhat informally
using the heuristic that the law of $dB_t$ is a Gaussian
random variable in $\mathbb R^n$ with covariance $dt \cdot I$.
It&amp;ocirc;'s formula states that this heuristic is justified (see our use
of the formula below).
&lt;/p&gt;

The following lemma says that, given any sample path $\{W_s : s \in [0,t]\}$
of our process up to time $s$, the probability that Brownian motion (without drift)
would have
&quot;done the same thing is $\frac{1}{M_t}$.
&lt;/div&gt;

&lt;div class=&quot;remark&quot;&gt;
I chose to present various steps in the next proof at varying levels of formality.
The arguments have the same structure as corresponding formal proofs,
but I thought (perhaps na&amp;iuml;vely) that this would be instructive.
&lt;/div&gt;

&lt;div class=&quot;lemma&quot; text=&quot;Heuristic Girsanov&quot;&gt;
Let $\mu_t$ denote the law of $\\{W_s : s \in [0,t]\\}$.
If we define
\[
M_t = \exp\left(-\int_0^t \langle v_s,dB_s\rangle - \frac12 \int_0^t \|v_s\|^2\,ds\right)\,,
\]
then under the measure $\nu_t$ given by
\[
d\nu_t = M_t \,d\mu_t\,,
\]
the process $\\{W_s : s \in [0,t]\\}$ has the same law as $\\{B_s : s \in [0,t]\\}$.
&lt;/div&gt;

&lt;div class=&quot;proof&quot;&gt;
We argue by analogy with the discrete proof.
First, let us define the infinitesimal ``transition kernel'' of Brownian motion
using our heuristic that $dB_t$ has covariance $dt \cdot I$:
\[
p(x,y) = \frac{e^{-\|x-y\|^2/2dt}}{(2\pi dt)^{n/2}}\,.
\]
&lt;p&gt;
We can also compute the (time-inhomogeneous) transition kernel $q_t$ of $\\{W_t\\}$:
\[
q_t(x,y) =  \frac{e^{-\|v_t dt + x - y\|^2/2dt}}{(2\pi dt)^{n/2}} = p(x,y) e^{-\frac12 \|v_t\|^2 dt} e^{-\langle v_t, x-y\rangle}\,.
\]
Here we are using that $dW_t = dB_t + v_t\,dt$ and $v_t$ is deterministic conditioned on the past, thus
the law of $dW_t$ is a normal with mean $v_t\,dt$ and covariance $dt \cdot I$.
&lt;/p&gt;

&lt;p&gt;
To avoid confusion of derivatives, let's use $\alpha_t$ for the density of $\mu_t$ and $\beta_t$ for the density of
Brownian motion (recall that these are densities on paths).
Now let us relate the density $\alpha_{t+dt}$ to the density $\alpha_{t}$.
We use here the notations $\\{\hat W_t, \hat v_t, \hat B_t\\}$ to denote
a (non-random) sample path of $\\{W_t\\}$:
\begin{align*}
\alpha_{t+dt}(\hat W_{[0,t+dt]}) &amp;= \alpha_t(\hat W_{[0,t]})  q_t(\hat W_t, \hat W_{t+dt}) \\
&amp;=  \alpha_t(\hat W_{[0,t]}) p(\hat W_t, \hat W_{t+dt}) e^{-\frac12 \|\hat v_t\|^2\,dt-\langle \hat v_t,\hat W_t-\hat W_{t+dt}\rangle} \\
&amp;=
\alpha_t(\hat W_{[0,t]})  p(\hat W_t, \hat W_{t+dt}) e^{-\frac12 \|\hat v_t\|^2\,dt+\langle \hat v_t,d \hat W_t\rangle} \\
&amp;=
\alpha_t(\hat W_{[0,t]})  p(\hat W_t, \hat W_{t+dt}) e^{\frac12 \|\hat v_t\|^2\,dt+\langle \hat v_t, d \hat B_t\rangle}\,,
\end{align*}
where the last line uses $d\hat W_t = d\hat B_t + \hat v_t\,dt$.
&lt;/p&gt;

Now by ``heuristic'' induction, we can assume $\alpha_t(\hat W_{[0,t]})=\frac{1}{M_t} \beta_t(\hat W_{[0,t]})$, yielding
\begin{align*}
\alpha_{t+dt}(\hat W_{[0,t+dt]}) &amp;= \frac{1}{M_t} \beta_t(\hat W_{[0,t]})  p(\hat W_t, \hat W_{t+dt}) e^{\frac12 \|\hat v_t\|^2\,dt+\langle \hat v_t, d \hat B_t\rangle} \\
&amp;=
\frac{1}{M_{t+dt}}  \beta_t(\hat W_{[0,t]}) p(\hat W_t, \hat W_{t+dt}) \\
&amp;=
\frac{1}{M_{t+dt}}  \beta_{t+dt}(\hat W_{[0,t+dt]})\,.
\end{align*}
In the last line, we used the fact that $p$ is the infinitesimal transition kernel for Brownian motion.
&lt;/div&gt;

## The Gaussian log-Sobolev inequality

Consider again a measurable $f : \mathbb R^n \to \mathbb R_+$ with $\int f\,d\gamma_n=1$.
Let us define $\mathrm{Ent}_{\gamma_n}(f) = D(f\,d\gamma_n \,\\|\,d\gamma_n)$.
Then the classical log-Sobolev inequality in Gaussian space asserts that

\begin{equation}\label{eq:logsob}
\mathrm{Ent}_{\gamma_n}(f) \leq \frac12 \int \frac{\|\nabla f\|^2}{f}\,d\gamma_n\\,.
\end{equation}

First, we discuss the correct way to interpret this.
Define the Ornstein-Uhlenbeck semi-group $\\{U_t\\}$ by its action
\\[
U_t f(x) = \mathbb{E}[f(e^{-t} x + \sqrt{1-e^{-2t}} B_1)]\\,.
\\]
This is the natural stationary diffusion process on Gaussian space.  For every measurable $f$, we have
\\[
U_t f \to \int f d\gamma_n \quad \textrm{ as $t \to \infty$}\,,
\\]
or equivalently
\\[
\mathrm{Ent}_{\gamma_n}(U_t f) \to 0 \quad \textrm{ as $t \to \infty$}\,.
\\]

&lt;p&gt;
The log-Sobolev inequality yields quantitative convergence in the relative entropy
distance as follows:
Define the &lt;em&gt;Fisher information&lt;/em&gt;
\[
I(f) = \int \frac{\|\nabla f\|^2}{f} \,d\gamma_n\,.
\]
&lt;/p&gt;

&lt;p&gt;
One can check that
$$
\frac{d}{dt} \mathrm{Ent}_{\gamma_n} (U_t f)\Big|_{t=0} = - I(f)\,,
$$
thus the Fisher information describes the instantaneous decay of the relative entropy of $f$
under diffusion.
&lt;/p&gt;

&lt;p&gt;
So we can rewrite the log-Sobolev inequality as:
\[
- \frac{d}{dt} \mathrm{Ent}_{\gamma_n}(U_t f)\Big|_{t=0} \geq \mathrm{Ent}_{\gamma_n}(f)\,.
\]
This expresses the intuitive fact that when the relative entropy is large,
its rate of decay toward equilibrium is faster.
&lt;/p&gt;
</description>
        <pubDate>Sat, 21 Nov 2015 00:00:00 +0000</pubDate>
        <link>http://tcsmath.github.io/entropy/2015/11/21/follmer-drift/</link>
        <guid isPermaLink="true">http://tcsmath.github.io/entropy/2015/11/21/follmer-drift/</guid>
      </item>
    
  </channel>
</rss>
