Curvature for Markov Chains

Recently, Yann Ollivier developed a nice theory of Ricci curvature for Markov chains. In many ways, this can be seen as a geometric language giving another view on the notion of path coupling, developed at the end of the {90}‘s by Martin Dyer and co-workers. It has to be noted that this new notion of curvature is very general and does not need the state space where the Markov chain evolves to have any differential structure, as can be expected at first sight. Any state space endowed with a metric suffices.

Let {P} be a Markov kernel on a metric state space {(S,d)}. We would like to quantify how long it takes for two different particles evolving according to the Markovian dynamic given by {P} to meet. If the first particle starts at {x \in S} and the second at {y \in S}, the initial distance between them is {d(x,y)}. At time {t>0}, what is the average distance between these two particles. For example, if {W^x} and {W^y} are two Brownian motions in {{\mathbb R}^n} started from {x} and {y} respectively, there is no reason why {W^x_t} and {W^y_t} should be closer from each other than {x=W^x_0} and {y=W^y_0}. Indeed, one can even show that whatever the coupling of these two Brownian motions we have {\mathop{\mathbb E}[d(W^x_t, W^y_t)] \geq d(x,y)}: this is roughly speaking because the Euclidean space {{\mathbb R}^n} has no curvature. The situation is quite different if we were instead considering Brownian motions on a sphere: in this case, trajectories tend to coalesce.

1. Wasserstein distance

In the sequel, we will need to use a notion of distance between probability distributions on the metric space {(S,d)}. The usual total variation distance {d(\mu,\nu)} defined by

\displaystyle  d(\mu,\nu) \;=\; \sup_{A \subset S} \; |\mu(A)-\nu(A)| \ \ \ \ \ (1)

is not adapted to our purpose since the metric structure of the space is not exploited. Instead, in order to take into account the distance {d(\cdot,\cdot)} of the space {E} and develop a notion of curvature, we use the Wasserstein distance {W(\mu,\nu)} between probability measures. It is defined as

\displaystyle  W(\mu,\nu) \;=\; \sup\Big\{ \mu(f) - \nu(f) \;:\; \text{Lip}(f) \leq 1\Big\}. \ \ \ \ \ (2)

The distance {d(\cdot,\cdot)} is crucial to this definition: a change of distance implies a change of the class of {1}-Lipschitz functions. Since {\mu(f) - \nu(f) = \mathop{\mathbb E}[f(X) - f(Y)]} for any coupling {(X,Y)} of {\mu} and {\nu}, and since the function {f} is {1}-Lipschitz, it follows that {\mathop{\mathbb E}[f(X) - f(Y)] \leq \mathop{\mathbb E}[d(X,Y)]}. Consequently, for any coupling {(X,Y)} we have {W(\mu,\nu) \leq \mathop{\mathbb E}[d(X,Y)]}. Taking the infimum over all the couplings {(X,Y)} leads to the inequality

\displaystyle  W(\mu,\nu) \;=\; \sup_{\text{Lip}(f) \leq 1} \; |\mu(f) - \nu(f)| \;\leq\; \inf_{(X,Y)} \; \mathop{\mathbb E}[d(X,Y)]. \ \ \ \ \ (3)

This is a deep result that on any reasonable space {(S,d)} the inequality is in fact an equality. Indeed, Kantorovich duality states that on any Radon space {(S,d)} we have

\displaystyle   W(\mu,\nu) \;=\; \sup_{\text{Lip}(f) \leq 1} \; |\mu(f) - \nu(f)| \;=\; \inf_{(X,Y)} \; \mathop{\mathbb E}[d(X,Y)]. \ \ \ \ \ (4)

It is interesting to note that under mild conditions on the state space {(S,d)} one can always find a coupling that achieves the infimum of (4): this is an easy compactness argument.

2. Notion of Curvature

Denoting by {m_x = \delta_x P} the one step distribution of the Markov chain started from {x} in the sense that {m_x(A) = \mathop{\mathbb P}[X_1 \in A \;| X_0 = x ]}, we define the local (Ricci) curvature {\kappa(x,y) \in {\mathbb R}} between {x} and {y} as

\displaystyle   W(m_x, m_y) = d(x,y) \cdot (1-\kappa(x,y)). \ \ \ \ \ (5)

The closer to {1} is {\kappa(x,y)}, the more the trajectories started at {x} tend to meet the trajectories started at {y}.

Trajectories tend to coalesce

The interesting case is when the infimum {\inf_{x,y} \, \kappa(x,y)} is strictly positive,

\displaystyle   \inf_{x,y \in E} \kappa(x,y) \;=\; \kappa > 0. \ \ \ \ \ (6)

In this case we say that the Markov kernel {P} is positively curved on {(S,d)}. It should be noted that in many natural spaces it suffices to ensure that {\kappa(x,y) \;\geq\; \kappa} for all neighbouring states {x} and {y} to ensure that {\kappa(x,y) \;\geq\; \kappa} for any pair {x,y \in S}. This can be proved thanks to the so called Gluing Lemma. A space without curvature correspond to the case {\kappa=0}: for example, a symmetric random walk on {\mathbb{Z}^d} and a Brownian motion on {{\mathbb R}^d} have both zero curvature. The curvature {\kappa} is a property of both the metric space {(S,d)} and the Markov kernel {P}: indeed, different Markov chain on the same metric space {(S,d)} have generally different associated curvature. Given a metric space {(S,d)} carrying a probability distribution {\pi}, this is an interesting problem to construct a {\pi}-invariant Markov chain with the highest possible curvature {\kappa}.

Indeed, the notion of curvature readily generalizes to continuous time Markov processes by taking a limiting case of (5). For example, one can define the curvature of the continuous time Markov process {\{X_t\}_{t \geq 0}} as the largest real number {\kappa} such that for any {x,y \in (S,d)} and {\kappa' < \kappa} we have

\displaystyle  W(m_x^{\delta}, m_y^{\delta}) \;\leq\; (1-\delta \kappa') \; d(x,y) \ \ \ \ \ (7)

for every {\delta} small enough. The quantity {m_x^{\delta}} is the distribution of {X_{\delta}} when started from {x} in the sense that {m_x^{\delta}(A) = \mathop{\mathbb P}[X_{\delta} \in A \; |X_0=x]}.

3. Contraction property

We now show that a positive curvature implies a contraction property. Equation (5) shows that {W(\delta_x P, \delta_y P) \leq W(\delta_x,\delta_y) \cdot (1-\kappa)} for any {x,y \in S}. A simple argument shows that one can indeed generalize the situation to any two distributions {\mu,\nu} in the sense that

\displaystyle   W(\mu P, \nu P) \leq W(\mu,\nu) \cdot (1-\kappa). \ \ \ \ \ (8)

Proof: For any pair {x,y \in S} consider a coupling {(U_{x,y}, V_{x,y})} of {m_x} and {m_y} such that {W(m_x,m_y)=\mathop{\mathbb E}[d(U_{x,y}, V_{x,y})]}. Now, choose an optimal coupling {(X,Y)} of {\mu} and {\nu}. This is straightforward to check that {(U_{X,Y}, V_{X,Y})} is a coupling (in general not optimal) of {\mu P} and {\nu P} so that

\displaystyle  \begin{array}{rcl}  W(\mu P, \nu P) &\leq& \mathop{\mathbb E}[d(U_{X,Y}, V_{X,Y})] = \mathop{\mathbb E}[\; \mathop{\mathbb E}[d(U_{x,y}, V_{x,y}) \;|X=x, Y=y] ] \\ &=& \mathop{\mathbb E}[ W(m_X, m_Y) ] = \mathop{\mathbb E}[ d(X,Y) \cdot (1-\kappa(X,Y)) ]\\ &\leq& (1-\kappa) \; \mathop{\mathbb E}[ d(X,Y) ] = (1-\kappa) \; W(\mu,\nu). \end{array}

\Box

Equation (8) is extremely powerful since it immediately shows that

\displaystyle  W(\mu P^t, \pi) \leq (1-\kappa)^{t} \; W(\mu,\pi). \ \ \ \ \ (9)

In other words, there is exponential convergence (in the Wasserstein metric) to the invariance distribution {\pi} at rate {(1-\kappa)^t}. In continuous time, this reads

\displaystyle  W(\mu P^t, \pi) \;\leq\; e^{-\kappa t} \; W(\mu,\pi). \ \ \ \ \ (10)

In other words, the higher the curvature, the faster the convergence to equilibrium.

4. Examples

Let us give examples of positively curved Markov chains.

  1. Langevin diffusion with convex potential: consider a convex potential {\Psi:{\mathbb R} \rightarrow {\mathbb R}} that is uniformly elliptic in the sense {\Psi^{''}(x) \geq \lambda > 0}. The Langevin diffusion {dz = -\frac{1}{2} \Psi'(z) \, dt + dW} has invariant distribution {\pi} with density proportional to {e^{-\Psi(x)}}. Given a time step {\delta}, the Euler discretization of this diffusion reads
    \displaystyle  x^{k+1} = x^k - \frac{1}{2} \Psi'(x^k) \, \delta + \sqrt{\delta} \; \xi \ \ \ \ \ (11) 

    where {\xi \sim {\mathcal N}(0,1)}. Given two starting points {x^0=x} and {y^0=y}, using the same noise {\xi} to define {x^1} and {y^1} it immediately follows that

    \displaystyle  \begin{array}{rcl}  W(x^1, y^1) &\leq& (x-y) \; \Big(1 - \frac{\delta}{2} \frac{\Psi'(x)-\Psi'(y)}{x-y} \Big)\\ &\leq& (x-y) \; (1-\frac{\lambda}{2} \delta). \end{array}

    In other words, the Langevin diffusion {\{z_t\}_{t \geq 0}} is positively curved with curvature (at least) equal to {\kappa = \frac{\lambda}{2}}.

     

  2. Brownian motion on a sphere: consider a Brownian motion on the unit sphere of {{\mathbb R}^n}. Consider two points {X,Y} on this unit sphere: by symmetry, one can always rotate the coordinates so that that {X=(\sqrt{1-h^2},0,h)} and {X=(\sqrt{1-h^2},0,-h)} for some {h \in [0,1]}. For {h \ll 1} the (geodesic) distance {d(X,Y)} is approximated by {d(X,Y) \approx 2h}. One can couple two Brownian motions {W^X} and {W^Y}, one started at {X} and the other one started at {Y}, by the usual symmetry with respect to the plane {\mathcal{P} = \{(x,y,z): z=0\}}: in other words, {W^Y} is the reflexion of {W^X} with respect to {\mathcal{P}}. One can check (good exercise!) that the diffusion followed by the {z}-coordinate of a Brownian motion on the unit sphere of {{\mathbb R}^n} is simply given by
    \displaystyle  dz = -\frac{1}{2}(n-1)z \, dt + \sqrt{1-z^2} \, dW. \ \ \ \ \ (12) 

    With this coupling, for small time {\delta \ll 1}, it follows that

    \displaystyle  \begin{array}{rcl}  z^X_{\delta} &\approx& h - \frac{1}{2} (n-1) h \, \delta + \sqrt{1-h^2} \sqrt{\delta} \; \xi\\ z^Y_{\delta} &\approx& -h + \frac{1}{2} (n-1) h \, \delta - \sqrt{1-h^2} \sqrt{\delta} \; \xi \end{array}

    where {\xi \sim {\mathcal N}(0,1)} is used as the same source of randomness for {z^X_{\delta}} and {z^Y_{\delta}} since {W^Y} is the reflexion of {W^X}. Since {d(W^X_{\delta}, W^X_{\delta}) \approx |z^X_{\delta} - z^Y_{\delta}|} it readily follows that

    \displaystyle  \begin{array}{rcl}  d(W^X_{\delta}, W^X_{\delta}) \; \leq \; \big(1- \frac{1}{2}(n-1)\delta \big)\; d(x,y). \end{array}

    In other words, the curvature of a Brownian motion on the unit sphere of {{\mathbb R}^n} is equal to {\frac{1}{2}(n-1)}. Maybe surprisingly, the higher the dimension, the faster the convergence to equilibrium. This is not so unreal if one notices that the Brownian increment satisfies {\mathop{\mathbb E} \|W_{t+\delta}-W_t\|^2 \approx n \delta}.

  3. Other examples: see the original text for many other examples.