SMC sampler

Sequential Monte Carlo (SMC) methods are usually designed to deal with probability distributions of increasing dimensions. The canonical example is the sequence of distributions {\pi_n(X_{1:n}) = \mathbb{P}(X_{1:n} \, | Y_{1:n})} where {\{ X_k \}_{k \geq 0}} is a hidden Markov chain and {\{ Y_k \}_{k \geq 0}} is an observation process. The notation {X_{1:n}} is a shortcut for {(X_1, X_2, \ldots, X_n)}. This is usually called a Hidden Markov Model in the literature. If the Markov chain {X_k} evolves on in {\mathbb{R}^d} then the distribution {\pi_n} is a probability on {\big( \mathbb{R}^d \big)^n}.

It turns out that one can use the same kind of ideas to study a single distribution {\pi}. If {\pi} is difficult to sample directly from and that it is difficult to use traditional MCMC methods (because of the dimensionality, multi-modality, poor mixing properties, etc…), one can try to use SMC methods to study the distribution {\pi}. To this end, one can introduce a sequence of bridging distributions

\displaystyle \begin{array}{rcl} \pi_0, \pi_1, \ldots, \pi_n \end{array}

where {\pi_0} is typically very easy to sample from and {\pi_n=\pi}. A traditional choice is {\pi_k(x) \propto \pi(x)^{\alpha_k}} where {0 \leq \alpha_1 < \alpha_2 < \ldots <\alpha_n=1}. The bigger the index {k}, the more intricate the distribution {\pi_k} is. The problem is that each distribution {\pi_k} has the same dimensionality as the original distribution {\pi} so that it is difficult to use all the machinery of SMC methods. For the sake of concreteness, let us suppose that {\pi} is a distribution on {\mathbb{R}^d}. A fundamental remark that is the base of what is now known as Sequential Monte Carlo samplers is that {\pi_k(x_k)} can be seen as the marginal distribution of the probability

\displaystyle \begin{array}{rcl} \Pi_k(x_1, \ldots, x_k) = \pi_k(x_k) L_k(x_k, x_{k-1})L_{k-1}(x_{k-1}, x_{k-2}) \ldots L_2(x_2,x_1) \end{array}

where {L_j(\cdot, \cdot)} are Markov kernels in the sense that for any {x} we have {\int_y L_j(x, y) \, dy = 1}. The distribution {\Pi_k} lives on {\big( \mathbb{R}^d \big)^k} and one can use the traditional SMC methods to explore the sequence of distributions {\Pi_1, \ldots, \Pi_n}. This algorithm was devised a few years ago by Pierre Del Moral, Arnaud Doucet and Ajay Jasra. It is extremely powerful and several other algorithms can be seen as particular cases of it. Because I had never implemented the algorithm before, I tried yesterday to see how this worked on a very simple example. My target distribution was an equal mixture of seven Gaussian distributions in {\mathbb{R}^2} with unit covariance matrices and centred at the vertices of a regular heptagon with radius {20}. I started with {N=3000} particles at the centre of the heptagon and let them evolved according to a random walk. When the Effective Sampling Size (ESS) was too small (less than {N/2}), the whole particle system was regenerated. In order to obtain a sweet movie, I used {100} bridge distributions {\pi_k \propto \pi^{k/100}}.

The result is clear and quite impressive. Indeed, this could have been made much more efficient by taking much less bridge distributions. Some recent theory has been developed to analyse the stability of the algorithm in high dimensional situations. The analysis described in the article uses a scaling limit approach very similar to some ideas used to analyse MCMC algorithms and some more recent developements along the same lines that we have done with Andrew and Natesh. Very exciting stuff!

[EDIT: 2 October 2012]
I received several emails asking me to describe how to obtain the short animation showing the evolution of the particles system. The code is available here. This is a quick python script producing a bunch of .png files. It then suffices to stick them together to obtain an animated .gif file. This can for example easily be done using the free (and very good) photoshop-like software GIMP. Open all the png files as different layers. Export the result as a .gif file with the option ‘animated gif’ checked. Voila!

An army of gamblers and a martingale

coin toss
Start flipping a coin. How long should you wait (on average) before observing the pattern {HHTHHTT} !? Easy you might say, that’s simply the hitting time of the state {H} starting from state {A} of the following Markov chain.

Indeed, one can solve this kind of problem using a first step analysis. In other words, if {a} is the expected number of steps before reaching {H} starting from {A}, {b} is the expected number of steps before reaching {H} starting from {B}, {c} is the expected number of steps before reaching {H} starting from {C}, etc… one can readily see that {a,b, \ldots, h} verify the system of equations

\displaystyle \begin{array}{rcl} a &=& 1 + \frac12(a+b), \quad b = 1 + \frac12(a+c), \quad c = 1 + \frac12(c+d) \\ d &=& 1 + \frac12(a+e), \quad e = 1 + \frac12(a+f), \quad f = 1 + \frac12(c+g) \\ g &=& 1 + \frac12(e+h), \quad h = 0. \end{array}

After a finite amount of time and coffee one then come up with the answer {a=128}. That was a little bit tedious and I don’t really want to know how long one should wait on average before observing the sequence {HHTHTTTHTHHHHTTHTHTTH}.

It turns out that there is a very neat way to approach this kind of problems without solving any linear system. Apparently, this trick was discovered in the 80s by Shuo-Yen Robert Li. As often in probability, this boils down to introducing a clever martingale. Suppose that one wants to know how long it takes on average before observing the magic coin pattern {x_1x_2 \ldots x_n} of length {n \geq 1} with {x_i \in \{H,T\}} for all {1 \leq i \leq n}. To this end, one will introduce an army of gamblers, a casino, a game of chance and a martingale. The martingale {M} describes the amount of money earned by the casino. At each round (starting at round {t=1}) the casino tosses a coin and the gamblers can make bets. If a gambler starts betting just before round {k}, he wins if the coin shows {x_1}. If he decides to bet again at round {k+1}, he wins if the coin shows {x_2}, etc… Suppose that there are infinitely many gamblers {G_0, G_1, G_2, \ldots} and that gambler {G_k} starts betting just before round {k+1}. The game is fair in the sense that if a gambler bets an amount {A}, with probability {1/2} he gets {2A} and with probability {1/2} he losses his bet. One supposes that gambler {G_k} starts betting {1} dollar just before round {k+1} and keeps reinvesting his total capital until he either losses everything (total loss equals {1} dollar) or observes the magic coin pattern {x_1x_2 \ldots x_n} (total earning equals {2^n-1}). Indeed, the amount of money earned by the casino is a martingale and {M_0=0}. For example, {M_1=1} if the first gambler {G_0} loses his bet and {M_1=-1} if the wins. Consider the time {\tau} when the magic coin pattern first appears. This is a stopping time and the optional stopping theorem for martingales shows that {\mathbb{E}[M_{\tau}] = 0}. Indeed, one can also compute this quantity another way. By time {\tau}, all the losers have given {1} dollar to the casino. Also, if the gambler {G_k} is still in the game he has won a total amount {2^{\tau-k}-1}. in other words,

\displaystyle M_{\tau} = \tau - \sum_{j=0}^{\tau-1} 2^{\tau-k} \, \delta_j

where {\delta_j=1} if the gambler {G_j} is still in the game after the round {\tau} and {\delta_j=0} otherwise. The fundamental remark is that the quantity {\sum_{j=1}^{\tau} 2^{\tau-k+1} \, \delta_j} is not random so that {\mathbb{E}[\tau] = \sum_{j=0}^{\tau-1} 2^{\tau-k} \, \delta_j}. Indeed, right after round {\tau} one can know with certainty whether {G_{k}} is still in the game or not. Gambler {G_{\tau-k}} is in the game only if the {k} last letters of the magic word {x_1x_2 \ldots x_n} are the same as the first {k} letters. For example, if the magic word is {HHTHHTH} then the gamblers who have seen the pattern {H},{HHTH} and {HHTHHTH} at time {\tau} are still in the game when the magic word first appear. In this case this shows that {\mathbb{E}[\tau] = 2^1+2^4+2^7}.

gambler

For example, this immediately shows that we have to wait on average {2^{21}+2^1} coin tosses before observing the pattern {HHTHTTTHTHHHHTTHTHTTH}.

A sweet probabilistic proof that zeta(2)=…

Here is a delightful probabilistic proof that {\zeta(2)=\frac{\pi^2}{6}} extracted from a recent article by Greg Markowsky. This starts with pretty lemma (which can originally be found in this article) on the exit time of a {2} dimensional Brownian motion {B_t}. Suppose that {f:\mathbb{C} \rightarrow \mathbb{C}} is an analytic function on the neighbourhood of the unit disk. This function maps the unit disk {\mathbb{D}} to {f(\mathbb{D})} with boundary {\partial f(\mathbb{D}) = f(\partial \mathbb{D})} where {\partial \mathbb{D}= \big\{e^{i\theta} : 0 \leq \theta \leq 2 \pi \big\}}. A two dimensional brownian motion started at {f(0)} takes on average

\displaystyle \mathbb{E}[ \, \tau \, ] \; = \; \sum_{k \geq 1} |a_k|^2

to exit the domain {f(\mathbb{D})} where {f(z) = \sum_{k \geq 0} a_k z^k} and {\tau = \inf \big\{ \, t>0: B_t \in \partial f(\mathbb{D}) \, \big\}} is the hitting time of the boundary {\partial f(\mathbb{D})}. Indeed, since the situation is invariant by translation one can always suppose that {f(0)=a_0=0}. The proof of this result is a simple martingale argument. Indeed, since {M_t = \|B_t\|^2 - 2t} is a martingale, the optional stopping theorem shows that

\displaystyle \mathbb{E}[ \, \|B_{\tau}\|^2 \, ] \; = \; 2 \mathbb{E}[\, \tau \, ].

Then, since {f(\cdot)} is an analytic function, Ito’s formula shows that the trajectories of {B_t} are the same as the trajectories of {f(W_t)}, up to a time change, where {W_t} is another {2} dimensional Brownian motion started at {z=0}. Consequently, {B_{\tau} = f(Z_{\rho})} where {\rho} is the hitting time {\rho = \inf \big\{ \, t>0: W_t \in \partial \mathbb{D} \, \big\}}. This shows that

\displaystyle \mathbb{E}[ \, \|B_{\tau}\|^2 \, ] = \mathbb{E}[ \, \|f(W_{\rho})\|^2 \, ] = \frac{1}{2\pi} \int_{\theta=0}^{2\pi} |f(e^{i\theta})|^2 \, d\theta.

Since Parseval‘s theorem (or a direct calculation) shows that the last quantity also equals {\sum_{k \geq 0} |a_k|^2}, the conclusion follows.

 

To find a nice identity, it thus suffices to find an easy domain where one can compute explicitly the brownian exit time {\mathbb{E}[ \, \tau \, ]}. The first thing that comes to mind is a strip {S_a = \big\{ x+iy \, : \, -a < y < a\big\}} since it is classical that a Brownian motion started at {z=0} takes on average {T=a^2} to exit the strip {S_a}. Since {f(z) = \log\big( \frac{1-z}{1+z} \big) = -2(z+z^3/3+z^5/5 + z^7/7 + \ldots)} maps the unit disk {\mathbb{D}} to the strip {S_{\frac{\pi}{2}}} it follows that

\displaystyle \frac{\pi^2}{4} = 2(1+3^{-2} + 5^{-2} + 7^{-2} + \ldots),

which is indeed equivalent to the celebrated identity {\zeta(2) = \frac{\pi^2}{6}}.

Travelling between cities …

Consider the standard plane lattice {\mathbb{Z}^2} and suppose that each point {(x,y) \in \mathbb{Z}^2} represents a city. Each city is connected to its {4} neighbors only and each edge {(AB)} carries a weight that represents the time it takes to go from city {A} to city {B}. A natural question is the following:

Given two (non-neighboring) cities on the map, how long does take to travel between them?

To make the problem more tractable, mathematicians usually assume that the time it takes to travel between neighboring cities are independent from each other and identically distributed according to a distribution {\mu(dt)}. In other words, each edge of the lattice {\mathbb{Z}^2} carries a random variable {T_{e}}. The random variables {\big\{ T_e \big\}_e} are {i.i.d} and distributed according to {\mu(dt)}. It takes

\displaystyle \begin{array}{rcl} \tau(A,B)\;=\; \inf \; \Big\{ \; \sum_{\text{path}}T_{e} \;: \text{paths between A and B} \; \Big\} \end{array}

to travel on the optimal route between city {A} and city {B}. The optimal route is usually called a geodesic. If one starts from the origin, one may wonder what are the cities that can be reached in less that {30} hours, say. On the next picture, each red dot represents a city that can be reached in less than {30} hours when the travel time between any two neighboring city is exponentially distributed with mean {1} hour.

One may also wonder how the optimal routes look like. On the next picture I have plotted {4.10^4} cities (the origin is the blue dotted city) and highlighted the optimal routes between the origin and each one of the cities on the border of the map.

Is it very different from the highway system around a typical city?

For those who want to play with this model, you can modify the quick and dirty Python code that can be found here. The study of these models is an active area of research.

A new record

Alexandros Marinos has just found a new solution to the Challenge 17 problem. His solution only contains 3 rectangles. This is the best solution (6 November 2011) known so far. Congratulations to Alexandros for this amazing achievement!

Conditioned Brownian motion, 1st Laplacian eigenfunction, etc…

Brownian Motion

Brownian Motion

A few days ago, a student asked the following question:

Consider a bounded domain {D \subset \mathbb{R}^d} and a Brownian motion conditioned to stay inside {D}. How does it look like ? What is the invariant distribution ?

This is very simple, and surprisingly interesting. This involves the first eigenfunction of the Laplacian on {D}. To make everything simple, and because this does not change anything, it suffices to study the situation where {D=[0,1] \subset \mathbb{R}}. In other words, what does a real Brownian motion conditioned to stay inside the segment {[0,1]} look like.

Discretisation

One could directly do the computations in a continuous setting but, as it is often the case, it is simpler to consider the usual random walk discretisation of a Brownian motion. For this purpose, consider a time discretisation {\delta>0} and a standard random walk {X_k} with increment {\pm \sqrt{\delta}}: with probability {\frac12} the random walk goes up {+\sqrt{\delta}} and with probability {\frac12} it goes down {-\sqrt{\delta}}. For clarity, assume that there exists an integer {m_{\delta} \in \mathbb{N}} such that {m_{\delta} \sqrt{\delta} = 1}. Suppose as well that it is conditioned on the event {X_k \in [0,1]} for {k=1, \ldots, N}. Later, we will consider the limiting case {N \rightarrow \infty} and {\delta \rightarrow 0}, in this order, to recover the Brownian motion case.

Conditioning

First, let us compute the probability transitions of the random walk {\{X_k\}_{k=0}^N} conditioned on the event {X_k \in [0,1]} for {k=1, \ldots, N}. For clarity, let us denote the conditioned random walk by {\hat{X} = \big\{ \hat{X}_k \big\}_{k \geq 0}}. This is a Doob h-transform, and the resulting process is a non-homogenous Markov chain. In this simple case the computations are straightforward. The conditioned Markov chain has transition probabilities given by {\hat{P}(k,x,y) = \mathbb{P}[\hat{X}_{k+1}=y \,|\hat{X}_k=x]} with {x,y \in D_{\delta}} where {D_{\delta} = \sqrt{\delta} \mathbb{N} \; \cap \; [0,1]}. One can compute the probability that the conditioned Markov chain follows a given trajectory {(x_0, \ldots, x_N)} of {D_{\delta}},

\displaystyle  \begin{array}{rcl}  \mathbb{P}[\hat{X}_0 = x_0, \ldots, \hat{X}_N=x_N] = \frac{1}{Z(N,\delta)} P(x_0,x_1) \times \ldots \times P(x_{N-1}, x_N) \end{array}

where {P(x,y)} is the transition kernel of the unconditioned Markov chain and {Z(N,\delta) = \mathbb{P}[X_k \in D_{\delta} \; \text{for} \; k=0, \ldots, N]} is a normalisation constant. The Doob h-transform simply consists in noticing that this also reads

\displaystyle  \begin{array}{rcl}  \mathbb{P}[\hat{X}_0 = x_0, \ldots, \hat{X}_N=x_N] = \hat{P}(0,x_0,x_1) \times \ldots \times \hat{P}(N-1,x_{N-1}, x_N) \end{array}

where the conditioned Markov kernel is {\hat{P}(k,x_k,x_{k+1}) = \frac{P(x,y) \, h(k+1,y)}{h(k,x)}} and the function {h(\cdot, \cdot)} is defined by

\displaystyle  \begin{array}{rcl}  h(k,x) = \mathbb{P}[X_{j} \in D_{\delta} \; \text{for} \; k+1 \leq j \leq N \, |X_k=x]. \end{array}

Of course we have {h(N,x)=1} for all {x \in D_{\delta}}. The quantity {h(x,k)} is the probability that a random walk starting at {x \in D_{\delta}} at time {k} remains inside {D_{\delta}} for time {k,k+1, \ldots, N}. Consequently, in order to find the transition probabilities of the conditioned kernel, it suffices to compute the quantities {h(k, x)} for all {x \in D_{\delta}}. Since we are interested in the limiting case {N \rightarrow \infty}, it actually suffices to consider the case {k=0}. It can be computed recursively since {h(k,x) = \frac12 \, h(k+1, x+\sqrt{\delta}) + \frac12 \, h(k+1, x-\sqrt{\delta})} with the appropriate boundary conditions. In other words, adopting the obvious matrix notations, the vector {h(k,\cdot)} satisfies

\displaystyle  \begin{array}{rcl}  h(k,\cdot) = A h(k+1, \cdot) \end{array}

where {\big( A_{i,j} \big)_{0 \leq i,j \leq m_{\delta}}} is the usual tridiagonal matrix given by {A_{i,j} = 1} if, and only if, {|i-j|=1} and {A_{i,j} = 0} otherwise. It is related to the discrete Laplacian operator. Indeed, because all the eigenvalues of {A} are real with modulus strictly inferior to {1}, it follows that {h(0,\cdot) = A^{N} \textbf{1} \approx \lambda^N \, \varphi_{\delta}(\cdot)} where {\lambda_{\delta}} is the highest eigenvalue of {A} and {\varphi_{\delta}(\cdot)} the associated eigenfunction. The eigenvalues of {A} are well-known, and as {\delta \rightarrow 0}, the highest eigenvalue converges to {1} and the associated eigenfunction converges to the first eigenfunction of the Laplacian on the domain {D=[0,1]} with Dirichlet boundary. In our case it is {\varphi(u) = \sin(\pi \, u)} and

\displaystyle  \begin{array}{rcl}  \lim_{N \rightarrow \infty} \frac{ h(1,x \pm \sqrt{\delta})}{h(0,x)} = \lambda_{\delta}^{-1} \frac{ \varphi_{\delta}(x \pm \sqrt{\delta})}{\varphi_{\delta}(x)}. \end{array}

In other words, the random walk with increments {\pm \sqrt{\delta}} conditioned to stay inside {D_{\delta}} has probability transitions given by

\displaystyle  \begin{array}{rcl}  \hat{P}(x, x \pm \sqrt{\delta}) = \frac{{\lambda_{\delta}}^{-1}}{2} \frac{ \varphi_{\delta}(x \pm \sqrt{\delta})}{\varphi_{\delta}(x)}. \end{array}

Next section investigates the limiting case {\delta \rightarrow 0}.

Conclusion

We have computed the dynamics of the conditioned random walk with space-increments {\sqrt{\delta}}. To obtain the dynamics of the conditioned Brownian motion it suffices to consider the limiting case {\delta \rightarrow 0}. The drift of the resulting diffusion is given by

\displaystyle  \begin{array}{rcl}  \text{(drift)} &=& \lim_{\delta \rightarrow 0} \frac{1}{\delta} \Big\{ \hat{P}(0,x,x+\sqrt{\delta})\sqrt{\delta} - \hat{P}(0,x,x-\sqrt{\delta})\sqrt{\delta} \Big\} \\ &=& \lim_{\delta \rightarrow 0} \frac{{\lambda_{\delta}}^{-1}}{2 \sqrt{\delta}} \Big\{ \frac{ \varphi_{\delta}(x + \sqrt{\delta})}{\varphi_{\delta}(x)} - \frac{ \varphi_{\delta}(x - \sqrt{\delta})}{\varphi_{\delta}(x)} \Big\} = \frac{1}{2} \frac{\varphi'(x)}{\varphi(x)}. \end{array}

The same computation gives the volatility of the resulting diffusion. It is given by

\displaystyle  \begin{array}{rcl}  \text{(volatility)} &=& \lim_{\delta \rightarrow 0} \frac{1}{\delta} \Big\{ \hat{P}(0,x,x+\sqrt{\delta})\delta + \hat{P}(0,x,x-\sqrt{\delta})\delta \Big\} \\ &=& \lim_{\delta \rightarrow 0} \frac{{\lambda_{\delta}}^{-1}}{2} \Big\{ \frac{ \varphi_{\delta}(x + \sqrt{\delta})}{\varphi_{\delta}(x)} + \frac{ \varphi_{\delta}(x - \sqrt{\delta})}{\varphi_{\delta}(x)} \Big\} = 1. \end{array}

As the consequence, this shows that a Brownian motion conditioned to stay inside {D=[0,1]} follows the stochastic differential equation {dX = \frac{1}{2} \frac{\varphi'(X)}{\varphi(X)} \, dt + dW_t} where {\varphi(\cdot)} is the first eigenfunction of the Laplacian on {D} with Dirichlet boundary conditions. More generaly, the same argument would show that a Brownian motion in {\mathbb{R}^d} conditioned to stay inside a nice bounded domain {D \subset \mathbb{R}^d} evolves according to the stochastic differential equation

\displaystyle  \begin{array}{rcl}  dX = \frac{1}{2} \frac{\nabla \varphi(X)}{\varphi(X)} \, dt + dW_t \end{array}

where {\varphi:D \rightarrow \mathbb{R}} is the first eigenfunction of the Laplacian on {D}. This is a Langevin diffusion and one can immediately see that the invariant distribution of this diffusion is given by

\displaystyle  \begin{array}{rcl}  \pi_{\infty}(x) \, \propto \varphi(x). \end{array}

For example, the following plot depicts the first eigenfunction of the Laplacian on the domain {D=[0,1]^2 \subset \mathbb{R}^2}.

First Eigenfunction

Uncertainty Principle and Robust Reconstruction

1. Heisenberg Uncertainty Principle

The Heisenberg Uncertainty principle states that a signal cannot be both highly concentrated in time and highly concentrated in frequency. For example, consider a square-integrable function {f:{\mathbb R} \rightarrow {\mathbb R}} normalised so that {\|f\|_2 =1}. In this case, {|f(x)|^2 \, dx} defines a probability distributions on the real line. The Fourier isometry shows that its Fourier transform {\hat{f}} also defines a probability distribution {|\hat{f}(x)|^2 \, dx}. In order to measure how spread-out these two probability distributions are, one can use their variance. The uncertainty principle states that

\displaystyle  \textbf{var}(|f|^2) \cdot \textbf{var}(|\hat{f}|^2) \geq \frac{1}{16\pi^2} \ \ \ \ \ (1)

where {\textbf{var}(g)} designates the variance of the distribution {g(x) \, dx}. This shows for example that {\textbf{var}(|f|^2)} and {\textbf{var}(|\hat{f}|^2)} cannot be simultaneously less than {\frac{1}{4 \pi}}.

We can play the same with a vector {x \in {\mathbb R}^N} and its (discrete) Fourier transform {\hat{x} = Hx}. The Fourier matrix {H} is defined by

\displaystyle  H_{p,q} = \frac{1}{\sqrt{N}} \, \exp(2i\pi\frac{pq}{N}). \ \ \ \ \ (2)

where the normalisation coefficient {\frac{1}{\sqrt{N}}} ensures that {HH_*=I}. The Fourier inversion formula reads {x = H_*\hat{x}}. To measure how spread-out the coefficients of a vector {v \in {\mathbb C}^N} are, one can look at the size of its support

\displaystyle  \mu(v) \, := \, \textbf{Card} \Big\{ k \in [1,N]: \; v_k=0 \Big\}. \ \ \ \ \ (3)

If an uncertainty principle holds, one should be able to bound {\mu(x) \cdot \mu(\hat{x})} from below. There is no universal constant {\alpha>0} such that

\displaystyle  \frac{\mu(x)}{N} \cdot \frac{\mu(\hat{x})}{N} \geq \alpha \ \ \ \ \ (4)

for any {N \geq 1} and {x \in {\mathbb R}^N}. Indeed, if {N=pq} and {x = \sum_{j=1}^{\frac{N}{p}} e_{pj}} one readily checks that {\mu(x) = q} and {\mu(\hat{x})=p} so that {\frac{\mu(x)}{N} \cdot \frac{\mu(\hat{x})}{N} = \frac{1}{N}}. Nevertheless, in this case this gives {\mu(x) \cdot \mu(\hat{x}) = N}. Indeed, a beautiful result of L. Donoho and B. Stark shows that

\displaystyle   \mu(x) \cdot \mu(\hat{x}) \geq N \qquad \text{for all } \qquad x \in {\mathbb R}^N. \ \ \ \ \ (5)

Maybe more surprisingly, and crucial to applications, they even show that this principle can be made robust by taking noise and approximations into account. This is described in the next section.

2. Robust Uncertainty Principle

Consider a subset {T \subset \{1,2, \dots, N\}} of indices and the orthogonal projection {P_T} on {\text{span}\big\{ e_t : t \in T\big\}}. In other words, if {x = \sum_{j=1}^N x_j e_j}, then

\displaystyle  P_T x = \sum_{t \in T} x_t \, e_t. \ \ \ \ \ (6)

We say that a vector {x \in {\mathbb C}^N} is {\epsilon_T}-concentrated on {T} if {\|x-P_T x\|_2 \leq \epsilon_T}. The {\ell^2}-robust uncertainty principle states that if {x} is {\epsilon_T} concentrated on {T} and {\hat{x}} is {\epsilon_W}-concentrated on {W} then

\displaystyle   |T| \cdot |W| \geq N \, \big( 1 - \epsilon_2\big) \qquad \text{with} \qquad 1 - \epsilon_2 = \big(1-(\epsilon_W+\epsilon_W)\big)^2. \ \ \ \ \ (7)

Indeed, the case {\epsilon_T=\epsilon_W=0} gives Equation (5). The proof is surprisingly easy. On introduces the reconstruction operator {R} defined by

\displaystyle  R = H_* \circ P_W \circ H \circ P_T. \ \ \ \ \ (8)

In words: take a vector, delete the coordinate outside {T}, take the Fourier transform, delete the coordinate outside {W}, finally take the inverse Fourier transform. The special case {T=W=\{1,\ldots, N\}} simply gives {R=\text{(identity)}}. The proof consists in bounding the operator norm of {R: \ell^2 \rightarrow \ell^2} from above and below. The existence of a vector {x} that is {\epsilon_T}-concentrated such that {\hat{x}} is {\epsilon_W}-concentrated gives a lower bound. In details:

  • Upper bound: it is obvious that {R} is a contraction in the sense that {\|Rx\| \leq \|x\|} since {H,H_*} are isometries and {P_T,P_W} are orthogonal projections. Moreover, the smaller {W} and {T}, the smaller the norm of {R}. Since {H_*} is an isometry we have {\|R\| = \|P_W \circ H \circ P_T\| = \sup \|P_W \circ H x\|}, where the supremum is over all {\|x\| \leq 1} that are supported by {T}. Cauchy-Schwarz shows that

    \displaystyle  \begin{array}{rcl}  \|P_W \circ H x\|^2 &=& \sum_{w \in W} \big( \sum_{t \in T} H_{w,t} x_t\big)^2 \\ &\leq& \sum_{w \in W} \sum_{t \in T} |H_{w,t}|^2 = \frac{1}{N}\,|W| \, |T|. \end{array}

    In other words, the reconstruction operator {R} satisfies {\|R\| \leq \frac{1}{\sqrt{N}}\, \sqrt{|W| \, |T|}}. This is a non-trivial bound only if {|W| \, |T| < N}.

  • Lower bound: consider {x \in {\mathbb R}^N} that satisfies {\|x-P_Tx\| = \epsilon_T} and {\|\hat{x} - P_W \hat{x}\| = \epsilon_W}. The Fourier transform is an {\ell^2} isometry so that the triangle inequality easily shows that {x \approx Rx} in the sense that

    \displaystyle  \|x-R x\| \leq \epsilon_W + \epsilon_T. \ \ \ \ \ (9)

    This proves the lower bound {\|R\| \geq 1-(\epsilon_W+\epsilon_T)}.

In summary, the reconstruction operator {R=H_* \circ P_W \circ H \circ P_T} always satisfies {\|R\| \leq \frac{1}{\sqrt{N}}\, \sqrt{|W| \, |T|}}. Moreover, the existence of {x \in {\mathbb R}^N} satisfying {\|x-P_Tx\|_2 = \epsilon_T} and {\|\hat{x} - P_W \hat{x}\|_2 = \epsilon_W} implies the lower bound {\|R\| \geq 1-(\epsilon_W+\epsilon_T)}. Therefore, the sets {W} and {T} must verify the uncertainty principle

\displaystyle  |W| \cdot |T| \geq N \, (1-\epsilon_2) \qquad \text{with}\qquad 1-\epsilon_2 = \big( 1-(\epsilon_W+\epsilon_T)\big)^2. \ \ \ \ \ (10)

Notice that we have only use the fact that the entries of the Fourier matrix are bounded in absolute value by {\frac{1}{\sqrt{N}}}. We could have use for example any other unitary matrix {U \in M_N({\mathbb C})}. The bound {|U_{r,s}| \leq \frac{1}{\sqrt{\alpha N}}} for all {1 \leq r,s \leq N} gives the uncertainty principle {|W| \cdot |T| \geq \alpha N \, \big( 1-(\epsilon_W+\epsilon_T)\big)^2}. In general {\alpha \leq 1} since {U} is an isometry.

One can work out an upper bound and a lower bound for the reconstruction operator {R} using the {\ell^1}-norm instead of the {\ell^2}-norm. Doing so, one can prove that if {\|x-P_Tx\|_1 \leq \epsilon_T} and {\|\hat{x}-P_W \hat{x}\|_1 \leq \epsilon_W} then

\displaystyle  |W| \cdot |T| \geq N \, (1-\epsilon_1) \qquad \text{with}\qquad 1-\epsilon_1 = \frac{1-(\epsilon_W+\epsilon_T)}{1+\epsilon_W}. \ \ \ \ \ (11)

3. Stable {\ell^2} Reconstruction

Let us see how the uncertainty principle can be used to reconstruct a corrupted signal. For example, consider a discrete signal {\big\{ s_k \big\}_{k=1}^N} corrupted by some noise {\big\{ n_k \big\}_{k=1}^N}. In top of that, let us suppose that on the set {T \subset \{1,\ldots, N\}} the receiver does not receive any information. In other words, the receiver can only observe

\displaystyle  y_k = s_k + n_k \qquad \text{for} \qquad k \in \{1,\ldots, N\} \setminus T. \ \ \ \ \ (12)

In general, it is impossible to recover the original {s_k}, or an approximation of it, since the data {s_k} for {k \in T} are lost forever. In general, even if the received signal {y_k} is very weak, the deleted data {\{s_k\}_{k \in T}} might be huge. Nevertheless, under the assumption that the frequencies of {s} are supported by a set {W} satisfying {|T| \cdot |W| < N}, the reconstruction becomes possible. It is not very surprising since the hypothesis {|T| \cdot |W| < N} implies that the signal {\{s_k\}_1^N} is sufficiently smooth so that the knowledge of {y_k + n_k} on {k \in \{1,\ldots, N\} \setminus T} is enough to construct an approximation of {s_k} for {k \in T}. We assume that the set {W} is known to the observer. Since the components on {T} are not observed, one can suppose without loss of generality that there is no noise on these components ({i.e.} we have {n_k=0} for {k \in T}): the observation {y} can be described as

\displaystyle  y = (I-P_T)s + n. \ \ \ \ \ (13)

It is possible to have a stable reconstruction of the original signal {s} if the operator {(I-P_T)} can be inverted: there exists a linear operator {Q} such that {Q \circ (I-P_T) s = s} for every {s \in B(W)}. If this is the case, the reconstruction

\displaystyle  \hat{s} = Q y \ \ \ \ \ (14)

satisfies {\hat{s} = s + Qn} so that {\|\hat{s}-s\|_2 \leq \|Q\| \, \|n\|_2}. A sufficient condition for {Q} to exist is that {\|P_T s\| < \|s\|} for every {s \in B(W)}. This is equivalent to

\displaystyle  \mu_2(W,T) \;:=\; \sup\Big\{ \frac{\|P_T s\|_2}{\|s\|_2}: s \in B(W)\Big\} < 1. \ \ \ \ \ (15)

Since for {s \in B(W)} we have {s = (H_* \circ P_W \circ H) s} it follows that {P_T s = R^* \, s} where {R=H_* \circ P_W \circ H \circ P_T} is the reconstruction operator. The condition {\mu_2(W,F) < 1} thus follows from the bound {\|R\| \leq \sqrt{\frac{|T| \cdot |W|}{N}}<1}. Consequently the operator {(I-P_T)} is invertible since {Q=(1-R^*)^{-1}} satisfies

\displaystyle  Q \circ (1-P_T) s=s \qquad \text{for all} \qquad s \in B(W). \ \ \ \ \ (16)

Since {\|Q\| = \|(1-R^*)^{-1}\| \leq (1-\|R\|)^{-1}} we have the bound {\|Q\| \leq C = (1-\frac{|W| \cdot |T|}{N})^{-1}}. Therefore

\displaystyle  \text{(error reconstruction)} = \| \hat{s}-s\|_2 \leq C \, \|n\|_2. \ \ \ \ \ (17)

Notice also that {\hat{s} = Q y} can easily and quickly be approximated using the expansion

\displaystyle  Q = (1-R^*)^{-1}= 1 + R^* + (R^*)^2 + (R^*)^3 + \ldots \ \ \ \ \ (18)

Nevertheless, there are two things that are not very satisfying:

  • the bound {|W| \cdot |T| < N} is extremely restrictive. For example, if {10 \%} of the component are corrupted, this imposes that the signal has only {10} (and not {10\%}) non-zero Fourier coefficients.
  • in general, the sets {W} and {T} are not known by the receiver.

The theory of compressed sensing addresses these two questions. More on that in forthcoming posts, hopefully … Emmanuel Candes recently gave extremely interesting lectures on compressed sensing.

4. Exact {\ell^1} Reconstruction: Logan Phenomenon

Let us consider a slightly different situation. A small fraction of the components of a signal {\{s_k\}_{k=1}^N} are corrupted. The noise is not assumed to be small. More precisely, suppose that one observes

\displaystyle  y_k = s_k + P_T n_k \ \ \ \ \ (19)

where {T} is the set of corrupted components and {n} is an arbitrary noise that can potentially have very high intensity. Moreover, the set {T} is unknown to the receiver. In general, it is indeed impossible to recover the original signal {s}. Surprisingly, if one assumes that the signal {s} is band-limited {i.e.} the Fourier transform of {s} is supported by a known set {W} that satisfies

\displaystyle   |W| \cdot |T| < \frac{N}{2}, \ \ \ \ \ (20)

it is possible to recover exactly the original {s}. The main argument is that the condition (20) implies that the signal {s} has more that {50\%} of its energy on {U=T^c} in the sense that {\|P_U s\|_1> \|P_{T} s\|_1}. Consequently, since {U} is the set where the signal {s} is perfectly known, enough information is available to recover the original signal {s}. To prove the inequality {\|P_U s\|_1> \|P_{T} s\|_1}, it suffices to check that

\displaystyle  \mu_1(W,T) \;:=\; \sup\Big\{ \frac{\|P_T s\|_1}{\|s\|_1}: s \in B(W)\Big\} < \frac{1}{2}. \ \ \ \ \ (21)

In this case we have {\|P_U\ s\|_1 = \|s\|_1 - \|P_T s\|_1 > \frac{1}{2} \|s\| > \|P_T s\|_1}, which prove that {s} has more that {50\%} of its energy on {U}. Indeed, the same conclusion holds with the {\ell^2}-norm if the condition {\mu_2(W,T) < \frac12} holds. The magic of the {\ell^1}-norm is that condition {\mu_1(W,T) < \frac12} implies that the signal {s} is solution of the optimisation problem

\displaystyle   s \;=\; \textbf{argmin} \big\{ \|\tilde{s} - y\|_1 : \tilde{s} \in B(W) \Big\}. \ \ \ \ \ (22)

This would not be true if one were using the {\ell^2} norm instead. This idea was first discovered in a slightly different context in Logan’s thesis (1965). More details here. To prove (22), notice that any {\tilde{s} \in B(W)} can be written as {\tilde{s}= s+\xi} with {\xi \in B(W)}. Therefore, since {\tilde{s} - y = \xi - P_T n}, it suffices to show that

\displaystyle  0 = \textbf{argmin} \big\{ \| \xi - P_T n\|_1 : \xi \in B(W) \Big\}. \ \ \ \ \ (23)

This is equivalent to proving that for any non-zero {\xi \in B(W)} we have

\displaystyle  \| \xi - P_T n\|_1 > \|P_T n\|_1. \ \ \ \ \ (24)

It is now that the {\ell^1}-norm is crucial. Indeed, we have

\displaystyle  \begin{array}{rcl}  \| \xi - P_T n\|_1 &=& \|P_U \xi\|_1 + \| P_T \xi - P_T n\|_1 \\ &\geq& \|P_T n\|_1 + \big( \|P_U \xi\|_1- \|P_{T} \xi\|_1\big)\\ &>& \|P_T n\|_1, \end{array}

which gives the conclusion. This would not work for any {\ell^p}-norm with {p>1} since in general {\| \xi - P_T n\|_r \geq \|P_U \xi\|_r + \| P_T \xi - P_T n\|_r}: the inequality is in the wrong sense. In summary, as soon as the condition {\mu_1(W,T) < \frac{1}{2}} is satisfied, one can recover the original signal by solving the {\ell^1} minimisation problem (22).

\noindent Because it is not hard to check that in general we always have

\displaystyle  \mu_1(W,T) \leq \frac{|W| \cdot |T|}{N}, \ \ \ \ \ (25)

this shows that the condition {|W| \cdot |T| < \frac{N}{2}} implies that exact recovery is possible! Nevertheless, the condition {\mu_1(W,T) \leq \frac{|W| \cdot |T|}{N}} is far from satisfying. The inequality is tight but in practice it happens very often that {\mu_1(W,T) < \frac{1}{2}} even if {|W| \cdot |T|} is much bigger that {\frac{N}{2}}.

The take away message might be that perfect recovery is often possible if the signal has a sparse representation in an appropriate basis (in the example above, the usual Fourier Basis): the signal can then be recovered by {\ell^1}-minimisation. For this to be possible, the representation basis (here the Fourier basis) must be incoherent with the observation basis in the sense that the coefficients of the change of basis matrix (here, Fourier matrix) must be small {i.e.} if {(e_1, \ldots, e_N)} is the observation basis and {(\hat{e}_1, \ldots, \hat{e}_N)} is the representation basis then {\langle e_i, \hat{e}_j \rangle} must be small for every {1 \leq i,j \leq N}. For example, for the Fourier basis we have {|\langle e_i, \hat{e}_j \rangle| = \frac{1}{\sqrt{N}}} for every {1 \leq i,j \leq N}.

black: observation basis, green:representation basis

Curvature for Markov Chains

Recently, Yann Ollivier developed a nice theory of Ricci curvature for Markov chains. In many ways, this can be seen as a geometric language giving another view on the notion of path coupling, developed at the end of the {90}‘s by Martin Dyer and co-workers. It has to be noted that this new notion of curvature is very general and does not need the state space where the Markov chain evolves to have any differential structure, as can be expected at first sight. Any state space endowed with a metric suffices.

Let {P} be a Markov kernel on a metric state space {(S,d)}. We would like to quantify how long it takes for two different particles evolving according to the Markovian dynamic given by {P} to meet. If the first particle starts at {x \in S} and the second at {y \in S}, the initial distance between them is {d(x,y)}. At time {t>0}, what is the average distance between these two particles. For example, if {W^x} and {W^y} are two Brownian motions in {{\mathbb R}^n} started from {x} and {y} respectively, there is no reason why {W^x_t} and {W^y_t} should be closer from each other than {x=W^x_0} and {y=W^y_0}. Indeed, one can even show that whatever the coupling of these two Brownian motions we have {\mathop{\mathbb E}[d(W^x_t, W^y_t)] \geq d(x,y)}: this is roughly speaking because the Euclidean space {{\mathbb R}^n} has no curvature. The situation is quite different if we were instead considering Brownian motions on a sphere: in this case, trajectories tend to coalesce.

1. Wasserstein distance

In the sequel, we will need to use a notion of distance between probability distributions on the metric space {(S,d)}. The usual total variation distance {d(\mu,\nu)} defined by

\displaystyle  d(\mu,\nu) \;=\; \sup_{A \subset S} \; |\mu(A)-\nu(A)| \ \ \ \ \ (1)

is not adapted to our purpose since the metric structure of the space is not exploited. Instead, in order to take into account the distance {d(\cdot,\cdot)} of the space {E} and develop a notion of curvature, we use the Wasserstein distance {W(\mu,\nu)} between probability measures. It is defined as

\displaystyle  W(\mu,\nu) \;=\; \sup\Big\{ \mu(f) - \nu(f) \;:\; \text{Lip}(f) \leq 1\Big\}. \ \ \ \ \ (2)

The distance {d(\cdot,\cdot)} is crucial to this definition: a change of distance implies a change of the class of {1}-Lipschitz functions. Since {\mu(f) - \nu(f) = \mathop{\mathbb E}[f(X) - f(Y)]} for any coupling {(X,Y)} of {\mu} and {\nu}, and since the function {f} is {1}-Lipschitz, it follows that {\mathop{\mathbb E}[f(X) - f(Y)] \leq \mathop{\mathbb E}[d(X,Y)]}. Consequently, for any coupling {(X,Y)} we have {W(\mu,\nu) \leq \mathop{\mathbb E}[d(X,Y)]}. Taking the infimum over all the couplings {(X,Y)} leads to the inequality

\displaystyle  W(\mu,\nu) \;=\; \sup_{\text{Lip}(f) \leq 1} \; |\mu(f) - \nu(f)| \;\leq\; \inf_{(X,Y)} \; \mathop{\mathbb E}[d(X,Y)]. \ \ \ \ \ (3)

This is a deep result that on any reasonable space {(S,d)} the inequality is in fact an equality. Indeed, Kantorovich duality states that on any Radon space {(S,d)} we have

\displaystyle   W(\mu,\nu) \;=\; \sup_{\text{Lip}(f) \leq 1} \; |\mu(f) - \nu(f)| \;=\; \inf_{(X,Y)} \; \mathop{\mathbb E}[d(X,Y)]. \ \ \ \ \ (4)

It is interesting to note that under mild conditions on the state space {(S,d)} one can always find a coupling that achieves the infimum of (4): this is an easy compactness argument.

2. Notion of Curvature

Denoting by {m_x = \delta_x P} the one step distribution of the Markov chain started from {x} in the sense that {m_x(A) = \mathop{\mathbb P}[X_1 \in A \;| X_0 = x ]}, we define the local (Ricci) curvature {\kappa(x,y) \in {\mathbb R}} between {x} and {y} as

\displaystyle   W(m_x, m_y) = d(x,y) \cdot (1-\kappa(x,y)). \ \ \ \ \ (5)

The closer to {1} is {\kappa(x,y)}, the more the trajectories started at {x} tend to meet the trajectories started at {y}.

Trajectories tend to coalesce

The interesting case is when the infimum {\inf_{x,y} \, \kappa(x,y)} is strictly positive,

\displaystyle   \inf_{x,y \in E} \kappa(x,y) \;=\; \kappa > 0. \ \ \ \ \ (6)

In this case we say that the Markov kernel {P} is positively curved on {(S,d)}. It should be noted that in many natural spaces it suffices to ensure that {\kappa(x,y) \;\geq\; \kappa} for all neighbouring states {x} and {y} to ensure that {\kappa(x,y) \;\geq\; \kappa} for any pair {x,y \in S}. This can be proved thanks to the so called Gluing Lemma. A space without curvature correspond to the case {\kappa=0}: for example, a symmetric random walk on {\mathbb{Z}^d} and a Brownian motion on {{\mathbb R}^d} have both zero curvature. The curvature {\kappa} is a property of both the metric space {(S,d)} and the Markov kernel {P}: indeed, different Markov chain on the same metric space {(S,d)} have generally different associated curvature. Given a metric space {(S,d)} carrying a probability distribution {\pi}, this is an interesting problem to construct a {\pi}-invariant Markov chain with the highest possible curvature {\kappa}.

Indeed, the notion of curvature readily generalizes to continuous time Markov processes by taking a limiting case of (5). For example, one can define the curvature of the continuous time Markov process {\{X_t\}_{t \geq 0}} as the largest real number {\kappa} such that for any {x,y \in (S,d)} and {\kappa' < \kappa} we have

\displaystyle  W(m_x^{\delta}, m_y^{\delta}) \;\leq\; (1-\delta \kappa') \; d(x,y) \ \ \ \ \ (7)

for every {\delta} small enough. The quantity {m_x^{\delta}} is the distribution of {X_{\delta}} when started from {x} in the sense that {m_x^{\delta}(A) = \mathop{\mathbb P}[X_{\delta} \in A \; |X_0=x]}.

3. Contraction property

We now show that a positive curvature implies a contraction property. Equation (5) shows that {W(\delta_x P, \delta_y P) \leq W(\delta_x,\delta_y) \cdot (1-\kappa)} for any {x,y \in S}. A simple argument shows that one can indeed generalize the situation to any two distributions {\mu,\nu} in the sense that

\displaystyle   W(\mu P, \nu P) \leq W(\mu,\nu) \cdot (1-\kappa). \ \ \ \ \ (8)

Proof: For any pair {x,y \in S} consider a coupling {(U_{x,y}, V_{x,y})} of {m_x} and {m_y} such that {W(m_x,m_y)=\mathop{\mathbb E}[d(U_{x,y}, V_{x,y})]}. Now, choose an optimal coupling {(X,Y)} of {\mu} and {\nu}. This is straightforward to check that {(U_{X,Y}, V_{X,Y})} is a coupling (in general not optimal) of {\mu P} and {\nu P} so that

\displaystyle  \begin{array}{rcl}  W(\mu P, \nu P) &\leq& \mathop{\mathbb E}[d(U_{X,Y}, V_{X,Y})] = \mathop{\mathbb E}[\; \mathop{\mathbb E}[d(U_{x,y}, V_{x,y}) \;|X=x, Y=y] ] \\ &=& \mathop{\mathbb E}[ W(m_X, m_Y) ] = \mathop{\mathbb E}[ d(X,Y) \cdot (1-\kappa(X,Y)) ]\\ &\leq& (1-\kappa) \; \mathop{\mathbb E}[ d(X,Y) ] = (1-\kappa) \; W(\mu,\nu). \end{array}

\Box

Equation (8) is extremely powerful since it immediately shows that

\displaystyle  W(\mu P^t, \pi) \leq (1-\kappa)^{t} \; W(\mu,\pi). \ \ \ \ \ (9)

In other words, there is exponential convergence (in the Wasserstein metric) to the invariance distribution {\pi} at rate {(1-\kappa)^t}. In continuous time, this reads

\displaystyle  W(\mu P^t, \pi) \;\leq\; e^{-\kappa t} \; W(\mu,\pi). \ \ \ \ \ (10)

In other words, the higher the curvature, the faster the convergence to equilibrium.

4. Examples

Let us give examples of positively curved Markov chains.

  1. Langevin diffusion with convex potential: consider a convex potential {\Psi:{\mathbb R} \rightarrow {\mathbb R}} that is uniformly elliptic in the sense {\Psi^{''}(x) \geq \lambda > 0}. The Langevin diffusion {dz = -\frac{1}{2} \Psi'(z) \, dt + dW} has invariant distribution {\pi} with density proportional to {e^{-\Psi(x)}}. Given a time step {\delta}, the Euler discretization of this diffusion reads
    \displaystyle  x^{k+1} = x^k - \frac{1}{2} \Psi'(x^k) \, \delta + \sqrt{\delta} \; \xi \ \ \ \ \ (11) 

    where {\xi \sim {\mathcal N}(0,1)}. Given two starting points {x^0=x} and {y^0=y}, using the same noise {\xi} to define {x^1} and {y^1} it immediately follows that

    \displaystyle  \begin{array}{rcl}  W(x^1, y^1) &\leq& (x-y) \; \Big(1 - \frac{\delta}{2} \frac{\Psi'(x)-\Psi'(y)}{x-y} \Big)\\ &\leq& (x-y) \; (1-\frac{\lambda}{2} \delta). \end{array}

    In other words, the Langevin diffusion {\{z_t\}_{t \geq 0}} is positively curved with curvature (at least) equal to {\kappa = \frac{\lambda}{2}}.

     

  2. Brownian motion on a sphere: consider a Brownian motion on the unit sphere of {{\mathbb R}^n}. Consider two points {X,Y} on this unit sphere: by symmetry, one can always rotate the coordinates so that that {X=(\sqrt{1-h^2},0,h)} and {X=(\sqrt{1-h^2},0,-h)} for some {h \in [0,1]}. For {h \ll 1} the (geodesic) distance {d(X,Y)} is approximated by {d(X,Y) \approx 2h}. One can couple two Brownian motions {W^X} and {W^Y}, one started at {X} and the other one started at {Y}, by the usual symmetry with respect to the plane {\mathcal{P} = \{(x,y,z): z=0\}}: in other words, {W^Y} is the reflexion of {W^X} with respect to {\mathcal{P}}. One can check (good exercise!) that the diffusion followed by the {z}-coordinate of a Brownian motion on the unit sphere of {{\mathbb R}^n} is simply given by
    \displaystyle  dz = -\frac{1}{2}(n-1)z \, dt + \sqrt{1-z^2} \, dW. \ \ \ \ \ (12) 

    With this coupling, for small time {\delta \ll 1}, it follows that

    \displaystyle  \begin{array}{rcl}  z^X_{\delta} &\approx& h - \frac{1}{2} (n-1) h \, \delta + \sqrt{1-h^2} \sqrt{\delta} \; \xi\\ z^Y_{\delta} &\approx& -h + \frac{1}{2} (n-1) h \, \delta - \sqrt{1-h^2} \sqrt{\delta} \; \xi \end{array}

    where {\xi \sim {\mathcal N}(0,1)} is used as the same source of randomness for {z^X_{\delta}} and {z^Y_{\delta}} since {W^Y} is the reflexion of {W^X}. Since {d(W^X_{\delta}, W^X_{\delta}) \approx |z^X_{\delta} - z^Y_{\delta}|} it readily follows that

    \displaystyle  \begin{array}{rcl}  d(W^X_{\delta}, W^X_{\delta}) \; \leq \; \big(1- \frac{1}{2}(n-1)\delta \big)\; d(x,y). \end{array}

    In other words, the curvature of a Brownian motion on the unit sphere of {{\mathbb R}^n} is equal to {\frac{1}{2}(n-1)}. Maybe surprisingly, the higher the dimension, the faster the convergence to equilibrium. This is not so unreal if one notices that the Brownian increment satisfies {\mathop{\mathbb E} \|W_{t+\delta}-W_t\|^2 \approx n \delta}.

  3. Other examples: see the original text for many other examples.

Joe’s Pyramid

Yesterday, while reading the last issue of the NewScientist, I came across the following very cute riddle:

Lazy, I asked myself if it were possible to write {10} lines long Python code to solve this innocent looking enigma. The whole pyramid is entirely determined by the {6} numbers lying at the bottom, and each one of them is an integer between {1} and {99}: these numbers must be different so that there are at most {\binom{99}{6} \approx 1.1 \times 10^9} possibilities to test! Brute force won’t work my friend!

When stupid brute force does not work, one can still try annealing/probabilist methods: this works pretty well for Sudoku (which is NP-hard) as this is brilliantly described here and there. The principle is simple: if one can find a good energy function {E:\{1, \ldots, 99\}^6 \rightarrow \mathbb{R}} such that a solution to the problem corresponds to a low energy configuration, one can do MCMC-simulating annealing-etc on the target distribution

\displaystyle \pi(\text{configuration}) \propto e^{-\beta \cdot E(\text{configuration})}.

The issue is that it might be very difficult to choose a sensible energy function {E(\cdot)}. Foolishly, I first tried the following energy function, and then ran a random walk Metropolis algorithm with {\pi} as target probability:

\displaystyle  \pi(\text{configuration}) \propto e^{\beta \cdot {\text{Height}}(\text{configuration})}

where {\text{Height}(\text{configuration})} is the numbers of levels that one can fill, starting from the bottom, without encountering any problem {i.e} no repetition and no number greater than {100}. With different values of {\beta} and letting run the algorithm for a few millions iterations ({5} min on my crappy laptop), one can easily produce configurations that are {5}-levels high: but the algorithm never found any real solution {i.e} a configuration with height equal to {6}.

Now I am curious wether this is possible to produce a non-stupid energy function {E(\cdot)} so that this riddle is solvable in a reasonable amount of time by standard MCMC – annealing methods.

As a conclusion, I should mention that with a pen and a cup of coffee, one can easily find a solution: I will not spoil the fun, but just say that the configuration space is not that big if one think more carefully about it…

Sampling conditioned Markov chains, and diffusions

In many situations it might be useful to know how to sample a Markov chain (or a diffusion process) {X_k} between time {0} and {T}, conditioned on the knowledge that {X_0=a} and {X_T=b}. This conditioned Markov chain {X^{\star}} is still a Markov chain but in general is not time homogeneous. Moreover, it is generally very difficult to compute the transition probabilities of this conditioned Markov chain since they depend on the knowledge of the transition probabilities {p(t,x,y) = \mathop{\mathbb P}(X_t=y | X_0=x)} of the unconditioned Markov chain, which are usually not available: this has been discussed in this previous post on Doob h-transforms. Perhaps surprisingly, this article by Michael Sorensen and Mogens Bladt shows how this is sometimes quite easy to sample good approximations of such a conditioned Markov chain, or diffusion.

1. Reversible Markov chains

Remember that a Markov chain {X_k} on the state space {S} with transition operator {p(x,y) = \mathop{\mathbb P}(X_{k+1}=y \,|X_k=x)} is reversible with respect to the probability {\pi} if for any {x,y \in S}

\displaystyle  \pi(x) p(x,y) = p(y,x) \pi(y). \ \ \ \ \ (1)

In words, this means that looking at a trajectory of this Markov chain, this is impossible to say if the time is running forward or backward: indeed, the probability of observing {y_1, y_2, \ldots, y_n} is equal to {\pi(y_1)p(y_1, y_2) \ldots p(y_{n-1}, y_n)}, which is also equal to {\pi(y_n)p(y_n,y_{n-1}) \ldots p(y_2, y_1)} if the chain is reversible.

This is precisely this property of invariance by time reversal that allows to sample from conditioned reversible Markov chain. Since under mild conditions a one dimensional diffusion is also reversible, this also shows that (ergodic) one dimensional conditioned diffusions are sometimes quite easy to sample from!

2. One dimensional diffusions are reversible!

The other someone told me that almost any one dimensional diffusion is reversible! I did not know that, and I must admit that I still find this result rather surprising. Indeed, this is not true for multidimensional diffusions, and this is very easy to construct counter-examples. What makes the result works for real diffusions is that there is only one way to go from {a} to {b}, namely the segment {[a,b]}. Indeed, the situation is completely different in higher dimensions.

First, let us remark that any Markov chain on {\mathbb{Z}}, that has {\pi} as invariant distribution and that can only make jumps of size {+1} or {-1} is reversible: such a Markov chain is usually called skip-free in the litterature. This is extremely easy to prove, and since skip-free Markov chains have been studied a lot, I am sure that this result is somewhere in the litterature (any reference for that?). To show the result, it suffices to show that {u_k-d_{k+1}=0} for any {k \in \mathbb{Z}} where {u_k = \pi(k) p(k,k+1)} is the upward flux at level {k} and {d_k = \pi(k)p(k,k-1)} is the downward flux. Because {\pi} is invariant it satisfies the usual balance equations,

\displaystyle  \begin{array}{rcl}  u_k+d_k &=& \pi(k) = \pi(k-1)p(k-1,k)+\pi(k+1)p(k+1,k)\\ &=& u_{k-1} + d_{k+1} \end{array}

so that {u_{k}-d_{k+1} = u_{k-1}-d_{k}}. Interating we get that for any {n \in \mathbb{Z}} we have {u_k-d_{k+1} = u_{k-n}-d_{k-n+1}}: the conclusion follows since {\lim_{m \rightarrow \pm \infty} u(m) = \lim_{m \rightarrow \pm \infty} d(m) = 0}.

This simple result on skip-free Markov chains gives also the result for many one dimensional diffusions {dX_t = \alpha(X_t) \, dt + \sigma(X_t) \, dW_t} since under regularity assumptions on {\alpha} and {\sigma} they can be seen as limit of skip-free one dimensional Markov chains on {\epsilon \mathbb{Z}}. Indeed, I guess that the usual proof of this result goes through introducing the scale function and the speed measure of the diffusion, but I would be very glad if anyone had another pedestrian approach that gives more intuition into this.

3. How to sample conditioned reversible Markov chains

From what has been said before, I am sure that this becomes quite clear how a conditioned reversible Markov chain can be sampled from. Suppose that {X} is reversible Markov chain and that we would like to sample a path conditioned on the event {X_0=a} and {X_T=b}:

  1. sample the path {X^{(a)}} between {t=0} and {t=T}, starting from {X^{(a)}_0=a}
  2. sample the path {X^{(b)}} between {t=0} and {t=T}, starting from {X^{(b)}_0=b}
  3. if there exists {t \in [0,T]} such that {X^{(a)}_t = X^{(b)}_{T-t}} then define the path {X^{\star}} by\displaystyle  X^{\star}_s = \left\{ \begin{array}{ll} X^{(a)}_s	& \text{ for } s \in [0,t]\\ X^{(b)}_s	& \text{ for } s \in [t,T], \end{array} \right. \ \ \ \ \ (2)and otherwise go back to step {1}.

Indeed, the resulting path {X^{\star}} is an approximation of the realisation of the conditioned Markov chain: this is not hard to prove it, playing around with the definition of time reversibility. It is not hard at all to adapt this idea to reversible diffusions, though the result is indeed again an approximation. The interesting question is to discuss how good this approximation is (see the paper by Michael Sorensen and Mogens Bladt )

For example, here is a sample from a Birth-Death process, conditioned on the event {X_0=0} and {X_{1000}=50}, with parameter {\mathop{\mathbb P}(X_{t+1}=k+1 | X_t=k) = 0.4 = 1-\mathop{\mathbb P}(X_{t+1}=k-1 | X_t=k)}.

Conditioned Birth-Death process

Conditioned Birth-Death process

4. Final remark

It might be interesting to notice that this method is especially inefficent for multidimensional processes: the probability of finding an instant {t} such that {X^{(a)}_t = X^{(b)}_{T-t}} is extremely small, and in many cases equal to {0} for diffusions! This works pretty well for one dimensional diffusion thanks to the continuity of the path and the intermediate value property. Nevertheless, even for one dimensional diffusion this method does not work well at all when trying to sample from conditioned paths between two meta-stable position: this is precisely this situation that is interesting in many physics when one wants to study the evolution of a particle in a double well potential, for example. In short, sampling conditioned (multidimensional) diffusions is still a very difficult problem.

« Older entries