A sweet probabilistic proof that zeta(2)=…

Here is a delightful probabilistic proof that {\zeta(2)=\frac{\pi^2}{6}} extracted from a recent article by Greg Markowsky. This starts with pretty lemma (which can originally be found in this article) on the exit time of a {2} dimensional Brownian motion {B_t}. Suppose that {f:\mathbb{C} \rightarrow \mathbb{C}} is an analytic function on the neighbourhood of the unit disk. This function maps the unit disk {\mathbb{D}} to {f(\mathbb{D})} with boundary {\partial f(\mathbb{D}) = f(\partial \mathbb{D})} where {\partial \mathbb{D}= \big\{e^{i\theta} : 0 \leq \theta \leq 2 \pi \big\}}. A two dimensional brownian motion started at {f(0)} takes on average

\displaystyle \mathbb{E}[ \, \tau \, ] \; = \; \sum_{k \geq 1} |a_k|^2

to exit the domain {f(\mathbb{D})} where {f(z) = \sum_{k \geq 0} a_k z^k} and {\tau = \inf \big\{ \, t>0: B_t \in \partial f(\mathbb{D}) \, \big\}} is the hitting time of the boundary {\partial f(\mathbb{D})}. Indeed, since the situation is invariant by translation one can always suppose that {f(0)=a_0=0}. The proof of this result is a simple martingale argument. Indeed, since {M_t = \|B_t\|^2 - 2t} is a martingale, the optional stopping theorem shows that

\displaystyle \mathbb{E}[ \, \|B_{\tau}\|^2 \, ] \; = \; 2 \mathbb{E}[\, \tau \, ].

Then, since {f(\cdot)} is an analytic function, Ito’s formula shows that the trajectories of {B_t} are the same as the trajectories of {f(W_t)}, up to a time change, where {W_t} is another {2} dimensional Brownian motion started at {z=0}. Consequently, {B_{\tau} = f(Z_{\rho})} where {\rho} is the hitting time {\rho = \inf \big\{ \, t>0: W_t \in \partial \mathbb{D} \, \big\}}. This shows that

\displaystyle \mathbb{E}[ \, \|B_{\tau}\|^2 \, ] = \mathbb{E}[ \, \|f(W_{\rho})\|^2 \, ] = \frac{1}{2\pi} \int_{\theta=0}^{2\pi} |f(e^{i\theta})|^2 \, d\theta.

Since Parseval‘s theorem (or a direct calculation) shows that the last quantity also equals {\sum_{k \geq 0} |a_k|^2}, the conclusion follows.


To find a nice identity, it thus suffices to find an easy domain where one can compute explicitly the brownian exit time {\mathbb{E}[ \, \tau \, ]}. The first thing that comes to mind is a strip {S_a = \big\{ x+iy \, : \, -a < y < a\big\}} since it is classical that a Brownian motion started at {z=0} takes on average {T=a^2} to exit the strip {S_a}. Since {f(z) = \log\big( \frac{1-z}{1+z} \big) = -2(z+z^3/3+z^5/5 + z^7/7 + \ldots)} maps the unit disk {\mathbb{D}} to the strip {S_{\frac{\pi}{2}}} it follows that

\displaystyle \frac{\pi^2}{4} = 2(1+3^{-2} + 5^{-2} + 7^{-2} + \ldots),

which is indeed equivalent to the celebrated identity {\zeta(2) = \frac{\pi^2}{6}}.

Curvature for Markov Chains

Recently, Yann Ollivier developed a nice theory of Ricci curvature for Markov chains. In many ways, this can be seen as a geometric language giving another view on the notion of path coupling, developed at the end of the {90}‘s by Martin Dyer and co-workers. It has to be noted that this new notion of curvature is very general and does not need the state space where the Markov chain evolves to have any differential structure, as can be expected at first sight. Any state space endowed with a metric suffices.

Let {P} be a Markov kernel on a metric state space {(S,d)}. We would like to quantify how long it takes for two different particles evolving according to the Markovian dynamic given by {P} to meet. If the first particle starts at {x \in S} and the second at {y \in S}, the initial distance between them is {d(x,y)}. At time {t>0}, what is the average distance between these two particles. For example, if {W^x} and {W^y} are two Brownian motions in {{\mathbb R}^n} started from {x} and {y} respectively, there is no reason why {W^x_t} and {W^y_t} should be closer from each other than {x=W^x_0} and {y=W^y_0}. Indeed, one can even show that whatever the coupling of these two Brownian motions we have {\mathop{\mathbb E}[d(W^x_t, W^y_t)] \geq d(x,y)}: this is roughly speaking because the Euclidean space {{\mathbb R}^n} has no curvature. The situation is quite different if we were instead considering Brownian motions on a sphere: in this case, trajectories tend to coalesce.

1. Wasserstein distance

In the sequel, we will need to use a notion of distance between probability distributions on the metric space {(S,d)}. The usual total variation distance {d(\mu,\nu)} defined by

\displaystyle  d(\mu,\nu) \;=\; \sup_{A \subset S} \; |\mu(A)-\nu(A)| \ \ \ \ \ (1)

is not adapted to our purpose since the metric structure of the space is not exploited. Instead, in order to take into account the distance {d(\cdot,\cdot)} of the space {E} and develop a notion of curvature, we use the Wasserstein distance {W(\mu,\nu)} between probability measures. It is defined as

\displaystyle  W(\mu,\nu) \;=\; \sup\Big\{ \mu(f) - \nu(f) \;:\; \text{Lip}(f) \leq 1\Big\}. \ \ \ \ \ (2)

The distance {d(\cdot,\cdot)} is crucial to this definition: a change of distance implies a change of the class of {1}-Lipschitz functions. Since {\mu(f) - \nu(f) = \mathop{\mathbb E}[f(X) - f(Y)]} for any coupling {(X,Y)} of {\mu} and {\nu}, and since the function {f} is {1}-Lipschitz, it follows that {\mathop{\mathbb E}[f(X) - f(Y)] \leq \mathop{\mathbb E}[d(X,Y)]}. Consequently, for any coupling {(X,Y)} we have {W(\mu,\nu) \leq \mathop{\mathbb E}[d(X,Y)]}. Taking the infimum over all the couplings {(X,Y)} leads to the inequality

\displaystyle  W(\mu,\nu) \;=\; \sup_{\text{Lip}(f) \leq 1} \; |\mu(f) - \nu(f)| \;\leq\; \inf_{(X,Y)} \; \mathop{\mathbb E}[d(X,Y)]. \ \ \ \ \ (3)

This is a deep result that on any reasonable space {(S,d)} the inequality is in fact an equality. Indeed, Kantorovich duality states that on any Radon space {(S,d)} we have

\displaystyle   W(\mu,\nu) \;=\; \sup_{\text{Lip}(f) \leq 1} \; |\mu(f) - \nu(f)| \;=\; \inf_{(X,Y)} \; \mathop{\mathbb E}[d(X,Y)]. \ \ \ \ \ (4)

It is interesting to note that under mild conditions on the state space {(S,d)} one can always find a coupling that achieves the infimum of (4): this is an easy compactness argument.

2. Notion of Curvature

Denoting by {m_x = \delta_x P} the one step distribution of the Markov chain started from {x} in the sense that {m_x(A) = \mathop{\mathbb P}[X_1 \in A \;| X_0 = x ]}, we define the local (Ricci) curvature {\kappa(x,y) \in {\mathbb R}} between {x} and {y} as

\displaystyle   W(m_x, m_y) = d(x,y) \cdot (1-\kappa(x,y)). \ \ \ \ \ (5)

The closer to {1} is {\kappa(x,y)}, the more the trajectories started at {x} tend to meet the trajectories started at {y}.

Trajectories tend to coalesce

The interesting case is when the infimum {\inf_{x,y} \, \kappa(x,y)} is strictly positive,

\displaystyle   \inf_{x,y \in E} \kappa(x,y) \;=\; \kappa > 0. \ \ \ \ \ (6)

In this case we say that the Markov kernel {P} is positively curved on {(S,d)}. It should be noted that in many natural spaces it suffices to ensure that {\kappa(x,y) \;\geq\; \kappa} for all neighbouring states {x} and {y} to ensure that {\kappa(x,y) \;\geq\; \kappa} for any pair {x,y \in S}. This can be proved thanks to the so called Gluing Lemma. A space without curvature correspond to the case {\kappa=0}: for example, a symmetric random walk on {\mathbb{Z}^d} and a Brownian motion on {{\mathbb R}^d} have both zero curvature. The curvature {\kappa} is a property of both the metric space {(S,d)} and the Markov kernel {P}: indeed, different Markov chain on the same metric space {(S,d)} have generally different associated curvature. Given a metric space {(S,d)} carrying a probability distribution {\pi}, this is an interesting problem to construct a {\pi}-invariant Markov chain with the highest possible curvature {\kappa}.

Indeed, the notion of curvature readily generalizes to continuous time Markov processes by taking a limiting case of (5). For example, one can define the curvature of the continuous time Markov process {\{X_t\}_{t \geq 0}} as the largest real number {\kappa} such that for any {x,y \in (S,d)} and {\kappa' < \kappa} we have

\displaystyle  W(m_x^{\delta}, m_y^{\delta}) \;\leq\; (1-\delta \kappa') \; d(x,y) \ \ \ \ \ (7)

for every {\delta} small enough. The quantity {m_x^{\delta}} is the distribution of {X_{\delta}} when started from {x} in the sense that {m_x^{\delta}(A) = \mathop{\mathbb P}[X_{\delta} \in A \; |X_0=x]}.

3. Contraction property

We now show that a positive curvature implies a contraction property. Equation (5) shows that {W(\delta_x P, \delta_y P) \leq W(\delta_x,\delta_y) \cdot (1-\kappa)} for any {x,y \in S}. A simple argument shows that one can indeed generalize the situation to any two distributions {\mu,\nu} in the sense that

\displaystyle   W(\mu P, \nu P) \leq W(\mu,\nu) \cdot (1-\kappa). \ \ \ \ \ (8)

Proof: For any pair {x,y \in S} consider a coupling {(U_{x,y}, V_{x,y})} of {m_x} and {m_y} such that {W(m_x,m_y)=\mathop{\mathbb E}[d(U_{x,y}, V_{x,y})]}. Now, choose an optimal coupling {(X,Y)} of {\mu} and {\nu}. This is straightforward to check that {(U_{X,Y}, V_{X,Y})} is a coupling (in general not optimal) of {\mu P} and {\nu P} so that

\displaystyle  \begin{array}{rcl}  W(\mu P, \nu P) &\leq& \mathop{\mathbb E}[d(U_{X,Y}, V_{X,Y})] = \mathop{\mathbb E}[\; \mathop{\mathbb E}[d(U_{x,y}, V_{x,y}) \;|X=x, Y=y] ] \\ &=& \mathop{\mathbb E}[ W(m_X, m_Y) ] = \mathop{\mathbb E}[ d(X,Y) \cdot (1-\kappa(X,Y)) ]\\ &\leq& (1-\kappa) \; \mathop{\mathbb E}[ d(X,Y) ] = (1-\kappa) \; W(\mu,\nu). \end{array}


Equation (8) is extremely powerful since it immediately shows that

\displaystyle  W(\mu P^t, \pi) \leq (1-\kappa)^{t} \; W(\mu,\pi). \ \ \ \ \ (9)

In other words, there is exponential convergence (in the Wasserstein metric) to the invariance distribution {\pi} at rate {(1-\kappa)^t}. In continuous time, this reads

\displaystyle  W(\mu P^t, \pi) \;\leq\; e^{-\kappa t} \; W(\mu,\pi). \ \ \ \ \ (10)

In other words, the higher the curvature, the faster the convergence to equilibrium.

4. Examples

Let us give examples of positively curved Markov chains.

  1. Langevin diffusion with convex potential: consider a convex potential {\Psi:{\mathbb R} \rightarrow {\mathbb R}} that is uniformly elliptic in the sense {\Psi^{''}(x) \geq \lambda > 0}. The Langevin diffusion {dz = -\frac{1}{2} \Psi'(z) \, dt + dW} has invariant distribution {\pi} with density proportional to {e^{-\Psi(x)}}. Given a time step {\delta}, the Euler discretization of this diffusion reads
    \displaystyle  x^{k+1} = x^k - \frac{1}{2} \Psi'(x^k) \, \delta + \sqrt{\delta} \; \xi \ \ \ \ \ (11) 

    where {\xi \sim {\mathcal N}(0,1)}. Given two starting points {x^0=x} and {y^0=y}, using the same noise {\xi} to define {x^1} and {y^1} it immediately follows that

    \displaystyle  \begin{array}{rcl}  W(x^1, y^1) &\leq& (x-y) \; \Big(1 - \frac{\delta}{2} \frac{\Psi'(x)-\Psi'(y)}{x-y} \Big)\\ &\leq& (x-y) \; (1-\frac{\lambda}{2} \delta). \end{array}

    In other words, the Langevin diffusion {\{z_t\}_{t \geq 0}} is positively curved with curvature (at least) equal to {\kappa = \frac{\lambda}{2}}.


  2. Brownian motion on a sphere: consider a Brownian motion on the unit sphere of {{\mathbb R}^n}. Consider two points {X,Y} on this unit sphere: by symmetry, one can always rotate the coordinates so that that {X=(\sqrt{1-h^2},0,h)} and {X=(\sqrt{1-h^2},0,-h)} for some {h \in [0,1]}. For {h \ll 1} the (geodesic) distance {d(X,Y)} is approximated by {d(X,Y) \approx 2h}. One can couple two Brownian motions {W^X} and {W^Y}, one started at {X} and the other one started at {Y}, by the usual symmetry with respect to the plane {\mathcal{P} = \{(x,y,z): z=0\}}: in other words, {W^Y} is the reflexion of {W^X} with respect to {\mathcal{P}}. One can check (good exercise!) that the diffusion followed by the {z}-coordinate of a Brownian motion on the unit sphere of {{\mathbb R}^n} is simply given by
    \displaystyle  dz = -\frac{1}{2}(n-1)z \, dt + \sqrt{1-z^2} \, dW. \ \ \ \ \ (12) 

    With this coupling, for small time {\delta \ll 1}, it follows that

    \displaystyle  \begin{array}{rcl}  z^X_{\delta} &\approx& h - \frac{1}{2} (n-1) h \, \delta + \sqrt{1-h^2} \sqrt{\delta} \; \xi\\ z^Y_{\delta} &\approx& -h + \frac{1}{2} (n-1) h \, \delta - \sqrt{1-h^2} \sqrt{\delta} \; \xi \end{array}

    where {\xi \sim {\mathcal N}(0,1)} is used as the same source of randomness for {z^X_{\delta}} and {z^Y_{\delta}} since {W^Y} is the reflexion of {W^X}. Since {d(W^X_{\delta}, W^X_{\delta}) \approx |z^X_{\delta} - z^Y_{\delta}|} it readily follows that

    \displaystyle  \begin{array}{rcl}  d(W^X_{\delta}, W^X_{\delta}) \; \leq \; \big(1- \frac{1}{2}(n-1)\delta \big)\; d(x,y). \end{array}

    In other words, the curvature of a Brownian motion on the unit sphere of {{\mathbb R}^n} is equal to {\frac{1}{2}(n-1)}. Maybe surprisingly, the higher the dimension, the faster the convergence to equilibrium. This is not so unreal if one notices that the Brownian increment satisfies {\mathop{\mathbb E} \|W_{t+\delta}-W_t\|^2 \approx n \delta}.

  3. Other examples: see the original text for many other examples.

Doob H-transforms

I read today about Doob h-transforms in the Rogers-Williams … It is done quite quickly in the book so that I decided to practice on some simple examples to see how this works.

So we have a Markov process {X_t} living in the state space {S}, and we want to see how this process looks like if we condition on the event {X_T \in A} where {A} is a subset of the state space. To fix the notations we define {p(t,t+s,x,y) = P(X_{t+s}=y|X_t=x)} and {h(t,x)=P(X_T \in A \, | X_t=x)}. The conditioned semi-group {\hat{p}(t,t+s,x,y)=P(X_{t+s}=y|X_t=x, X_T \in A)} is quite easily computed from {p} and {h}. Indeed, this also equals

\displaystyle \hat{p}(t,t+s,x,y) = \frac{P(X_{t+s}=y; X_T \in A\;|X_t=x)}{P(X_T \in A \,|X_t=x)} = p(t,t+s,x,y) \frac{h(t+s,y)}{h(t,x)}.

Notice also that {\hat{p}(t,t+s,x,y) = p(t,t+s,x,y) \frac{h(t+s,y)}{h(t,x)}} is indeed a Markov kernel in the sense that {\int_{y} \hat{p}(t,t+s,x,y) \, dy = 1}: the only property needed for that is

\displaystyle  h(t,x) = \int_{y} p(t,t+s,x,y)h(t+s,y)\,dy = E\left[ h(t+s,X_{t+s}) \, |X_t=x\right].

In fact, we could take any function {h} that satisfies this equality and define a new Markovian kernel {\hat{p}} and study the associated Markov process. That’s what people usually do by the way.

Remark 1 we almost never know explicitly the quantity {h(t,x)}, except in some extremely simple cases !

Before trying these ideas on some simple examples, let us see what this says on the generator of the process:

  1. continuous time Markov chains, finite state space:let us suppose that the intensity matrix is {Q} and that we want to know the dynamic on {[0,T]} of this Markov chain conditioned on the event {X_T=z}. Indeed {p(t,t+s,i,j) = [\exp(sQ)]_{i,j}} so that {\hat{p}(t,t+s,i,j) = [\exp(sQ)]_{i,j} \frac{p(t+s,T,j,z)}{p(t,T,i,z)}} so that in the limit we see that at time {t}, the intensity of the jump from {i} to {j} of the conditioned Markov chain is
    \displaystyle  Q(i,j) \frac{p(t+s,T,j,z)}{p(t,T,i,z)}.

    Notice how this behaves while {t \rightarrow T}: if at {t=T-\epsilon} the Markov chain is in state {i \neq z} then the intensity of jump from {i} to {z} is equivalent to {\approx \frac{1}{\epsilon}}.

  2. diffusion processes:this time consider a {1}-dimensional diffusion {dX_t = \mu(X_t) \, dt + \sigma(X_t) \, dW_t} on {[0,T]} conditioned on the event {X_T \in A} and define as before {h(t,x)=P(X_T \in A \,|X_t=x)}. The generator of the (non-homogeneous) conditioned diffusion is defined at time {t} by
    \displaystyle  \begin{array}{rcl}  \mathcal{G}^{(t)} f(x) &=& \lim_{s \rightarrow 0} \frac{1}{s} \Big( E\left[f(X_{t+s}) \,| X_t=x, X_T \in A\right]-f(x) \Big)\\ &=& \lim_{s \rightarrow 0} \frac{E\left[ f(X_{t+s}) h(t+s, X_{t+s}) \,| X_t=x\right]-f(x) }{s\,h(t,x)} \end{array}

    so that if {\mathcal{L} = \mu \partial_x + \frac{1}{2} \sigma^2 \partial^2_{xx}} is the generator of the original diffusion we get

    \displaystyle  \mathcal{G}^{(t)} f = \frac{1}{h} \Big(\partial_t + \mathcal{L})(hf).

    Because {(\partial_t + \mathcal{L})h=0}, this also reads

    \displaystyle  \mathcal{G}^{(t)} f = \mathcal{L}f + \sigma^2 \frac{\partial_x h}{h} \partial_x f.

    This means that the conditioned diffusion {Z} follows the SDE:

    \displaystyle  dZ_t = \Big( \mu(Z_T) + \sigma(Z_t)^2 \frac{\partial_x h(t,Z_t)}{h(t,Z_t)} \Big) \, dt+ \sigma(Z_t) dW_t.

    The volatility function remains the same while an additional drift shows up.

We will try these ideas on some examples where the probability densities are extremely simple. Notice that in the case of diffusions, if we take {A=\{x^+\}}, the function {(t,x) \mapsto P(X_T = x^+ \, |X_t=x)} is identically equal to {0} (except degenerate cases): to condition on the event {X_T=x} we need instead to take {h(t,x)} to be the transition probability {p(t,T,x,x^+)}. This follows from the approximation {P(X_T \in (x^+,x^+ + dx) \,|X_t=x ) \approx p(t,T,x,x^+) \, dx + o(dx)}. Let’s do it:

  • Brownian Bridge on {[0,T]}:in this case {p(t,T,x,0) \propto e^{-\frac{|x-y|^2}{2(T-t)}}} so that the additional drift reads {\frac{-x}{T-t}}: a Brownian bridge follows the SDE
    \displaystyle  dX_t = -\frac{X_t}{T-t} \, dt + dW_T.

    This might not be the best way to simulate a Brownian bridge though!

  • Poisson Bridge on {[0,T]}:we condition a Poisson process of rate {\lambda} on the event {X_T=N}. The intensity matrix is simply {Q(k,k+1)=\lambda=-Q(k,k)} and {0} everywhere else while the transition probabilities are given by {p(t,T,k,N) = e^{-\lambda (T-t)} \frac{(\lambda (T-t) )^{N-k}}{(N-k)!}}. This is why at time {t}, the intensity from {k} to {k+1} is given by
    \displaystyle  \lambda(t,k,k+1) = \frac{N-k}{T-t}.

    Again, that might not be the most efficient way to simulate a Poisson Bridge ! Notice how the intensity {\lambda} has disappeared …

  • Ornstein-Uhlenbeck Bridge:Let’s consider the usual OU process given by the dynamic {dX_t = -X_t + \sqrt{2}dW_t}: the invariant probability is the usual centred Gaussian distribution. Say that we want to know how does such an OU process behave if we condition on the event {X_T = z}. Because {p(t,T,x,z) \propto \exp(-\frac{|z-e^{-(T-t)}x|^2}{2(1-e^{-2(T-t)})^2})} we find that the conditioned O-U process follows the SDE
    \displaystyle dX_t = \Big(\frac{z-e^{-(T-t)x}}{e^{T-t}(1-e^{-2(T-t)})^2} - X_t \Big)\, dt+ \sqrt{2} \, dW_t.

    If we Taylor expand the additonal drift, it can be seen that this term behaves exactly as in the case of the Brownian bridge. Below is a plot of an O-U process conditioned on the event {X_{10} = 10}, starting from {X_0=0}.

    conditioned O-U process

    conditioned O-U process

Brownian particles on a circle

Today, James norris gave a talk related to Diffusion-limited aggregation processes and mentioned, in passing, the following amusing fact: put {N} equidistant Brownian particles {W_1, \ldots, W_N} on the circle with unit circumference and let them evolved. When two of them collide they get stuck to each other and continue together afterwards: after a certain amount of time {T_N}, only one particle remains. Perhaps surprisingly, this is extremely easy to obtain the first few properties of {T_N}. For example, {\lim_N \mathop{\mathbb E}\left[ T_N \right] = \frac{1}{6}}.

To see that, define {D_k} the distance between {W_k} and {W_{k+1}} (modulo {N}) so that {D_k = \frac{1}{N}} for {k=1,2 \ldots, N}. Notice then (It\^o’s formula) that

\displaystyle  M_t = e^{\lambda t} \sum_{k=1^N} \sin(\sqrt{\lambda} D_k)

is a (local) martingale that starts from {M_0 = N \sin(\frac{\sqrt{\lambda}}{N})}. Also, at time {T_N}, exactly {N-1} of the distances {D_1, D_2, \ldots, D_N} are equal to {0} while one of them is equal to {1}: this is why {M_{T_N} = e^{\lambda T_N} \sin(\sqrt{\lambda})}. The end is clear: apply the optional sampling theorem (to be rigorous, take {\lambda} not too big, or do some kind of truncations to be sure that the optional sampling theoem applies) to conclude that

\displaystyle  \mathop{\mathbb E}\left[ e^{\lambda T_N} \right] = \frac{N \sin(\frac{\sqrt{\lambda}}{N})}{\sin(\sqrt{\lambda})}.

This gives for example {\mathop{\mathbb E}\left[ T_N \right] = \frac{1}{6}(1-\frac{1}{N^2})}. I just find it cute!

So what if we do that on a segment ?

On the Wiener space

I would like to discuss in this post the importance of the heuristic functional

\displaystyle I(W) := \frac{1}{2} \int_0^T |\frac{d}{dt} W|^2 \, dt \ \ \ \ \ (1)


that often shows up when doing analysis on the Wiener space {S = (C([0,T], {\mathbb R}), \|\cdot\|_{\infty})}: an element of the Wiener space is traditionally denoted by {\omega} – this is a continuous function and {\omega(t)} is its value at time {t \in [0,T]}. For a (nice) subset {A} of {S}, the Wiener measure of {A} is nothing else than the probability that a Brownian path belongs to {A}. Having said that, the quantity {I(W)} hardly makes sense since a Brownian path {(W_t: t \in [0,T])} is (almost surely) non differentiable anywhere: still, this is a very useful heuristic in many situations. A review of probability can be found on the excellent blog of Terry Tao.

Where does it come from ?

As often, this is very instructive to come back to the discrete setting. Consider a time interval {[0,T]} and a discretization parameter {\Delta t = \frac{T}{N}}: a discrete Brownian path is represented by the {N}-tuple

\displaystyle W^{(N)} = (W_{\Delta t},W_{2\Delta t},\ldots,W_{T}) \in {\mathbb R}^{N}.

The random variables {\Delta W_k := W_{k \Delta t} - W_{(k-1) \Delta t}} are independent centred Gaussian variables with variance {\Delta t} so that the random vector {W^{(N)}} has a density {\mathop{\mathbb P}^{(N)}} with respect to the {N}-dimensional Lebesgue measure

\displaystyle \begin{array}{rcl} \frac{d \mathop{\mathbb P}^{(N)}}{d \lambda^{\textrm{Leb}}}(W^{(N)}) &\propto& \exp\{-\frac{1}{2 \Delta t} \sum_{k=1}^N |W_{k \Delta t} - W_{(k-1) \Delta t}|^2\} \\ &=& \exp\{-\frac{1}{2} \sum_{k=1}^N |\frac{W_{k \Delta t} - W_{(k-1) \Delta t}}{\Delta t}|^2 \, \Delta t\} \\ &=& \exp\{- I^{(N)}(W^{N})\}. \end{array}

The functional {I^{(N)} := \frac{1}{2} \sum_{k=1}^N |\frac{W_{k \Delta t} - W_{(k-1) \Delta t}}{\Delta t}|^2 \, \Delta t} is indeed a discretization of {I}. Informally, the Wiener measure has a density proportional to {\exp\{-I(W)\}} with respect to the “infinite dimensional Lebesgue measure”: this does not make much sense because there is no such thing as the infinite dimensional Lebesgue measure. This should be understood as the limiting case {N \rightarrow \infty} of the discretization procedure presented above. Indeed, this is not an absolute non-sense to say that

\displaystyle \mathop{\mathbb P}[ \omega = g] \sim e^{-I(g)}. \ \ \ \ \ (2)


because we will see that if {f,g} are two nice functions then

\displaystyle \lim_{ \epsilon \rightarrow 0} \frac{P[ \|W-f\|_{\infty} < \epsilon]}{P[ \|W-g\|_{\infty} < \epsilon]} = \frac{e^{-I(f)}}{e^{-I(g)}}.

It is then very convenient to write

\displaystyle \mathop{\mathbb P}[\omega \in A] = \int_{A} e^{-I(W)} \, d\lambda(W)

where {\lambda} is a fictional infinite dimensional Lebesgue measure (ie: translation invariant).

Translations in the Wiener space

As a first illustration of the heuristic {\mathop{\mathbb P}[\omega = g] \sim e^{-I(g)}}, let see how the Wiener measure behave under translations. If we choose a nice continuous function {f} such that {I(f)} is well defined (ie: {\dot{f} \in L^2([0,T]))}), a translated probability measure {\mathop{\mathbb P}^{f}} can be defined through the relation

\displaystyle \mathop{\mathbb P}^f(A) := \mathop{\mathbb P}(f+\omega \in A). \ \ \ \ \ (3)


This is not clear that {\mathop{\mathbb P}^f} is absolutely continuous with respect to the Wiener measure {\mathop{\mathbb P}}. Of course, we impose that {f(0)=0}. For a set {A \subset S}, the heuristic says that

\displaystyle \begin{array}{rcl} \mathop{\mathbb P}^f( \omega \in A) &=& \mathop{\mathbb P}(\omega \in A-f) = \int_{A-f} e^{-I(W)} \, d\lambda(W)\\ &\stackrel{\textrm{(trans. inv.)}}{=}& \int_{A} e^{-I(W-f)} \, d\lambda(W)\\ &=& \int_{A} e^{-\frac{1}{2} \int_0^T |\dot{W}-\dot{f}|^2 \, dt} \, d\lambda(W)\\ &=& \int_{A} e^{\int_0^T \dot{f} \dot{W} \, dt -\frac{1}{2} \int_0^T \dot{f}^2 \, dt} e^{-I(W)}\, d\lambda. \end{array}

This is why, writing {\dot{W} \, dt = dW}, we obtain the following change of probability formula

Proposition 1 Cameron-Martin-Girsanov change of probability formula:

for any continuous function {f} such that {f(0)=0} and {\dot{f} \in L^2([0,T])},

\displaystyle \frac{d \mathop{\mathbb P}^f}{d \mathop{\mathbb P}}(W) = Z^f(W) := \exp\{ \int_0^T \dot{f} dW_t -\frac{1}{2} \int_0^T \dot{f}^2 \, dt \} \ \ \ \ \ (4)


This change of probability formula is extremely useful since this is typically much more convenient to work with a Brownian motion {W_t} than with a drifted Brownian motion {f(t) + W_t}. In many situations, we get rid of the annoying stochastic integral {\int_0^T \dot{f} dW_t}: if {f} is regular enough ({f \in C^3([0,T])}, say) we have

\displaystyle \begin{array}{rcl} \frac{d \mathop{\mathbb P}^f}{d \mathop{\mathbb P}}(W) &=& Z^f(W)\\ &:=& \exp\{ \dot{f}(T)W(T) - \int_0^T \ddot{f}(t) W(t)\, dt -\frac{1}{2} \int_0^T \dot{f}^2 \, dt\}. \end{array}


The next section is a straightforwards application of this change of variable formula.

Probability to be in an {\epsilon}-tube

Suppose that {f,g} are two nice functions (smooth, say): for small {\epsilon \ll 1}, what is a good approximation of the quotient

\displaystyle Q(f,g,\epsilon) = \frac{\mathop{\mathbb P}(\|W-f\|_{\infty}<\epsilon)}{\mathop{\mathbb P}(\|W-g\|_{\infty}<\epsilon)}.

In words, this basically asks the question: how more probable is the event {\{ W \textrm{looks like } f\}} than the event {\{ W \textrm{looks like } g\}} ? Of course, since this can also be read as

\displaystyle Q(f,g,\epsilon) = \frac{Q(f,0,\epsilon)}{Q(g,0,\epsilon)}

where {0} indicates the function identically equal to zero, it suffices to consider the case {g=0}. If we introduce the event

\displaystyle A_{\epsilon} = \{ \omega \in S: \|\omega\|_{\infty} < \epsilon\}),

the quotient {Q(f,0,\epsilon)} is equal to {\frac{\mathop{\mathbb P}^f(A_{\epsilon})}{\mathop{\mathbb P}(A_{\epsilon})}}. This why, using the change of probability formula (4),

\displaystyle \begin{array}{rcl} Q(f,0,\epsilon) &=& \frac{\mathop{\mathbb P}^f(A_{\epsilon})}{\mathop{\mathbb P}(A_{\epsilon})} = \frac{\mathop{\mathbb E}\left[ 1_{A_{\epsilon}}(W) Z^f(W) \right]}{\mathop{\mathbb E}\left[ 1_{A_{\epsilon}}(W) \right]} \end{array}

with {Z^f(W) = \exp\{\dot{f}(T)W(T) - \int_0^T \ddot{f}(t) W(t)\, dt -\frac{1}{2} \int_0^T \dot{f}^2 \, dt\}}. If {\|\dot{f}\|_{\infty}, \|\ddot{f}\|_{\infty} \leq C}, this is clear that for {W \in A},

\displaystyle -(C\epsilon + \epsilon C T) < \dot{f}(T)W(T) - \int_0^T \ddot{f}(t) W(t)\, dt < C\epsilon + \epsilon C T.

Both sides going to zero when {\epsilon} goes to zero, this is enough to conclude that

\displaystyle \lim_{\epsilon \rightarrow 0} Q(f,0,\epsilon) = \exp\{-\frac{1}{2} \int_0^T \dot{f}^2 \, dt\}\} = \exp\{ -I(f)\}. \ \ \ \ \ (5)

In short, for any two reasonably nice functions (for example {f,g \in C^3([0,T])}) {f,g} that satisfy {f(0)=g(0)=0},

\displaystyle \lim_{\epsilon \rightarrow 0} Q(f,0,\epsilon) \frac{\mathop{\mathbb P}(\|W-f\|_{\infty}<\epsilon)}{\mathop{\mathbb P}(\|W-g\|_{\infty}<\epsilon)} = \frac{ \exp\{ -I(f)\}}{\exp\{ -I(g)\} }. \ \ \ \ \ (6)

Large deviation result

Take a subset {A} of {S} (it might be useful to think of sets like {A_{f,\epsilon,\alpha} = \{\omega: |\omega(u)-f(u)| < \alpha \, \forall u \in U\}}). We are interested to the probability that the rescaled (in space) Brownian motion

\displaystyle W^{(\epsilon)}(t) = \epsilon W(t)

belongs to {A} when {\epsilon} goes to {0}. Typically, if the null function does not belong to (the closure of) {A}, the probability {\mathop{\mathbb P}( W^{(\epsilon)} \in A)} is exponentially small. It turns out that if {A} is regular enough

\displaystyle \ln \mathop{\mathbb P}( \epsilon W \in A) \sim -\epsilon^2 \inf_{f \in A} I(f) := -\epsilon^2 I(A).

Again, the usual heuristic gives this result in no time if we accept not to be too rigorous:

\displaystyle \begin{array}{rcl} \epsilon^2 \, \ln \mathop{\mathbb P}( \epsilon W \in A) &=& \epsilon^2 \, \ln \int_{\epsilon W \in A} e^{-I(W)} \, d\lambda \\ &=& \epsilon^2 \, \ln \int_{W \in A} e^{-I(\frac{W}{\epsilon})} \textrm{(Jacobian)}\, d\lambda \\ &=& \epsilon^2 \, \ln \int_{W \in A} e^{-\frac{I(W)}{\epsilon^2}} \textrm{(Jacobian)}\, d\lambda \\ &\stackrel{\epsilon \rightarrow 0}{\rightarrow}& -\inf \{I(f): f \in A\}. \end{array}

This is very fishy since the Jacobian should behave very badly (actually the measure {\mathop{\mathbb P}[W \in \cdot]} and {\mathop{\mathbb P}[\epsilon W \in \cdot]} are mutually singular) but all this mess can be made perfectly rigorous. Nevertheless, the basic idea is almost there, and it can be proved (Freidlin-Wentzel theory) that for any open set G,

\displaystyle \liminf \epsilon^2 \, \ln \mathop{\mathbb P}( \epsilon W \in G) \geq -\inf \{I(f): f \in G\}

while for any closed set {F},

\displaystyle \limsup \epsilon^2 \, \ln \mathop{\mathbb P}( \epsilon W \in F) \leq -\inf \{I(f): f \in F\}.

One cleaner way to prove this is to used the usual Cramer theorem of large deviations for sums of i.i.d random variables (in Banach space) and notice that for {\epsilon(N) = \frac{1}{\sqrt{N}} } then

\displaystyle W^{\epsilon(N)} = \frac{W_1+W_2+\ldots+W_N}{N}

where {(W_i:i=1,2,\ldots,N)} are independent standard Brownian motions. Cramer theorem states that

\displaystyle \mathop{\mathbb P}[ \frac{W_1+W_2+\ldots+W_N}{N} \in A] \sim \exp\{-N \inf\{I(f): f \in A\} \}


\displaystyle I(f) = \sup\{ \int_{0}^t f(t)g(t)\, dt - \ln \, \mathop{\mathbb E} e^{ \int_0^T g(t) W(t) \, dt } : g \in L^2([0,T])\}.

This is not very hard to see that the supremum is indeed {\frac{1}{2} \int_0^T \dot{f}^2 \, dt}.