SMC sampler

Sequential Monte Carlo (SMC) methods are usually designed to deal with probability distributions of increasing dimensions. The canonical example is the sequence of distributions {\pi_n(X_{1:n}) = \mathbb{P}(X_{1:n} \, | Y_{1:n})} where {\{ X_k \}_{k \geq 0}} is a hidden Markov chain and {\{ Y_k \}_{k \geq 0}} is an observation process. The notation {X_{1:n}} is a shortcut for {(X_1, X_2, \ldots, X_n)}. This is usually called a Hidden Markov Model in the literature. If the Markov chain {X_k} evolves on in {\mathbb{R}^d} then the distribution {\pi_n} is a probability on {\big( \mathbb{R}^d \big)^n}.

It turns out that one can use the same kind of ideas to study a single distribution {\pi}. If {\pi} is difficult to sample directly from and that it is difficult to use traditional MCMC methods (because of the dimensionality, multi-modality, poor mixing properties, etc…), one can try to use SMC methods to study the distribution {\pi}. To this end, one can introduce a sequence of bridging distributions

\displaystyle \begin{array}{rcl} \pi_0, \pi_1, \ldots, \pi_n \end{array}

where {\pi_0} is typically very easy to sample from and {\pi_n=\pi}. A traditional choice is {\pi_k(x) \propto \pi(x)^{\alpha_k}} where {0 \leq \alpha_1 < \alpha_2 < \ldots <\alpha_n=1}. The bigger the index {k}, the more intricate the distribution {\pi_k} is. The problem is that each distribution {\pi_k} has the same dimensionality as the original distribution {\pi} so that it is difficult to use all the machinery of SMC methods. For the sake of concreteness, let us suppose that {\pi} is a distribution on {\mathbb{R}^d}. A fundamental remark that is the base of what is now known as Sequential Monte Carlo samplers is that {\pi_k(x_k)} can be seen as the marginal distribution of the probability

\displaystyle \begin{array}{rcl} \Pi_k(x_1, \ldots, x_k) = \pi_k(x_k) L_k(x_k, x_{k-1})L_{k-1}(x_{k-1}, x_{k-2}) \ldots L_2(x_2,x_1) \end{array}

where {L_j(\cdot, \cdot)} are Markov kernels in the sense that for any {x} we have {\int_y L_j(x, y) \, dy = 1}. The distribution {\Pi_k} lives on {\big( \mathbb{R}^d \big)^k} and one can use the traditional SMC methods to explore the sequence of distributions {\Pi_1, \ldots, \Pi_n}. This algorithm was devised a few years ago by Pierre Del Moral, Arnaud Doucet and Ajay Jasra. It is extremely powerful and several other algorithms can be seen as particular cases of it. Because I had never implemented the algorithm before, I tried yesterday to see how this worked on a very simple example. My target distribution was an equal mixture of seven Gaussian distributions in {\mathbb{R}^2} with unit covariance matrices and centred at the vertices of a regular heptagon with radius {20}. I started with {N=3000} particles at the centre of the heptagon and let them evolved according to a random walk. When the Effective Sampling Size (ESS) was too small (less than {N/2}), the whole particle system was regenerated. In order to obtain a sweet movie, I used {100} bridge distributions {\pi_k \propto \pi^{k/100}}.

The result is clear and quite impressive. Indeed, this could have been made much more efficient by taking much less bridge distributions. Some recent theory has been developed to analyse the stability of the algorithm in high dimensional situations. The analysis described in the article uses a scaling limit approach very similar to some ideas used to analyse MCMC algorithms and some more recent developements along the same lines that we have done with Andrew and Natesh. Very exciting stuff!

[EDIT: 2 October 2012]
I received several emails asking me to describe how to obtain the short animation showing the evolution of the particles system. The code is available here. This is a quick python script producing a bunch of .png files. It then suffices to stick them together to obtain an animated .gif file. This can for example easily be done using the free (and very good) photoshop-like software GIMP. Open all the png files as different layers. Export the result as a .gif file with the option ‘animated gif’ checked. Voila!

Joe’s Pyramid

Yesterday, while reading the last issue of the NewScientist, I came across the following very cute riddle:

Lazy, I asked myself if it were possible to write {10} lines long Python code to solve this innocent looking enigma. The whole pyramid is entirely determined by the {6} numbers lying at the bottom, and each one of them is an integer between {1} and {99}: these numbers must be different so that there are at most {\binom{99}{6} \approx 1.1 \times 10^9} possibilities to test! Brute force won’t work my friend!

When stupid brute force does not work, one can still try annealing/probabilist methods: this works pretty well for Sudoku (which is NP-hard) as this is brilliantly described here and there. The principle is simple: if one can find a good energy function {E:\{1, \ldots, 99\}^6 \rightarrow \mathbb{R}} such that a solution to the problem corresponds to a low energy configuration, one can do MCMC-simulating annealing-etc on the target distribution

\displaystyle \pi(\text{configuration}) \propto e^{-\beta \cdot E(\text{configuration})}.

The issue is that it might be very difficult to choose a sensible energy function {E(\cdot)}. Foolishly, I first tried the following energy function, and then ran a random walk Metropolis algorithm with {\pi} as target probability:

\displaystyle  \pi(\text{configuration}) \propto e^{\beta \cdot {\text{Height}}(\text{configuration})}

where {\text{Height}(\text{configuration})} is the numbers of levels that one can fill, starting from the bottom, without encountering any problem {i.e} no repetition and no number greater than {100}. With different values of {\beta} and letting run the algorithm for a few millions iterations ({5} min on my crappy laptop), one can easily produce configurations that are {5}-levels high: but the algorithm never found any real solution {i.e} a configuration with height equal to {6}.

Now I am curious wether this is possible to produce a non-stupid energy function {E(\cdot)} so that this riddle is solvable in a reasonable amount of time by standard MCMC – annealing methods.

As a conclusion, I should mention that with a pen and a cup of coffee, one can easily find a solution: I will not spoil the fun, but just say that the configuration space is not that big if one think more carefully about it…

Potts model and Monte Carlo Slow Down

A simple model of interacting particles

The mean field Potts model is extremely simple: there are {N} interacting particles {x_1, \ldots, x_N} and each one of them can be in {q} different states {1,2, \ldots, q}. Define the Hamiltonian

\displaystyle  H_N(x) = -\frac{1}{N} \sum_{i,j} \delta(x_i, x_j)

where {x=(x_1, \ldots, x_N)} and {\delta} is the Kronecker symbol. The normalization {\frac{1}{N}} ensures that the energy is an extensive quantity so that the mean energy per particle {h_N(x) = \frac{1}{N} H_N(x)} does no degenerate to {0} or {+\infty} for large values of {N}. The sign minus is here to favorize configurations that have a lot of particles in the same state. The Boltzman distribution at inverse temperature {\beta} on {\{1, \ldots, q\}^N} is given by

\displaystyle P_{N,\beta} = \frac{1}{Z_N(\beta)} e^{-\beta H_N(x)}

where {Z_N(\beta)} is a normalization constant. Notice that if we choose a configuration uniformly at random in {\{1, \ldots, q\}^N}, with overwhelming probability the ratio of particles in state {k} will be close to {\frac{1}{q}}. Also it is obvious that if we define

\displaystyle  L^{(N)}_k(x) = \frac{1}{N} \, \Big( \textrm{Number of particles in state }k \Big)

then {L=(L^{(N)}_1, \ldots, L^{(N)}_q)} will be close to {(\frac{1}{q}, \ldots, \frac{1}{q})} for a configuration taken uniformly at random. Stirling formula even says that the probability that {L} is close to {\nu = (\nu_1, \ldots, \nu_q)} is close to {e^{-N \, R(\nu)}} where

\displaystyle R(\nu) = \nu_1 \ln(q\nu_1) + \ldots + \nu_q \ln(q\nu_q).

Indeed {(\frac{1}{q}, \ldots, \frac{1}{q}) = \textrm{argmin} \, R(\nu)}. The situation is quite different under the Boltzman distribution since it favorizes the configurations that have a lot of particles in the same state: this is because the Hamiltonian {H_N(x)} is minimized for configurations that have all the particles in the same state. In short there is a competition between the entropy (there are a lot of configurations close to the ratio {(\frac{1}{q}, \ldots, \frac{1}{q})}) and the energy that favorizes the configurations where all the particles are in the same state.

With a little more work, one can show that there is a critical inverse temperature {\beta_c} such that:

  • for {\beta < \beta_c} the entropy wins the battle: the most probable configurations are close to the ratio {(\frac{1}{q}, \ldots, \frac{1}{q})}
  • for {\beta > \beta_c} the energy effect shows up: there are {q} most probable configurations that are the permutations of {(a_{\beta},b_{\beta},b_{\beta}, \ldots, b_{\beta})} where {a_{\beta}} and {b_{\beta}} are computable quantities.

The point is that above {\beta_c} the system has more than one stable equilibrium point. Maybe more important, if we compute the energy of these most probable states

\displaystyle  h(\beta) = \lim \frac{1}{N} H_N(\textrm{most probable state})

then this function has a discontinuity at {\beta=\beta_c}. I will try to show in the weeks to come how this behaviour can dramatically slow down usual Monte-Carlo approach to the study of these kind of models.

Hugo Touchette has a very nice review of statistical physics that I like a lot and a good survey of the Potts model. Also T. Tao has a very nice exposition of related models. The blog of Georg von Hippel is dedicated to similar models on lattices, which are far more complex that this mean field approximation presented here.

MCMC Simulations

These is extremely easy to simulate this mean field Potts model since we only need to keep track of the ratio {L=(L_1, \ldots, L_q)} to have an accurate picture of the system. For example, a typical Markov Chain Monte Carlo approach would run as follows:

  • choose a particle {x_i} uniformly at random in {\{1,2, \ldots, N\}}
  • try to switch its value uniformly in {\{1,2, \ldots, q\} \setminus \{x_i\}}
  • compute the Metropolis ratio
  • update accordingly.

If we do that {10^5} times for {q=3} states at inverse temperature {\beta=1.5} and for {100} particles (which is fine since we only need to keep track of the {3}-dimensional ratio vector) and plot the result in barycentric coordinates we get a picture that looks like:
Potts Model MCMC

Here I started with a configuration where all the particles were in the same states i.e ratio vector equal to {(1,0,0)}. We can see that even with {10^5} steps, the algorithm struggles to go from one most probable position {(a,b,b)} to the other two {(b,a,b)} and {(b,b,a)} – in this simulation, one of the most probable state has even not been visited! Indeed, this approach was extremely naive, and this is quite interesting to try to come up with better algorithms. Btw, Christian Robert’s blog has tons of interesting stuffs related to MCMC and how to boost up the naive approach presented here.

Challenge – Announcement

Have you already heard about the 17×17 Challenge ? It all started 7 months ago, and still no solution has been found. I have the feeling though that no really serious attempt at solving the problem has been done, even if very encouraging results have been found. It could be very interesting to keep track of the progress made so that results will be published here. Send what you have reached so far!

In the next few weeks, I will try to write about the Monte Carlo approaches described in the excellent book “Monte Carlo Strategies in Scientific Computing” by Jun Liu: I do not think that one can easily solve the problem with a direct implementation of these methods, but I want to know more about it. These are all more or less Markov Chain Monte Carlo Methods: I have tried the “parallel tempering” method, which gives results that are not that bad, and was planning to check the “flat histogram” method. “Simulated annealing” methods have succesfully been used to solve the Sudoku problem though: see Christian Robert blog for example. I am also planning to continue my reading of the fascinating book “Information, Physics, and Computation” by Marc Mezard: methods described in this book (belief propagation, k-SAT problem, etc…) are more adapted to the 17×17 challenge I think, and I am quite optimistic about it.



Best Solution so far, by Rohan Puttagunta: only 4 monochromatic rectangles!

Read also here,here,here,here and there – and have a try with this on-line simulator !