AlgLab: An Open Laboratory for Experiments On Algorithms

AlgLab Home About Labs Contribute

MarkovLab: Markov Chain Monte Carlo Text Generation

The Algorithm

(This is a synopsis of a discussion in Chapter 3: What to Measure, of A Guide to Experimental Algorithmics, referred to here as the Guide.) The Markov Chain Monte Carlo (MCMC) Text Generaton algorithm takes the following parameters as input:

TSample input text, containing n words.
mNumber ofoutput words to generate.
kParameter controlling key size, a small integer

The algorithm generates a ``real-looking'' random text of m words, according to word frequencies calculated from T. Parameter k is a small integer specifying the phrase size, explained below.

The algorithm starts by building a table of all k-word phrases in the text. For example, suppose k=2 and the text is

This is a test, this is only a test. This is a test of the emergency 
broadcasting system. 

In this text the phrases this is and a test each appear 3 times, is a appears 2 times, and so forth.

Once the table is built, the algorithm sets a String variable phrase to equal the first k words in the text, and prints them. It then generates the next m-k words as follows:

  1. Find all copies of phrase in the table. For example this is has three copies in the table.
  2. Select one of the copies uniformly at random. Print the successor word that comes after that copy of the phrase in the text. For example, the three successor words for this is are: a, only, and a. With probablity 2/3, a is printed next, and with probability 1/3, only is printed next.
  3. Remove the first word from phrase and append the selected successor word. For example with probability 2/3 the new phrase is is a.
  4. Go to step 1, and repeat until m words have been printed.

Modeling computation time. Chapter 3 surveys several choices of performance indicators for measuring the time performance of this algorithm, including the number of word comparisons; the number of character comparisons; the number of instructions executed in the program; profiling information from gprof; CPU times, and wall clock times.

A performance model is a function that is built to predict performance in terms of the parameters T, n, m, and k. The accuracy and precision of the performance model depend on the choice of performance indicator.


Resources and Links

In AlgLab

External Resources


Projects

Here are some suggestions for experimental projects using the MCMC algorithm.
  1. Build and evaluate new models. Here is the C implementation of the algorithm, written by Jon Bentley, that is described in the book.
    1. Run experiments to extend the word comparison model to cases where k>1.
    2. Add another term p(k,T,n,m) to the word-comparison or character-comparison model that gives a tighter prediction of the cost of the Random Selection step. For example, it might be based on the average (or maximum?) number of phrase duplicates in T. You will have to write a program that measures this property in T (preferably in a O(n) time or less).
    3. Develop predictive models for non-English prose, poetry or song lyrics, or files of source code.
    4. Use the procedure in Chapter 3 to develop models to predict CPU time on your favorite operating system. How closely can you predict times for the target experiment?
    5. Develop a character-comparison cost model that takes input parameters n, m and k in units of characters instead of words. How does it compare to the word-comparison and character-comparison models?
    Evaluate the accuracy and precison of your models: first, using the same text but different values of n,m,k within the range of your experiments; second, using values of n,m,k outside the range of your experiments; third, using different text files T. How do these variations affect the accuracy and precision of your model? How does your choice of performance indicator affect the accuracy and precision of your model? What would you do next to improve it?

  2. Evaluate performance indicators.
    1. Compare CPU times and elapsed times (clock ticks and instruction cycles) on your favorite system. How much variation can you observe?
    2. Compare CPU times and elapsed times on two different systems. Can you map measurements on one system to predictions on another?
    3. Use tools like gprof and Cachegrind to build a link between instruction counts and CPU times. How much variation do you observe on different systems?
    4. Is the memory hierarchy a facter here? Does time-per-instruction depend on the memory footprint of the program?
    5. Does the cost of the Random Select operation depend on the fact that the caches were previously ``warmed up'' by binary search? Devise experiments to measure random selection times, with and without binary search. Is there a difference?

  3. Compare implementations. Download an alternative implementation of the MCMC algorithm (see the resource list below).
    1. Build new performance models for various performance indicators for this implementation. Is one implementation easier to predict than another?
    2. Which performance indicators would be best for comparing the sorted-array and hash-table implementations? Which performance indicators would be best for tuning one of these implementations? Which performance indicators would you use to compare implementations in different languages?