We now want to learn the structure of the sequence of sub-trajectory groups. In order to do that, we use a variable length Markov model introduced by Ron et al. [11] in the context of learning a sequence of letters in English texts. The idea of this model is to use a memory of length varying with the context. In particular, when the trajectory splits into two trajectories, we need to know where the sub-trajectories are coming from, in order to decide what is the next sub-trajectory to infer. A memory of a long sequence of sub-trajectory groups is needed in that case. On the other hand, if group is always followed by group
, we need only a short memory (stating that if the current group is
, then
is next).
The probabilities of sequences of sub-trajectory groups are stored in a tree. Each branch of the tree corresponds to a sequence. The tree is constructed recursively by adding sequences. A sequence is added to the tree if and only if its probability is greater than a previously chosen threshold and adding the sequence
gives a statistically different tree. Two trees are said statistically different if the probability distributions encoded by the trees, weighted by the probability of the sequence
, are different. Thus we need a measure of distance between probability distributions. The Kullback-Leibler measure is usually used with the variable length Markov model. However, by trying the model on the generation on English texts, we found that the Matusita measure gives better results on long texts (about 33% of letters were successfully predicted against 26% for the Kullback-Leibler measure). The Matusita measure is given by: