The idea behind the learning algorithm is that for different sequences, we need different length of memories for the prediction of the next element. For instance, if the learning has been done with a lot of instances of sequences such as {a,c,b,d,e,f}, {e,c,b,d,e,f} or {d,c,b,d,e,f} then we would like to use a memory of four letters : {c,b,d,e} in order to predict the next letter "f". If only sequences like {a,l,k,j,e,f}, {s,d,l,l,e,f} or {s,l,o,a,e,f} occurs, then we want to use only the necessary information to predict "f" that is the sequence of one letter : {e}. The prediction can be done in a variable context. This allow us not to model the letters of the sequence that are not decisive for the prediction.
In a more formal way, we want to learn a model of the distribution of probabilities of strings in the sequence as suggested by the figure 5.1(b) but with a minimum number of nodes. The idea is to build a tree that gives the same probability values as a previously chosen probability measure gives on the learning sequence. The section 5.3.3 describes the probability measures that we have tried on a simplified example and how they worked. For now we can use the notation to denote this probability measure.