We repeated the experiment using the KL divergence to compare probability densities. Figures 6.7, 6.8 and 6.9 show the resulting trees.
![]() |
![]() |
=10cm
![]() |
The method of estimation of probability seems to exert less influence on the result. In the case of the maximum likelihood estimate the tree does not change significantly compared to the corresponding case in the previous section. The two other trees have grown.
The aim of the variable length Markov model is to reduce the number of links required to model the probability distribution. The size of the tree influences directly the learning because the more nodes there are in the tree, the more nodes the learning algorithm has to check. So it is possible that the KL divergence gives us a less efficient tree.
Due to the small amount of data used to construct these trees, the learning using the KL divergence in the two last cases can give such a tree because the text has been over learnt. In order to make sure that it is not the case, a further experiment has been done.