In order to have a correct qualitative comparison of the Matusita measure and the KL divergence, we trained a VLMM tree on a much larger text. The text used for training is now 70000 characters long and is a compilation of journalistic style articles from the news.
![]() |
![]() |
Figures 6.8 and 6.9 show the learned trees for a VLMM using a KL divergence and a Matusita distance respectively.
We can see that the two trees look similar. Anyway, the tree learned with a Matusita distance has a depth of 6 and the KL divergence gave a tree with a depth of 3 . So the Matusita distance is able to encode more history in the tree while still not using so much nodes in the tree. It encodes small parts of words, but also small words like " said ", " and ", " the ", " of " or " to ". The KL divergence is not able to find words in the text, it only encodes parts of words.
The fact that humans split letters into words suggest that a word is a coherent sequence of information and the letters at the extremity of the word are less linked to the other words than to the word itself. This suggests that the Matusita measure is better at representing split of natural sequences. So it is also better for the learning of variable length Markov model.