Comparison of trees built using the Matusita measure with trees built using the Kullback-Leibler divergence for large texts

Next: Quantitative assessment of the Up: VLMM results Previous: Comparison of trees built Index

Comparison of trees built using the Matusita measure with trees built using the Kullback-Leibler divergence for large texts

In order to have a correct qualitative comparison of the Matusita measure and the KL divergence, we trained a VLMM tree on a much larger text. The text was a compilation of journalistic style articles from the news, 70000 characters long.

**Figure 6.10:** VLMM tree learnt using the maximum likelihood estimation of probability and the KL divergence. $\epsilon$ is set to 0.003 and only the probabilities that are greater than 0.003 are shown on the graph.
=10cm $\includegraphics[width=145mm,keepaspectratio]{ranktrees/t_ve_0.003_r17_learn.eps}$

**Figure 6.11:** VLMM tree learnt using the maximum likelihood estimation of probability and the Matusita distance. $\epsilon$ is set to 0.003 and only the probabilities that are greater than 0.003 are shown on the graph.
=10cm $\includegraphics[width=145mm,keepaspectratio]{ranktrees/t_veb_0.003_r17_learn.eps}$

Figures 6.10 and 6.11 show the learnt trees for a VLMM using a KL divergence and a Matusita distance respectively.

We can see that the two trees look similar. The tree learnt with a Matusita distance has a depth of six and the KL divergence gave a tree with a depth of three. The method using the Matusita distance is able to encode more history in the tree while still not using so many nodes. It encodes small parts of words, but also short words such as ``said'', ``and'', ``the'', ``of'' or ``to''. The method using the KL divergence is not able to find words in the text, it only encodes parts of words.

The fact that humans group letters into words suggest that a word is a coherent sequence of information and the letters at the extremity of the word are less linked to the other words than to the word itself. This suggests that the method using the Matusita measure is better at representing groups of natural sequences.

Next: Quantitative assessment of the Up: VLMM results Previous: Comparison of trees built Index

franck 2006-10-01