Creating a convincing synthetic talking head is a very ambitious project, well beyond the scope of a single PhD. This thesis demonstrates some progress towards such a goal, but much further work remains to be done.
The psychophysical experiment suggests that, although our method generally outperforms a relatively straightforward alternative such as the autoregressive process, it is not yet good enough to be indistinguishable from the original video. There is clearly further work required to improve performance. We would also like to compare our approach with recently described methods such as that of Hack and Taylor [40], who model facial behaviour using various modifications of the hidden Markov models.
One of the major drawbacks of our approach, is the number of thresholds that have to be set by the user. The threshold values usually depend on the data that we want to process. For instance, as explained in section 5.2.2, if the parameter is too small or too large for the dataset, we take the risk of not selecting enough nodes to split the trajectory in the appearance parameter space.
We need a method for selecting the parameters automatically.
So far, our model has only been used to synthesise faces. In theory a good model should be able to help in tracking by predicting likely facial movements. Adapting our model for this task remains to be done.
Clearly, a convincing talking head would have to integrate speech synthesis, as well as to be able to both observe a human user and in some way understand them so that it could react to them appropriately. Merging of results in speech synthesis, speech recognition, behaviour modelling and eventually complex artificial intelligence methods could be investigated to create a talking head displaying emotions believably.