Time Varying Active Appearance Model

Time Varying Active Appearance Model

Author: Franck Bettinger
Supervisor: Tim Cootes

Building a human-computer interface based on visual clues requires several stages. The first stage of an ideal human-computer interface is a tracking system that is able to locate the face of the user each time it is required. The tracking problem is a really challenging one because of the variability of the expressions one can show. Furthermore, in order to be able to bring useful informations to the next stages, it has to be precise and robust. Facial hairs, glasses, occlusions and sensor noise are problems that make this task difficult.

The second stage of a human-computer interface is the analysis of the user's face sequence in order to extract informations such as facial expression or the direction where the user is looking at. We can then combine these informations to deduce the state of the user and decide how the computer has to react. Then, in a third stage, the computer has to synthetise a virtual face which seems to react in accordance to what we decided in the previous stage.

The aim of this project is to construct a model of face sequence that can reproduce natural behaviours. Such a model can be used the create a realistic moving face that reacts the user in a human-computer interface.

The model that we have developed is an extension of the active appearance model (AAM) designed by Cootes, Edwards and Taylor [1]. The AAM is a probabilistic model used to describe an object in a still image by learning its shape and appearance from a database of hand labelled images of similar objects. It is built using two sub-models.

The first sub-model is the statistical shape model. A principle component analysis (PCA) is applied to the set of hand labelled images (see figure 1) to extract the mean and variance of the shapes (see figure 2).

**Figure 1:** Example of a hand labelled face. The labels placed by hand are represented by dark points.
$\begin{figure}\begin{center}\epsfxsize =7cm \epsfbox{franck_marked.eps} \end{center} \end{figure}$

**Figure 2:** Example of the first mode of variation of a shape of a face.
$\begin{figure}\begin{center} \epsfbox{mode1.eps} \end{center} \end{figure}$

The second sub-model is a shape-free model of the texture. The texture of the objects in the training images are wrapped to fit the mean shape of the first sub-model. A PCA is then applied to find the mean and the variations of the set of shape-free textures. The two sub-models are then combined. A PCA is used again to reduce the dimensionality of the model.

An iterative search algorithm based on differences between synthesized and real images can be applied to the resulting model in order to fit the model to new images (see figure 3). Thus each face can be approximated by set of AAM parameters and furthermore, each set of AAM parameters can be synthesised back into a face. This search has the drawback of requiring a good first approximate of the location and shape of the face.

**Figure 3:** Example of application of the AAM search algorithm.
$\begin{figure}\begin{center}\epsfxsize =14cm \epsfbox{aamsearch.eps} \end{center} \end{figure}$

In order to track faces, we developed a semi-automatic framework. We first mark up some faces from chosen frames in the training video sequence. In particular, we mark up the first frame so that the algorithm know how to begin the tracking. We use those marked up frames to generate the model of the face using an AAM. For each frame, we use the iterative search algorithm to find the location of the face using the location of the face in the previous frame as a first estimate. In case it fails, we have to mark up the face in this frame, add it to the training set, and restart the tracking.

The tracking returns a stream of AAM parameters that can be seen as a trajectory in the AAM parameter space. The idea of the model is to split this trajectory into sub-trajectories, then group the sub-trajectories and finally learn the temporal relationship between those sub-trajectory groups. The process can be reversed by using the learned model to generate a new sequence of sub-trajectory groups. We then use this sequence of sub-trajectory groups to synthesise a new trajectory in the AAM parameter space that can be synthesised back into a video sequence of a face (see figure 4).

**Figure 4:** Overview of the model. P is the AAM parameter space. Arrows from left to right represent the learning and arrows from right to left represent the generation of a new video sequence.
$\begin{figure}\begin{center}\epsfxsize =14cm \epsfbox{frameworkoverview.eps} \end{center} \end{figure}$

In order to split the trajectories into meaningful groups, we decided to adopt the following method. First we segment the trajectory by selecting some nodes that will represent beginnings and ends of the sub-trajectories. Those nodes are chosen to be the points of high density.

We cluster the resulting set of sub-trajectories, in order to get the sub-trajectory groups. We developed a greedy algorithm to perform this clustering. First each sub-trajectory is supposed to form a sub-trajectory group by itself. Then we iteratively merge groups in a way that keeps low variances for the resulting groups. We stop the algorithm when we reach a given number of clusters.

The learning of sequences of sub-trajectory groups is done by a variable length markov model (VLMM). The VLMM models the probability distribution of the prototypes given the former prototypes in a sequence. In order to do that, the probabilities are efficiently stored in a tree. The small probabilities are not stored and the probabilities of prototypes that do not bring enough information are not stored either.

A new sequence of face can be generated from this model by sampling a sequence of sub-trajectory groups given the probabilities stored in the VLMM. A sub-trajectory is then sampled for each sub-trajectory group models. The concatenation of those sub-trajectory samples forms the generated trajectory in the AAM parameter space. A face can then be synthesised given the AAM parameters for each frame. Figure 5(b) shows a sequence of a face generated by this framework.

**Figure 5:** Figure 5(a) represents some frames extracted from the video sequence used to train the model, while figure 5(b) represents some frames extracted from the sequence generated by the model.
[Training sequence of a face gesturing "no"] $\includegraphics[width=120mm,keepaspectratio]{tfmovall.eps}$ [Generated sequence] $\includegraphics[width=120mm,keepaspectratio]{gfmovall.eps}$

Our plan for the next year is to improve this model (for instance, smooth links between generated sub-trajectories are required and the model should take this into account as well). We also want to use this model of behaviour to simulate the interaction between two persons speaking together. To do so, we will model the joint probability of two AAM, one for each speaker. Finally we will evaluate the performance of this model.