Statistical shape models

Statistical shape models represent the structure of an object using a set of landmark points. They are trained using manually annotated images. The manual labelling is a way of including human knowledge into a learning mechanism. Figure 3.2 shows an example of hand labelled image. The shape is described by a vector ${\bf x}$ that contains the coordinates of each point of the shape.

**Figure 3.2:** Example of hand labelled face. The points are manually put on features such as corners of the eyes and mouth, and along boundaries.
$\includegraphics[width=145mm,keepaspectratio]{anot.eps}$

The first step of the statistical shape model is to align the shapes found in the training set. This is done by an approach called Procrustes analysis [29]. This algorithm is iterative and reduces the sum of the distances between each shape to the mean shape. On its completion, all the shapes have the same centre of gravity, scale and orientation.

The variation in shape is then estimated by applying a principle component analysis (PCA) to the vectors representing the aligned shapes. The mean of these

vectors is computed:

$\begin{displaymath} \overline{\bf x} = \frac{1}{s} \sum_{i=1}^s {\bf x}_i \end{displaymath}$

(2)

$\begin{displaymath} {\bf S}=\frac{1}{s-1} \sum_{i=1}^s ({\bf x}_i-\overline{\bf x})({\bf x}_i-\overline{\bf x})^T \end{displaymath}$

(3)

In order to decrease the dimensionality of the data, the eigenvectors corresponding to the

largest eigenvalues are chosen, as they explain most of the variation of the dataset. A threshold

is previously chosen (usually $95\%$ or $98\%$ ). We then compute

by taking the minimum integer where the equation:

$\begin{displaymath} \sum_{i=1}^t\lambda_i\geq f_v \sum_{i=1}^s\lambda_i \end{displaymath}$

(4)

If we define $\Phi=\left(\Phi_1,...,\Phi_t\right)$ , each vector ${\bf x}$ in the training set can be approximated by:

$\begin{displaymath} {\bf x} \approx \overline{\bf x} + \Phi {\bf b} \end{displaymath}$

(5)

$\begin{displaymath} {\bf b}=\Phi^T \left({\bf x}-\overline{\bf x}\right) \end{displaymath}$

(6)

${\bf b}$ describes the shape ${\bf x}$ . The approximation of the shape ${\bf x}$ can be reconstructed with only ${\bf b}$ , given that we know the model (that is, $\overline{\bf x}$ and $\Phi$ ). Constraining the model to small variations allows it to generate only shapes that are similar to the training shapes. This can be done either by restricting the elements

of ${\bf b}$ to vary between bounds (for instance $\pm 3 \sqrt{\lambda_i}$ ) or by constraining ${\bf b}$ to be in a hyper-ellipsoid:

$\begin{displaymath} \sum_{i=1}^t\frac{b_i^2}{\lambda_i}\leq M_t \end{displaymath}$

(7)

Figure 3.3a shows the first mode of variation of the model built on images of annotated faces.

**Figure 3.3:** Example of the first mode of variation of the shape of a face. The parameter of ${\bf b}$ corresponding to the largest eigenvalue, varies from $-3\sqrt{\lambda_1}$ to $3\sqrt{\lambda_1}$ .
$\begin{figure}\begin{center} \epsfbox{mode1.eps} \end{center} \end{figure}$