Next: Learning and Analysis of Up: A review of object Previous: Introduction Index

Video based detection and tracking

Hunke [31] has developed a neural network [3,49] based face tracker. The neural network searches the object with largest size, skin-color and appropriate motion feature. The face color classifier is able to distinguish face color and background color by low-pass filtering. In combination with a simple thresholded difference between two successive images that provide a motion feature, we obtain a binary image of the face. This binary image is given as an input to a multi layer perceptron composed with a retina (array of inputs of the size of the image), one single hidden layer and the output layer. The output layer gives us the position and size of the face for each frame. This system has been extended by Yang et al. in [58] in order to be able to track features of the face such as eyes, lips corner or nostrils. It has been successfully applied to eye-gaze monitoring, head pose tracking and lip-reading using time delay neural network [57].

Colmenarez and Thomas [14] used a set of 9 features to track faces : the four corners of the eyebrows, the two outer corners of the eyes, the nose and the two corners of the lips. The system uses fast low-level processing to find features, such as horizontal and vertical edge detection. These features are learned by the system using information-based maximum discrimination (IBMD) [13]. This learning method optimizes the correct-answer/false-alarm trade off by rearranging the elements of a Markov chain [12]. A hierarchical matching procedure was implemented in order to deal with fast motion and a full frame search is performed regularly based on the IBMD algorithm. As the hierarchical matching algorithm is able to compensate the time lost in performing the IBMD algorithm, the system can be considered as real-time. It is also person independent and it is able to track multiple faces.

McKenna and Gong [44] propose a tracker that combines motion based tracking and model based tracking. The model based tracking is done using a neural network detection of the face whilst the motion based tracking reduces the search area. The motion based tracking is based on a temporal convolution of the images that allow a clustering algorithm to find the different moving objects. A Kalman filter is then used to track these objects (human bodies in this case). The location of the heads can then be estimated using the bounding boxes found. The detected heads are then used to resolves ambiguities appeared in the motion tracking process. Thus the motion based tracking process helps the model based tracking process and the model based tracking process provides feedback and thus helps the motion based tracking.

Duta [24] used a modified version of this information-based maximum discrimination to track Corpus Callosum in brains in order to predict dyslexia. The Corpus Callosum is modeled by a point distribution model. The structures are locally found using an algorithm similar to the active shape model [18] which makes use of prior knowledge about neuroanatomic structures and is able to deal with occlusion and outliners [23,22]. A range of windows placed at different positions and scales are tested. The correct-answer/false-alarm rate is maximized using simulated annealing to decide whether or not a detected object is a Corpus Callosum. This detection process has also been applied to detections of magnetic resonance cardiac video sequences.

Jebara and Pentland [35] built a 3D face tracker based on two modules : detection and tracking. For the detection module, they are using several features to detect faces :

-: Skin color classification : a expectation maximization algorithm [20,45] to find the parameters of a mixture of gaussians modeling the probability distribution of skin color.
-: Within the window found skin color, eyes, mouth and nose are detected using dark regions and gradient information. The combination of features that gives the maximum likelihood is then selected.
-: The location of eyes, mouth and nose are mapped to an average 3D face previously obtained using a 3D data range scanner. A mug-shot of a frontal view of this 3D model is then computed using the fact that faces are symmetrical. Its illumination is corrected by adjusting the histogram of the mug-shot so that it fits the histogram of a well-illuminated face.
-: A distance to face is then used to assess the resulting mug-shot. The nose position is then adjust in order to minimize this distance to face.

The tracking module then uses the four detected features, motion information and an extended kalman filter [56] to estimate the 3D state of the face for the next frame. From this information we can extract an estimation of the feature for the next frame. If the distance from face space using these features is below a threshold then the tracking continue using these new features. Otherwise, a global detection is needed and performed using the detection module.

Baumberg and Hogg [2] are modelling pedestrian by a spline controlled by a point distribution model as described by Cootes et al. [17]. A Kalman filter is then used to track the normalized control points and the alignment (scale and rotation) separately. An iterative refinement method is then applied to correct the prediction and obtain a more accurate contour of the pedestrian on the next frame.

Edwards et al. [25] are using active appearance models to improve recognition while tracking faces. A linear discrimant analysis to separate the separate identity subspace and non-identity sub-space of the face space. A correction scheme is applied to improve identity recognition using several frames of the same individual. Three kalman filters are then used to track separately pose, corrected identity subspace and non-identity subspace the video sequence [26]. Given this framework, a low resolution video sequence of the face can then be reconstructed in high resolution using the synthetising facility of the active appearance model.

Isard and Blake [33] developed an extension of the Kalman filter in order to deal with non linear predictions of movement. The algorithm, called CONDENSATION, uses sampling of distributions, Bayes rule, diffusion of distributions and reactive reinforcement to provide a robust estimate of the next position of an object. Both object properties and motion are used in the model. In order to apply it in practice, distributions have to be estimated. The shape of an object is modeled by splines. The distribution of possible shapes for the object we want to track has been collected from a set of manually annotated images. The motion as been estimated by using a kalman filter to track the object until this tracking fails. The motion information collected can then be used with the CONDENSATION algorithm to track the object again. Improvements between the two trackings have been observed by Isard and Blake. Eventually, new motion distribution can be derived by the second tracking and used in a third tracking to improve the result.

Magee and Boyle [40] are using a CONDENSATION-like algorithm to track active shape models of livestocks. The set of training shapes is split into prototypes in the same way we will describe in section 4.3. They first separate the inter-class and intra-class variations using linear discriminant analysis. Then two mixtures of gaussians are used to model both inter-class and intra-class variations in the shape parameter space. After initialization, shape, scale and positions are sampled from estimated probability distributions. By using a fitness function to assess the prototypes sampled, they then update the probability distributions. This approach is adapted to their needs, because the self occlusion that can occur in when a cow is walking push them to adopt a multi-model approach. Three active shape model were used and two of them were specific to occlusion situations where one can only see one leg on the image instead of two. So the sampling approach allows us to sample from different models.

Modelling motion is the first step towards modelling and analysis of behaviours so the next section presents a literature review of this second aspect.

Next: Learning and Analysis of Up: A review of object Previous: Introduction Index

franck 2006-10-16