Action Recognition Using Log-Covariance Matrices
Team: K. Guo, P. Ishwar, J. Konrad
Funding: National Science Foundation (CISE-SNC-NOSS)
Status: Ongoing (2007-…)
Background: A 5-year old child has no problem recognizing different types of human actions, such as, walking, running, jumping, waving, etc., from video footage captured by a surveillance camera. Give the video footage to a computer, however, and soon obstacles emerge. One source of difficulty is that different actions, such as walking and running, may sometimes look very similar due to the viewing-angle and the camera frame-rate. Another problem, which makes the task challenging, is action variability: the same action performed by different people may look quite different, e.g., people have different walking gaits. Although this problem has been tackled with limited success over the last decade or so, when storage and processing resources are severely limited and decisions need to be made in real time, no satisfactory system has been developed to date. There are many application areas that could greatly benefit from a fast, accurate, and robust solution: homeland security (e.g., to detect thefts and assaults), healthcare (e.g., to monitor the actions of elderly patients and detect life-threatening symptoms), ecological monitoring (e.g., to analyze the behavior of animals in their natural habitats), and automatic sign-language recognition for assisting the speech-impaired.
Summary: In this project, a new class of action recognition algorithms has been developed with performance on par or better than that of of state-of-the-art methods. Furthermore, the proposed algorithms have low storage and computational requirements thus making them suitable for real-time implementation. Central to the developed framework is the use of covariance-matrix representation to characterize actions in a video, and leading to a family of algorithms based on different features and classification methods. In one of the early variants of the approach, background subtraction is applied to all video frames in order to extract moving object’s silhouette tunnel. The tunnel’s shape, that captures action dynamics, is then described by a covariance matrix computed from 13-dimensional feature vectors. Combined with the nearest neighbor classification, this early approach produced very promising results .
Marrying the covariance-matrix representation with a recently developed classification framework based on sparse linear representation, related to compressive sampling, has lead to improved action recognition accuracy. The covariance matrix representation used is discriminative but also simple to compute and has low storage requirements. However, in order to perform an effective classification the nonlinear action representation needs to be mapped into a linear space by means of log-covariance. The resulting method  consistently achieved such a high performance on several databases that the authors were invited to enter the “Aerial View Activity Classification Challenge” in the Semantic Description of Human Actions (SDHA) contest at the 2010 International Conference on Pattern Recognition (ICPR). The goal of the challenge was to test methodologies with realistic surveillance-type videos particularly from low-resolution far-away cameras. Eight teams entered the contest, and in the final stage the BU team “BU Action Covariance Manifolds” won the contest by edging an Italian team from the University of Modena.
The very same framework was also used with another set of features, namely optical flow instead of silhouette tunnels. 12-dimensional feature vectors combining optical flow with its gradient, divergence, vorticity, etc. were computed and aggregated in a covariance matrix. The resulting matrix was used as action representation in the same sparse-linear classification with excellent performance at low memory requirements and low computational cost . The elegance of the method and its excellent performance have been recognized at the at the 7th IEEE International Conference on Advanced Video and Signal-Based Surveillance where the method won the best paper award.
Although the above methods proved successful, they all rely on the knowledge of action boundaries, i.e., when actions start and stop. However, a surveillance camera produces continuous video that includes different types of actions following one another. In order to find action boundaries, we developed a non-parametric statistical framework to learn the distribution of the distance between covariance descriptors . Action changes are then detected as covariance-distance outliers.
Results: The proposed action recognition algorithms were tested on various databases, including the Weizmann Human Action Database (see an example in the figure above), KTH database, and the recent UT-Tower dataset and
YouTube dataset. The processing of individual videos was performed using overlapping segments. The correct classification rate (CCR) for leave-one-out-cross-validation (LOOCV) attained varies between different datasets. On the Weizmann dataset a CCR up to 100% has been attained, on the KTH dataset – up to 98.5%, on the UT-Tower low-resolution video – up to 97.2%, and on the YouTube dataset – up to 78.5%. Detailed results can be found in the papers below. It is very important to note that all of the proposed algorithms are lightweight in terms of memory and CPU requirements, and have been implemented to run at video rates on a modern CPU under Matlab.
Promising results have been also obtained for the proposed action change detection. In a ground-truth experiment, where a video sequence was constructed by concatenating different actions by the same individual (61 action changes), a false negative error of 1.64% and false positive error of 0.19% have been attained.
The above framework has three important traits: conceptual simplicity, state-of-the-art performance, and low computational complexity. Its relative simplicity, as compared to some of the top methods in the literature, allows a rapid deployment of robust action recognition operating in real time. This opens new application areas outside the surveillance/security arena, for example in sports video annotation and human-computer interaction. The videos below show three examples of the application of our action recognition framework to the automatic annotation of sports videos. First, we created a dictionary of manually-annotated short video segments (e.g., forehand, backhand) based on 2-3 minutes of sample tennis or pommel-horse videos. Then, we automatically classified each frame of a new video using optical flow and nearest-neighbor classifier with respect to the dictionary; each frame was labeled based on a majority vote from all overlapping video segments that include that frame. The detected actions are denoted by yellow and red dots on the screen (see the legends below).
Legend for tennis videos: red dot = serve, yellow dot = side on which player hits the ball (forehand versus backhand)
Legend for pommel horse video: left dot = spindle on back, middle dot = spindle on chest, right dot = scissors
 K. Guo, P. Ishwar, and J. Konrad, “Action recognition from video by covariance matching of silhouette tunnels,” in Proc. Brazilian Symp. on Computer Graphics and Image Proc., pp. 299-306, Oct. 2009.
 K. Guo, P. Ishwar, and J. Konrad, “Action change detection in video by covariance matching of silhouette tunnels,” in Proc. IEEE Int. Conf. Acoustics Speech Signal Processing, pp. 1110-1113, Mar. 2010.
 K. Guo, P. Ishwar, and J. Konrad, “Action recognition in video by sparse representation on covariance manifolds of silhouette tunnels,” in Proc. Int. Conf. Pattern Recognition (Semantic Description of Human Activities Contest), Aug. 2010, [SDHA contest web site], Winner of Aerial View Activity Classification Challenge.
 K. Guo, P. Ishwar, and J. Konrad, “Action recognition using sparse representation on covariance manifolds of optical flow,” in Proc. IEEE Int. Conf. Advanced Video and Signal-Based Surveillance, pp. 188-195, Aug. 2010, AVSS 2010 Best Paper Award.
 K. Guo, Action recognition using log-covariance matrices of silhouette and optical-flow features. PhD thesis, Boston University, Sept. 2011.