Silhouettes versus Skeletons: Can less data be as good yet more robust?

Team: J.Wu, K. Lai, P. Ishwar, J. Konrad
Funding: National Science Foundation (CISE-SATC)
Status: Completed (2011-2017)

Summary: This project focuses on two issues in gesture-based authentication. First, which data produced by a Kinect camera performs better in gesture-based authentication: silhouettes or skeletons? A person’s silhouette can be easily derived from a depth map produced by Kinect camera (see the picture above). However, due to illumination, clothing, occlusions, etc., silhouettes may be locally erroneous and may potentially increase authentication errors. Alternatively, a person’s skeleton produced by Kinect SDK can serve as the input to an authentication algorithm. Skeletons produced by Kinect SDK are quite robust to illumination, clothing and occlusions, but contain fewer bits of information than silhouettes. The second issue we tackle is the impact of some real-world degradations, such as user memory of gestures, and of various personal-effects, such as backpacks and heavy clothing, on the performance of gesture-based authentication. We provide a careful experimental study of our algorithms considering both of these issues.

A sample of a user performing a gesture — A sample depth map sequence of a single user gesture captured using a Kinect

In order to achieve these goals, we acquired gesture data from 40 users. Gestures with various degradations, namely user-memory and personal effects (heavy coats, bags, etc.), were recorded. Users were asked to perform two gestures. One gesture was a shared gesture (S gesture), and another was a gesture selected by the users on their own (user-defined gesture). Details of the acquired data can be found here.

To generate silhouette features, multiple steps were required. First, we segmented full-body silhouettes (a silhouette tunnel) from a sequence of depth maps produced by Kinect. For each silhouette pixel in the tunnel, we computed the Euclidean distance from that pixel to the farthest silhouette boundary in 10 directions (8 spatial directions, 2 temporal directions). The figure below provides a visualization of these directions. All distances were combined into a feature vector, one per pixel. From this collection of per-pixel features, an empirical covariance matrix was formed. In order to authenticate, the nearest-neighbor was found by evaluating either a Euclidean or log-Euclidean distance between covariance matrices of a query user and enrolled users.

sil_tunnel_gesture — Each silhouette pixel n₀ has an associated 13-dimensional feature vector which consists of the pixel position (x,y,t) and 10 directional distances to the silhouette boundary.

To generate skeletal features, skeletal pose-estimates (x-y-z coordinates — see figure below) were extracted from the Kinect v1 Windows SDK. Three distances between skeletal features of a query user and enrolled users were considered. The first two distances mirrored the kind applied to silhouette features above; empirical covariance matrices were formed from per-frame skeletal pose-estimates, and nearest-neighbor Euclidean or log-Euclidean distances between matrices were considered. The last distance was derived from dynamic time warping (DTW), which computes a non-linear alignment distance between two skeletal sequences.

Sample user swing sequence — Skeletal time-snapshots of a user swinging his/her left arm. Red and green indicate the left and right arms, respectively, and blue indicates the center of the body: head to spine and both legs. All gestures are arm based.

Results: The aforementioned features were both evaluated on our acquired dataset. We tested the performance of both silhouette and skeleton features under various scenarios. For example, the first row in the tables below shows the results when both training and testing data included no degradations. The subsequent rows show results when no degradations were present at training but individual or all degradations were present during testing. The tables below show all results for our 2 considered gestures.

Equal Error Rate (EER) for the problem of individual user authentication for the S Gesture. Smaller is better. The best results for skeletons and for silhouettes are highlighted. SvS denotes the error difference between the best performing result from skeletons vs. silhouettes (best skeleton - best silhouette). Everything* contains samples both with and without degradations. — Equal Error Rate (EER) for the problem of individual user authentication for the S Gesture. Smaller is better. The best results for skeletons and for silhouettes are highlighted. SvS denotes the error difference between the *best* performing result from skeletons vs. silhouettes (best skeleton – best silhouette). Everything* contains samples both with and without degradations.

AVSS_user_table — Equal Error Rate (EER) for the problem of individual user authentication for the User Defined Gesture. Please refer to the caption of the last table for explanations.

For additional results (identification and group level authentication) and more in-depth details on our methodology, please refer to the papers below and, in particular to the AVSS-2014 paper [4].

To date, we obtained overall Equal Error Rates (EER — equal false rejection and false acceptance rates) showing that skeletal features outperform silhouette features on average by 4.89%. Although our conclusions about the improved performance with skeletal features cannot be drawn in every scenario, our results indicate that skeletal features are always preferable when training data lacks degradations that are present during testing. Thus, in situations where large quantities of varied training data cannot be obtained, skeletal features may be preferable.

Publications:

K. Lai, J. Konrad, and P. Ishwar, “Towards gesture-based user authentication,” in Proc. IEEE Int. Conf. Advanced Video and Signal-Based Surveillance (AVSS), Sep. 2012, pp. 282-287.
J. Wu, J. Konrad, and P. Ishwar “Dynamic Time Warping for Gesture-Based User Identification and Authentication with Kinect”, in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Proc. (ICASSP), May 2013, pp. 2371-2375.
Article in BU Today: http://www.bu.edu/today/2013/opening-movements
J. Wu, P. Ishwar, and J. Konrad, “Silhouettes versus skeletons in gesture-based authentication with Kinect,” in Proc. IEEE Int. Conf. Advanced Video and Signal-Based Surveillance (AVSS), pp. 99-106, Aug. 2014.