Privacy-Preserving Action Recognition using Deep Convolutional Neural Networks

Team: Jiawei Chen, Jonathan Wu, Janusz Konrad, Prakash Ishwar
Funding: National Science Foundation (Lighting Enabled Systems and Applications ERC)
Status: Ongoing (2016-…)

BackgroundHuman action and gesture recognition has received significant attention in the computer vision and signal processing communities. Recently, various ConvNet models have been applied in this context and achieved substantial performance gains over traditional methods that are based on hand-crafted features. As promising as ConvNet-based models are, they typically rely upon data at about 200 x 200-pixel resolution that is likely to reveal an individual’s identity. However, as more and more sensors are being deployed in our homes and offices, the concern for privacy only grows. Clearly, reliable methods for human activity analysis at privacy-preserving resolutions are urgently needed.

Summary: We leveraged deep learning techniques and proposed multiple, end-to-end ConvNets for action recognition from extremely Low Resolution (eLR) videos (e.g., 16 × 12 pixels). We proposed multiple eLR ConvNet architectures, each leveraging and fusing spatial and temporal information. Further, in order to leverage high resolution (HR) videos in training, we incorporated eLR-HR coupling to learn an intelligent mapping between the eLR and HR feature spaces. The effectiveness of this architecture has been validated on two public datasets on which our algorithms have outperformed state-of-the-art methods.

Technical Approach:

Fusion of the two-stream networks: Multiple works have extended two-stream ConvNets by combining the spatial and temporal cues such that only a single, combined network is trained. This is most frequently done by fusing the outputs of the spatial and temporal network’s convolutional layers with the purpose of learning a correspondence of activations at the same pixel location. In this project, we explored and implemented three fusion methods (concatenation, summation, convolution) in the context of eLR videos.

Semi-coupled networks: Applying recognition directly to eLR video is not robust as visual features tend to carry little information at such low resolutions. However, it is possible to augment ConvNet training with an auxiliary, HR version of the eLR video, while only using an eLR video during testing. In this context, we proposed semi-coupled networks which share filters between eLR and HR fused, two-stream ConvNets. The eLR two-stream ConvNet takes an eLR RGB frame and its corresponding eLR optical flow frames as input. The HR two-stream ConvNet simply takes HR RGB and its corresponding HR optical flow frames as input. In layer number n of the network (n = 1, . . . , 5), the eLR and HR two-stream ConvNets share k(n) filters. During training, we leverage both eLR and HR information, and update the filter weights of both networks in tandem. During testing, we decouple these two networks and only use the eLR network which includes the shared weights.

semi-coupled network
Visualization of the proposed semi-coupled networks of two fused two-stream ConvNets for video recognition. We feed HR RGB and optical flow frames (32 x 32 pixels) to the HR ConvNet (colored in blue). We feed eLR RGB (16 x 12 interpolated to 32 x 32 pixels) and optical flow frames (computed from the interpolated 32 x 32 pixel RGB frames) to the eLR ConvNet (colored in red). In training, the two ConvNets share k(n) (n = 1, …, 5) filters (gray shaded) between corresponding convolutional and fully-connected layers. Note that the deeper the layer, the more filters are being shared. In testing, we decouple the two ConvNets and only use the eLR network (the red network which includes the shared filters).

Experimental Results: In order to confirm the effectiveness of our proposed method, we conducted experiments on two publicly-available video datasets. The results below demonstrate that we outperform state-of-the-art methods on both datasets.

Performance of different ConvNet architectures against baseline on the eLR-IXMAS dataset. “Spatial & Temp avg” has been performed by averaging the temporal and spatial stream predictions. The best performing method is highlighted in bold.
Performance of different ConvNet architectures and current state-of-the-art method on the eLR-HMDB dataset. The two-stream networks are all fused after the “Conv3” layer. The best method is highlighted in bold.

Source Code (with ConvNet models):


  1.   J. Chen, J. Wu, J. Konrad, and P. Ishwar, “Semi-Coupled Two-Stream Fusion ConvNets for Action Recognition at Extremely Low Resolutions,” in Proc. 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), Mar. 2017.