Automatic 2D-to-3D image conversion
Team: J. Konrad, G. Brown, M. Wang, P. Ishwar
Collaborators: D. Mukherjee, C. Wu (Google)
Funding: National Science Foundation
Status: Completed (2011-2015)
Video recording of presentation by J. Konrad entitled “Automatic 2D-to-3D image conversion using 3D examples from the Internet” at SPIE Stereoscopic Displays and Applications Symposium, January 2012, San Francisco:
Background: Among 2D-to-3D image conversion methods, those involving human operators have been most successful but also time-consuming and costly. Fully-automatic methods, typically make strong assumptions about the 3D scene, for example faster or larger objects are assumed to be closer to the viewer, higher frequency of texture belongs to objects located further away from the viewer, etc. Although such methods may work well in some cases, in general it is very difficult, if not impossible, to construct a deterministic scene model that covers all possible background and foreground combinations. In practice, such methods have not achieved the same level of quality as the semi-automatic methods.
Summary: In this project, we explore a radically different approach inspired by our work on saliency detection in images [1]. Instead of relying on a deterministic scene model for the input 2D image, we propose to “learn” the model from a large dictionary of 3D images, such as YouTube 3D [2] or the NYU Kinect dataset [3]. Our novel approach is built upon a key observation and an assumption. The key observation is that among millions of 3D images available on-line, there likely exist many whose 3D content matches that of the 2D input (query). The key assumption is that two 3D images (e.g., stereopairs) whose left images are photometrically similar are likely to have similar depth fields. In our approach, we first find a number of on-line 3D images that are close photometric matches to the 2D query and then we extract depth information from these 3D images. If the 3D images are provided in the form of stereopairs, this necessitates disparity/depth estimation. In the case of NYU dataset, however, depth fields are provided since all images have been captured by a Kinect camera indoors. Since depth/disparity fields of the best photometric matches to the query differ due to differences in underlying image content, level of noise, distortions, etc., we combine them by using the median and post-process the result using cross-bilateral filtering. We apply the resulting median depth/disparity field to the 2D query to obtain the corresponding right image, while handling occlusions and newly-exposed areas in the usual way.
Results: We have applied the above method in two scenarios. First, we used YouTube 3D videos in search of the most similar frames. Then, we repeated the experiments on a smaller NYU Kinect dataset of indoor scenes. While far from perfect, the presented results demonstrate that on-line repositories of 3D content can be used for effective 2D-to-3D image conversion. With the continuously increasing amount of 3D data on-line and with the rapidly growing computing power in the cloud, the proposed framework seems a promising alternative to operator-assisted 2D-to-3D conversion.
Below are shown three example of automatic 2D-to-3D image conversion from YouTube 3D videos as anaglyph images.
Publications:
[1] M. Wang, J. Konrad, P. Ishwar, K. Jing, and H. Rowley, “Image saliency: From intrinsic to extrinsic context,” in Proc. IEEE Conf. Computer Vision Pattern Recognition, pp. 417-424, June 2011.
[2] J. Konrad, G. Brown, M. Wang, P. Ishwar, C. Wu, and D. Mukherjee, “Automatic 2D-to-3D image conversion using 3D examples from the Internet,” in Proc. SPIE Stereoscopic Displays and Applications, vol. 8288, Jan. 2012. (Video recording of the presentation is shown at the top of this page.)
[3] J. Konrad, M. Wang, and P. Ishwar, “2D-to-3D image conversion by learning depth from examples,” in 3D Cinematography Workshop (3DCINE’12) at CVPR’12, pp. 16-22, June 2012.