Rendering Natural Camera Bokeh Effect with Deep Learning

Andrey Ignatov	Jagruti Patel	Radu Timofte
andrey@vision.ee.ethz.ch	patelj@student.ethz.ch	timofter@vision.ee.ethz.ch

Abstract: Bokeh is an important artistic effect used to highlight the main object of interest on the photo by blurring all out-of-focus areas. While DSLR and system camera lenses can render this effect naturally, mobile cameras are unable to produce shallow depth-of-field photos due to a very small aperture diameter of their optics. Unlike the current solutions simulating bokeh by applying Gaussian blur to image background, in this paper we propose to learn a realistic shallow focus technique directly from the photos produced by DSLR cameras. For this, we present a large-scale bokeh dataset consisting of 5K shallow / wide depth-of-field image pairs captured using the Canon 7D DSLR with 50mm f/1.8 lenses. We use these images to train a deep learning model to reproduce a natural bokeh effect based on a single narrow-aperture image. The experimental results show that the proposed approach is able to render a plausible non-uniform bokeh even in case of complex input data with multiple objects. The dataset, pre-trained models and codes used in this paper are provided below.

arXiv: 2006.05698, 2020

Original Photos vs. Rendered Bokeh Images

Google Pixel Camera Bokeh vs. Rendered with PyNET

Everything is Better with Bokeh! Dataset

One of the biggest challenges in the bokeh rendering task is to get high-quality real data that can be used for training deep models. To tackle this problem, a large-scale Everything is Better with Bokeh! (EBB!) dataset containing more than 10 thousand images was collected in the wild with the Canon 7D DSLR camera during several months. By controlling the aperture size of the lens, images with shallow and wide depth-of-field were taken. In each photo pair, the first image was captured with a narrow aperture (f/16) that results in a normal sharp photo, whereas the second one was shot using the highest aperture (f/1.8) leading to a strong bokeh effect. The photos were taken during the daytime in a wide variety of places and in various illumination and weather conditions. The photos were captured in automatic mode, the default settings were used throughout the entire collection procedure.

The captured image pairs are not aligned exactly, therefore they were first matched using SIFT keypoints and RANSAC method. The resulting images were then cropped to their intersection part and downscaled so that their final height is equal to 1024 pixels. Finally, we computed a coarse depth map for each wide depth-of-field image using the Megadepth model. These maps can be stacked directly with the input images and used as an additional guidance for the trained model. From the resulting 10 thousand images, 200 image pairs were reserved for testing, while the other 4.8 thousand photo pairs can be used for training and validation.

Note: The full EBB! dataset will be available after the end of the AIM 2020 Bokeh Effect Rendering Challenge. You can now download its training part (without the depth data) by registering in the above challenge using the following link:

PyNET Architecture

Bokeh effect simulation problem belongs to a group of tasks dealing with both global and local image processing. High-level image analysis is needed here to detect the areas on the photo where the bokeh effect should be applied, whereas low-level processing is used for rendering the actual shallow depth-of-field images and refining the results. Therefore, in this work we base our solution on the PyNET architecture designed specifically for this kind of tasks: it is processing the image at different scales and combining the learned global and local features together.

The proposed architecture has a number of blocks that are processing feature maps in parallel with convolutional filters of different size (from 3×3 to 9×9), and the outputs of the corresponding convolutional layers are then concatenated, which allows the network to learn a more diverse set of features at each level. The outputs obtained at lower scales are upsampled, stacked with feature maps from the upper level and then subsequently processed in the following convolutional layers. Instance normalization is used in all convolutional layers that are processing images at lower scales (levels 2-5). We are additionally using two transposed convolutional layers on top of the main model that upsample the images to their target size.

The model is trained sequentially, starting from the lowest layer. This allows to achieve good semantically-driven reconstruction results at smaller scales that are working with images of very low resolution and thus performing mostly global image manipulations. After the bottom layer is pre-trained, the same procedure is applied to the next level till the training is done on the original resolution. Since each higher level is getting upscaled high-quality features from the lower part of the model, it mainly learns to reconstruct the missing low-level details and refines the results.

< Code >

TensorFlow implementation and the entire training pipeline is available in our github repository

PyTorch implementation of the PyNET model can be found here

Citation

Andrey Ignatov, Jagruti Patel and Radu Timofte.

"Rendering Natural Camera Bokeh Effect with Deep Learning",

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2020

Computer Vision Laboratory, ETH Zurich

Switzerland, 2020-2021