DSLR-Quality Photos on Mobile Devices with Deep Convolutional Networks

Andrey Ignatov	Nikolay Kobyshev	Radu Timofte	Kenneth Vanhoey	Luc Van Gool
ihnatova@vision.ee.ethz.ch	nk@vision.ee.ethz.ch	timofter@vision.ee.ethz.ch	vanhoey@vision.ee.ethz.ch	vangool@vision.ee.ethz.ch

ICCV 2017 Paper

Abstract: Despite a rapid rise in the quality of built-in smartphone cameras, their physical limitations - small sensor size, compact lenses and the lack of specific hardware, - impede them to achieve the quality results of DSLR cameras. In this work we present an end-to-end deep learning approach that bridges this gap by translating ordinary photos into DSLR-quality images. We propose learning the translation function using a residual convolutional neural network that improves both color rendition and image sharpness. Since the standard mean squared loss is not well suited for measuring perceptual image quality, we introduce a composite perceptual error function that combines content, color and texture losses. The first two losses are defined analytically, while the texture loss is learned in an adversarial fashion. We also present DPED, a large-scale dataset that consists of real photos captured from three different phones and one high-end reflex camera. Our quantitative and qualitative assessments reveal that the enhanced image quality is comparable to that of DSLR-taken photos, while the methodology is generalized to any type of digital camera.

arXiv: 1704.02470, 2017

iPhone 3GS Camera Demo

BlackBerry Passport Camera

Sony Xperia Z Camera

DPED: DSLR Photo Enhancement Dataset

To tackle the general photo enhancement problem by mapping low-quality phone photos into photos captured by a professional DSLR camera, we introduce a large-scale DPED dataset that consists of photos taken synchronously in the wild by three smartphones and one DSLR camera. The devices used to collect the data are iPhone 3GS, BlackBerry Passport, Sony Xperia Z and Canon 70D DSLR. To ensure that all devices were capturing photos simultaneously, they were mounted on a tripod and activated remotely by a wireless control system.

In total, over 22K photos were collected during 3 weeks, including 4549 photos from Sony smartphone, 5727 from iPhone and 6015 photos from BlackBerry; for each smartphone photo there is a corresponding photo from the Canon DSLR. The photos were taken during the daytime in a wide variety of places and in various illumination and weather conditions. The images were captured in automatic mode, we used default settings for all cameras throughout the whole collection procedure.

The synchronously captured photos are not perfectly aligned since the cameras have different viewing angles, focal lengths and positions. To address this, we performed additional non-linear transformations based on SIFT features to extract the intersection part between phone and DSLR photos, and then used the obtained aligned image fragments to extract patches of size 100x100 pixels for CNN training (139K, 160K and 162K pairs for BlackBerry, iPhone and Sony, respectively). These patches constituted the input data to our CNN.

Algorithm

Image enhancement is performed using a 12-layer Residual Convolutional Neural Network that takes phone photo as an input and is trained to reproduce the corresponding image from DSLR camera. In other words, its goal is to learn the underlying translation function that modifies photos taken by a given camera into DSLR-quality photos. The network is trained to minimize a composite loss function that consists of the following three terms:

● Color loss:	the enhanced image should be close to the target (DSLR) photo in terms of colors. To measure the difference between them, we apply Gaussian blur to both images and compute the Euclidean distance between the obtained representations.
● Texture loss:	to measure texture quality of the enhanced image, we train a separate adversarial CNN-discriminator that observes both improved and target grayscale images, and its objective is to predict which image is which. The goal of our image enhancement network is to fool the discriminator, so that it cannot distinguish between them.
● Content loss:	a distinct VGG-19 CNN pre-trained on Alexnet is used to preserve image semantics: content description produced by this CNN should be the same both for the improved and target images.

All these losses are then summed, and the system is trained as a whole with the backpropagation algorithm to minimize the final weighted loss.

Code

TensorFlow implementation of the proposed models and the whole training pipeline is available in our github repo

Pre-trained models + standalone code to run them can be downloaded separately here

Prerequisites: GPU + CUDA CuDNN + TensorFlow (>= 1.0.1)

Citation

Andrey Ignatov, Nikolay Kobyshev, Radu Timofte, Kenneth Vanhoey and Luc Van Gool.

"DSLR-Quality Photos on Mobile Devices with Deep Convolutional Networks",

In IEEE International Conference on Computer Vision (ICCV), 2017

Computer Vision Laboratory, ETH Zurich

Switzerland, 2017-2021