WESPE: Weakly Supervised Photo Enhancer for Digital Cameras

Andrey Ignatov Nikolay Kobyshev Radu Timofte Kenneth Vanhoey Luc Van Gool
ihnatova@vision.ee.ethz.ch nk@vision.ee.ethz.ch timofter@vision.ee.ethz.ch vanhoey@vision.ee.ethz.ch vangool@vision.ee.ethz.ch

Abstract: Low-end and compact mobile cameras demonstrate limited photo quality mainly due to space, hardware and budget constraints. In this work, we propose a deep learning solution that translates photos taken by cameras with limited capabilities into DSLR-quality photos automatically. We tackle this problem by introducing a weakly supervised photo enhancer (WESPE) - a novel image-to-image GAN-based architecture. The proposed model is trained by weakly supervised learning: unlike previous works, there is no need for strong supervision in the form of a large annotated dataset of aligned original/enhanced photo pairs. The sole requirement is two distinct datasets: one from the source camera, and one composed of arbitrary high-quality images - the visual content they exhibit may be unrelated. Hence, our solution is repeatable for any camera: collecting the data and training can be achieved in a couple of hours. Our experiments on the DPED, Kitti and Cityscapes datasets as well as on photos from several generations of smartphones demonstrate that WESPE produces comparable qualitative results with state-of-the-art strongly supervised methods.

arXiv: 1709.01118, 2017

Cityscapes Dataset

Original Image Original
Modified Image Modified
Original Image Original
Modified Image Modified
Original Image Original
Modified Image Modified
Original Image Original
Modified Image Modified
Original Image Original
Modified Image Modified
Original Image Original
Modified Image Modified

iPhone 6 Camera

Original Image Original
Modified Image Modified
Original Image Original
Modified Image Modified
Original Image Original
Modified Image Modified
Original Image Original
Modified Image Modified
Original Image Original
Modified Image Modified
Original Image Original
Modified Image Modified

Huawei P9 Leica Camera

Original Image Original
Modified Image Modified
Original Image Original
Modified Image Modified
Original Image Original
Modified Image Modified
Original Image Original
Modified Image Modified
Original Image Original
Modified Image Modified
Original Image Original
Modified Image Modified

HTC One M9 Camera

Original Image Original
Modified Image Modified
Original Image Original
Modified Image Modified
Original Image Original
Modified Image Modified
Original Image Original
Modified Image Modified
Original Image Original
Modified Image Modified
Original Image Original
Modified Image Modified

Meizu M3s Camera

Original Image Original
Modified Image Modified
Original Image Original
Modified Image Modified
Original Image Original
Modified Image Modified
Original Image Original
Modified Image Modified
Original Image Original
Modified Image Modified
Original Image Original
Modified Image Modified

Xiaomi Redmi 3X Camera

Original Image Original
Modified Image Modified
Original Image Original
Modified Image Modified
Original Image Original
Modified Image Modified
Original Image Original
Modified Image Modified
Original Image Original
Modified Image Modified
Original Image Original
Modified Image Modified

Nexus 5X Camera

Original Image Original
Modified Image Modified
Original Image Original
Modified Image Modified
Original Image Original
Modified Image Modified
Original Image Original
Modified Image Modified
Original Image Original
Modified Image Modified
Original Image Original
Modified Image Modified

Datasets

The first part of experiments is conducted on three publicly available datasets: DPED, Cityscapes and Kitti. The first dataset consists of photos from three smartphones (iPhone 3GS, BlackBerry Passport, Sony Xperia Z), two other datasets were collected using low-end cameras installed in the cars and are intended for semantic labeling and autonomous driving tasks.


Additionally, the proposed approach was tested on photos from several common smartphones: iPhone 6, Huawei P9, HTC One M9, Meizu M3s, Xiaomi Redmi 3X and Nexus 5X. For each phone, between 600 and 1500 training images were collected within one day.

Algorithm

The proposed solution consists of five Convolutional Neural Networks. The first network is a 12-layer Residual CNN which goal is to enhance the input image to DSLR quality. However, since we don't have explicit target images, we are facing the following problem: the network doesn't know how the enhanced image should look like. Moreover, without any constraints it can produce the enhanced image that is not related to the input photo at all. To solve these challenges, we propose three loss functions that take care about all these aspects:


Content loss: We introduce the second CNN with the same architecture that takes the enhanced image as an input and tries to reproduce the original photo. If the content of the original photo is not preserved in the enhanced image, the network will fail and the loss will be high. The loss itself is computed as a difference between the original image and its reconstruction using their content descriptions produced by VGG-19 CNN pre-trained on Alexnet.
Color loss: The enhanced image should have bright and vivid colors. To measure its color quality, we train an adversarial CNN-discriminator that observes the improved and arbitrary high-quality images, and its objective is to predict which image is which. Before passing images to this CNN, we apply Gaussian blur to them to avoid texture and content comparison. The goal of the image enhancement network is to fool the discriminator, so that it cannot distinguish between the enhanced and DSLR photos.
Texture loss: A separate CNN-discriminator is trained to measure texture quality of the enhanced image. To avoid color comparison, we pass grayscale enhanced and DSLR images to this CNN and its objective is to predict which image is which. The goal of the image enhancement network is the same as in the previous case.

Finally, these losses are summed, and the presented system is trained as a whole to minimize the final weighted loss.

Note that after the system is trained, only the first CNN is needed to enhance images.

Citation

Andrey Ignatov, Nikolay Kobyshev, Radu Timofte, Kenneth Vanhoey and Luc Van Gool.

"WESPE: Weakly Supervised Photo Enhancer for Digital Cameras",

In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2018

Computer Vision Laboratory, ETH Zurich

Switzerland, 2017-2021