Fast and Accurate Camera Scene Detection on Smartphones

Angeline Pouget Sidharth Ramesh Maximilian Giang Ramithan Chandrapalan Toni Tanner Moritz Prussing Radu Timofte Andrey Ignatov
angeline.pouget@gmail.com siramesh@ethz.ch giangm@ethz.ch rachandr@ethz.ch tannerto@ethz.ch moritzpr@ethz.ch timofter@vision.ee.ethz.ch andrey@vision.ee.ethz.ch

Abstract: AI-powered automatic camera scene detection mode is nowadays available in nearly any modern smartphone, though the problem of accurate scene prediction has not yet been addressed by the research community. This paper for the first time carefully defines this problem and proposes a novel Camera Scene Detection Dataset (CamSDD) containing more than 11K manually crawled images belonging to 30 different scene categories. We propose an efficient and NPU-friendly CNN model for this task that demonstrates a top-3 accuracy of 99.5% on this dataset and achieves more than 200 FPS on the recent mobile SoCs. An additional in-the-wild evaluation of the obtained solution is performed to analyze its performance and limitation in the real-world scenarios. The dataset and pre-trained models in this paper are provided below.

arXiv: 2105.07869, 2021

< Camera Scene Detection  Dataset >

When solving the camera scene detection problem, one of the most critical challenges is to get high-quality diverse data for training the model. Since no public datasets existed for this task, a new Camera Scene Detection Dataset (CamSDD) containing more than 11K images and consisting of 30 different categories was collected first. The photos were crawled from Flickr and inspected manually to remove monochrome and heavily edited pictures, images with distorted colors and watermarks, photos that are impossible for smartphone cameras (e.g., professional underwater or night shots), etc. The dataset was designed to contain diverse images, therefore each scene category contains photos taken in different places, from different viewpoints and angles: e.g., the 'cat' category does not only contain cat faces but also normal full-body pictures shot from different positions. This diversity is essential for training a model that is generalizable to different environments and shooting conditions. Each image from the CamSDD dataset belongs to only one scene category. The dataset was designed to be balanced, thus each category contains on average around 350 photos. After the images were collected, they were resized to 576 x 384 px resolution as using larger photos will not bring any information that is vital for the considered classification problem.

< Models >

Backbone Architecture Model Type Input Size Model Size, MB Top-1 Accuracy, % Top-3 Accuracy, %
MobileNet-V2 FP32 224 × 224 73 94.17 98.67
MobileNet-V2 INT8 224 × 224 19 94.17 98.67
MobileNet-V1 FP32 224 × 224 208 92.67 99.50
MobileNet-V1 INT8 224 × 224 52 91.50 99.00

< In-the-wild  Performance  Testing >

While the proposed models demonstrate high accuracy on the CamSDD dataset, their real performance on live camera data is the most important for this task. For this, we developed an Android application that is using the obtained TensorFlow Lite models to perform real-time classification of the image frames coming from camera stream. We checked the predictions of the models on hundreds of different scenery, some sample results obtained with the Samsung Galaxy J5 are provided below.

< Citation >

Angeline Pouget, Sidharth Ramesh, Maximilian Giang, Ramithan Chandrapalan, Toni Tanner, Moritz Prussing, Radu Timofte and Andrey Ignatov,


"Fast and Accurate Camera Scene Detection on Smartphones",


Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2021

Computer Vision Laboratory, ETH Zurich

Switzerland, 2021