CamSDD Project

Fast and Accurate Camera Scene Detection on Smartphones

Angeline Pouget

Sidharth Ramesh

Maximilian Giang

Ramithan Chandrapalan

Toni Tanner

Moritz Prussing

Radu Timofte ✉

Andrey Ignatov ✉

angeline.pouget@gmail.com

siramesh@ethz.ch

giangm@ethz.ch

rachandr@ethz.ch

tannerto@ethz.ch

moritzpr@ethz.ch

timofter@vision.ee.ethz.ch

andrey@vision.ee.ethz.ch

Abstract: AI-powered automatic camera scene detection mode is nowadays available in nearly any modern smartphone, though the problem of accurate scene prediction has not yet been addressed by the research community. This paper for the first time carefully defines this problem and proposes a novel Camera Scene Detection Dataset (CamSDD) containing more than 11K manually crawled images belonging to 30 different scene categories. We propose an efficient and NPU-friendly CNN model for this task that demonstrates a top-3 accuracy of 99.5% on this dataset and achieves more than 200 FPS on the recent mobile SoCs. An additional in-the-wild evaluation of the obtained solution is performed to analyze its performance and limitation in the real-world scenarios. The dataset and pre-trained models in this paper are provided below.

arXiv: 2105.07869, 2021

< Camera Scene Detection Dataset >

When solving the camera scene detection problem, one of the most critical challenges is to get high-quality diverse data for training the model. Since no public datasets existed for this task, a new Camera Scene Detection Dataset (CamSDD) containing more than 11K images and consisting of 30 different categories was collected first. The photos were crawled from Flickr and inspected manually to remove monochrome and heavily edited pictures, images with distorted colors and watermarks, photos that are impossible for smartphone cameras (e.g., professional underwater or night shots), etc. The dataset was designed to contain diverse images, therefore each scene category contains photos taken in different places, from different viewpoints and angles: e.g., the 'cat' category does not only contain cat faces but also normal full-body pictures shot from different positions. This diversity is essential for training a model that is generalizable to different environments and shooting conditions. Each image from the CamSDD dataset belongs to only one scene category. The dataset was designed to be balanced, thus each category contains on average around 350 photos. After the images were collected, they were resized to 576 x 384 px resolution as using larger photos will not bring any information that is vital for the considered classification problem.

< Models >

Backbone Architecture	Model Type	Input Size	Model Size, MB	Top-1 Accuracy, %	Top-3 Accuracy, %
MobileNet-V2	FP32	224 × 224	73	94.17	98.67
MobileNet-V2	INT8	224 × 224	19	94.17	98.67
MobileNet-V1	FP32	224 × 224	208	92.67	99.50
MobileNet-V1	INT8	224 × 224	52	91.50	99.00

Backbone Architecture

Model Type

Input Size

Model Size, MB

Top-1 Accuracy, %

Top-3 Accuracy, %

MobileNet-V2

FP32

224 × 224

94.17

98.67

MobileNet-V2

INT8

224 × 224

94.17

98.67

MobileNet-V1

FP32

224 × 224

208

92.67

99.50

MobileNet-V1

INT8

224 × 224

91.50

99.00

< In-the-wild Performance Testing >

While the proposed models demonstrate high accuracy on the CamSDD dataset, their real performance on live camera data is the most important for this task. For this, we developed an Android application that is using the obtained TensorFlow Lite models to perform real-time classification of the image frames coming from camera stream. We checked the predictions of the models on hundreds of different scenery, some sample results obtained with the Samsung Galaxy J5 are provided below.

Computer Vision Laboratory, ETH Zurich

Switzerland, 2021

Fast and Accurate Camera Scene Detection on Smartphones

< Camera Scene Detection Dataset >

< Models >

< In-the-wild Performance Testing >

< Citation >