Proceedings Article | 20 June 2021
KEYWORDS: RGB color model, 3D modeling, Sensors, Cameras, Motion models, LIDAR, Data modeling, Safety, Robots, Robotics
Nowadays, thanks to the development of new advanced driver-assistance systems (ADAS) able to help drivers in driving tasks, autonomous driving is becoming part of our lives. This massive development is mainly due to the great possibilities of guaranteeing higher safety levels that these systems can offer to vehicles that every day travel on our roads. At the heart of every application in this area, there stands the environment’s perception, guiding the vehicle’s behavior. This counts in autonomous driving field and all the applications characterized by a system that moves in the 3D real-world like robotics, augmented reality, etc. For this purpose, an effective 3D perception system is necessary to accurately localize all the objects that compose the scene and reconstruct it in a 3D model. This issue is often faced using LIDAR sensors, which allow an accurate 3D perception offering high robustness in unfavorable light and weather conditions. But these sensors are generally expensive, and thus do not represent the right choice for low-cost vehicles and robots. Moreover, they need to be used in a particular position that does not permit integrating it on the car changing both the appearance and the aerodynamics. Besides, their output is a point cloud data that, due to its structure is not easily manageable with deep learning models that promise outstanding results in various similar predictive tasks. For these reasons, in some applications, it is better to leverage other sensors like RGB cameras to estimate 3D perception. For this purpose, more classic approaches are based on stereo-cameras, RGB-D cameras, and stereo from motion, which generally can reconstruct the scene with less accuracy than LIDARS, but still produce acceptable results. In recent years, several approaches have been proposed in literature which aim to estimate the depth from a monocular camera leveraging deep learning models. Some of these methods use a supervised approach,1, 2 however they mainly rely on annotated datasets which in practice can be labor-expensive to be collected. Thus, some works3, 4 use, on the contrary, self-supervised training procedure leveraging reprojection error. Notwithstanding their good performance, most of the proposed approaches use very deep neural networks, that are power- and resource- consuming and need high-end GPUs to produce results in real-time. For these reasons, these approaches cannot be used in systems with power and computational limits. In this work, we propose a new approach based on a standard CNN proposed in the literature to deal with the image segmentation problem created not to be highly resource-dependent. For the training, we used the knowledge-distillation method using an out of shell pre-trained network as teacher network. We execute large scale experiments to qualitatively and quantitatively compare our results with those obtained with baselines. Moreover, we propose a deep study of inference times using both general-purpose and mobile architectures.