At present, the perception method based on bird's-eye view has become the mainstream of autonomous driving perception. It realizes comprehensive perception of the vehicle's surrounding environment by fusing multiple sensors at the feature level. However, the existing multi-modal fusion perception methods based on bird's-eye view usually require extremely high computing resources, especially in the multi-camera view image conversion processing. In addition, the key to multimodal bird's-eye view perception lies in how to efficiently fuse point cloud features and image features. To address these defects, this paper proposes a novel multi-modal bird's-eye view perception algorithm. First, this paper proposes an index lookup calculation method for the conversion of multi-view image features to bird's-eye view perspective. This method greatly reduces the consumption of computing resources without basically reducing information. Secondly, this paper proposes a feature fusion method, which uses a cross-modal attention mechanism to enhance the interaction between different modal features, realize dynamic spatiotemporal alignment and fusion. Experimental results show that the method proposed in this paper can effectively perceive the environment and can be deployed on a real vehicle platform for real-time detection.
Currently, depth estimation of 2D image is widely treated as an important technology for environmental perception in autonomous driving, but it still suffers from many issues. From the view of application, this work proposes an unsupervised monocular depth estimation based on hybrid ViT to improve accuracy and reduce cost. Specifically, the technology of convolution and transformer have been combined in this work to encode to extract fine-grained features. Besides, fusing multi-scale features to decode is also adopted to generate multi-scale disparity maps. Then, the loss is calculated based on multi-scale and full-resolution disparity maps, and stereo constraints to realize image reconstruction have been achieved finally. Additionally, experiments have been carried out on the KITTI dataset, and the measured results indicate that compared with the previous works, this work has made progresses in the indicators of error and accuracy, i.e., higher accuracy of 3.4% than baseline, more clear boundaries, fewer artifacts and higher quality of depth maps. It is proved that the combination of hybrid encoding, multi-scale decoder and full-resolution loss can bring significant effect on depth estimation, especially the hybrid encoding.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.