Fusion of multimodal data can offer enhanced machine learning. One of the most common fusion approaches in deep learning is end-to-end training of a neural network on all available modalities. However, paired multimodal data from all the modalities is required to train such a network. Collecting paired data from multiple modalities can be challenging and expensive due to the requirement of specialized equipment, atmospheric conditions, limitation of individual modalities to probe a scene, data integration from modalities with different spatial and spectral resolutions, and annotation challenges for obtaining ground truth. A two-phase multi-stream fusion approach is presented in this work to counteract this issue. First, we train the unimodal streams in parallel with their own decision layers, loss, and hyper-parameters. Then, we discard the individual decision layers, concatenate the last feature map of all unimodal streams, and jointly train a common multimodal decision layer. We tested the proposed approach on the NTIRE-21 dataset. Our experiments corroborate that in multiple cases, the proposed method can outperform the alternatives.
|