Open Access Paper
28 December 2022 Multimodal depression detection using a deep feature fusion network
Guangyao Sun, Shenghui Zhao, Bochao Zou, Yubo An
Author Affiliations +
Proceedings Volume 12506, Third International Conference on Computer Science and Communication Technology (ICCSCT 2022); 125066A (2022) https://doi.org/10.1117/12.2662620
Event: International Conference on Computer Science and Communication Technology (ICCSCT 2022), 2022, Beijing, China
Abstract
Currently, more and more people are suffering from depression with the increase of social pressure, which has become one of the most severe health issues worldwide. Therefore, timely diagnosis of depression is very important. In this paper, a deep feature fusion network is proposed for multimodal depression detection. Firstly, an unsupervised autoencoder based on transformer is applied to derive the sentence-level embedding for the frame-level audiovisual features; then a deep feature fusion network based on a cross-modal transformer is proposed to fuse the text, audio and video features. The experimental results show that the proposed method achieves superior performance compared to state-of-the-art methods on the English database DAIC-WOZ.

1.

INTRODUCTION

Depression, also known as depressive disorder, is a major type of mood disorders. Currently, the main depression detection methods rely on specific questionnaires, such as the clinician-administered Hamilton Rating Scale for Depression (HAMD)1. These methods are time-consuming and labor-intensive. With the rise of deep learning, many studies use deep neural networks to process multimodal features for depression assessment. Yang et al.2 fed the features of text, audio, and video into a DCNN network to obtain regression results respectively. Then outputs of the three networks were passed into a DNN network for decision fusion. Alhanai et al.3 deployed a LSTM model to capture the temporal relationship between audio features and text features for depression detection. Lin et al.4 used the LSTM and CNN to process text features and audio features respectively. Then the outputs of the two models were combined for late fusion.

In the field of automatic depression detection, current research has certain limitations. Firstly, the features extracted from audio and video are frame-level, and the existing methods to encode frame-level features into sentence-level use statistical functions which lead to the loss of the temporal relationship between frames. Secondly, most current network structures for multimodal fusion use decision fusion or late fusion which have limited ability to capture the complementarity of features between different modalities. In this work, we tried to overcome the aforementioned drawbacks. Our main contributions can be summarized as follow.

  • (1) An unsupervised autoencoder network is proposed to obtain sentence-level vector of frame-level audio and video features for depression detection.

  • (2) We propose a cross-modal transformer based deep fusion network for multimodal features to detect depression.

2.

METHOD

2.1

Model overview

The overall structure of our proposed method is shown in Figure 1.

Figure 1.

An overview of the proposed model.

00213_PSISDG12506_125066A_page_2_1.jpg

Firstly, the feature representation module is used to obtain sentence-level vectors of audio, text, and video. Then the feature fusion module based on cross-modal Transformer is adopted to combine three modalities. Next, a Bi-LSTM with selfattention is used to capture the temporal relationship of each single modality. Finally, the outputs of the self-attention module are fed into the low-rank fusion module for further late fusion to get the final regression result.

2.2

Feature extraction

The dataset consists of three modalities, namely text, video and audio features. For the text data, we use a pre-trained 1024-dimensional Elmo5 sentence embedding to encode the transcribed subjects’ responses for each question. For the audio data, 39-dimensional frame-level MFCC features are extracted for each audio segment. For the video data, 20-dimensional frame-level AU features are extracted for each video segment.

2.3

Unsupervised autoencoder

The extracted video and audio features are both frame-level. Temporal aggregations are needed to compress sentence-level vectors before feature fusion. Most previous methods encode the frame-level features into sentence-level with statistical functions which may lead to the loss of the temporal information between frames. To solve this problem, we propose an unsupervised autoencoder based on Transformer6. The overall structure of the autoencoder is shown in Figure 2.

Figure 2.

The structure of autoencoder.

00213_PSISDG12506_125066A_page_3_5.jpg

The frame-level features are fed into the frame-sentence encoder to obtain the sentence-level vector of multi-frame features. The frame-sentence encoder consists of three layers transformer encoding units. The transformer encoding unit is illustrated in Figure 3.

Figure 3.

The structure of transformer.

00213_PSISDG12506_125066A_page_3_6.jpg

The Positional Encoding module is used to add positional information to the frame-level features X. The specific calculation formula of positional embedding can be expressed by equations (1) and (2).

00213_PSISDG12506_125066A_page_2_2.jpg
00213_PSISDG12506_125066A_page_2_3.jpg

Here pos is the frame number, F is the dimension of the frame-level feature, I is the index of the features, and 2i represents the even index while 2i + 1 represents the odd index.

The frame-level features X are then added to the positional embedding PE to generate the input vector I. Then the Multi-Head Attention module is used to calculate the temporal relationship between frames. The relationship between vector A and input vector I can be expressed by

00213_PSISDG12506_125066A_page_3_1.jpg
00213_PSISDG12506_125066A_page_3_2.jpg
00213_PSISDG12506_125066A_page_3_3.jpg
00213_PSISDG12506_125066A_page_3_4.jpg

where WQ, WK, WV denote linear matrixes,Q, K, V denote the query vector, key vector, and value vector, respectively, and dk is the normalization coefficient.

Then vector A passes through residual connections and forward neural networks to obtain the output vector O. O is fused with the temporal and attention relationship between multi-frame features. The vector O of the last time step is the output of the frame-sentence encoder. After self-filling, a vector X′ is passed through the sentence-frame decoder to reconstruct the frame-level features. The decoder is also composed of 3-layer Transformer coding units. The output vector Y of the decoder has the same dimension as the frame-level features X. The loss function of the unsupervised autoencoder is mean square error, and the difference between X and Y is calculated to update weights of the unsupervised autoencoder. When the model is converged, the output of the encoder is stored as a sentence-level vector representation of frame-level features.

So far, we have obtained three sentence-level vectors Xt, Xa, Xv of text, audio, and video modality respectively. Before the feature fusion module for deep fusion, three one-dimensional convolutional layers are utilized to compress the dimensions of the three sentence-level vectors. Then, XTRS×30, XARS×30, XVRS×30 are fed into the feature fusion module for deep fusion, where S is the question number of each subject and 30 is the dimension of compressed features.

2.4

Feature fusion

Six cross-modal transformers7 are adopted to achieve a deep fusion of three modalities. We take AT cross-modal transformer as an example to introduce the application of cross-modal transformer in the feature fusion module. The structure of AT cross-modal transformer can be seen in Figure 4.

Figure 4.

The structure of cross-modal transformer.

00213_PSISDG12506_125066A_page_4_1.jpg

The relationship between the output of the ith AT cross-modal transformer ZAT[i] and the input vector XT, XA can be represented by

00213_PSISDG12506_125066A_page_4_2.jpg
00213_PSISDG12506_125066A_page_4_3.jpg
00213_PSISDG12506_125066A_page_4_4.jpg
00213_PSISDG12506_125066A_page_4_5.jpg
00213_PSISDG12506_125066A_page_4_6.jpg
00213_PSISDG12506_125066A_page_4_7.jpg
00213_PSISDG12506_125066A_page_4_8.jpg
00213_PSISDG12506_125066A_page_4_9.jpg
00213_PSISDG12506_125066A_page_4_10.jpg
00213_PSISDG12506_125066A_page_4_11.jpg

where Positional Encoding is used to add positional information to XT and XA. The specific relationship can be obtained from equations (1) and (2). LayerNorm is the layer normalization operation. ZAT[i–1] is the output of (i – 1)th layer cross-modal transformer. 00213_PSISDG12506_125066A_page_4_12.jpg denote linear matrixes. And QT, KA, VA denote query vector of text modality, key vector, and value vector of audio modality. dk is the normalization coefficient. FFN is the forward neural network which is composed of linear layers.

The outputs of the cross-modal transformer are concatenated to generate ZTRS×60, ZARS×60, ZVRS×60. ZT, ZA, ZV deeply fuse the information of the other two modalities and are then fed into the self-attention module to capture the temporal information of each modality.

2.5

Self-attention

Bi-LSTM with attention mechanism is utilized to capture the temporal relationship of ZT, ZA, and ZV. Taking ZT for example, the Bi-LSTM model with a hidden size of 30 is used to capture the temporal relationship between each time step of ZT. Considering that features at different time steps are of different importance to the final assessment result, an attention mechanism is introduced to weight the output of Bi-LSTM at different time steps. The relationship between the output of self-attention module ST and the output of Bi-LSTM Z°T can be represented by

00213_PSISDG12506_125066A_page_5_1.jpg
00213_PSISDG12506_125066A_page_5_2.jpg
00213_PSISDG12506_125066A_page_5_3.jpg
00213_PSISDG12506_125066A_page_5_4.jpg

where w is the weighting factor and T is the question number.

Then ST is passed through a ReLU layer and a linear layer to obtain the output OT of self-attention module. Finally, three temporal vectors OT, OV, OA are passed through the late fusion module for further fusion to obtain the final assessment result.

2.6

Late fusion

The low-rank fusion8 is used to further fuse OT, OA and OV. The relationship between the final output H and OT, OV, and OA can be represented by

00213_PSISDG12506_125066A_page_5_5.jpg

where zl, zv, and za are the vectors OT, OV and OA padding with 1, M is the number of modality 3, 00213_PSISDG12506_125066A_page_5_6.jpg denotes the element-wise product over a sequence of tensors, and W is the weight tensor. And the value of r is 17, which is the number of rank factors.

3.

EXPERIMENTAL RESULTS AND ANALYSIS

In the experiment, the evaluation metrics of the classification task are Precision, Recall, and F1. For the regression task, the evaluation metrics are MAE and RMSE. Our experiments are conducted on the DAIC-WOZ9 database which has a training set (77 depressed, 30 non-depressed) for training the network, validation set (23 depressed, 12 non-depressed) to verify network performance during training and testing set (33 depressed, 14 non-depressed) for testing network performance. Three baseline models2-4 are compared to our proposed network DDFN, which are introduced in the Introduction section. Experimental results are shown in Table 1.

Table 1.

Experimental results with DAIC-WOZ.

MethodsMAERMSEPRECISIONRECALLF1
DCNN+DNN25.165.97///
LSTM35.136.500.710.830.77
LSTM+CNN43.755.440.790.920.85
DDFN (proposed)3.785.350.910.880.89

The proposed deep feature fusion network achieves superior results in both regression and classification tasks. Compared to current state-of-the-art methods, the lowest RMSE of 5.35 and the highest Precision of 0.91 and F1 of 0.89 were obtained by the proposed deep feature fusion network DDFN. Overall, our proposed deep feature fusion network DDFN can deeply fuse the features of three modalities for depression assessment. Compared with decision fusion and simple feature fusion network, better evaluation results are obtained by our deep feature fusion network DDFN.

4.

CONCLUSIONS

In this paper, we propose an unsupervised autoencoder to encode frame-level features into sentence-level. A deep feature fusion network is proposed to further fuse features of three modalities for depression detection. The experiment results show that our network DDFN achieves improved performance compared to state-of-the-art methods. Considering the insufficient samples in the depression database, we will investigate multimodal few-shot learning for depression detection in the next step.

ACKNOWLEDGMENTS

This research is supported in part by National Natural Science Foundation of China (U19B2032) and Guangdong Basic and Applied Basic Research Foundation (2021A1515110249).

REFERENCES

[1] 

Hamilton, M. A. X., “Development of a rating scale for primary depressive illness,” Br. J. Soc. Clin. Psychol, 6 (4), 278 –296 (1967). https://doi.org/10.1111/bjc.1967.6.issue-4 Google Scholar

[2] 

Yang, L., Jiang, D., Xia, X., Pei, E.,, Oveneke, M. C. and Sahli, H., “Multimodal measurement of depression using deep learning models,” AVEC, 2017 53 –9 (2017). https://doi.org/10.1145/3133944 Google Scholar

[3] 

Al Hanai, T., Ghassemi, M. M. and Glass, J. R., “Detecting depression with audio/text sequence modeling of interviews,” Interspeech, 2018 1716 –20 (2018). Google Scholar

[4] 

Lin, L., Chen, X., Shen, Y. and Zhang, L., “Towards automatic depression detection: A BiLSTM/1d CNN-based model,” Applied Sciences, 10 (23), 8701 (2020). https://doi.org/10.3390/app10238701 Google Scholar

[5] 

Peters, M. E., Neumann, M., Iyyer, M., Gardner, M. and Clark, C., “Deep contextualized word representations,” NAACL, 2018 2227 –37 (2018). Google Scholar

[6] 

Vaswani, A., Shazeer, N., et al, “Attention is all you need,” arXiv:1706.03762, (2017). Google Scholar

[7] 

Tsai, Y. H. H., Bai, S., Liang, P. P., Kolter, J. Z., Morency, L. P. and Salakhutdinov, R., “Multimodal transformer for unaligned multimodal language sequences,” ACL, 2019 6558 –69 (2019). Google Scholar

[8] 

Liu, Z., Shen, Y., Lakshminarasimhan, V. B., Liang, P. P., Zadeh, A. and Morency, L. P., “Efficient low-rank multimodal fusion with modality-specific factors,” arXiv preprint arXiv:1806.00064, (2018). Google Scholar

[9] 

Gratch, J., Artstein, R., Lucas, G. M., Stratou, G., Scherer, S., Nazarian, A., Wood, R., Boberg, J., DeVault, D., Marsella, S., et al, “The distress analysis interview corpus of human and computer interviews,” LREC, 2014 3123 –28 (2014). Google Scholar
© (2022) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Guangyao Sun, Shenghui Zhao, Bochao Zou, and Yubo An "Multimodal depression detection using a deep feature fusion network", Proc. SPIE 12506, Third International Conference on Computer Science and Communication Technology (ICCSCT 2022), 125066A (28 December 2022); https://doi.org/10.1117/12.2662620
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Transformers

Video

Computer programming

Feature extraction

Databases

Neural networks

Electronics

RELATED CONTENT


Back to Top