|
1.INTRODUCTIONDepression, also known as depressive disorder, is a major type of mood disorders. Currently, the main depression detection methods rely on specific questionnaires, such as the clinician-administered Hamilton Rating Scale for Depression (HAMD)1. These methods are time-consuming and labor-intensive. With the rise of deep learning, many studies use deep neural networks to process multimodal features for depression assessment. Yang et al.2 fed the features of text, audio, and video into a DCNN network to obtain regression results respectively. Then outputs of the three networks were passed into a DNN network for decision fusion. Alhanai et al.3 deployed a LSTM model to capture the temporal relationship between audio features and text features for depression detection. Lin et al.4 used the LSTM and CNN to process text features and audio features respectively. Then the outputs of the two models were combined for late fusion. In the field of automatic depression detection, current research has certain limitations. Firstly, the features extracted from audio and video are frame-level, and the existing methods to encode frame-level features into sentence-level use statistical functions which lead to the loss of the temporal relationship between frames. Secondly, most current network structures for multimodal fusion use decision fusion or late fusion which have limited ability to capture the complementarity of features between different modalities. In this work, we tried to overcome the aforementioned drawbacks. Our main contributions can be summarized as follow. 2.METHOD2.1Model overviewThe overall structure of our proposed method is shown in Figure 1. Firstly, the feature representation module is used to obtain sentence-level vectors of audio, text, and video. Then the feature fusion module based on cross-modal Transformer is adopted to combine three modalities. Next, a Bi-LSTM with selfattention is used to capture the temporal relationship of each single modality. Finally, the outputs of the self-attention module are fed into the low-rank fusion module for further late fusion to get the final regression result. 2.2Feature extractionThe dataset consists of three modalities, namely text, video and audio features. For the text data, we use a pre-trained 1024-dimensional Elmo5 sentence embedding to encode the transcribed subjects’ responses for each question. For the audio data, 39-dimensional frame-level MFCC features are extracted for each audio segment. For the video data, 20-dimensional frame-level AU features are extracted for each video segment. 2.3Unsupervised autoencoderThe extracted video and audio features are both frame-level. Temporal aggregations are needed to compress sentence-level vectors before feature fusion. Most previous methods encode the frame-level features into sentence-level with statistical functions which may lead to the loss of the temporal information between frames. To solve this problem, we propose an unsupervised autoencoder based on Transformer6. The overall structure of the autoencoder is shown in Figure 2. The frame-level features are fed into the frame-sentence encoder to obtain the sentence-level vector of multi-frame features. The frame-sentence encoder consists of three layers transformer encoding units. The transformer encoding unit is illustrated in Figure 3. The Positional Encoding module is used to add positional information to the frame-level features X. The specific calculation formula of positional embedding can be expressed by equations (1) and (2). Here pos is the frame number, F is the dimension of the frame-level feature, I is the index of the features, and 2i represents the even index while 2i + 1 represents the odd index. The frame-level features X are then added to the positional embedding PE to generate the input vector I. Then the Multi-Head Attention module is used to calculate the temporal relationship between frames. The relationship between vector A and input vector I can be expressed by where WQ, WK, WV denote linear matrixes,Q, K, V denote the query vector, key vector, and value vector, respectively, and dk is the normalization coefficient. Then vector A passes through residual connections and forward neural networks to obtain the output vector O. O is fused with the temporal and attention relationship between multi-frame features. The vector O of the last time step is the output of the frame-sentence encoder. After self-filling, a vector X′ is passed through the sentence-frame decoder to reconstruct the frame-level features. The decoder is also composed of 3-layer Transformer coding units. The output vector Y of the decoder has the same dimension as the frame-level features X. The loss function of the unsupervised autoencoder is mean square error, and the difference between X and Y is calculated to update weights of the unsupervised autoencoder. When the model is converged, the output of the encoder is stored as a sentence-level vector representation of frame-level features. So far, we have obtained three sentence-level vectors Xt, Xa, Xv of text, audio, and video modality respectively. Before the feature fusion module for deep fusion, three one-dimensional convolutional layers are utilized to compress the dimensions of the three sentence-level vectors. Then, XT ∈ RS×30, XA ∈ RS×30, XV ∈ RS×30 are fed into the feature fusion module for deep fusion, where S is the question number of each subject and 30 is the dimension of compressed features. 2.4Feature fusionSix cross-modal transformers7 are adopted to achieve a deep fusion of three modalities. We take A → T cross-modal transformer as an example to introduce the application of cross-modal transformer in the feature fusion module. The structure of A → T cross-modal transformer can be seen in Figure 4. The relationship between the output of the ith A → T cross-modal transformer ZA→T[i] and the input vector XT, XA can be represented by where Positional Encoding is used to add positional information to XT and XA. The specific relationship can be obtained from equations (1) and (2). LayerNorm is the layer normalization operation. ZA→T[i–1] is the output of (i – 1)th layer cross-modal transformer. denote linear matrixes. And QT, KA, VA denote query vector of text modality, key vector, and value vector of audio modality. dk is the normalization coefficient. FFN is the forward neural network which is composed of linear layers. The outputs of the cross-modal transformer are concatenated to generate ZT ∈ RS×60, ZA ∈ RS×60, ZV ∈ RS×60. ZT, ZA, ZV deeply fuse the information of the other two modalities and are then fed into the self-attention module to capture the temporal information of each modality. 2.5Self-attentionBi-LSTM with attention mechanism is utilized to capture the temporal relationship of ZT, ZA, and ZV. Taking ZT for example, the Bi-LSTM model with a hidden size of 30 is used to capture the temporal relationship between each time step of ZT. Considering that features at different time steps are of different importance to the final assessment result, an attention mechanism is introduced to weight the output of Bi-LSTM at different time steps. The relationship between the output of self-attention module ST and the output of Bi-LSTM Z°T can be represented by where w is the weighting factor and T is the question number. Then ST is passed through a ReLU layer and a linear layer to obtain the output OT of self-attention module. Finally, three temporal vectors OT, OV, OA are passed through the late fusion module for further fusion to obtain the final assessment result. 2.6Late fusionThe low-rank fusion8 is used to further fuse OT, OA and OV. The relationship between the final output H and OT, OV, and OA can be represented by where zl, zv, and za are the vectors OT, OV and OA padding with 1, M is the number of modality 3, denotes the element-wise product over a sequence of tensors, and W is the weight tensor. And the value of r is 17, which is the number of rank factors. 3.EXPERIMENTAL RESULTS AND ANALYSISIn the experiment, the evaluation metrics of the classification task are Precision, Recall, and F1. For the regression task, the evaluation metrics are MAE and RMSE. Our experiments are conducted on the DAIC-WOZ9 database which has a training set (77 depressed, 30 non-depressed) for training the network, validation set (23 depressed, 12 non-depressed) to verify network performance during training and testing set (33 depressed, 14 non-depressed) for testing network performance. Three baseline models2-4 are compared to our proposed network DDFN, which are introduced in the Introduction section. Experimental results are shown in Table 1. Table 1.Experimental results with DAIC-WOZ.
The proposed deep feature fusion network achieves superior results in both regression and classification tasks. Compared to current state-of-the-art methods, the lowest RMSE of 5.35 and the highest Precision of 0.91 and F1 of 0.89 were obtained by the proposed deep feature fusion network DDFN. Overall, our proposed deep feature fusion network DDFN can deeply fuse the features of three modalities for depression assessment. Compared with decision fusion and simple feature fusion network, better evaluation results are obtained by our deep feature fusion network DDFN. 4.CONCLUSIONSIn this paper, we propose an unsupervised autoencoder to encode frame-level features into sentence-level. A deep feature fusion network is proposed to further fuse features of three modalities for depression detection. The experiment results show that our network DDFN achieves improved performance compared to state-of-the-art methods. Considering the insufficient samples in the depression database, we will investigate multimodal few-shot learning for depression detection in the next step. ACKNOWLEDGMENTSThis research is supported in part by National Natural Science Foundation of China (U19B2032) and Guangdong Basic and Applied Basic Research Foundation (2021A1515110249). REFERENCESHamilton, M. A. X.,
“Development of a rating scale for primary depressive illness,”
Br. J. Soc. Clin. Psychol, 6
(4), 278
–296
(1967). https://doi.org/10.1111/bjc.1967.6.issue-4 Google Scholar
Yang, L., Jiang, D., Xia, X., Pei, E.,, Oveneke, M. C. and Sahli, H.,
“Multimodal measurement of depression using deep learning models,”
AVEC, 2017 53
–9
(2017). https://doi.org/10.1145/3133944 Google Scholar
Al Hanai, T., Ghassemi, M. M. and Glass, J. R.,
“Detecting depression with audio/text sequence modeling of interviews,”
Interspeech, 2018 1716
–20
(2018). Google Scholar
Lin, L., Chen, X., Shen, Y. and Zhang, L.,
“Towards automatic depression detection: A BiLSTM/1d CNN-based model,”
Applied Sciences, 10
(23), 8701
(2020). https://doi.org/10.3390/app10238701 Google Scholar
Peters, M. E., Neumann, M., Iyyer, M., Gardner, M. and Clark, C.,
“Deep contextualized word representations,”
NAACL, 2018 2227
–37
(2018). Google Scholar
Vaswani, A., Shazeer, N., et al,
“Attention is all you need,”
arXiv:1706.03762,
(2017). Google Scholar
Tsai, Y. H. H., Bai, S., Liang, P. P., Kolter, J. Z., Morency, L. P. and Salakhutdinov, R.,
“Multimodal transformer for unaligned multimodal language sequences,”
ACL, 2019 6558
–69
(2019). Google Scholar
Liu, Z., Shen, Y., Lakshminarasimhan, V. B., Liang, P. P., Zadeh, A. and Morency, L. P.,
“Efficient low-rank multimodal fusion with modality-specific factors,”
arXiv preprint arXiv:1806.00064,
(2018). Google Scholar
Gratch, J., Artstein, R., Lucas, G. M., Stratou, G., Scherer, S., Nazarian, A., Wood, R., Boberg, J., DeVault, D., Marsella, S., et al,
“The distress analysis interview corpus of human and computer interviews,”
LREC, 2014 3123
–28
(2014). Google Scholar
|