Emotion is strongly subjective, and different parts of the image may have a different degree of impact on emotion. The key for solving image emotion recognition is to effectively mine different discriminative local regions. We present a deep architecture to guide the network to extract discriminative and diverse affective semantic information. First, training a full convolutional network with a cross-channel max pooling strategy (CCMP) to extract discriminative feature maps. Second, to ensure that most of the discriminative sentiment regions are located accurately, we add a module consisting of the convolution layer and the CCMP. After obtaining the discriminative regions of the first module, the feature elements corresponding to the discriminant regions are erased, and then the erased features are fed into the second module. Such adversarial erasure operation can force the network to discover different sentiment discriminative regions. Third, an adaptive feature fusion mechanism is proposed to better integrate discriminative and diverse sentiment representations. Sufficient experiments are conducted on the benchmark datasets FI, EmotionROI, Instagram, and Twitter1 to achieve 72.17%, 61.13%, 81.97%, and 85.44% recognition accuracies, respectively. The results of the experiment demonstrate that the proposed network outperforms the state-of-the-art results.
Recently, Visual Entailment is proposed as a new task in the multi-modal field. Its main focus is to reason about entailment relations between a real-world image as a premise and a natural language as a hypothesis. Some papers have proposed models to obtain more accurate entailment relation judgments. However, these models do not consider the semantics of text at both global and local granularity, and the refining of the two modalities is not sufficient. In this paper, a new stacked multi-modal refining and fusion network is proposed. For cross-sniffing and key information activation between global and local features of hypothesis-sentence, a Global & Local Textual Features Fusion block is introduced. Secondly, a Refining and Affine Fusion block is proposed to achieve efficient multi-modal attention and fusion between image and text features. Finally, this paper presents a stacked structured network which embedded an adaptive hypothesis-preserving mechanism to enriching the grounds for semantic implication judgements. The experiments demonstrate that our model has a certain improvement in the accuracy of visual entailment classification compared with some existing methods in this field.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.