In the traditional visual relationship detection model, the entity representation of visual features tends to focus on coarsegrained and ignore fine-grained features. Moreover, the entity representation of spatial features does not fully reflect the prominent role of relative position. In some specific cases, it is impossible to generate a unique spatial feature representation vector. The final feature fusion did not take into account the characteristic that visual features are primary, while semantic and spatial features are secondary. In order to address the above issues, we propose a visual relationship detection model with pixel-level feature enhancement and weighted fusion. Specifically, we embed a fine-grained information block to capture pixel-level context information in the feature map, providing richer visual features for the prediction of relationships. We adopt a coordinate encoding method to encode the respective and relative positions of entity pairs to bounding boxes to obtain more accurate spatial feature representations. We construct a feature fusion method based on feature weighting to obtain a more improved fusion vector. We conducted extensive experiments on the mainstream visual relationship detection dataset. The results show that our proposed model performs better.
|