Attention mechanisms have shown impressive abilities in solving downstream multi-modal tasks. However, there exists a natural semantic gap between vision and language modalities that hinders conventional attention-based models in achieving effective cross-modal semantic alignment. In this paper, we present JoAt, a Joint Attention net, through which we investigate how to utilize the visual background information more directly in a query-adaptive manner to enrich querying semantics for each visual token, and how to more fully bridge the semantic gap to achieve cross-modal alignment between visual-grid and textual features. Specifically, our JoAt utilizes each query’s neighboring pixels, aggregates the visual query tokens from different receptive fields, and allows the model to dynamically select the most relevant neighboring tokens for each query, then obtains representations that are more semantically matched with the textual features to realize better interaction between visual and linguistic modalities. The experimental results show that our JoAt net can fully utilize different semantic-level signals from visual features at different receptive fields and effectively narrow the natural semantic differences between visual and language modalities. Our JoAt achieved an accuracy of 72.15% and 98.90% on the VQAv2.0 test-std and CLEVR benchmarks, respectively.
KEYWORDS: Transformers, Visualization, Semantics, Performance modeling, Matrices, Information visualization, Visual process modeling, Feature extraction, Education and training, Head
Recently, the Transformer based on grid features has achieved great success in image captioning, but it still has some problems: the flattening operation of Transformer will destroy the positional information among visual objects, and only the output of the last layer encoder is sent to the decoder, which will lose low-level semantic information. To solve the above problems, we first introduce Distance-aware self-attention (DA), which considers the original geometric distance between visual objects in a two-dimensional image during the self-attention modeling process, and integrates distance information into attention calculation through a mapping function, better capturing the relational information among visual objects. Second, we propose the Multilayer Aggregation (MA) module, which aggregates the output of the encoder and establishes a weighted residual connection as the final output, sent to the decoder separately. It aggregates information from different encoder layers to achieve cross-layer semantic complementarity, features with rich semantics can be explored simultaneously from both low-level and high-level coding layers. To verify the validity of our proposed two designs, we applied them to a standard Transformer and conducted extensive experiments on MS-COCO, a benchmark dataset for image captioning. The experimental results demonstrate the effectiveness of our proposed Distance-aware Multilayer Aggregation Transformer (DMAT) model.
Vision Transformer (ViT) fully demonstrates the potential of the transformer architecture in the field of computer vision. However, the computational complexity is proportional to the length of the input sequence, thus limiting the application of transformers to high-resolution images. In order to improve the overall performance of Vision Transformer, this paper proposes an efficient Vision Transformer (MLVT) with dynamic embedding of multi-scale features, adopting the pyramid architecture, replacing the self-attention operation with linear self-attention, proposing a local attention enhancement module to address the problem of the dispersal of linear self-attention scores that ignores local correlation, and supplementing the local attention with the convolution of the self-attention-like computation. operation of self-attention-like computation is utilized to supplement the local attention. Aiming at the increase of feature dimension in pyramid architecture, the bottleneck of linear self-attention computation is changed from sequence length to feature dimension, and the linear self-attention with compressed feature dimension is proposed. In addition, since multi-scale inputs are crucial for processing image information, this paper proposes a flexible and learnable dynamic multi-scale feature embedding module, which dynamically adjusts the weights of different scale features according to the input image for fusion. A large number of experiments on image classification and target detection tasks show that competitive results are achieved while reducing the computational effort.
By introducing the object cloud into topological space, the spatial relationships between fuzzy objects transform to cloud
relationships in cloud space. According to cloud theory, all the spatial objects can be represented by three types object
cloud: point-cloud, line-cloud and area-cloud. So the 9-intersection model of spatial topological relations proposed by
Egenhofer can be extended by using the new definition of object cloud. The relationship between object clouds is
flexible relationship. Different from the crisp relationship model, 9IM, the flexible relationship model by object cloud
can be simplified to 4-intersection cloud model(4ICM), including to equal, contain, intersect and disjoint. The cloud
operation and virtue cloud can be introduced to representing the fuzzy and uncertain topological relations. The method
makes spatial data model enable to model the spatial phenomena with fuzziness and uncertainties, and enriches the cloud theory.
A transition region extraction and image segmentation algorithm in cloud-space was proposed. Mapping image into onedimensional
cloud-space by a one-to-many model, an object can be transformed into an object-cloud which with some
digital characteristics. Two neighboring objects are corresponding to two intersecting object-clouds in cloud-space and
the intersected region is just the transition region. By logical operation between intersecting clouds, we can obtain the
boundary-cloud and its digital characteristics. The entropy and hyper-entropy of boundary-cloud can determine the
reasonable scope of transition region. Calculating the average gray level of transition region as the threshold, we could
extract the object accurately. Experiments testify that the algorithm is both efficient and effective.
A fuzzy edge detection algorithm based on object-cloud and maximum fuzzy entropy principle are proposed in this
paper. According to the uncertainty of the objects in the RS image, the spatial objects in RS image space can be mapped
to the cloud space by 1:M cloud model. Object-cloud will have the digital characteristics to describe the fuzziness and
randomicity of objects in RS image. According to the cloud operation, boundary-cloud and its digital characteristics can
be achieved and the membership matrix of transition region can be constructed. By maximum fuzzy entropy principle,
edge detection can be accomplished in the membership matrix of transition region.
This paper presents a new method applied to texture feature representation in RS image based on cloud model. Aiming at
the fuzziness and randomness of RS image, we introduce the cloud theory into RS image processing in a creative way.
The digital characteristics of clouds well integrate the fuzziness and randomness of linguistic terms in a unified way and
map the quantitative and qualitative concepts. We adopt texture multi-dimensions cloud to accomplish vagueness and
randomness handling of texture feature in RS image. The method has two steps: 1) Correlativity analyzing of texture
statistical parameters in Grey Level Co-occurrence Matrix (GLCM) and parameters fuzzification. GLCM can be used to
representing the texture feature in many aspects perfectly. According to the expressive force of texture statistical
parameters and by Correlativity analyzing of texture statistical parameters, we can abstract a few texture statistical
parameters that can best represent the texture feature. By the fuzziness algorithm, the texture statistical parameters can be
mapped to fuzzy cloud space. 2) Texture multi-dimensions cloud model constructing. Based on the abstracted texture
statistical parameters and fuzziness cloud space, texture multi-dimensions cloud model can be constructed in micro-windows
of image. According to the membership of texture statistical parameters, we can achieve the samples of cloud-drop.
By backward cloud generator, the digital characteristics of texture multi-dimensions cloud model can be achieved
and the Mathematical Expected Hyper Surface(MEHS) of multi-dimensions cloud of micro-windows can be constructed.
At last, the weighted sum of the 3 digital characteristics of micro-window cloud model was proposed and used in texture
representing in RS image. The method we develop is demonstrated by applying it to texture representing in many RS
images, various performance studies testify that the method is both efficient and effective. It enriches the cloud theory,
and proposes a new idea for image texture representing and analyzing, especially RS image.
The spatial region in RS image has positional and thematic values uncertainties. Based on the uncertainties and the cloud
theory, the paper studies the representation of spatial uncertain region in image and proposes a new method applied to
spatial uncertain region representation based on cloud model. In 2-dimensional universe of discourse, by the gray and gradient or other digital characters of image, we can construct object-cloud of spatial object. Uncertain spatial region can be represented by object-cloud. Edge of spatial object can be represented by half-cloud-ring. So spatial uncertain region can be represented based on cloud model properly. Experiments testify that the method is both efficient and effective. It enriches the cloud theory, and proposes a new idea for representation of fuzzy object and image comprehending and analyzing, especially remote sensing image.
In 1999, the first Hartmann-Shack wave-front sensor for the human eye aberration measurement in China was established. The H-S sensor was successfully improved and applied to the clinic diagnosis. In this paper, the principle and the method of measuring wave aberrations of the human eye are given. The accuracy of the Hartman-Shack sensor is measured and analyzed. The measurement results of the wave-front aberrations of the real eyes using the sensor are demonstrated.
In 1980, the first laboratory on Adaptive Optics in China was established in Institute of Optics and Electronics, Chinese Academy of Sciences. Several adaptive optical systems had been set up and applied in Inertial Confinement Fusion (ICF) and retinal high-resolution imaging. In 1985, the first adaptive optical system for ICF equipment was set up in the world. Another 45 element adaptive optical system was first built for correcting the static and dynamic wavefront aberrations existed in the large-aperture Nd: glass laser for inertial confinement fusion in 2001. Two set adaptive optical system with 19-element and 37-element deformable mirror had been developed for human retina imaging in 2000 and 2002 respectively. In this paper, the function and performance of these adaptive optical systems are described and the experiment results are presented.
Clustering in spatial data mining is to group similar objects based on their distance, connectivity, or their relative density in space. Clustering algorithms typically use the Euclidean distance. In the real world, there exist many physical obstacles such as rivers, lakes and highways, and their presence may affect the result of clustering substantially. In this paper, we study the problem of clustering in the presence of obstacles and propose spatial clustering by Voronoi distance in Voronoi diagram (Thiessen polygon). Voronoi diagram has lateral spatial adjacency character. Based on it, we can express the spatial lateral adjacency relation conveniently and solve the problem derived from spatial clustering in the presence of obstacles. The method has three steps. First, building the Voronoi diagram in the presence of obstacles. Second, defining the Voronoi distance. Based on Voronoi diagram, we propose the Voronoi distance. Giving two spatial objects, Pi and Pj, The Voronoi distance is defined that the minimum object Voronoi regions number between Pi and Pj in the Voronoi diagram. Third, we propose Following-Obstacle-Algorithm (FOA). FOA includes three steps: the initializing step, the querying step and the pruning step. By FOA, we can get the Voronoi distance between any two objects. By Voronoi diagram and the FOA, the spatial clustering in the presence of obstacles can be accomplished conveniently, and more precisely. We conduct various performance studies to show that the method is both efficient and effective.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.