In the rapid development of the Internet era, various intelligent devices will generate massive image data. With the emergence of large-scale manually labeled datasets, deep learning technology has made great breakthroughs in the field of computer vision such as image classification, image super resolution and object detection. However, manually labeling image data is a tedious and time-consuming process. In contrast, unlabeled datasets are cheap and easy to obtain from the internet. Therefore, how to effectively use the massive unlabeled data on the internet is one of the research hotspots in the field of computer vision. Self-supervised representation learning constructs supervision signals by designing self-supervised pretext tasks, and learns rich semantic representations from unlabeled datasets. However, many existing self-supervised representation learning models usually need a large batch size to learn a good visual representation in the training process, and setting a large batch size often requires a large amount of computing resources. In order to solve the problems in the above self supervised representation learning algorithm, we apply self-supervised representation learning to object detection task and propose a self-supervised object detection method based on spatial scale learning and category prediction. Without the need to add additional manual labels, we help the model to learn the spatial scale relationship and category relationship between objects in the image by introducing the spatial scale information learning task and category prediction task. Moreover, in the feature extraction stage, we use the feature pyramid network and attention mechanism fusion to help the model better adapt to the size difference of different objects in the image, so as to learn more abundant detail information in the image and further improve the performance of the model. The experimental results show that our method can achieve better performance.
Grayscale image colorization is a process of adding reasonable color information to an image, and converting grayscale images into color images is an important and difficult image processing task. The process of colorization is to predict the color information corresponding to the grayscale image by the colorization model. In this paper, the proposed multi-scale input adversarial generative network coloring model with multi-scale input is used in the generator, and the input condition map is fused with the network feature map using the input fusion module so that the network can better utilize the information on each scale of grayscale image for color prediction and improve the effect of colorization. The effectiveness of the generator in this paper is verified experimentally. And it is compared with several existing methods to prove the improvement of the colorization effect of this paper's method.
KEYWORDS: 3D modeling, 3D image processing, Data modeling, 3D image reconstruction, Image registration, Virtual reality, Solid modeling, Mathematical modeling, 3D displays
Virtual classroom is a challenging research topic which is often discussed. It may be one part of metaverse. In this paper, we introduce 3D shape estimation for the registration and authentication of teachers and students in a virtual classroom, a new deep learning method, called geometric generative model, and 2D to 3D mapping algorithm DensePose are used. In the Login of the virtual classroom, the 3D human body reconstructed by the 2D image input of the teacher or the student is used as the message base of the identification and the authentication. Experiments of the first step of the virtual classroom, that is login and authentication based on 3D Shape model of the teacher and student is going to be done.
In the deep learning-based video action recognition, the function of the neural network is to acquire spatial information, motion information, and the associated information of the above two kinds of information over an uneven time span. We propose a network for extracting semantic information of video sequences based on the deep fusion feature of local spatial–temporal information. Convolutional neural networks (CNNs) are used in the network to extract local spatial information and local motion information, respectively. The spatial information is in three-dimensional convolution with the motion information of the corresponding time to obtain local spatial–temporal information at a certain moment. The local spatial–temporal information is then input into the long- and short-time memory (LSTM) to obtain the context relationship of the local spatial–temporal information in the long-time dimension. We add the ability of the regional attention mechanism of video frames in the neural network mechanism for obtaining context. That is, the last layer of convolutional layer spatial information and the first layer of the fully connected layer are, respectively, input into different LSTM networks, and the outputs of the two LSTMs at each time are merged again. This enables a fully connected layer that is rich in categorical information to provide a frame attention mechanism for the spatial information layer. Through the experiments on the three action recognition common experimental datasets UCF101, UCF11, and UCFSports, the spatial–temporal information deep fusion network proposed has a high correct recognition rate in the task of action recognition.
Monocular image-based three-dimensional (3-D) human pose recovery aims to retrieve 3-D poses using the corresponding two-dimensional image features. Therefore, the pose recovery performance highly depends on the image representations. We propose a multispectral embedding-based deep neural network (MSEDNN) to automatically obtain the most discriminative features from multiple deep convolutional neural networks and then embed their penultimate fully connected layers into a low-dimensional manifold. This compact manifold can explore not only the optimum output from multiple deep networks but also the complementary properties of them. Furthermore, the distribution of each hierarchy discriminative manifold is sufficiently smooth so that the training process of our MSEDNN can be effectively implemented only using few labeled data. Our proposed network contains a body joint detector and a human pose regressor that are jointly trained. Extensive experiments conducted on four databases show that our proposed MSEDNN can achieve the best recovery performance compared with the state-of-the-art methods.
Estimating three-dimensional (3D) human poses from a single camera is usually implemented by searching pose candidates with image descriptors. Existing methods usually suppose that the mapping from feature space to pose space is linear, but in fact, their mapping relationship is highly nonlinear, which heavily degrades the performance of 3D pose estimation. We propose a method to recover 3D pose from a silhouette image. It is based on the multiview feature embedding (MFE) and the locality-sensitive autoencoders (LSAEs). On the one hand, we first depict the manifold regularized sparse low-rank approximation for MFE and then the input image is characterized by a fused feature descriptor. On the other hand, both the fused feature and its corresponding 3D pose are separately encoded by LSAEs. A two-layer back-propagation neural network is trained by parameter fine-tuning and then used to map the encoded 2D features to encoded 3D poses. Our LSAE ensures a good preservation of the local topology of data points. Experimental results demonstrate the effectiveness of our proposed method.
Estimating three-dimensional (3-D) pose from a single image is usually performed by retrieving pose candidates with two-dimensional (2-D) features. However, pose retrieval usually relies on the acquisition of sufficient labeled data and suffers from low retrieving accuracy. Acquiring a large amount of unconstrained 2-D images annotated with 3-D poses is difficult. To solve these issues, we propose a coupled-source framework that integrates two independent training sources. The first source contains only 3-D poses, and the second source contains images annotated with 2-D poses. For accurate retrieval, we present a local-topology preserved sparse coding (LTPSC) to generate pose candidates, where the estimated 2-D pose of a test image is regarded as features for pose retrieval and represented as a sparse combination of features in the exemplar database. Our LTPSC can ensure that the semantically similar poses are retrieved with larger probabilities. Extensive experiments validate the effectiveness of our method.
IPTV can be a new service performed on the Internet in that network transmission and streaming media technologies are getting mature. In this paper, IPTV system infrastructure of UTStarcom, key technologies deployed, and applications will be discussed and evaluated. The key technologies to achieve IPTV services include 1) codec and compression; 2) streaming media; and 3) broadband networks and access to such networks. The implementation of Media Switch IPTV system in Harbin CNC city network is also discussed.
We propose a PCA algorithm for image fusion between the infrared image and the visible image of the same night vision. The infrared image and the visible image are preprocessed with image Gaussian low-pass filter respectively, to smooth the images. The mathematical formulation of image fusion techniques is worked out in this paper. Included are block-based method, wavelet transform, wavelet packet and PCA. Experimental results show the effectiveness of the proposed method.
This paper addresses two problems of a ship handling simulator. Firstly, 360 scene generation, especially 3D dynamic sea wave modeling, is described. Secondly, a multi-computer complementation of ship handling simulator. This paper also gives the experimental results of the proposed ship handling simulator.
KEYWORDS: 3D modeling, Reconstruction algorithms, Video, 3D image processing, Cameras, Detection and tracking algorithms, 3D image reconstruction, Visual process modeling, Image processing, Motion models
The 3D reconstruction of video sequences in this paper need not to know the parameters and locations of the camera. First step of our reconstruction is feature point matching and stereo correspondence by Singular Value Decomposition (SVD). After getting the redundant information between image sequences, we introduce stratification algorithm to get 3D model. We at first get 3D model in projective space, and then get that of Euclidean space by using constrain information.
In this paper, firstly, several video shot detection technologies have been discussed. An edited video consists of two kinds of shot boundaries have been known as straight cuts and optical cuts. Experimental result using a variety of videos are presented to demonstrate that moving window detection algorithm and 10-step difference histogram comparison algorithm are effective for detection of both kinds of shot cuts. After shot isolation, methods for shot characterization were investigated. We present a detailed discussion of key-frame extraction and review the visual features, particularly the color feature based on HSV model, of key-frames. Video retrieval methods based on key-frames have been presented at the end of this section. This paper also present an integrated system solution for computer- assisted video parsing and content-based video retrieval. The application software package was programmed on Visual C++ development platform.
We present a 3D reconstruction method, which combines geometry-based modeling, image-based modeling and rendering techniques. The first component is an interactive geometry modeling method which recovery of the basic geometry of the photographed scene. The second component is model-based stereo algorithm. We discus the image processing problems and algorithms of walking through in virtual space, then designs and implement a high performance multi-thread wandering algorithm. The applications range from architectural planning and archaeological reconstruction to virtual environments and cinematic special effects.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.