Effectively recognizing human actions from variant viewpoints is crucial for successful collaboration between humans and robots. Deep learning approaches have achieved promising performance in action recognition given sufficient well-annotated data from the real world. However, collecting and annotating real-world videos can be challenging, particularly for rare or violent actions. Synthetic data, on the other hand, can be easily obtained from simulators with fine-grained annotations and variant modalities. To learn domain-invariant feature representations, we propose a novel method to distill the pseudo labels from the strong mesh-based action recognition model into a light-weighted I3D model. In this way, the model can leverage robust 3D representations and maintain real-time inference speed. We empirically evaluate our model on the Mixamo→Kinetics dataset. The proposed model achieves state-of-the-art performance compared to the existing video domain adaptation methods.
Effectively recognizing human gestures from variant viewpoints plays a fundamental role in the successful collaboration between humans and robots. Deep learning approaches have achieved promising performance in gesture recognition. However, they are usually data-hungry and require large-scale labeled data, which are not usually accessible in a practical setting. Synthetic data, on the other hand, can be easily obtained from simulators with fine-grained annotations and variant modalities. Existing state-of-the-art approaches have shown promising results using synthetic data, but there is still a large performance gap between the models trained on synthetic data and real data. To learn domain-invariant feature representations, we propose a novel approach which jointly takes RGB videos and 3D meshes as inputs to perform robust action recognition. We empirically validate our model on the RoCoG-v2 dataset, which consists of a variety of real and synthetic videos of gestures from the ground and air perspectives. We show that our model trained on synthetic data can outperform state-of-the-art models under the same training setting and models trained on real data.
Multimedia Event Detection (MED) is a multimedia retrieval task with the goal of finding videos of a particular event in a large-scale Internet video archive, given example videos and text descriptions. In this paper, we mainly focus on an 'ad-hoc' scenario in MED where we do not use any example video. We aim to retrieve test videos based on their visual semantics using a Visual Concept Signature (VCS) generated for each event only derived from the event description provided as the query. Visual semantics are described using the Semantic INdexing (SIN) feature which represents the likelihood of predefined visual concepts in a video. To generate a VCS for an event, we project the given event description to a visual concept list using the proposed textual semantic similarity. Exploring SIN feature properties, we harmonize the generated visual concept signature and the SIN feature to improve retrieval performance. We conduct different experiments to assess the quality of generated visual concept signatures with respect to human expectation, and in the context of the MED task to retrieve the SIN feature of videos in the test dataset when we have no or only very few training videos.
KEYWORDS: Video, Cameras, Performance modeling, Surgery, Optical flow, Medical devices, Video surveillance, Model-based design, Thallium, Chemical elements
We model the sequence of human actions operating an infusion pump using a Markovian conditional exponential model.
We divide each video recorded by a camera into video action units. A video action unit corresponds to the start of a unique
human action operation of the infusion pump to the end of that human action operating an infusion pump. We calculate
the MOSIFT features of video action units which combines the spatial and temporal dimensions from videos. We vector
quantize the MOSIFT features of video action units using K means clustering as video codebook elements. We estimate
the conditional exponential model parameters from a training set using maximum entropy constraint and use the video
codebook elements as maximum entropy constraint features. We estimate the parameters of the Markovian conditional
exponential model from a training set. This Markovian conditional exponential model has 6 states which correspond to
the 6 classes of infusion pump operation. To find the optimal state sequence of the Markovian conditional exponential
model we use the Viterbi algorithm. This optimal state sequence corresponds to the class label sequence. The infusion
pump operation is recorded from 4 video cameras. We calculate the results of classification of 6 classes of infusion
pump operation using the conditional exponential model for the 4 video cameras and also we calculate the results of of
classification of 6 classes of infusion pump operation using the Markovian conditional exponential model for the 4 video
cameras. The classification performance of the Markovian conditional exponential model is better than the classification
performance of conditional exponential model.
As video data from a variety of different domains (e.g., news, documentaries, entertainment) have distinctive data distributions, cross-domain video concept detection becomes an important task, in which one can reuse the labeled data of one domain to benefit the learning task in another domain with insufficient labeled data. In this paper, we approach this problem by proposing a cross-domain active learning method which iteratively queries labels of the most informative samples in the target domain. Traditional active learning assumes that the training (source domain) and test data (target domain) are from the same distribution. However, it may fail when the two domains have different distributions because querying informative samples according to a base learner that initially learned from source domain may no longer be helpful for the target domain. In our paper, we use the Gaussian random field model as the base learner which has the advantage of exploring the distributions in both domains, and adopt uncertainty sampling as the query strategy. Additionally, we present an instance weighting trick to accelerate the adaptability of the base learner, and develop an efficient model updating method which can significantly speed up the active learning process. Experimental results on TRECVID collections highlight the effectiveness.
We describe a robust new approach to extract semantic concept information based on explicitly
encoding static image appearance features together with motion information. For high-level semantic
concept identification detection in broadcast video, we trained multi-modality classifiers which
combine the traditional static image features and a new motion feature analysis method (MoSIFT).
The experimental result show that the combined features have solid performance for detecting a
variety of motion related concepts and provide a large improvement over static image analysis
features in video.
We present a set of experiments with a video OCR system (VOCR) tailored for video information retrieval
and establish its importance in multimedia search in general and for some specific queries in particular. The
system, inspired by an existing work on text detection and recognition in images, has been developed using
techniques involving detailed analysis of video frames producing candidate text regions. The text regions are
then binarized and sent to a commercial OCR resulting in ASCII text, that is finally used to create search
indexes. The system is evaluated using the TRECVID data. We compare the system's performance from an
information retrieval perspective with another VOCR developed using multi-frame integration and empirically
demonstrate that deep analysis on individual video frames result in better video retrieval. We also evaluate
the effect of various textual sources on multimedia retrieval by combining the VOCR outputs with automatic
speech recognition (ASR) transcripts. For general search queries, the VOCR system coupled with ASR sources
outperforms the other system by a very large extent. For search queries that involve named entities, especially
people names, the VOCR system even outperforms speech transcripts, demonstrating that source selection for
particular query types is extremely essential.
In this paper, we propose a discriminative approach for retrieval of video shots characterized by a sequential structure. The task of retrieving shots similar in content to a few positive example shots is more close to a binary classification problem. Hence, this task can be solved by a discriminative learning approach. For a content-based retrieval task the twin characteristics of rare positive example occurrence and a sequential structure in the positive examples make it attractive for us to use a learning approach based on a generative model like HMM. To make use of the positive aspects of both discriminative and generative models, we derive Fisher and Modified score kernels for a Continuous HMM and incorporate them into SVM classification framework. The training set video shots are used to learn SVM classifier. A test set video shot is ranked based on its proximity to the positive class side of hyperplane. We evaluate the performance of the derived kernels by retrieving video shots of airplane takeoff. The retrieval performance using the derived kernels is found to be much better compared to linear and RBF kernels.
KEYWORDS: Video, Video compression, Bridges, Cameras, Image segmentation, Video processing, RGB color model, Computer programming, 3D video compression, Visualization
As personal wearable devices become more powerful and ubiquitous, soon everyone will be capable to continuously record video of everyday life. The archive of continuous recordings need to be segmented into manageable units so that they can be efficiently browsed and indexed by any video retrieval systems. Many researchers approach the problem in two-pass methods: segmenting the continuous recordings into chunks, followed by clustering chunks. In this paper we propose a novel one-pass algorithm to accomplish both tasks at the same time by imposing time constraints on the K-Means clustering algorithm. We evaluate the proposed algorithm on 62.5 hours of continuous recordings, and the experiment results show that time-constrained clustering algorithm substantially outperforms the unconstrained version.
TRECVID, an annual retrieval evaluation benchmark organized by NIST, encourages research in information retrieval from digital video. TRECVID benchmarking covers both interactive and manual searching by end users, as well as the benchmarking of some supporting technologies including shot boundary detection, extraction of semantic features, and the automatic segmentation of TV news broadcasts. Evaluations done in the context of the TRECVID benchmarks show that generally, speech transcripts and annotations provide the single most important clue for successful retrieval. However, automatically finding the individual images is still a tremendous and unsolved challenge. The evaluations repeatedly found that none of the multimedia analysis and retrieval techniques provide a significant benefit over retrieval using only textual information such as from automatic speech recognition transcripts or closed captions. In interactive systems, we do find significant differences among the top systems, indicating that interfaces can make a huge difference for effective video/image search. For interactive tasks efficient interfaces require few key clicks, but display large numbers of images for visual inspection by the user. The text search finds the right context region in the video in general, but to select specific relevant images we need good interfaces to easily browse the storyboard pictures. In general, TRECVID has motivated the video retrieval community to be honest about what we don't know how to do well (sometimes through painful failures), and has focused us to work on the actual task of video retrieval, as opposed to flashy demos based on technological capabilities.
Classifying the identities of people appearing in broadcast news video into anchor, reporter, or news subject is an important topic in high-level video analysis, which remains as a missing piece in the existing research. Given the visual resemblance of different types of people, this work explores multi-modal features derived from a variety of evidences, including the speech identity, transcript clues, temporal video structure, named entities, and face information. A Support Vector Machine (SVM) model is trained on manually-classified people to combine the multitude of features to predict the types of people who are giving monologue-style speeches in news videos. Experiments conducted on ABC World News Tonight video have demonstrated that this approach can achieve over 93% accuracy on classifying person types. The contributions of different categories of features have been compared, which shows that the relatively understudied features such as speech identities and video temporal structure are very effective in this task.
The difficulty with information retrieval for OCR documents lies in the fact that OCR documents contain a significant amount of erroneous words and unfortunately most information retrieval techniques rely heavily on word matching between documents and queries. In this paper, we propose a general content-based correction model that can work on top of an existing OCR correction tool to “boost” retrieval performance. The basic idea of this correction model is to exploit the whole content of a document to supplement any other useful information provided by an existing OCR correction tool for word corrections. Instead of making an explicit correction decision for each erroneous word as typically done in a traditional approach, we consider the uncertainties in such correction decisions and compute an estimate of the original “uncorrupted” document language model accordingly. The document language model can then be used for retrieval with a language modeling retrieval approach. Evaluation using the TREC standard testing collections indicates that our method significantly improves the performance compared with simple word correction approaches such as using only the top ranked correction.
Video contains multiple types of audio and visual information, which are difficult to extract, combine or trade-off in general video information retrieval. This paper provides an evaluation on the effects of different types of information used for video retrieval from a video collection. A number of different sources of information are present in most typical broadcast video collections and can be exploited for information retrieval. We will discuss the contributions of automatically recognized speech transcripts, image similarity matching, face detection and video OCR in the contexts of experiments performed as part of 2001 TREC Video Retrieval Track evaluation performed by the National Institute of Standards and Technology. For the queries used in this evaluation, image matching and video OCR proved to be the deciding aspects of video information retrieval.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.