KEYWORDS: Cameras, Video, Video surveillance, Data mining, Detection and tracking algorithms, Surveillance, Sensors, Imaging systems, Radiofrequency ablation, Algorithm development
The proposed system is focusing on the detection of three events in airport videos: a person running, a person putting
down an object and a person pointing with his/her hand. The system was part of the NIST-TRECVid 2010 campaign, the
training dataset consists in 100 hours of video from the Gatwick airport from five different cameras. For the detection of
a person running, a non-parametric approach was adopted where statistics about tracked object velocities were
accumulated over a long period of time using a Gaussian kernel. Outliers were then detected with the help of a kind of tstudent
test taking into account the local statistics and the number of observations. For the detection of "object put"
events, we follow a dual background segmentation approach where the difference in response between a short term and a
long term background model (Mixture of Gaussians) triggers alerts. False alerts are excluded based on a simple
modeling of the camera geometry in order to reject objects that are too large or too small given their positions in the
image. The detection of pointing gesture events is based on the grouping of significant spatio-temporal corners (Harris)
in a 3x3x3 cell called compound features as proposed recently by Andrew Gilbert et al. [10]. A hierarchical codebook is
then derived from the training set based on a data mining algorithm looking for frequent items (called transactions). The
algorithm was modified in order to deal with the large number of potential transactions (several millions) during the
training step.
This paper reports on a new technique for unconstrained license plate detection in a surveillance context. The
proposed algorithm quickly finds license plates by performing the following steps. The image is first preprocessed
to extract the edges; opening with linear structuring elements ensures that plate sides are enhanced.
Multiple scans using the Hausdorff distance are made through the vertical edge map with binary templates
representing a pair of vertical lines (with varying gap to account for unknown plate size), so they efficiently
pinpoint areas in the image where plates may be located. Inside those areas, the Hausdorff is used again, this
time over the gradient image and with a family of templates corresponding to rectangles which have been
subjected to geometric transformations (to account for perspective effects). The end result is a set of plate
location candidates, each associated to a confidence level that is a function of the quality of match between
the image and the template. An additional criterion based on the symmetry of plate shapes also supplies
complementary information about each hypothesis that allows rejection of many bad candidates. Examples
are given to show the performance of the proposed method.
In video surveillance, automatic methods for scene understanding and activity modeling can exploit the high redundancy
of object trajectories observed over a long period of time. The goal of scene understanding is to generate a semantic
model of the scene describing the patterns of normal activities. We are proposing to boost the performances of a real
time object tracker in terms of object classification based on the accumulation of statistics over time. Based on the object
shape, an initial three class object classification (Vehicle, Pedestrian and Other) is performed by the tracker. This initial
labeling is usually very noisy because of object occlusions/merging and the eventual presence of shadows. The proposed
scene activity modeling approach is derived from Makris and Ellis algorithm where the scene is described in terms of
clusters of similar trajectories (called routes). The original envelope based model is replaced by a simpler statistical
model around each route's node. The resulting scene activity model is then used to improve object classification based on
the statistics observed within the node population of each route. Finally, the Dempster-Shafer theory is used to fuse
multiple evidence sources and compute an improved object classification map. In addition, we investigate the automatic
detection of problematic image areas that are the source of poor quality trajectories (object reflections in buildings, trees,
flags, etc.). The algorithm was extensively tested using a live camera in a urban environment.
This study reports about the detection of non-natural structures in outdoor natural scenes. In particular, we present a new approach based on ridgelet transform for the segmentation of man-made objects in landscape scenes. Multiscale directional moments of ridgelet coefficients are used as features along with a principal component analysis (PCA) followed by a linear discriminant analysis (LDA), kernel-based LDA (KLDA), or support vector classifier (SVC). The statistical learning is done on about 3,000 image patches that represent natural and artificial content. Performances are measured in terms of image patch type classification (natural versus non-natural) and man-made object segmentation on two different image test sets. Results using ridgelets are compared to Gabor features. Altogether, we compare performance for six different feature/classifier combinations: ridgelets+LDA, ridgelet+KLDA, ridgelets+SVC, Gabor+LDA, Gabor+KLDA, and Gabor+SVC, and various external parameter values. Results show that most of the time, the combinations with ridgelets provide comparable or better performance.
Producing off-line captions for the deaf and hearing impaired people is a labor-intensive task that can require up to 18
hours of production per hour of film. Captions are placed manually close to the region of interest but it must avoid
masking human faces, texts or any moving objects that might be relevant to the story flow. Our goal is to use image
processing techniques to reduce the off-line caption production process by automatically placing the captions on the
proper consecutive frames. We implemented a computer-assisted captioning software tool which integrates detection of
faces, texts and visual motion regions. The near frontal faces are detected using a cascade of weak classifier and tracked
through a particle filter. Then, frames are scanned to perform text spotting and build a region map suitable for text
recognition. Finally, motion mapping is based on the Lukas-Kanade optical flow algorithm and provides MPEG-7
motion descriptors. The combined detected items are then fed to a rule-based algorithm to determine the best captions
localization for the related sequences of frames. This paper focuses on the defined rules to assist the human captioners
and the results of a user evaluation for this approach.
The paper reports about the development of a software module that allows autonomous object detection, recognition and
tracking in outdoor urban environment. The purpose of the project was to endow a commercial PTZ camera with object
tracking and recognition capability to automate some surveillance tasks. The module can discriminate between various
moving objects and identify the presence of pedestrians or vehicles, track them, and zoom on them, in near real-time.
The paper gives an overview of the module characteristics and its operational uses within the commercial system.
Deaf and hearing-impaired people capture information in video through visual content and captions. Those activities
require different visual attention strategies and up to now, little is known on how caption readers balance these two
visual attention demands. Understanding these strategies could suggest more efficient ways of producing captions. Eye
tracking and attention overload detections are used to study these strategies. Eye tracking is monitored using a pupilcenter-
corneal-reflection apparatus. Afterward, gaze fixation is analyzed for each region of interest such as caption area,
high motion areas and faces location. This data is also used to identify the scanpaths. The collected data is used to
establish specifications for caption adaptation approach based on the location of visual action and presence of character
faces. This approach is implemented in a computer-assisted captioning software which uses a face detector and a motion
detection algorithm based on the Lukas-Kanade optical flow algorithm. The different scanpaths obtained among the
subjects provide us with alternatives for conflicting caption positioning. This implementation is now undergoing a user
evaluation with hearing impaired participants to validate the efficiency of our approach.
A face recognition module has been developed for an intelligent multi-camera video surveillance system. The module
can recognize a pedestrian face in terms of six basic emotions and the neutral state. Face and facial features detection
(eyes, nasal root, nose and mouth) are first performed using cascades of boosted classifiers. These features are used to
normalize the pose and dimension of the face image. Gabor filters are then sampled on a regular grid covering the face
image to build a facial feature vector that feeds a nearest neighbor classifier with a cosine distance similarity measure
for facial expression interpretation and face model construction. A graphical user interface allows the user to adjust the
module parameters.
This paper presents a method for spotting key-text in videos, based on a cascade of classifiers trained with Adaboost. The video is first reduced to a set of key-frames. Each key-frame is then analyzed for its text content. Text spotting is performed by scanning the image with a variable-size window (to account for scale) within which simple features (mean/variance of grayscale values and x/y derivatives) are extracted in various sub-areas. Training builds classifiers using the most discriminant spatial combinations of features for text detection. The text-spotting module outputs a decision map of the size of the input key-frame showing regions of interest that may contain text suitable for recognition by an OCR system. Performance is measured against a dataset of 147 key-frames extracted from 22 documentary films of the National Film Board (NFB) of Canada. A detection rate of 97% is obtained with relatively few false alarms.
We present the research and development status of two MPEG-7 indexing/search systems under development at the Computer Research Institute of Montreal (CRIM). The first (called ERIC-7) targets content-based encoding of still images and is mainly designed to experiment with the various aspects of the visual MPEG-7/XML schema with the help of analysis and exploration tools. The interface allows navigating graphically among the various descriptors in the XML files and through interactive UML graphics. The second (called MADIS) aims at providing a practical audio-visual MPEG-7 indexing/retrieval tool, within the framework of a light architecture. MADIS is designed to (1) be fully MPEG-7 compliant, (2) address both encoding and search, (3) combine audio, speech and visual modalities and (4) have search capability on the Internet. MADIS currently targets content-based indexing of documentary films.
We explore the feasibility of reconstructing some three-dimensional (3D) surface information of the human fundus present in a sequence of fluorescein angiograms. The angiograms are taken during the same examination with an uncalibrated camera. The camera is still and we assume that the natural head/eye micro movement is large enough to create the necessary view change for the stereo effect. We test different approaches to calculate the fundamental matrix and the disparity map. A careful medical analysis of the reconstructed 3D information indicates that it represents the 3D distribution of the fluorescein within the eye fundus rather than the 3D retina surface itself because the latter is mainly a translucent medium. Qualitative evaluation is presented and compared with the 3D information perceived with a stereoscope. This preliminary study indicates that our approach could provide a simple way to extract 3D fluorescein information without the use of a complex stereo image acquisition setup.
In recent years, new harmonic analysis tools providing sparse representation in high dimension space have been
proposed. In particular, ridgelets and curvelets bases are similar to the sparse components of naturally occurring image
data derived empirically by computational neuroscience researchers. Ridgelets take the form of basis elements which
exhibit very high directional sensitivity and are highly anisotropic. The ridgelet transform have been shown to provide a
sparse representation for smooth objects with straight edges. Independently, for the purpose of scene description, the
shape of the Fourier energy spectra has been used as an efficient way to provide a “holistic” description of the scene
picture and its semantic category. Similarly, we focus on a simple binary semantic classification (artificial vs. natural)
based on various ridgelet features. The learning stage is performed on a large image database using different state of the
art Linear Discriminant techniques. Classification results are compared with those resulting from the Gabor
representation. Additionally, ridgelet representation provides us with a way to accurately reconstruct the original signal.
Using this synthesis step, we filter the ridgelet coefficients with the discriminant vector. The resulting image identifies
the elements within the scene contributing to the different perceptual dimensions.
This paper reports on the development status of a Multimedia Asset Management (MAM) test-bed for content-based indexing and retrieval of audio-visual documents within the MPEG-7 standard. The project, called "MPEG-7 Audio-Visual Document Indexing System" (MADIS), specifically targets the indexing and retrieval of video shots and key frames from documentary film archives, based on audio-visual content like face recognition, motion activity, speech recognition and semantic clustering. The MPEG-7/XML encoding of the film database is done off-line. The description decomposition is based on a temporal decomposition into visual segments (shots), key frames and audio/speech sub-segments. The visible outcome will be a web site that allows video retrieval using a proprietary XQuery-based search engine and accessible to members at the Canadian National Film Board (NFB) Cineroute site. For example, end-user will be able to ask to point on movie shots in the database that have been produced in a specific year, that contain the face of a specific actor who tells a specific word and in which there is no motion activity. Video streaming is performed over the high bandwidth CA*net network deployed by CANARIE, a public Canadian Internet development organization.
We have performed a study to identify optimal texture parameters for woodland segmentation in a highly non-homogeneous urban area from a temperate-zone panchromatic IKONOS image. Texture images are produced with the sum- and difference-histograms depend on two parameters: window size f and displacement step p. The four texture features yielding the best discrimination between classes are the mean, contrast, correlation and standard deviation. The f-p combinations 17-1, 17-2, 35-1 and 35-2 are those which give the best performance, with an average classification rate of 90%.
We present an overview of the design and test of an image processing procedure for detecting all important anatomical structures in color fundus images. These structures are the optic disk, the macula and the retinal network. The algorithm proceeds through five main steps: (1) automatic mask generation using pixels value statistics and color threshold, (2) visual image quality assessment using histogram matching and Canny edge distribution modeling, (3) optic disk localization using pyramidal decomposition, Hausdorff-based template matching and confidence assignment, (4) macula localization using pyramidal decomposition and (5) bessel network tracking using recursive dual edge tracking and connectivity recovering. The procedure has been tested on a database of about 40 color fundus images acquired from a digital non-mydriatic fundus camera. The database is composed of images of various types (macula- and optic disk-centered) and of various visual quality (with or without abnormal bright or dark regions, blurred, etc).
The aim of this work is to explore the applicability of a relatively new snakes formulation called geometric snakes, for robust contour segmentation in radar images. In particular, we are looking for clear experimental indicators regarding the usefulness of such tool for radar imagery. In this work, we mainly concentrate on various contour segmentation problems in airborne and spaceborne SAR images (swatch and inverse mode). As an example, we study the segmentation of coastlines and ship targets. We observe that the dynamical and adaptive properties of geometric contours is better suited to determine the morphological properties of the contours. For high-resolution radar images of ships, the underlying motivation is that these properties could help providing robust extraction of ship structures for automatic ship classification.
We investigate computer vision techniques for the stabilization of image sequences from a single image sensor. Image stabilization is required to improve the performance of human operators for evaluating surveillance imagery in real-time. Non-trivial rotation and scale changes of the input images could be important. Furthermore, for many operations such as airborne surveillance, perspective distortion induces an image transformation that is typically not handled well by classical registration techniques such as cross-correlation. We focus on the issue of rotation, scale and projective invariance for point feature detection and verification. It is often the case that hypothesized point matches are incorrect or poorly localized so we investigate solutions incorporating robust estimators. Feature points are detected with the Harris-Stephens corner detector. We use the greylevel differential invariant (GDI) matching due to Schmid and Mohr which is invariant to rotation and scaling. Extensions to the basic GDI method are introduced that improve the performance of the method. We verify the point correspondences under orthographic projection using the epipolar constraint via M-estimators and least median of squares on real-world and synthesized IR sequences.
The area-based methods, such as using Laplacian pyramid and Fourier transform-based phase matching, benefit by highlighting high spatial frequencies to reduce sensitivity to the feature inconsistency problem in the multisensor image registration. The feature extraction and matching methods are more powerful and versatile to process poor quality IR images. We implement multi-scale hierarchical edge detection and edge focusing and introduce a new salience measure for the horizon, for multisensor image registration. The common features extracted from images of two modalities can be still different in detail. Therefore, the transformation space match methods with the Hausdorff distance measure is more suitable than the direct feature matching methods. We have introduced image quadtree partition technique to the Hausdorff distance matching, that dramatically reduces the size of the search space. Image registration of real world visible/IR images of battle fields is shown.
We give an overview of some R&D projects in SAR imagery at Lockheed Martin Canada. These projects are motivated by airborne surveillance applications such as the landmass and coastal surveillance missions of the Canadian CP-140 (Aurora) aircraft. The activities reviewed here are: (1) R&D supports to CP-140 Spotlight SAR upgrade, (2) fast multiresolution prescreening filter for CFAR detection, (3) comparison of traditional and wavelet-based speckle filters and (4) high-level ship classification in high-resolution SAR imagery.
An enhanced greylevel differential invariant matching scheme is applied to the stabilization of real-world, infrared image sequences that have large translation, rotation, scaling and viewpoint changes. Its performance is compared with that of Zhang's robust image matching method.
We report about a hierarchical design for extracting ship features and recognizing ships from SAR images, and which will eventually feed a multisensor data fusion system for airborne surveillance. The target is segmented from the image background using directional thresholding and region merging processes. Ship end-points are then identified through a ship centerline detection performed with a Hough transform. A ship length estimate is calculated assuming that the ship heading and/or the cross-range resolution are known. A high-level ship classification identifies whether the target belongs to Line (mainly combatant military ships) or Merchant ship categories. Category discrimination is based on the radar scatterers' distribution in 9 ship sections along the ship's range profile. A 3-layer neural network has been trained on simulated scatterers distributions and supervised by a rule- based expert system to perform this task. The NN 'smoothes out' the rules and the confidence levels on the category declaration. Line ship type (Frigate, Destroyer, Cruiser, Battleship, Aircraft Carrier) is then estimated using a Bayes classifier based on the ship length. Classifier performances using simulated images are presented.
We present a comparative study between a complex Wavelet Coefficient Shrinkage (WCS) filter and several standard speckle filters that are widely used in the radar imaging community. The WCS filter is based on the use of Symmetric Daubechies wavelets which share the same properties as the real Daubechies wavelets but with an additional symmetry property. The filtering operation is an elliptical soft- thresholding procedure with respect to the principal axes of the 2D complex wavelet coefficient distributions. Both qualitative and quantitative results (signal to mean square error ratio, equivalent number of looks, edgemap figure of merit) are reported. Tests have been performed using simulated speckle noise as well as real radar images. It is found that the WCS filter performs equally well as the standard filters for low-level noise and slightly outperforms them for higher-level noise.
We report on an evaluation study of a ship classifier based on the principal components analysis (PCA). A set of ship profiles are used to build a covariance matrix which is diagonalized using the Karhunen-Loeve transform. A subset of the principal components corresponding to the highest eigenvalues are selected as the ship features space. The recognition process consists in projecting a profile on this eigen-subspace and performing a similarity measure. We have measured the recognition performance of the classifier using various sets of range-profile signatures of ship silhouette images and simulated synthetic aperture radar images of ships under various aspect angles. It is found that the PCA-based ship classifier design offers good class discriminacy when trained with a limited number of ship classes under an aspect angle range of 60 degrees about the ship side view. Additional tests are however necessary to validate the classifier on large data sets and real images.
This paper proposes using a backpropagation (BP) neural network for the classification of ship targets in airborne synthetic aperture radar (SAR) imagery. The ship targets consisted of 2 destroyers, 2 cruisers, 2 aircraft carriers, a frigate and a supply ship. A SAR image simulator was employed to generate a training set, a validation set, and a test set for the BP classifier. The features required for classification were extracted from the SAR imagery using three different methods. The first method used a reduced resolution version of the whole SAR image as input to the BP classifier using simple averaging. The other two methods used the SAR image range profile either before or after a local-statistics noise filtering algorithm for speckle reduction. Performance on an extensive test set demonstrated the performance and computational advantages of applying the neural classification approach to targets in airborne SAR imagery. Improvements due to the use of multi-resolution features were also observed.
This paper describes a phased incremental integration approach for application of image analysis and data fusion technologies to provide automated intelligent target tracking and identification for airborne surveillance on board an Aurora Maritime Patrol Aircraft. The sensor suite of the Aurora consists of a radar, an identification friend or foe (IFF) system, an electronic support measures (ESM) system, a spotlight synthetic aperture radar (SSAR), a forward looking infra-red (FLIR) sensor and a link-11 tactical datalink system. Lockheed Martin Canada (LMCan) is developing a testbed, which will be used to analyze and evaluate approaches for combining the data provided by the existing sensors, which were initially not designed to feed a fusion system. Three concurrent research proof-of-concept activities provide techniques, algorithms and methodology into three sequential phases of integration of this testbed. These activities are: (1) analysis of the fusion architecture (track/contact/hybrid) most appropriate for the type of data available, (2) extraction and fusion of simple features from the imaging data into the fusion system performing automatic target identification, and (3) development of a unique software architecture which will permit integration and independent evolution, enhancement and optimization of various decision aid capabilities, such as multi-sensor data fusion (MSDF), situation and threat assessment (STA) and resource management (RM).
We report the study of a multiresolution speckle reduction method for airborne synthetic aperture radar (SAR) images. The SAR image is first subband-coded using complex symmetric Daubechies wavelets, followed by a noise estimate on the three high-pass bands. An elliptic wavelet coefficient thresholding rule is then applied, that preserves the global orientation of the complex wavelet coefficient distribution. FInally, a multiresolution synthesis (inverse wavelet transform) is done in a last small dim objects. A speckle index is computed to quantify the speckle reduction performance. We compare our results with those obtained using median and geometrical (Crimmins) filters.
It is shown that analyses based on Symmetric Daubechies Wavelets (SDW) lead to a multiresolution form of the Laplacian operator. This property, which is related to the complex values of the SDWs, gives a way to new methods of image enhancement applications. After a brief recall of the construction and main properties of the SDW, we propose a representation of the sharpening operator at different scales and we discuss the `importance of the phase' of the complex wavelet coefficients.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.