|
1.IntroductionPedestrian detection is a fundamental challenge in computer vision due to great variation in appearance, changes in illumination, poor resolution, and partial occlusions. The general framework of pedestrian detection can be decomposed into three modules: (i) generate the region proposals that represent object hypotheses in a test image, (ii) classify the region proposals, and (iii) refine the region proposals to obtain accurate localization of pedestrians. In the past years, the use of Hough transform framework has attracted considerable attention for pedestrian detection.1–10 The applicability of the Hough transform framework can be attributed to its robustness against partial occlusions, as indicated in Refs. 1 and 34.–5. Another attractive property of the Hough transform is its simplicity. The Hough transform framework for pedestrian detection includes three primary steps: (i) construct visual codebook, (ii) cast probabilistic votes for object center into a Hough image according to the codebook using voting elements of the test image, and (iii) search maxima in the Hough image as object hypotheses. Although some Hough transform methods demonstrate the significance of the visual codebook and voting weights1,2,4 for detection performance, none use contextual information. Voting elements, which denote the image patches classified into object categories, cast probabilistic votes into a Hough image. However, the image patch contains only partial information about an object, and its appearance is highly variable. Thus, it is difficult to disambiguate object patches from background patches by a classifier at the local level. Therefore, detection performance can be reduced due to noisy votes cast by background patches. Fortunately, conditional random field (CRF) frameworks modeling context have achieved an impressive performance for semantic segmentation,11–15 image classification,16 saliency detection,17 and object detection.18 The CRF distribution can be formulated by a probabilistic graphical model, in which variables are interdependent rather than independent. Given an image, CRF inference is performed by a maximum a posteriori (MAP) or maximum posterior marginal criterion, and all patches can be classified into an object category or background simultaneously. In other words, the CRF model uses whole image information instead of local information to obtain all patch labels. In this paper, we build a CRF model that regards the locality-constrained linear coding (LLC)19 code of a local feature as a latent variable, which is more informative than the corresponding local feature. In addition, we apply a Gaussian kernel to neighboring features to measure the strength of pairwise energy in the CRF framework. In the training stage, we iteratively modulate the codebook and CRF model parameters by a max-margin approach with a maximum-likelihood criterion. Furthermore, to learn the spatial-occurrence distribution of the codebook, offset vectors of the local feature to its object center in a training image are assigned to matching codewords. In the detection stage, all image patches are classified into an object category or background simultaneously by CRF inference, and the patches classified into an object category are used as voting elements in the Hough transform. The voting element casts weighted votes into the Hough image according to its LLC coefficients on codewords, and the use of LLC enables us to reduce the reconstruction error for representing the voting element by a linear combination of codewords.20 This may result in more balanced probabilistic votes than uniform votes in the Hough image. Maxima are regarded as object hypotheses in the Hough image, in which all votes accumulate. The proposed method makes three main contributions:
We evaluated our method on the INRIA pedestrian, TUD Brussels, and Caltech pedestrian datasets. This work compromises speed, accuracy, and simplicity. Experiments demonstrated the effectiveness of the proposed method compared with other Hough transform-based methods, benefiting from the contextual information in images and the weighted Hough voting strategy. The rest of the paper is structured as follows. We review literature on the Hough transform methods, encoding methods, and CRF in Sec. 2. We describe our method for pedestrian detection in Sec. 3. We evaluate the proposed method on several challenging datasets in Sec. 4, and we provide our conclusions in Sec. 5. 2.Related WorkIn this section, we first discuss the Hough transform-based methods for pedestrian detection and then briefly describe encoding methods and CRF that are related to the proposed method. 2.1.Hough Transform MethodsThere is extensive literature dedicated to pedestrian detection.21–39 Here, we review the methods based on the Hough transform framework1,2,4–68–10 that are most relevant to our work. In the past years, applications of the methods based on the Hough transform framework have resulted in progress in pedestrian detection. The majority of Hough transform methods usually focus on codebook learning, voting element generation, and hypotheses search. The advantage of the Hough transform methods is that they can detect pedestrians with low computational cost due to the simple structure9 and can also locate a partially occluded pedestrian in an image using a small set of local patches.1,3–5 The implicit shaped model (ISM)1 has been widely derived by other Hough transform-based methods, which constructs a visual codebook by clustering local features in an unsupervised manner. Gall and Lempitsky2 proposed the Hough forest to build decision trees in a supervised manner, where a set of leaves can be regarded as a discriminative codebook that produces probabilistic votes with better voting performance. Barinova et al.4 proposed an MAP inference method rather than nonmaximum suppression (NMS) to seek the maxima in the Hough image. Wang et al.5 proposed a structured Hough transform method that incorporates depth-dependent contexts into a codebook-based pedestrian detection model. Cabrera and Lpez-Sastre6 proposed a boosted Hough forest, in which decision trees are trained in a stage-wise fashion to optimize a global loss function. Liu et al.9 proposed a pair Hough model (PHM) for detecting objects whose voting elements were extracted from interest points to handle the rotation of objects. In a study by Liu et al.,10 extremely randomized trees (ERTs) were constructed from features of soft-labeled training blobs, and a Hough image was accumulated by votes from features based on the soft-labeled ERTs. Different from other Hough transform methods, the proposed method regards LLC codes as hidden variables in a unified CRF framework that exploits the contextual information between neighboring image patches, from which the visual codebook and CRF parameters are learned in a supervised manner. 2.2.Encoding MethodsMany approaches for encoding local features (image patches) have been proposed.19,20,40 Lazebnik et al.40 proposed spatial pyramid matching (SPM), which is a simple and computationally efficient extension of an orderless bag-of-features image representation. Yang et al.20 developed an extension of the SPM method called ScSPM for nonlinear codes. Wang et al.19 proposed LLC in place of the vector quantization (VQ) coding in traditional SPM utilizing the locality constraint to project each local feature into its local coordinate system. Moreover, dictionary learning plays a significant role in encoding.17,41 Bach et al.41 demonstrated that better results can be obtained when dictionary is modulated to the specific task. Yang and Yang17 proposed a top-down saliency model that jointly learns a discriminative dictionary and a CRF to improve sparse coding (SC). However, codebooks optimized in these methods are utilized for image classification or saliency detection rather than Hough transform-based pedestrian detection. The LLC can represent local features by codewords with lower reconstruction error than VQ42 and SC.20 This property of LLC motivated us to utilize the code coefficients of a voting element as codeword weights to cast better balanced votes in the Hough image. 2.2.1.Locality-constrained linear codingFeature encoding decomposes a local feature into a linear combination of codewords over the predefined codebook , where denotes the ’th codeword that is -dimensional. While the SC20 method applies a sparsity constraint to select similar codewords of local features from a codebook, the LLC method19 incorporates a locality constraint that must lead to a sparsity constraint but not necessarily vice versa. The visual information of image patches contained in the codebook is transferred into the latent variables of the CRF model by the LLC, which is more informative than local features. The LLC code of a local feature is obtained by solving the following optimization problem: where denotes the element-wise multiplication, is used to control the locality constraint, is the vector of weights corresponding to the codewords, and is the locality adaptor that corresponds to the similarities between the codewords and local feature . Specifically where , and denotes the Euclidean distance between and . denotes the weight-decay speed for the locality adaptor. Note that the LLC code in Eq. (1) is not sparse in the sense of the norm, but it is sparse in the sense that the solution has few significant values. In the LLC method, the solution of the optimization problem can be translated into the following equation: where denotes the data covariance matrix, denotes matrix left division, is a parameter controlling the locality constraint, and indicates the constant vector where / denotes the division. Equation (4) is used for vector unitization.2.3.Conditional Random FieldA CRF is a flexible framework for modeling contextual information that can be grouped into three levels: pixels, patches, and objects. It is widely used for image semantic segmentation and patch-level labeling11–15,18 by addressing computer vision problems with CRF inference. Kumar and Hebert18 proposed the discriminative random field, which inherits the CRF concept for labeling man-made structures at patch level. To disambiguate local image information, He et al.11 proposed a multi-CRF with three separate components at different scales for image semantic segmentation. Quattoni et al.16 proposed a hidden-state CRF for image classification that models the latent structure of the input domain via intermediate hidden variables. Toyoda and Hasegawa12 proposed a CRF incorporating local and global image information. Thus, global consistency of layouts is achieved from a global viewpoint. Shotton et al.13 proposed a CRF model for semantic segmentation that uses a texture-layout filter incorporating texture, layout, and contextual information. Owing to the need to solve excessive boundary smoothing for semantic segmentation using an adjacency CRF structure, Krähenbühl and Koltun14 proposed a fully connected CRF that establishes pairwise potentials consisting of a linear combination of Gaussian kernels on all pairs of pixels in the image. Chen et al.15 proposed a DeepLab system that utilizes a fully connected CRF coupled with a deep convolutional network-based pixel-level classifier as well as long range dependencies to capture fine edge details. Yang and Yang17 proposed a top-down saliency model by constructing a CRF upon SC of image patches; the codebook was optimized by jointly learning the CRF model. To speed-up the saliency detection procedure, Yang and Xiong43 proposed a saliency detection method by combining LLC and CRF. While these saliency detection methods use CRF to generate saliency maps directly, the proposed method builds the CRF model to obtain Hough voting elements. The CRF13,18 is a conditional distribution over the labels given the observations , which can be written as where is a normalizing constant known as the partition function, and are the unary and pairwise potentials, respectively, is a set of sites that refers to elements (pixels or patches) in an image, is a set of neighbors of site , and is a coefficient that modulates the effect of the pairwise potential . In general, the unary potential denotes the penalty for a local classifier applied to an image patch and ignoring its neighbors. The pairwise potential is seen as a penalty of label inconsistency that assumes neighboring pixels or patches should be classified into the same object category.3.Our MethodOur pedestrian detection system consists of two modules: (i) a CRF model with latent variables denoted by LLC codes of image patches. The visual codebook can be optimized by learning this model and can further learn a spatial-occurrence distribution that specifies where each codeword may be found on the object. (ii) A Hough voting module. Patch labels are jointly estimated in a test image by CRF inference, and the patches classified into the object category are voting elements that cast weighted votes into the Hough image. Maxima in the Hough image are regarded as object hypotheses. An overview of the detection procedure is shown in Fig. 1. 3.1.Conditional Random Field ModelWe exploit the contextual information in an image by a CRF model that uses LLC codes as latent variables and applying a Gaussian kernel to measure the strength of pairwise energy. This model is used for two purposes: (i) to optimize the codebook by learning the CRF model and (ii) to jointly classify image patches into the object category or background by CRF inference. To reduce Hough image noise resulting from background patches, image patches classified into the object category are used as voting elements (Sec. 3.4). Yang and Yang17 developed a CRF model upon SC of image patches for saliency detection. Inspired by this CRF model, we build a CRF framework for modeling the context constraint that uses a Gaussian kernel to measure the local feature similarity between neighboring nodes for pairwise energy where is the partition function for normalization, denotes a set of local features that is sampled from different sites of the image, denotes the corresponding labels, is the visual codebook, is the energy function, are the latent variables denoting LLC codes of a set of local features , and is the model parameter vector. For clarity, we simplify the notation by writing and . The energy function is decomposed into unary and pairwise energy terms where is a set of sites that refers to patches in an image and is a set of neighbors of site . The unary energy can be measured by the total contribution of sparse codes , where is the weight vector and denotes the number of codewords. The pairwise energy can be denoted as , where the scalar measures the weight of the pairwise energy term, is a Gaussian kernel to measure the strength of pairwise energy, and is an indicator function equaling 1 for different labels. The Gaussian kernel is defined as where and denote the LLC codes of neighboring local features and , respectively. The degree of similarity is controlled by the parameter .Like most CRF models,11–13 the energy function is linear with the parameter , but it is nonlinear with the codebook , which is implicitly defined by in Sec. 2.2. This nonlinear parametrization makes it challenging to learn the model. We discuss the learning approach in Sec. 3.2. 3.2.Joint CRF and Codebook LearningFollowing Yang and Yang’s17 method, we learn the CRF parameters and codebook in accordance with the CRF model. Let be a set of training images and be corresponding set of labels. We aim to estimate the CRF parameter vector and the codebook by maximizing the joint likelihood of training data where and is the convex set of codebooks that satisfies the following constraint:The evaluation of the partition function of Eq. (6) is an NP-hard problem. Referring to the max-margin CRF learning approach,44 we look for the optimal weights and codebook that assign the training labels , a probability that is greater than or equal to any other labeling of instance The partition function can be canceled from both sides of the constraints [Eq. (7)], and we express the constraints in terms of energies Moreover, we desire the energy of ground truth to be lower than that of any other energies of label configurations on the training data. Thus, we have a new constraint set The margin function , where is an indicator function equal to 1 for different labels. There are an exponential number of constraints with respect to labeling for each training image. Inspired by the cutting plane algorithm,45 the most violated constraints can be found by solving Therefore, the optimal weight and the codebook can be learned by minimizing the following objective function: where and controls the regularization of the weight .The above objective function is optimized by a stochastic gradient descent algorithm, which is summarized in Algorithm 1. Algorithm 1Joint CRF and codebook learning
3.3.Learning the Spatial-Occurrence DistributionIn this section, we learn the nonparametric spatial-occurrence distribution for each codeword of the optimized codebook , which can be used to cast votes into the Hough image in the test stage. An occurrence represents an image patch of the training images, which matches a codeword. As in the other Hough transform methods,1,4,5 a codeword represents a specific object part whose position relative to the object center is uncertain. Each codeword corresponds to a set of occurrences in the training images. As shown in Algorithm 2, we perform an iteration over all training images to match the codewords to local features. Here, we activate the codewords whose similarity exceeds a matching threshold of 0.7 (discussed in Sec. 4.1). For every codeword, we store all occurrence positions that reflect its spatial distribution over the object area in a nonparametric form (as a list of occurrences). Algorithm 2Learning the spatial-occurrence distribution
3.4.Weighted Hough Voting StrategyIn Sec. 3.3, the visual codebook was optimized by learning the CRF model iteratively, and voting elements were obtained by CRF inference in the test image. We now describe the Hough voting procedure based on the CRF model that regards the LLC code of an image patch as a latent variable. A flowchart of the detection procedure is shown in Fig. 1. The voting element consistently casts weighted votes into the Hough image according to its LLC code. To locate the objects in the test image, maxima in the Hough image are regarded as object hypotheses. Moreover, to handle scale variations, a test image is resized by a set of scale factors, and hypotheses are computed independently in the Hough images at each scale. Different from other Hough transform approaches,1,2,4–6,8–10 our Hough voting procedure is cast into a probabilistic framework with a coding strategy. Let be the local feature observed at location in the test image. By matching it to the visual codebook, a set of valid interpretations with probabilities can be obtained. If a codeword matches, it casts votes for different object positions. That is, for every , votes for several object categories and a position can be obtained according to the learned spatial-occurrence distribution . The voting probability of a local feature can be formally expressed by the following marginalization: for , where is the number of codewords. Since the unknown local feature has been replaced by a known interpretation in the test image, the first term can be considered independent from . Also, local features matched to the codebook are independent of their location. Thus, the equation is reduced to where is the voting probability for an object position given its category label , codeword , and location . The probability denotes the confidence that the codeword is matched on the object category against the background. Finally, denotes the probability that local feature matches to codeword . The object scale is regarded as a third dimension in the voting space. If a local feature extracted from location matches a codeword that has been observed at position on a training image, it votes for the following coordinates:Thus, the voting probability is obtained by summing the votes for all stored observations from the learned occurrence distribution . The ensemble of all such votes is used to obtain a nonparametric probability density estimate for the position of the object center. The probability of a match between a local feature and codeword is obtained according to the LLC algorithm19 described above. In other words, the LLC code is regarded as weighted probabilities for Hough voting. Next, maxima are sought to be object hypotheses in the Hough voting space, in which all votes are accumulated. The search process includes two stages. We first accumulate the voting probabilities in a three-dimensional Hough space and find maxima as candidates. We then employ the mean-shift algorithm1 to refine the locations of hypotheses. Intuitively, the probability of an object hypothesis is obtained by summing the individual voting probabilities over all observations, and we arrive at the following equation: for , where is the number of local features in the test image. is the probability of local feature being sampled for object located at . Nonetheless, it is necessary to tolerate small shape deformations to be robust for intraclass variations of the object. Thus, the mean-shift framework1 is formulated with the following kernel density estimate: where the Gaussian kernel is a radially symmetric, nonnegative function, centered at zero and integrating to one, is the kernel bandwidth, and is its volume. The mean-shift search using this formulation will quickly converge to local modes of the underlying distribution. Moreover, the search procedure can be interpreted as kernel density estimation for the position of the object center.Candidates of objects with high scores are usually close to each other in the Hough image. This may lead to the same object corresponding to multiple candidates, resulting in false positives. To reduce redundancy, we adopt NMS on the overlapped object hypotheses. We fix the intersection over union (IoU) threshold for NMS at 0.7. 4.Experiments4.1.DatasetsTo evaluate the effectiveness of the proposed method in different scenes, we choose three publicly available pedestrian datasets, namely, INRIA pedestrian, TUD Brussels, and Caltech pedestrian. Pedestrians in these datasets are mostly upright but are of different degrees of occlusions, and pose and scale changes, together with the variations in background and illuminations. 4.1.1.INRIA PedestrianThe INRIA pedestrian dataset consists of 614 training images and 288 test images, which is challenging due to the variability of pedestrian poses, illumination changes, and highly cluttered backgrounds (mountains, buildings, vehicles, etc.). 4.1.2.TUD BrusselsThe TUD Brussels dataset contains 508 images (one pair per second) at a resolution of , which are recorded from a car driving in the inner city of Brussels. This dataset is challenging due to partial occlusion, cluttered backgrounds (e.g., poles, parked cars, buildings, and crowds), and numerous small-scale pedestrians. 4.1.3.Caltech PedestrianThe Caltech pedestrian dataset and its associated benchmark are among the most popular pedestrian detection datasets. It consists of about 10 h of videos (30 frames per second) collected from a vehicle driving through urban traffic. Every frame in the Caltech dataset has been densely annotated with the bounding boxes of pedestrian instances. In total, there are 350,000 bounding boxes of about 2300 unique pedestrians labeled in 250,000 frames. The pedestrians in the Caltech pedestrian dataset appear in many positions, orientations, and background variety. In the reasonable evaluation setting, the performance is evaluated on pedestrians over 50-pixels tall with no or partial occlusion. 4.2.Experiment ProcedureAll experiments are carried out on a workstation equipped with a Titan Xp GPU and an Intel Xeon(R) CPU E5-2620 v4 @ 2.10 GHz. The evaluation tool is based on the codes from the official websites of Caltech and PASCAL VOC. Bounding boxes of objects are predicted in an image at test time. By default, predicted bounding boxes are considered positives when the IoU overlaps by more than 0.5 with ground-truth bounding boxes, and the rest are considered negatives. We use precision recall (PR) curve to evaluate pedestrian datasets.4,26,28 Following,9,28 we use average precision (AP) to measure detection performance on these datasets, which denotes the area under the PR curve. The AP was calculated in accordance with the criteria of PASCAL VOC. We densely extract scale-invariant feature transform features from images with a step length of 16 pixels. The codebook is optimized by training the CRF model with 12 iterations. The matching threshold is set to 0.7 for learning the spatial-occurrence distribution of the optimized codebook (Sec. 3.3). The number of LLC neighbors is set to 20. The codebook size is set to 512. Implemented on a CPU to detect pedestrians from the Caltech pedestrian dataset, the Hough transform-based ISM1 and Barinova et al.’s method4 require 0.48 and 0.55 s per image, respectively, whereas the proposed method requires 0.62 s per image. Our method only requires 0.14 s (per image) extra computational time than ISM, because it mainly benefits from the efficient LLC19 and inference algorithms in the CRF model. 4.3.Result AnalysisFigure 2 shows the PR curves of our method compared to conventional pedestrian detection approaches (HOG,21 FPDW,23 CrossTalk,25 LatSvm-V2,22 ACF,30 Roerei,26 MT-DPM,27 and NAMC32) on the INRIA pedestrian, TUD Brussels, and Caltech pedestrian datasets according to the reasonable setting. The APs of these methods are shown in Table 1. It can be observed that our method obtained obvious improvements over the Hough transform-based methods1,4,9 on these datasets. This is mainly attributable to two properties of our method that solve two challenging problems in the INRIA, TUD, and Caltech datasets: (i) the proposed method relies on image patches; hence, it can cope with the partial occlusions that are common in pedestrian datasets and (ii) the CRF model can effectively reduce the voting noise generated by the cluttered background. Table 1Performance comparison in terms of AP (%) on the INRIA, TUD Brussels, and Caltech pedestrian datasets according to the reasonable setting.
Note: The bold values denote the best detection performances in terms of AP. We further evaluated the proposed method on three subsets of the Caltech pedestrian dataset according to its evaluation settings (“Occ = none,” “Occ = partial,” and “Occ = heavy”). Pedestrians are full, 65% to 100%, and 20% to 65% on those three settings, respectively. Table 2 shows that our method achieved APs of 66.4%, 47.3%, and 25.5% on these respective evaluation settings. Our method shows obvious improvements over the Hough transform-based methods1,4,9 on these evaluation settings. Table 2Detection performance comparisons of our method and other methods on three Caltech evaluation settings (“Occ = none,” “Occ = partial,” and “Occ = heavy”).
Note: The bold values denote the best detection performances in terms of AP. For the TUD pedestrian dataset, we masked ground-truth objects with proportions of 20%, 40%, and 60% from the left to right side, respectively, owing to an absence of occlusion information in this dataset. As shown in Fig. 3, our method has obvious improvements on these masked proportions compared to Hough transform-based ISM1 and Barinova et al.’s4 method. In addition, we verified the significance of codebook optimization, codebook size, number of LLC neighbors, and weighted voting strategy on detection performance. 4.3.1.Impact of the codebook optimizationWe initialized the codebook by the K-means clustering algorithm and then optimized the codebook by learning the CRF model. The codebook optimization was driven by top-down prior knowledge in a supervised manner. As shown in Fig. 4(a), detection performance improved rapidly in the first several iterations and converged after 12 iterations. The stochastic nature of the learning algorithm resulted in some performance perturbation in some iterations. 4.3.2.Impact of the matching thresholdAt test time, occurrence distributions of the codebook were used to cast votes into the Hough image for pedestrian detection; thus, they are significant to detection performance of the proposed method. Learning occurrence distributions mainly depends on the matching threshold that represents the similarity between a codeword and an object patch of a training image. Intuitively, the occurrence distributions may be impacted by noise when the matching threshold is set to a relatively low value. On the contrary, the occurrence distributions are likely to lack some important occurrences when the matching threshold is set to a relatively high value. To find the optimal matching threshold, we evaluated the detection performance with different values of the matching threshold. Figure 4(b) shows the detection results on the INRIA pedestrian and TUD Brussels datasets with different values of the matching threshold. We found that our method achieved a relatively high AP when the matching threshold was 0.7. 4.3.3.Impact of the LLC parameterTo focus on the impact of the number of LLC neighbors, the codebook size was fixed at 512. As shown in Fig. 4(c), detection performance improved dramatically when was , and it converged when was . The experimental results show that the number of LLC neighbors had a great impact on detection performance. 4.3.4.Impact of the codebook sizeTo investigate the impact of codebook size on detection performance, we compared detection performance with codebook sizes of 256 and 512, with the parameter of LLC fixed at 20. As shown in Table 3, the AP was 92.6% when on the INRIA pedestrian dataset and 94.4% when . The AP was 62.7% when on the TUD Brussels dataset and 67.1% when . We found that gives better detection results than . Table 3Performance comparison in terms of codebook size M on the TUD Brussels and INRIA pedestrian datasets.
Note: The bold values denote the best detection performances in terms of AP. 4.3.5.Performance of the weighted voting strategyAs for the weighted voting strategy (Sec. 3.4), we used the LLC coefficients instead of uniform weights as voting weights on codewords. The codebook size was fixed at 512. The parameter of LLC was fixed at 20. As shown in Table 4, the APs of the weighted voting were 4.0% and 2.9% higher, respectively, than the uniform voting on the INRIA pedestrian and TUD Brussels datasets. Table 4Performance comparison in terms of voting strategies on the TUD Brussels and INRIA pedestrian datasets.
Note: The bold values denote the best detection performances in terms of AP. 4.3.6.Effectiveness of the CRF model using the deep convolutional featuresTo investigate the effectiveness of the CRF model in detecting pedestrians using the deep convolutional features, we capture contextual relationships on the high-quality object candidates provided by the method RPN + BF.36 The region of interest (RoI) features of size are naturally extracted from the object candidates in the feature maps as in Ref. 36. An object candidate is regarded as a node in the CRF model within a fully connected form. The unary potential of the CRF model is the cost of the confidence score on an object candidate outputted by RPN + BF, which denotes the inverse likelihood of an object candidate taking the label of pedestrian. The pairwise potential relies on the RoI features of a pair of object candidates, which measures the cost of similar object candidates with different labels (e.g., the binary labels, pedestrian, and background) as in Refs. 48 and 49. We feed the RoI features of object candidates of all test images into the CRF model. Finally, the marginal probability distributions of all object candidates can be simultaneously obtained using the mean field inference in the CRF model. The PR curves are obtained by utilizing the marginal probabilities (as the confidence scores) of the pedestrian label, rather than utilizing the initial confidence scores provided by RPN + BF. In Fig. 5, it can be observed that the CRF model achieved APs of 98.7% and 93.2% on the INRIA and Caltech datasets, respectively, which obtains improvements of 1.3% and 2.2% over the RPN + BF. 5.ConclusionIn this work, we propose a pedestrian detection method that integrates context modeling and weighted voting strategy in a unified Hough transform framework. The noisy votes from background patches can be reduced by exploiting contextual information on image patches in an image. The coding coefficients based on the optimized codebook contribute to casting highly balanced votes in the Hough image. The experimental results on the INRIA pedestrian, TUD Brussels, and Caltech pedestrian datasets demonstrated the effectiveness of the proposed method compared with other Hough transform-based methods. In future studies, we intend to exploit contextual information among multiple images for pedestrian detection since the contextual information that we try to exploit in this work is only from a single image. AcknowledgmentsThis work was partially supported by the National Natural Science Foundation of China with Grant Nos. 61375008 and 61673274. ReferencesB. Leibe, A. Leonardis and B. Schiele,
“Robust object detection with interleaved categorization and segmentation,”
Int. J. Comput. Vision, 77
(1–3), 259
–289
(2008). https://doi.org/10.1007/s11263-007-0095-3 IJCVEQ 0920-5691 Google Scholar
J. Gall, V. Lempitsky,
“Class-specific Hough forests for object detection,”
Decision Forests for Computer Vision and Medical Image Analysis, 143
–157 Springer, London
(2013). Google Scholar
J. Gall et al.,
“Hough forests for object detection, tracking, and action recognition,”
IEEE Trans. Pattern Anal. Mach. Intell., 33
(11), 2188
–2202
(2011). https://doi.org/10.1109/TPAMI.2011.70 ITPIDJ 0162-8828 Google Scholar
O. Barinova, V. Lempitsky and P. Kholi,
“On detection of multiple object instances using Hough transforms,”
IEEE Trans. Software Eng., 34
(9), 1773
–1784
(2012). https://doi.org/10.1109/TPAMI.2012.79 IESEDJ 0098-5589 Google Scholar
T. Wang, X. He and N. Barnes,
“Learning structured Hough voting for joint object detection and occlusion reasoning,”
in Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition,
1790
–1797
(2013). https://doi.org/10.1109/CVPR.2013.234 Google Scholar
C. R. Cabrera and R. J. Lpez-Sastre,
“Because better detections are still possible: multi-aspect object detection with boosted Hough forest,”
in British Machine Vision Conf.,
(2015). Google Scholar
X. Lou et al.,
“Invariant Hough random ferns for RGB-D-based object detection,”
Opt. Eng., 55
(9), 091403
(2016). https://doi.org/10.1117/1.OE.55.9.091403 Google Scholar
F. Milletari et al.,
“Hough-CNN: deep learning for segmentation of deep brain regions in MRI and ultrasound,”
Comput. Vision Image Understanding, 164 92
–102
(2017). https://doi.org/10.1016/j.cviu.2017.04.002 Google Scholar
Y. Liu et al.,
“A novel rotation adaptive object detection method based on pair Hough model,”
Neurocomputing, 194 246
–259
(2016). https://doi.org/10.1016/j.neucom.2015.12.105 NRCGEO 0925-2312 Google Scholar
Y. Liu et al.,
“Soft Hough forest-ERTs: generalized Hough transform based object detection from soft-labelled training data,”
Pattern Recognit., 60 145
–156
(2016). https://doi.org/10.1016/j.patcog.2016.04.023 Google Scholar
X. He, R. S. Zemel and M. A. Carreira-Perpinan,
“Multiscale conditional random fields for image labeling,”
in IEEE Computer Society Conf. on Computer Vision and Pattern Recognition (CVPR),
II-695
–II-702
(2004). https://doi.org/10.1109/CVPR.2004.1315232 Google Scholar
T. Toyoda and O. Hasegawa,
“Random field model for integration of local information and global information,”
IEEE Trans. Pattern Anal. Mach. Intell., 30
(8), 1483
–1489
(2008). https://doi.org/10.1109/TPAMI.2008.105 ITPIDJ 0162-8828 Google Scholar
J. Shotton et al.,
“TextonBoost for image understanding: multi-class object recognition and segmentation by jointly modeling texture, layout, and context,”
Int. J. Comput. Vision, 81
(1), 2
–23
(2009). https://doi.org/10.1007/s11263-007-0109-1 IJCVEQ 0920-5691 Google Scholar
P. Krähenbühl and V. Koltun,
“Efficient inference in fully connected CRFs with Gaussian edge potentials,”
Adv. Neural Inf. Process. Syst., 109
–117
(2011). Google Scholar
L. C. Chen et al.,
“Semantic image segmentation with deep convolutional nets and fully connected CRFs,”
in IEEE Int. Conf. on Learning Representations (ICLR),
(2015). Google Scholar
A. Quattoni et al.,
“Hidden conditional random fields,”
IEEE Trans. Pattern Anal. Mach. Intell., 29
(10), 1848
–1852
(2007). https://doi.org/10.1109/TPAMI.2007.1124 ITPIDJ 0162-8828 Google Scholar
M. H. Yang and J. Yang,
“Top-down visual saliency via joint CRF and dictionary learning,”
in IEEE Conf. on Computer Vision and Pattern Recognition,
2296
–2303
(2012). https://doi.org/10.1109/CVPR.2012.6247940 Google Scholar
S. Kumar and M. Hebert,
“Discriminative random fields,”
Int. J. Comput. Vision, 68
(2), 179
–201
(2006). https://doi.org/10.1007/s11263-006-7007-9 IJCVEQ 0920-5691 Google Scholar
J. Wang et al.,
“Locality-constrained linear coding for image classification,”
in IEEE Computer Society Conf. on Computer Vision and Pattern Recognition,
3360
–3367
(2010). https://doi.org/10.1109/CVPR.2010.5540018 Google Scholar
J. Yang et al.,
“Linear spatial pyramid matching using sparse coding for image classification,”
in IEEE Computer Society Conf. on Computer Vision and Pattern Recognition,
1794
–1801
(2009). https://doi.org/10.1109/CVPR.2009.5206757 Google Scholar
N. Dalal and B. Triggs,
“Histograms of oriented gradients for human detection,”
in IEEE Computer Society Conf. on Computer Vision and Pattern Recognition,
886
–893
(2005). https://doi.org/10.1109/CVPR.2005.177 Google Scholar
P. F. Felzenszwalb et al.,
“Object detection with discriminatively trained part-based models,”
IEEE Trans. Pattern Anal. Mach. Intell., 32
(9), 1627
–1645
(2010). https://doi.org/10.1109/TPAMI.2009.167 ITPIDJ 0162-8828 Google Scholar
P. Dollr, S. J. Belongie and P. Perona,
“The fastest pedestrian detector in the west,”
in British Machine Vision Conf. (BMVC),
7
(2010). Google Scholar
W. Ouyang and X. Wang,
“A discriminative deep model for pedestrian detection with occlusion handling,”
in IEEE Conf. on Computer Vision and Pattern Recognition (CVPR),
3258
–3265
(2012). https://doi.org/10.1109/CVPR.2012.6248062 Google Scholar
P. Dollár, R. Appel and W. Kienzle,
“Crosstalk cascades for frame-rate pedestrian detection,”
Lect. Notes Comput. Sci., 7573 645
–659
(2012). https://doi.org/10.1007/978-3-642-33709-3 LNCSD9 0302-9743 Google Scholar
R. Benenson et al.,
“Seeking the strongest rigid detector,”
in Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition,
3666
–3673
(2013). https://doi.org/10.1109/CVPR.2013.470 Google Scholar
J. Yan et al.,
“Robust multi-resolution pedestrian detection in traffic scenes,”
in IEEE Conf. on Computer Vision and Pattern Recognition,
3033
–3040
(2013). https://doi.org/10.1109/CVPR.2013.390 Google Scholar
X. Ren and D. Ramanan,
“Histograms of sparse codes for object detection,”
in Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition,
3246
–3253
(2013). https://doi.org/10.1109/CVPR.2013.417 Google Scholar
B. Hariharan, C. Zitnick and P. Dollár,
“Detecting objects using deformation dictionaries,”
in Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition,
1995
–2002
(2014). https://doi.org/10.1109/CVPR.2014.256 Google Scholar
P. Dollar et al.,
“Fastest feature pyramids for object detection,”
IEEE Trans. Pattern Anal. Mach. Intell., 36
(8), 1532
–1545
(2014). https://doi.org/10.1109/TPAMI.2014.2300479 ITPIDJ 0162-8828 Google Scholar
B. C. Ko, J. E. Son and J. Y. Nam,
“View-invariant, partially occluded human detection in still images using part bases and random forest,”
Opt. Eng., 54
(5), 053113
(2015). https://doi.org/10.1117/1.OE.54.5.053113 Google Scholar
C. Toca, M. Ciuc and C. Patrascu,
“Normalized autobinomial Markov channels for pedestrian detection,”
in BMVC,
175.1
–175.13
(2015). Google Scholar
A. Angelova et al.,
“Real-time pedestrian detection with deep network cascades,”
in BMVC,
4
(2015). Google Scholar
A. Verma et al.,
“Pedestrian detection via mixture of CNN experts and thresholded aggregated channel features,”
in Proc. of the IEEE Int. Conf. on Computer Vision Workshops,
555
–563
(2015). https://doi.org/10.1109/ICCVW.2015.78 Google Scholar
Y. Tian et al.,
“Pedestrian detection aided by deep learning semantic tasks,”
in IEEE Conf. on Computer Vision and Pattern Recognition (CVPR),
(2015). https://doi.org/10.1109/CVPR.2015.7299143 Google Scholar
L. Zhang et al.,
“Is faster R-CNN doing well for pedestrian detection?,”
Lect. Notes Comput. Sci., 9906 443
–457
(2016). https://doi.org/10.1007/978-3-319-46475-6 LNCSD9 0302-9743 Google Scholar
J. Li et al.,
“Scale-aware fast R-CNN for pedestrian detection,”
IEEE Trans. Multimedia, 20 985
–996
(2017). https://doi.org/10.1109/TMM.2017.2759508 Google Scholar
X. Du et al.,
“Fused DNN: a deep neural network fusion approach to fast and robust pedestrian detection,”
in IEEE Winter Conf. on Applications of Computer Vision (WACV),
953
–961
(2017). https://doi.org/10.1109/WACV.2017.111 Google Scholar
G. Brazil, X. Yin and X. Liu,
“Illuminating pedestrians via simultaneous detection and segmentation,”
in IEEE Int. Conf. on Computer Vision (ICCV),
4960
–4969
(2017). Google Scholar
S. Lazebnik, C. Schmid and J. Ponce,
“Beyond bags of features: spatial pyramid matching for recognizing natural scene categories,”
in IEEE Computer Society Conf. on Computer Vision and Pattern Recognition,
2169
–2178
(2006). https://doi.org/10.1109/CVPR.2006.68 Google Scholar
F. Bach, J. Mairal and J. Ponce,
“Task-driven dictionary learning,”
IEEE Trans. Pattern Anal. Mach. Intell., 34
(4), 791
–804
(2012). https://doi.org/10.1109/TPAMI.2011.156 ITPIDJ 0162-8828 Google Scholar
G. Csurka et al.,
“Visual categorization with bags of keypoints,”
in Workshop on Statistical Learning in Computer Vision ECCV,
1
–22
(2004). Google Scholar
Z. Yang and H. Xiong,
“Computing object-based saliency via locality-constrained linear coding and conditional random fields,”
Visual Comput., 33
(11), 1403
–1413
(2017). https://doi.org/10.1007/s00371-016-1287-z VICOE5 0178-2789 Google Scholar
M. Szummer, P. Kohli and D. Hoiem,
“Learning CRFs using graph cuts,”
Lect. Notes Comput. Sci., 5303 582
–595
(2008). https://doi.org/10.1007/978-3-540-88688-4 LNCSD9 0302-9743 Google Scholar
T. Joachims, T. Finley and C. N. J. Yu,
“Cutting-plane training of structural SVMs,”
Mach. Learn., 77
(1), 27
–59
(2009). https://doi.org/10.1007/s10994-009-5108-8 MALEEZ 0885-6125 Google Scholar
J. Hosang et al.,
“Taking a deeper look at pedestrians,”
in Proc. of the IEEE Conf. on Computer Vision and Pattern,
(2015). https://doi.org/10.1109/CVPR.2015.7299034 Google Scholar
Y. Tian et al.,
“Deep learning strong parts for pedestrian detection,”
in IEEE Int. Conf. on Computer Vision,
1904
–1912
(2016). https://doi.org/10.1109/ICCV.2015.221 Google Scholar
Z. Hayder, M. Salzmann and X. He,
“Object co-detection via efficient inference in a fully-connected CRF,”
Lect. Notes Comput. Sci., 5303 330
–345
(2014). https://doi.org/10.1007/978-3-319-10578-9 LNCSD9 0302-9743 Google Scholar
Z. Hayder, X. He and M. Salzmann,
“Structural kernel learning for large scale multiclass object co-detection,”
in IEEE Int. Conf. on Computer Vision (ICCV),
2632
–2640
(2015). https://doi.org/10.1109/ICCV.2015.302 Google Scholar
BiographyLinfeng Jiang received his BS degree in computer science from Chongqing University, Chongqing, China, in 2005, and his MS degree in computer science from Kunming University of Science and Technology, Kunming, China, in 2011. Currently, he is working toward his PhD in the Department of Automation, Shanghai Jiao Tong University (SJTU), Shanghai, China. He is interested in computer vision and probabilistic graphical theory for context modeling. Huilin Xiong received his BSc and MSc degrees in mathematics from Wuhan University, Wuhan, China, in 1986 and 1989, respectively. He received his PhD in pattern recognition and intelligent control from the Institute of Pattern Recognition and Artificial Intelligence, Huazhong University of Science and Technology, Wuhan, China, in 1999. He joined SJTU, Shanghai, China, in 2007, and currently, he is a professor in the Department of Automation, SJTU. His research interests include pattern recognition, machine learning, and bioinfomatics. |