|
1.IntroductionVisual tracking is one of the current research hotspots in computer vision. It has been widely used in many fields, such as visual surveillance, human–computer interface, medical image analysis,1,2 and so on. Given the initial state of the target (including position, scale, etc.), the classical visual trackers achieve the tracking by estimating its continuous states in following frames. In recent years, a large number of tracking algorithms have been proposed. Existing tracking algorithms can be divided into two categories:3 generative methods and discriminative methods. The former is a “model-driven” method that uses the target’s information to establish the target model and determines the most similar sample as the tracking result. Some popular generative methods include incremental visual tracking (IVT),4 multitask tracking (MTT),5 and adaptive structural local appearance model (ASLA).6 The latter is a “data-driven” method that deals with the tracking process as a binary classification problem between target and background. Some state-of-the-art trackers, such as compressive tracking (CT),7 tracking-learning-detection (TLD),8 and multiple instance learning (MIL),9 are discriminative methods. These above trackers, which use hand-crafted features, achieve a good performance in simple and controllable environments, but there are always some problems of tracking drifting or a target missing in some practical and complicated environments, such as illumination variation, deformation, occlusion, motion blur, and so on. Therefore, there is still a challenging gap between a robust real-time tracker and the realistic application in extreme and complicated conditions. The emergence and development of “deep learning” has gradually become the potential solution to the above problems.10 Different from hand-crafted features, deep learning learns the high-level semantic features automatically. These features are effective in distinguishing the target from background due to the deep architectures of deep learning. Recently, the deep learning-based trackers have been gradually becoming the tendency in visual tracking fields due to their outperformance compared with traditional tracking methods. However, the tracking methods based on deep learning still suffer from some difficulties.11
In this paper, we propose a multiscale deep sparse network (MDSN) and build a robust tracker: multiscale sparse networks-based tracker (MSNT). The main contributions of our work can be summarized as follows:
A large number of experiments and analyses are carried out on the CVPR2013 tracking benchmark dataset17 (including 51 challenging videos) with nine recent state-of-the-art trackers. The experimental results show that our tracker achieves outstanding performance in challenging environments and attains a practical tracking speed. 2.Related WorkThe concept of “deep learning” was first proposed by Hinton and Salakhutdinov.12 Since then, deep learning technology has been widely concerned and has been making great progress. With its robust and efficient features, deep learning has been applied in diverse fields, such as image classification,14,18 automatic speech recognition,19 face recognition,20 and so on. Recently, deep neural networks (DNNs) have been applied in the visual tracking field. Fan et al.21 extracted specific features from convolutional neural networks (CNNs) with offline pretraining for human tracking and obtained acceptable tracking results in some complex situations. Through training a stacked denoising autoencoder on a large scale image dataset, deep learning tracker (DLT)22 learned generic features and achieved a robust tracking performance. Li et al.23 applied a single-CNN on visual tracking tasks without pretraining and combined it with multiple image cues to improve the tracking success rate. Wang et al.24 used hierarchical features for tracking by training a two-layer CNN on an auxiliary dataset and gained a good result in complicated tracking situations. Zhang et al.25 proposed a convolutional network-based tracker (CNT), which combined the local structure feature and global geometric information of tracking targets and attained a state-of-the-art performance. The sparse distributed representation (SDR) is the key for learning powerful features in deep learning, while the activation function plays an important role in encouraging sparsity.26 The performance of the activation function will directly influence the effectiveness and robustness of the extracted features. The most popular nonlinear activation functions are “sigmoid” and “tanh.” They have been widely used in many deep networks, but they suffer from some drawbacks,11 such as a slow training speed and a poor local solution with random initialization without good predictive performance. Recently, a sparse activation function called ReLU was proposed in Ref. 15. As illustrated in Fig. 1, different from traditional activation functions, such as sigmoid and tanh, the rectifier function is a one-side activation function. It enforces hard zeros in the learned feature representation26 and leads to the sparsity of hidden units by rectifying the negative output of the hidden units.16 The sparsity of hidden units has the same effectiveness as the pretraining methods. The experimental results in Ref. 27 showed that pretraining will lead to more sparsity of the deep networks compared to DNNs without pretraining. Moreover, Glorot’s experiments16 proved that deep networks with ReLU can reach their best performance without any unsupervised pretraining due to the sparsity. More experiments further proved the conclusion in Ref. 27 and showed that there is no significant improvement for DNNs with ReLU using pretraining. Moreover, ReLU was used in a sparse deep stacking network (S-DSN) for image classification in Ref. 18. It avoided the expensive inference effectively and achieved higher sparsity and better classification performance than S-DSN with sigmoid. Furthermore, the active part of ReLU is an unsaturated linear function, which alleviates the “gradient vanishing” problem effectively in training and improves the speed of training. Therefore, ReLU is a practical activation function for quickly building sparse and powerful deep architectures without requiring pretraining process. 3.Deep Sparse Network ModelDifferent from sparse coding (SC), the sparsity of neural networks attempts to represent the features of the input data using the least amount of hidden neurons. The feature of objects in sparse neural networks is SDR,26 which dictates that all representational units participate in data representation while very few units activate for a single data sample. It can exploit more powerful and robust feature representations from input data. Therefore, it is reasonable to build a model of deep sparse network for tracking. The SAE is a basic unsupervised learning model and is often used in deep learning. In this paper, we use a structure of SAE that is similar to Ref. 14 and obtain a deep sparse network by training the stacked-SAEs using the “layer-by-layer greedy algorithm.”12 The cost function in the model is defined as where denotes the reconstruction of , and are the weight matrices of encoder and decoder, respectively, is the bias vector of encoder included in , is the number of samples, is a penalty factor, which balances the reconstruction loss and weights, is the sparsity penalty factor, and denotes the Frobenius norm. The cross-entropy is given as where and are the number of neurons in the input layer and hidden layer, respectively. denotes the activation value in the ’th hidden layer to the input . The sparsity target is close to 0, and it is set to 0.05 in our experiments.In Refs. 16, 18, and 27, it is proven that ReLU will bring the inherent sparsity to DNNs, which let the pretraining become less effective for DNNs when using the ReLU activation function. Hence, we adopt ReLU as an activation function to the aforementioned deep sparse network to leave out the offline pretraining. Benefiting from the intrinsic sparsity of ReLU, around 50% of the hidden units’ output values are real zeros once the deep network is built. This makes the basic stacked-SAEs transform into a variant, as shown in Fig. 2. Moreover, this percentage of inactive neurons (units that do not activate for any data sample) can easily increase with online training based on the sparsity constraints of SAE.16 Fig. 2The basic stacked-SAEs and its variant with ReLU: (a) the basic stacked-SAEs and (b) the variant of stacked-SAEs with ReLU. ![]() Based on the architecture of Fig. 2(b), a “softmax” classifier layer is added to the model as the last layer to classify the learned features. The logistic regression is included in the softmax classifier layer where is a value in [0, 1], i.e., it represents the probability of the sample as the true target in the visual tracking problem and is the model parameter. The final model of the deep sparse neural network for tracking is shown in Fig. 3. During the tracking process, each sampling patch gets a value in [0, 1] through the softmax classifier in the tracking network.4.Tracking Algorithm Based on Multiscale Deep Sparse NetworksA single deep sparse network for tracking is introduced in Sec. 3. However, this fixed architecture of deep network is too rigid in practical tracking tasks, and it cannot adapt to different situations effectively. Based on the single network model mentioned in Sec. 2, we propose an MDSN and combine it with a particle filter framework to cope with the complex tracking tasks. 4.1.Multiscale Deep Sparse NetworksThe conventional neural network for tracking usually normalizes the initial target patch or sampling patches into the same size in the input layer, which can reduce the number of input neurons and the complexity of networks effectively. For example, the target patch in the first frame is normalized into a low-resolution (LR) image with in DLT.22 However, we observe in several experiments that the fixed normalization for different targets will cause various degrees of stretching or compressing for the targets. The deformation damages the inner structure information of the targets, reduces robustness of the extracted features, and increases tracking drifting. However, when different normalized method is used in different shapes of targets, it reserves more inner structure information and achieves a better tracking result due to the reduction of deformation of targets. As shown in Fig. 4, the red bounding box and line represent the tracker based on a normalization scale (normalization-2) while the green ones represent the normalization (normalization-1). We clearly observe that the normalization has better performance than normalization in this case. Fig. 4Comparisons for two trackers based on different normalizations (red bounding box and line represent normalization while green ones represent normalization). (a) The tracking results of two trackers based on different normalizations and (b) CLEs of two trackers based on different normalizations. ![]() Based on the observations, we propose an MDSN to adapt to different targets and situations effectively. It is called “multiscale” because we build four different architectures of deep sparse networks aimed at four different kinds of situations. The four defined situations of tracked targets and corresponding architectures of deep network are as follows:
The entire architecture of the MDSN is shown in Fig. 5. With a new tracking task, MDSN first chooses a corresponding tracking network according to the above defined situations. The multiscale architecture reserves the inner structure information of targets as much as possible, so it will improve the robustness of the extracted features. 4.2.Particle Filter Tracking FrameworkThe particle filter algorithm22,28 is a popular tracking framework used in the visual tracking field. Let and denote the state and observation of the target at time , respectively. The tracking task can be considered the process of searching for the target’s state of maximum probability at time according to the observations where is the posterior distribution of the target at time . According to Bayes criterion, we get thatThe particle filter algorithm estimates the posterior distribution through a set of random particles with corresponding weights , where denotes the numbers of sampling particles and the initial weights are . The weights of particles easily produce weight degeneracy, so the weights are updated as follows: where is the proposed distribution, which depends on the particle distribution at time and the observation at time . Additionally, it is often simplified to a first-order Markov process , which is independent of the current observation. Thus, the update formulation can be simplified toMeanwhile, the weights should be further normalized to satisfy the below equation In our proposed algorithm, we use the particle filter to randomly sample the candidate patches around the last tracking results; then, we send the sampling patches to the tracking network, which is proposed in Sec. 4.1. We get the confidence coefficient through the classifier layer, i.e., the posterior distribution , and then we choose the maximum to get the current target’s state by Eq. (5). Meanwhile, to adapt to the changes of the target’s scales during tracking, a random disturbance is added to the width and height of the candidate patches. In this paper, and obey normal distribution with zero mean and a variance of 0.01. 4.3.Online Training and Updating Strategy for Tracking NetworkAfter determining the corresponding tracking network, the tracking network with random initialization cannot satisfy the requirements of the specific tracking task, so we adjust the network parameters using specific labeled samples by online training. In specific tracking tasks, we need to collect enough positive and negative samples to train the network while only the initial state is given. Here, denotes the initial position of the target and and denote the width and height, respectively. In our proposed method, we randomly collect 10 positive samples close to the target’s center and 100 negative samples far away from the target. Using these positive and negative samples to train the tracking network at the beginning of tracking, we get the adapted network for specific tasks. A robust tracking algorithm should be able to consistently track the target without drifting or losing, which requires the tracker to have the capacity of adjusting parameters adaptively according to changes of environments. The condition to update the proposed method is as follows: where is the threshold of updating, fn is the accumulative frames after the last update, and is the maximum accumulative frames. If Eq. (10) is satisfied, the current tracking result will be added to the positive samples set, and the negative samples will be randomly sampled again in the current frame. Then, it is retrained by utilizing the updated positive and negative samples to realize the updating of the tracking network.4.4.Overall Process of AlgorithmIntegrating the above description of the key components, we present a visual tracking method MSNT via the proposed MDSNs. The main steps of MSNT are shown in Table 1, and the flow chart of the overall algorithm is shown in Fig. 6. Table 1The main steps of MSNT algorithm.
5.ExperimentsThe proposed MSNT algorithm is realized in MATLAB® on the experimental platform of a CPU (Intel Xeon 2.4 GHz) and GPU (TITAN X). The initial parameters of the MSNT are as follows: , , , , and . In addition, we set the learning rate to 0.01 during the online training. The weight matrix is randomly initialized where denotes the weight matrix between the ’th layer and the ()’th layer, and denote the number of the neurons of the ’th and the ()’th layer, and generates a random matrix of with uniform distribution between 0 and 1. Therefore, is randomly initialized into of the microweights, and the weights are different in different layers.To verify the validity of our proposed method, the one-pass evaluation (OPE) as in Ref. 16 is used in our experiments. The MSNT algorithm is evaluated on the tracking benchmark dataset,16 which includes 51 fully annotated videos. We compare the performance of our tracker with nine state-of-the-art trackers, including DLT,22 CNT,25 kernelized correlation filters (KCF),29 tracking with Gaussian processes regression (TGPR),30 sparsity-based collaborative model (SCM),31 Struck,32 structural sparse tracking (SST),33 linearization to nonlinear learning tracker (LNLT),34 and circulant sparse tracker (CST).35 A brief introduction of these referenced trackers is shown in Table 2, and their tracking results are provided by their authors. Some qualitative and quantitative comparisons are implemented to evaluate the performance of our tracker. The detailed and color comparisons can be obtained in the online version of this paper. Table 2Brief introduction to nine referenced trackers.
Note: For basic method, SC, sparse coding; KCF, kernelized correlation filter; DL, deep learning; CNN, convolutional neural network; AE, autoencoder; GPR, Gaussian processes regression; and SVM, support vector machine. 5.1.Qualitative ComparisonsIn qualitative comparisons, eight challenging sequences are selected to evaluate the MSNT intuitively. The results are shown in Fig. 7, and the different colors indicate different trackers. Then, we analysis the trackers qualitatively from the following aspects:
5.2.Quantitative ComparisonsTo evaluate our tracker comprehensively and reliably, we use four quantitative evaluation metrics, which are introduced in Ref. 17, to carry out quantitative analysis.
In our experiments, we quantitatively analyze our tracker from three aspects: the tracking performance for a single sequence, the overall performance, and the attribute-based performance for 51 sequences.17 5.2.1.Tracking performance for a single sequenceThe above eight challenging sequences, which are introduced in Sec. 5.1, are used to compare the tracking performance of a single sequence quantitatively. Figure 8 and Table 3 show the overlap rate plots and the success rate in the success threshold , respectively, of the 10 trackers on eight challenging sequences. From Fig. 8, we see that the overlap rates of our tracker are always at a high level in these eight challenging sequences, and the success rates of our tracker in Table 3 are higher than most other trackers. These metrics prove that our tracker achieves a good tracking success rate for single sequences in different challenging scenes. Fig. 8Overlap rate plots of 10 trackers on eight challenging sequences. (a) to (h) Car4, Coke, Deer, Freeman4, Girl, Jogging-1, Singer1, and Tiger1, respectively. ![]() Table 3The success rates in the success threshold t0=0.5 of the trackers on eight challenging sequences.
Note: %, the best results are in bold and the second best in italics. Figure 9 and Table 4 show the CLE plots and the average CLEs of the trackers, respectively, on eight challenging sequences. In the tracking process for a single sequence, our tracker maintains lower center errors compared to others and achieves a low tracking error for the whole sequence. These quantitative metrics show that our tracker possesses a higher precision during the tracking process. Fig. 9CLE plots of the 10 trackers on eight challenging sequences. (a) to (h) Car4, Coke, Deer, Freeman4, Girl, Jogging-1, Singer1, and Tiger1, respectively. ![]() Table 4Average CLEs of the trackers on eight sequences.
Note: Pixels, the best results are in bold and the second best in italics. 5.2.2.Overall performance for 51 sequencesFor evaluating our tracker’s overall performance for 51 sequences in the benchmark,17 we plot the success plots and the precision plots of the above 10 trackers. The success plot shows the success rates at a varied overlap threshold in the interval [0, 1], and the precision plot shows the precisions at a varied CLE threshold from 0 to 50 pixels. Furthermore, to verify the effectiveness of the multiscale tracking networks of our tracker, a single network tracker based on the S-target network, named single network-based tracker (SNT) algorithm, is used for comparison. Figure 10 shows the overall performance comparisons of 11 trackers based on success plots and precision plots. These trackers are ranked according to the area under curve (AUC) values of success plots in Fig. 10(a) and the precision values at the threshold of 20 pixels in Fig. 10(b). For success plots, MSNT achieves the AUC value of 0.564 and ranks first of 11 trackers. Compared with DLT and CNT, which are also based on the deep learning method, the value of MSNT is improved by 29.4% and 4.0% over these, respectively. For precision plots, the precision of MSNT achieves 0.753 and ranks second, which is only after CST of 0.777. Similarly, the precision of MSNT is increased by 28.3% and 4.1% more than DLT and CNT, respectively. Fig. 10The success plots and precision plots of OPE for the trackers: (a) success plots and (b) precision plots. ![]() Analyzing the success plots and precision plots of MSNT and SNT, we find that MSNT improves the performance of SNT apparently in both of these metrics. The MSNT improves the value by 6.6% more than SNT in the success plots and improves the precision by 8.5% more than SNT. These results suggest that our proposed multiscale networks can extract more robust and effective features and have better performance than the single and fixed network. These experimental data and the above analyses illustrate that our tracker outperforms these state-of-the-art trackers and achieves satisfactory tracking results in different challenging scenarios. 5.2.3.Attribute-based performance for 51 sequencesTo further analyze the performance of our tracker under different tracking conditions, we evaluated these trackers on 11 attributes, which are defined in Ref. 17. The success plots and precision plots on different attributes are shown in Figs. 11 and 12, respectively. Among the 11 attributes, MSNT ranks first in eight attributes (including “illumination variation,” “out-of-plane rotation,” “scale variation,” “occlusion,” “fast motion,” “in-plane rotation,” “out of view,” and “LR”) and outperforms SNT in all attributes in Fig. 9. Only on the attributes of “deformation” and “background clutter” does MSNT not rank in the top 3 in the success plots. For the precision plots in Fig. 12, MSNT ranks in the top 3 on eight attributes and outperforms SNT in all attributes. In particular, in the attributes of fast motion, out of view, and LR, MSNT has the best performance for the tracking precisions. However, MSNT has a worse performance on the attributes of illumination variation, deformation, and background clutter than some trackers, such as CST, KCF, and LNLT. Some observations we obtained from these attribute-based data: first, our tracker achieves a good performance in most attributions, especially in the attributes of fast motion, out of view, and LR. Second, our tracker cannot perform as well as CST and KCF on some attributes, such as deformation and background clutter, especially in the precision plots. These may be the next research areas for improving our tracker. Third, MSNT outperforms SNT in all attributes whether in success plots or precision plots. It further proves the availability of a multiscale tracking network. 5.3.Tracking Speed of TrackerOn our experimental platform, our tracker achieves a practical tracking speed of an average of 13.2 frame per second (FPS) for the 51 sequences. Table 5 shows the tracking speed of the above 10 trackers. All the data are published by the authors in their papers. The “—” indicates that the author does not give the tracking speed explicitly. From Table 5, we see that KCF has the highest tracking speed, and our tracker achieves a faster speed than CST, SST, TGPR, and SCM. Compared to DLT, which is also based on deep learning, our tracker has a slightly slower tracking speed, but it avoids the complex and time-consuming pretraining process. This property makes the establishment and adjustment of tracking networks more simple and flexible than DLT. Table 5The tracking speed comparison for the 10 trackers.
5.4.DiscussionFor a more thorough evaluation, we also add the following recent trackers with their corresponding results (success rate, precision, and FPS) to the comparison: STCT (0.640, 0.780, 2.5),36 RTT (0.588, 0.827, 3 to 4),37 and DLRT (0.512, 0.694, 3).38 Among these trackers, the proposed MSNT (0.564, 0.753, 13.2) achieves better performance than DLRT but is inferior to STCT and RTT. Nevertheless, our tracker has a faster processing speed than these trackers and achieves comparable performance as RTT in success rate and as STCT in precision. However, our tracker still has room to improve compared with the best tracker STCT. STCT regards CNN as an ensemble of base learners and trains the convolutional layers with random binary masks. These techniques reduce the correlation between the learned features and prevent overfitting effectively, although these lead to higher computation cost to some degree. Like the random binary masks in STCT, the similar trick, “Dropout,”39 may be used in our tracker to further avoid overfitting. In this paper, we propose a simple but effective MDSN for achieving real-time tracking. A robust tracker is built based on MDSN without offline pretraining with an auxiliary dataset, and the tracker alleviates the “gradient vanishing” in the training process due to the ReLU activation function. However, as shown in Fig. 13, there are some serious failed cases for our MSNT tracker. In “Bolt,” the runners have similar appearances, so it is difficult to discriminate the correct target from the others. In “Ironman,” the comprehensive factors (including intense lighting changes, similar background, fast motion, and rotation, etc.) make the tracker be unable to differentiate the dark target from the noisy background effectively. Finally, in “MotorRolling,” the significant rotation and deformation of the target cause the tracking failure. In these cases, our tracker easily makes the biggest errors, i.e., the tracking drifting and target missing at the beginning of the tracking. Fig. 13Some failure cases for our tracker. Red boxes show our results and the yellow ones are the ground truth. (a) Bolt, (b) Ironman, and (c) MotorRolling. ![]() Analyzing these failed cases, the deformation and background clutter of targets may be the main factors to cause failure for our proposed MSNT. Moreover, the rankings of our tracker in Figs. 11 and 12 also indicate that MSNT has a relatively poor performance on the attributes of deformation and background clutter. In addition, the above trick for preventing overfitting, such as in Dropout, can improve the performance of our tracker to some degree; a combination with more robust and semantic feature extractors, such as CNNs (in Ref. 36) or RNNs (in Ref. 37), may be potential solutions to our method on both challenging attributes. These problems will be the interesting research directions in our future work. 6.ConclusionsIn this paper, we proposed an MDSN for extracting robust and powerful features for visual tracking. The intrinsic sparsity of the networks avoids offline pretraining with an auxiliary image dataset and exploits more sparse and robust feature representations. The multiscale networks can adaptively select the corresponding tracking networks based on the shapes of targets. It will capture more useful structural information of targets. Combined the MDSN with the particle filter tracking framework, the MSNT tracker is proposed to solve the tracking problems. Through quantitative and qualitative comparison with state-of-the-art trackers on the challenging tracking benchmark dataset, our proposed tracker achieves a satisfactory result and practicable tracking speed in experiments. Furthermore, there are several possible directions to investigate in detail for this work. First, the technique of blocking, such as histograms of oriented gradients (HOG)40 can be used in the proposed method to improve the performance on the attributes of deformation and background clutter. Second, CNNs and other feature extractors may be combined in our proposed method to exploit more robust and semantic features for tracking. Third, more effective classification methods, such as support vector machine, will be employed instead of softmax classifier, which may further improve the robustness of tracking. AcknowledgmentsThis research has been supported by the National Natural Science Foundation of China (No. 61473309) and project supported by the Natural Science Basic Research Plan in Shaanxi Province (Nos. 2015JM6269 and 2016JM6050). We surely declare that no financial interests or conflicts are involved in this article. ReferencesA. W. M. Smeulders et al.,
“Visual tracking: an experimental survey,”
IEEE Trans. Pattern Anal. Mach. Intell., 36
(7), 1442
–1468
(2014). http://dx.doi.org/10.1109/TPAMI.2013.230 ITPIDJ 0162-8828 Google Scholar
A. Yilmaz, O. Javed and M. Shah,
“Object tracking: a survey,”
ACM Comput. Surv., 38
(4), 13
(2006). http://dx.doi.org/10.1145/1177352.1177355 ACSUEY 0360-0300 Google Scholar
X. Li et al.,
“A survey of appearance models in visual object tracking,”
ACM Trans. Intell. Syst. Technol., 4
(4), 58
(2013). http://dx.doi.org/10.1145/2508037.2508039 Google Scholar
D. A. Ross, J. Lim and R. S. Lin,
“Incremental learning for robust visual tracking,”
Int. J. Comput. Vision, 77
(1–3), 125
–141
(2008). http://dx.doi.org/10.1007/s11263-007-0075-7 IJCVEQ 0920-5691 Google Scholar
T. Z. Zhang et al.,
“Robust visual tracking via multi-task sparse learning,”
in IEEE Conf. on Computer Vision and Pattern Recognition (CVPR’12),
2042
–2049
(2012). http://dx.doi.org/10.1109/CVPR.2012.6247908 Google Scholar
X. Jia, H. Lu and M. H. Yang,
“Visual tracking via adaptive structural local sparse appearance model,”
in IEEE Conf. on Computer Vision and Pattern Recognition (CVPR’12),
1822
–1829
(2012). http://dx.doi.org/10.1109/CVPR.2012.6247880 Google Scholar
K. H. Zhang, L. Zhang and M. H. Yang,
“Real-time compressive tracking,”
in European Conf. on Computer Vision,
864
–877
(2012). http://dx.doi.org/10.1007/978-3-642-33712-3_62 Google Scholar
Z. Kalal, K. Mikolajczyk and J. Matas,
“Tracking-learning-detection,”
IEEE Trans. Pattern Anal. Mach. Intell., 34
(7), 1409
–1422
(2012). http://dx.doi.org/10.1109/TPAMI.2011.239 ITPIDJ 0162-8828 Google Scholar
B. Babenko, M. H. Yang and S. Belongie,
“Robust object tracking with online multiple instance learning,”
IEEE Trans. Pattern Anal. Mach. Intell., 33
(8), 1619
–1632
(2011). http://dx.doi.org/10.1109/TPAMI.2010.226 ITPIDJ 0162-8828 Google Scholar
Y. Lecun, Y. Bengo and G. Hinton,
“Deep learning,”
Nature, 521
(7553), 436
–444
(2015). http://dx.doi.org/10.1038/nature14539 Google Scholar
X. Glorot and Y. Bengio,
“Understanding the difficulty of training deep feedforward neural networks,”
in Proc. of the 13th Int. Conf. on Artificial Intelligence and Statistics (AISTATS’10),
249
–256
(2010). Google Scholar
G. E. Hinton and R. Salakhutdinov,
“Reducing the dimensionality of data with neural networks,”
Science, 313
(5786), 504
–507
(2006). http://dx.doi.org/10.1126/science.1127647 SCIEAS 0036-8075 Google Scholar
P. Vincent et al.,
“Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion,”
J. Mach. Learn. Res., 11
(12), 3371
–3408
(2010). Google Scholar
Y. Zhang, E. H. Zhang and W. J. Chen,
“Deep neural network for halftone image classification based on sparse auto-encoder,”
Eng. Appl. Artif. Intell., 50
(1), 245
–255
(2016). http://dx.doi.org/10.1016/j.engappai.2016.01.032 EAAIE6 0952-1976 Google Scholar
V. Nair and G. E. Hinton,
“Rectified linear units improve restricted Boltzmann machines,”
in Proc. of the 27th Int. Conf. on Machine Learning (ICML’10),
807
–814
(2010). Google Scholar
X. Glorot, A. Bordes and Y. Bengio,
“Deep sparse rectifier neural networks,”
in Proc. of the 14th Int. Conf. on Artificial Intelligence and Statistics (AISTATS’11),
315
–323
(2011). Google Scholar
Y. Wu, J. Lim and M. H. Yang,
“Online object tracking: a benchmark,”
in Proc. IEEE Computer Vision and Pattern Recognition,
2411
–2418
(2013). http://dx.doi.org/10.1109/CVPR.2013.312 Google Scholar
J. Li, H. Chang and J. Yang,
“Sparse deep stacking network for image classification,”
in Proc. Twenty-Ninth AAAI Conf. on Artificial Intelligence (AAAI’15),
3804
–3810
(2015). Google Scholar
M. Baccouche et al.,
“Deep learning of split temporal context for automatic speech recognition,”
in IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP’14),
5459
–5463
(2014). http://dx.doi.org/10.1109/ICASSP.2014.6854639 Google Scholar
O. M. Parkhi, A. Vedaldi and A. Zisserman,
“Deep face recognition,”
in Proc. British Machine Vision,
1
–12
(2015). http://dx.doi.org/10.5244/C.29.41 Google Scholar
J. Fan et al.,
“Human tracking using convolutional neural networks,”
IEEE Trans. Neural Networks, 21
(10), 1610
–1623
(2010). http://dx.doi.org/10.1109/TNN.2010.2066286 ITNNEP 1045-9227 Google Scholar
N. Y. Wang and D. Yeung,
“Learning a deep compact image representation for visual tracking,”
in Proc. Advances in Neural Information Processing Systems (NIPS’13),
809
–817
(2013). Google Scholar
H. Li, Y. Li and F. Porikli,
“Robust online visual tracking with a single convolutional neural network,”
in Asian Conf. on Computer Vision, Proc.,
194
–209
(2014). http://dx.doi.org/10.1007/978-3-319-16814-2_13 Google Scholar
L. Wang et al.,
“Video tracking using learned hierarchical features,”
IEEE Trans. Image Process., 24
(4), 1424
–1435
(2015). http://dx.doi.org/10.1109/TIP.2015.2403231 IIPRE4 1057-7149 Google Scholar
K. H. Zhang, Q. S. Liu and Y. Wu,
“Robust visual tracking via convolutional networks without training,”
IEEE Trans. Image Process., 25
(6), 1779
–1792
(2016). http://dx.doi.org/10.1109/TIP.2016.2531283 IIPRE4 1057-7149 Google Scholar
D. Arpit et al.,
“Why regularized auto-encoders learn sparse representation?,”
in Proc. Int. Conf. on Machine Learning (ICML’15),
134
–144
(2015). Google Scholar
J. Li et al.,
“Sparseness analysis in the pretraining of deep neural networks,”
IEEE Tran. Neural Networks Learn. Syst., PP
(99), 1
–14
(2016). http://dx.doi.org/10.1109/TNNLS.2016.2541681 Google Scholar
F. S. Wang,
“Particle filters for visual tracking,”
in Proc. Advanced Research on Computer Science and Information Engineering,
107
–112
(2011). http://dx.doi.org/10.1007/978-3-642-21402-8_17 Google Scholar
J. F. Henriques et al.,
“High-speed tracking with kernelized correlation filters,”
IEEE Trans. Pattern Anal. Mach. Intell., 37
(3), 583
–596
(2015). http://dx.doi.org/10.1109/TPAMI.2014.2345390 ITPIDJ 0162-8828 Google Scholar
J. Gao et al.,
“Transfer learning based visual tracking with Gaussian processes regression,”
Lect. Notes Comput. Sci., 8691
(1), 188
–203
(2014). http://dx.doi.org/10.1007/978-3-319-10578-9_13 LNCSD9 0302-9743 Google Scholar
W. Zhong, H. Lu and M. H. Yang,
“Robust object tracking via sparsity-based collaborative model,”
in IEEE Conf. on Computer Vision and Pattern Recognition (CVPR’12),
1838
–1845
(2012). http://dx.doi.org/10.1109/CVPR.2012.6247882 Google Scholar
S. Hare, A. Saffari and P. H. Torr,
“Struck: structured output tracking with kernels,”
in IEEE Int. Conf. on Computer Vision (ICCV’11),
263
–270
(2011). http://dx.doi.org/10.1109/ICCV.2011.6126251 Google Scholar
T. Z. Zhang et al.,
“Structural sparse tracking,”
in IEEE Conf. on Computer Vision and Pattern Recognition (CVPR’15),
150
–158
(2015). http://dx.doi.org/10.1109/CVPR.2015.7298610 Google Scholar
B. Ma et al.,
“Linearization to nonlinear learning for visual tracking,”
in IEEE Int. Conf. on Computer Vision (ICCV’15),
4400
–4407
(2015). http://dx.doi.org/10.1109/ICCV.2015.500 Google Scholar
T. Z. Zhang, A. Bibi and B. Ghanem,
“In defense of sparse tracking: circulant sparse tracker,”
in IEEE Conf. on Computer Vision and Pattern Recognition (CVPR’16),
3880
–3888
(2016). http://dx.doi.org/10.1109/CVPR.2016.421 Google Scholar
L. J. Wang et al.,
“STCT: sequentially training convolutional networks for visual tracking,”
in IEEE Conf. on Computer Vision and Pattern Recognition (CVPR’16),
1373
–1381
(2016). http://dx.doi.org/10.1109/CVPR.2016.153 Google Scholar
Z. Cui et al.,
“Recurrently target-attending tracking,”
in IEEE Conf. on Computer Vision and Pattern Recognition (CVPR’16),
1449
–1458
(2016). http://dx.doi.org/10.1109/CVPR.2016.161 Google Scholar
Y. Sui, Y. F. Tang and L. Zhang,
“Discriminative low-rank tracking,”
in Proc. IEEE Int. Conf. on Computer Vision,
3002
–3010
(2015). http://dx.doi.org/10.1109/ICCV.2015.344 Google Scholar
G. E. Hinton et al.,
“Improving neural networks by preventing co-adaptation of feature detectors,”
Comput. Sci., 3
(4), 212
–223
(2012). Google Scholar
N. Dalal and B. Triggs,
“Histograms of oriented gradients for human detection,”
in IEEE Computer Society Conf. on Computer Vision and Pattern Recognition (CVPR’05),
886
–893
(2005). http://dx.doi.org/10.1109/CVPR.2005.177 Google Scholar
BiographyXin Wang received his BS degree from Air Force Engineering University (AFEU), Xi’an, China, in 2015. He is currently a master’s candidate at the Information and Navigation College, AFEU. His current research interests include pattern recognition, computer vision, and machine learning. Zhiqiang Hou graduated from AFEU and received his MS degree in 1998. He received his PhD from Xi’an Jiaotong University in 2005. He was a visiting scholar at the University of Bristol, UK, in 2009. He is currently a professor at AFEU. His research interests include pattern recognition, computer vision, image processing, and information fusion. Wangsheng Yu received his MS degree and PhD in communication and information systems from the AFEU in 2010 and 2014, respectively. He is currently a lecturer at the Information and Navigation College, AFEU. His research interests include computer vision and image processing. Yang Xue received his BS degree in information engineering from AFEU, Xi’an, China, in 2015. He is currently working toward his graduate degree in the group of Professor L. H. Ma. His research interests include quantum information and machine learning. |