Open Access Paper
14 February 2024 Two step vehicle trajectory reconstruction strategy based on an improved outlier detection and data smoothing algorithm
Rui Cao
Author Affiliations +
Proceedings Volume 13018, International Conference on Smart Transportation and City Engineering (STCE 2023); 130184R (2024) https://doi.org/10.1117/12.3024008
Event: International Conference on Smart Transportation and City Engineering (STCE 2023), 2023, Chongqing, China
Abstract
Vehicle trajectory data contains abundant traffic information and plays a key role in traffic flow research. Due to unsuitable sensor position or perspective, there are some local errors in existing vehicle trajectory data sets. To solve this problem, a two-step vehicle trajectory data reconstruction strategy is proposed in this paper. In the first step, an outlier detection algorithm combines the cluster-based local outlier factor (CBLOF) and the k-means clustering algorithm (KCBLOF) is proposed to identify outlier in object dataset. In the second step, Savitzky-Golay (SG) filter is selected to smooth the data flow containing noise. In this paper, the prototype data of I-80 highway in the open source trajectory data set NGSIM is used for case study. Taking vehicle 1 as an example, the speed and acceleration data of vehicle 1 are reconstructed. Time Windows with length of 11 and 21 are selected respectively to implement SG filter. The results show that when the time window value is 21, the trajectory data shows a better smoothed result, and the data noise can be obviously eliminated, which verifies that the trajectory data smoothing strategy proposed in this paper has a certain reliability.

1.

INTRODUCTION

Accurate and reliable vehicle trajectory data can provide information such as vehicle position, vehicle speed, acceleration, queue length and driving time, which is of great significance for the development of intelligent transportation system and urban transportation system management. In previous studies, vehicle trajectory data has been widely used in traffic state estimation, traffic flow modeling, signal optimization, vehicle fuel consumption and emissions [1][2]. At present, two types of traffic sensors are widely used to collect vehicle trajectory information, namely fixed and mobile sensors. Fixed sensors have the advantages of simple operation, ensure consistency of detection target and monitoring environment, and can provide efficient target position and speed under various conditions. Perhaps the best-known vehicle trajectory dataset is the Next Generation Simulation Program (NGSIM) dataset [3], which is also widely used in vehicle trajectory reconstruction, vehicle lane change, and performance evaluation of traffic flow operation, etc. [4][5][6][7][10][11][12][13][14].

However, the error of trajectory will affect the accuracy of driving behaviour evaluation [15]. The erroneous position data will affect the estimation of kinematic quantities, and the measurement errors will also have adverse effects on the calibration of the car-following models [8][17]. As Thiemann et al [8] have already pointed out, the distribution of acceleration investigated in NGSIM datasets is unrealistic. Coifman and Li [17] critically evaluated vehicle trajectory video data from NGSIM video. They found that the magnitude of the errors depends on vehicle speed, location, and length. According to the study conducted by Ossen and Hoogendoorn [16], the influence of noise of trajectory data on the estimated parameters of car-following models is non-negligible. It is important for experimentalists to remove data noise as much as possible without compromising data quality [18].

Yet the quality of trajectory data and its impact on the reliability of related studies were a vastly underestimated problem in the traffic literature even before the availability of NGSIM data [9]. In recent years, the research on optimizing trajectory data quality mainly focuses on noise (outlier) recognition and vehicle trajectory reconstruction. Fitters et al [19] propose a novel framework for traffic flow prediction, OE-LSTM (Outlier-Enriched LSTM), to identify outliers that deviate from regular traffic flow. Fard et al [20] proposed a two-step technique based on wavelet analysis, primarily by using a wavelet transform with unique processing methods to identify and modify outliers. Montanino and Punzo [4] presented a multistep filtering method, which aimed to eliminate outliers leading to nonphysical acceleration values by local reconstruction of the vehicle trajectory. Wei et al [21] adopted a particle filter (PF)-based approach to reconstruct the missing trajectory between the continuous updates of vehicles on the main road of single and multiple intersections.

Although many studies have made great contributions to the noise elimination and reconstruction of vehicle trajectory data, few studies have combined the two processes of outlier recognition and vehicle trajectory reconstruction into a unified system, most studies focus on just one aspect. This paper proposes a two-step data reconstruction strategy to detect and reconstruct the target data set. The first step: local outlier identify algorithm based on k-means clustering is used to perform cluster analysis on the target vehicle trajectory data set and then identify the outliers in the data. The second step: using Savitzky-Golay filter to smooth and de-noise the data stream which containing outliers, it can not only improve the smoothness of the target data, but also reduce the interference of noise to the trajectory data.

2.

TWO-STEP VEHICLE TRAJECTORY DATA RECONSTRUCTION STRATEGY

The main objective of the two-step vehicle trajectory data reconstruction strategy proposed in this paper is to effectively identify the outliers of the target data set, select a suitable filter processor to smooth the data set which containing noise, and ensure that the data is not distorted in the process of noise processing.

2.1

Vehicle trajectory data outlier recognition based on the K-CBLOF algorithm

(1) Definition of cluster-based local outlier.

The cluster-based local outlier factor (CBLOF) detection method, is a cluster-based concept on the idea for local outlier factor detection. The basic idea of CBLOF is to cluster the data first, and then divide the data into large clusters and small clusters, with a small number of outliers, which tend to have a large deviation from most normal data. Therefore, it is necessary to calculate the distance between the data and the large clusters to measure the abnormal degree of the data. The larger the distance, the more abnormal the data.

This paper is based on the CBLOF proposed by He et al [22], which is, the target data set is divided into different clusters, the data belonging to small clusters is defined as outliers, and the cluster-based outlier should satisfy that they are “local” to specified clusters. In order to identify the physical significance of outliers, an outlier factor named cluster-based local outlier factor was assigned to each object, which is measured by both the size of the cluster the object belongs to and the distance between the object and its closest cluster (assume the object lies in a small cluster). Then, the concept of cluster is introduced.

Define |S| as the size of S, which usually refers to a set containing some element

  • (ⅰ) Data clustering:

    Let D1,…, Dm is a set of data, denoted as set D. Let С1,…, Ck refers to different clusters of data, denoted as set C. After the implementation of the clustering algorithm, the cluster formed by the data needs to meet C = {C1, C2, …, Ck} and CiCj = ∅ and C1C2C3 … ⋃ Ck = D, then k is the number of clusters.

  • (ⅱ) Large clusters (LC) and small clusters (SC)

    Let C = {C1, C2 …, Ck} be a set cluster that meets the data clustering conditions, and each cluster is sorted by elements from smallest to largest, that is, |C1| ≥ |C2| ≥ ⋯·≥ |Ck|. The two parameters are defined as α and β respectively, and b is set as the boundary between large and small clusters. Then b needs to meet one of the following two conditions:

    00172_PSISDG13018_130184R_page_2_1.jpg

    Where, LC = {Ci|ib}, SC = {Cj|j > b}. When α = 0.9, the number and proportion of the data of the large cluster is more than 90%, and β = 5 indicates that the data flow of the former cluster is 5 times the number of the latter cluster.

  • (ⅲ) Cluster-based local outlier factor

    Suppose C = {C1, C2 …, Ck} is the set belongs to |C1| ≥ |C2| ≥ ⋯ ≥ |Ck|, and the meaning of parameters are all same as above, for any record t, the cluster-based local outlier factor is defined as:

    00172_PSISDG13018_130184R_page_3_1.jpg

    From equation (2), it can be seen that the CBLOF of a record is not only determined by the size of its cluster but also the distance. If the record lies in small cluster, the distance between the record and its closet cluster is considered, else, the distance between the record and the cluster its belongs to, which is importance to the local data behaviour.

(2)K-means clustering algorithm.

In this paper, the clustering algorithm used is the k-means algorithm [24], which can produce good clustering results and has good stability. Then have an introduction on k-means clustering algorithm. The process of k-means is shown in Figure 1.

Figure 1.

K-means algorithm description

00172_PSISDG13018_130184R_page_3_4.jpg

Suppose the target object is x, xi, which is indicate the average of cluster Ci, the sum of the squared error of all objects is used Euclidean distance E, and the criterion function is define as follows:

00172_PSISDG13018_130184R_page_3_2.jpg

The Euclidean distance between two vectors x = (x1,…,xn) and y = (y1,…,yn), the Euclidean distance can be obtained as follows:

00172_PSISDG13018_130184R_page_3_3.jpg

In the k-means algorithm, the distance between the target and the cluster center are determined by executing several loops. When the cluster center is no longer updated, the k-means algorithm converges. The exact value of the iteration number t of k-means algorithm depends on the starting center of the initial cluster, and the distribution of data points in the target data set has a certain relationship with the new cluster center. n is the number of all data objects, k is the number of clusters, t is the iterations of algorithm. Usually requiring kn and t ≪ n.

The algorithm K-CBLOF for detecting outliers is listed in Figure 2.

Figure 2.

K-CBLOF algorithm description

00172_PSISDG13018_130184R_page_4_1.jpg

2.2

Vehicle trajectory data smoothing based on Savitzky-Golay filter

Savitzky-Golay filter [18] [23], originally proposed by SavitzkyA and GolayM in 1964, is widely used in smoothing data flows by noise removal. It is a method of best fitting in the time domain based on polynomials and using the least square method by moving windows. This method is relatively simple, fast, and preserves the relative maximum, minimum, and width distribution characteristics better than other similar averaging methods.

Set a set of data as u(i), and the value of i is 2m + 1 consecutive integer value, that is, i = −m, …,0, …, m. Construct an n-order polynomial (n ≤ 2m + 1) to fit this set of data.

00172_PSISDG13018_130184R_page_4_2.jpg

The error of squares of points is:

00172_PSISDG13018_130184R_page_4_3.jpg

To minimize M, the derivative of M with respect to each coefficient can be 0,

00172_PSISDG13018_130184R_page_4_4.jpg
00172_PSISDG13018_130184R_page_4_5.jpg
00172_PSISDG13018_130184R_page_4_6.jpg
00172_PSISDG13018_130184R_page_4_7.jpg

Given the number of unilateral points m to be fitted, the order n of the polynomial and the data u(−m), …,u(0), …, u(m), F can be solved. By substituting Sk+r into 00172_PSISDG13018_130184R_page_4_8.jpg, the coefficient hn0, hn1, …, hnn can be found, and the polynomial fi can be determined.

When k + r is odd,

00172_PSISDG13018_130184R_page_5_1.jpg

So only when the k + r is an even number, 00172_PSISDG13018_130184R_page_5_2.jpg, r = 0,1,2,…, n is existing.

When both n and s are even or odd:

00172_PSISDG13018_130184R_page_5_3.jpg

In practical, it is often not necessary to calculate all the coefficients hn0,hn1,…, hnn, then can be obtained:

00172_PSISDG13018_130184R_page_5_4.jpg

3.

CASE STUDY

Researchers for the NGSIM program collected detailed vehicle trajectory data on eastbound I-80 in the San Francisco Bay area in Emeryville, CA, on April 13, 2005. As shown in Figure 3, the study area was approximately 500 meters (1,640 feet) in length and consisted of six freeway lanes, including a high-occupancy vehicle (HOV) lane. An onramp also was located within the study area. Seven synchronized digital video cameras, mounted from the top of a 30-story building adjacent to the freeway, recorded vehicles passing through the study area. This vehicle trajectory data provided the precise location of each vehicle within the study area every one-tenth of a second, resulting in detailed lane positions and locations relative to other vehicles.

Figure 3.

The aerial photograph on the left shows the extent of the I-80 study area (from: https://www.fhwa.dot.gov/publications/research/operations/06137/index.cfm)

00172_PSISDG13018_130184R_page_5_5.jpg

A total of 45 minutes of data are available in the full dataset, segmented into three 15-minute periods: 4:00 p.m. to 4:15 p.m.; 5:00 p.m. to 5:15 p.m.; and 5:15 p.m. to 5:30 p.m. These periods represent the buildup of congestion, or the transition between uncongested and congested conditions, and full congestion during the peak period. In order to observe the error existing in the original data more clearly, the data from 4:00 to 4:15 was chosen to verify the strategy proposed in this paper.

In the prototype data set, two-thirds of all accelerations are beyond ±3m/s2. The example trajectory shows that the driver is allegedly changing between hard acceleration and hard deceleration several times a second, which is unrealistic [6]. Table 1 shows the field meanings of NGSIM trajectory data set.

Table 1.

Parameter naming and definition

Columns NameDescriptionUnit
Vehicle_IDVehicle identification number-
Frame_IDFrame Identification number-
Total_FramesTotal number of frames of vehicle appears in this data set-
Global_TimesElapsed time in milliseconds1/100s
Local_XLateral (X) of Data acquisition area coordinate systemft
Local_YLongitudinal (Y) of Data acquisition area coordinate systemft
Global_XLateral (X) of the standard geographic coordinate systemft
Global_YLongitudinal (Y) of the standard geographic coordinate systemft
v_lengthLength of vehicleft
v_WidthWidth of vehicleft
v_ClassVehicle type:1-motorcycle, 2-auto, 3-truck-
v_VelInstantaneous velocity of vehicleft/s
v_AccInstantaneous acceleration of vehicleft/s2

3.1

Vehicle trajectory data set outlier detection

The CBLOF algorithm of PyOD in Python was used to identify outliers in the I-80 data set from 4:00-4:15 p.m. With K-Means clustering, α is assigned a value of 0.9 means that the number and proportion of the data of the large cluster is more than 90%, and β is assigned a value of 5, which indicates that the data flow of the former cluster is 5 times the number of the latter cluster. When the data point belongs to a large cluster, its distance from the cluster center of the current cluster is calculated. When the data point belongs to a small cluster, its distance from the cluster center of the nearest large cluster is calculated. By sorting the outlier factor from largest to smallest, outliers can be picked out. The analysis results of vehicle speed and acceleration are shown in Figure 4.

Figure 4.

Vehicle speed and acceleration outlier analysis. (a): speed; (b): acceleration

00172_PSISDG13018_130184R_page_7_1.jpg

Through analysis, it can be seen that there are obvious mutations in the speed and acceleration of the vehicle, which is unreasonable in real life, which is consistent with the previous research. Compared with the speed, the abrupt value of vehicle acceleration is relatively large, which is also the main goal of data smoothing in this paper.

3.2

Vehicle trajectory data smoothing

In this paper, the Savitzky-Golay filter is used to smooth the vehicle speed and acceleration. There are three main approaches that can be used to smooth the dataset, which are:(ⅰ) Smooth all four variables i.e. X, Y, velocities and accelerations; (ⅱ) Smooth X, and Y, then differentiate velocities and acceleration; (ⅲ) Smooth X and Y, then differentiate velocities, smooth velocities, differentiate acceleration, and finally smooth acceleration.

This paper relies on the second approach and apply the filter on the Local_X and Local_Y and not the Global_X and Global_Y values since they are based on the California state plane coordinate system. The width of the filter window is n = 2m + 1, this paper use separately a window of 11 and a window of 21, because they represent, respectively, 10 and 20 data points excluding the target data point. In other words, we use the trajectory data of 1 seconds and 2 seconds to smooth the data. Both datasets can be found in the folder window-11 and window-21 respectively. Taking vehicle 1 as an example, the smooth results of speed and acceleration data are shown in Figure 5.

Figure 5.

Smoothing results of vehicle speed and acceleration. (a): speed; (b): acceleration

00172_PSISDG13018_130184R_page_8_1.jpg

As can be seen from Figure 5, through the analysis of vehicle speed and acceleration, vehicle acceleration noise in the original data is more obvious, which is consistent with the above analysis results. By using SG filtering to smooth the speed and acceleration data, it is obvious that the smoothed data has greatly improved compared with the prototype data. In Figure 3, the black line represents the prototype data, and the red line and the blue line represent the results after data smoothing when the window length is 11 and 21 respectively. The analysis results show that when the filter window length is 21, the data quality will have a great improvement after smoothing. That is, when the window length is larger, the SG filter has a more significant effect on eliminating data noise.

4.

CONCLUSION

This study points out the root mechanism of errors in NGSIM data, a widely used prototype dataset. It turns out that some of the data are seriously affected by errors in vehicle spatial coordinate measurement, which are further amplified during the differential calculation of velocity and acceleration values. Based on previous studies, it can be seen that the main cause of measurement errors in velocity and acceleration values is the poor positioning of points in space. Therefore, a two-step vehicle trajectory data reconstruction strategy is proposed in this paper to eliminate outliers in vehicle velocity and acceleration and remove residual random interference through local reconstruction of vehicle trajectory.

Firstly, CBLOF algorithm based on k-means clustering is used to identify the outliers of vehicle speed and acceleration. This step is mainly to divide the data into large clusters and small clusters, and determine the outlier value by calculating the distance between the target data and the large cluster. Then, the data with noise is smoothed locally based on SG filtering, and the vehicle speed and acceleration are smoothed at Local_X and Local_Y using Windows of length 11 and 21 respectively, and compared with the prototype data.

Through the analysis, it can be seen that the two-step vehicle trajectory data reconstruction strategy proposed in this paper can effectively identify the outliers existing in the data set, and the data quality after SG filtering and smoothing is also significantly improved. In addition, in the case of this paper, the larger the filtering window, the better the data smoothing effect. In summary, the strategy proposed in this paper can effectively eliminate outliers and reconstruct data without losing the truth.

REFERENCES

[1] 

Guo, Q., Li, L., Ban, X.J., “Urban traffic signal control with connected and automated vehicles: a survey,” Transp. Res. Part C Emerg. Technol., 101 313 –334 (2019). https://doi.org/10.1016/j.trc.2019.01.026 Google Scholar

[2] 

Li, L., Jiang, R., He, Z., Chen, X.M., Zhou, X., “Trajectory data-based traffic flow studies: a revisit,” Transp. Res. Part C Emerg. Technol., 114 225 –240 (2020). https://doi.org/10.1016/j.trc.2020.02.016 Google Scholar

[3] 

. (NGSIM, Next Generation Simulation (NGSIM) Vehicle Trajectories and Supporting Data [Online],” (2016) https://doi.org/https://data.transportation.gov/Automobiles/Next-Generation-Simulation-NGSIM-Vehicle-Trajector/8ect-6jqj Google Scholar

[4] 

Montanino, M., Punzo, V., “Making NGSIM data usable for studies on traffic flow theory,” Transp. Res. Rec., 2390 (1), 99 –111 (2013). https://doi.org/10.3141/2390-11 Google Scholar

[5] 

Altche, F., de La Fortelle, A., “An LSTM Network for Highway Trajectory Prediction,” in In: Proceedings of the 2017 IEEE 20th International Conference on Intelligent Transportation Systems (ITSC)., 353 –359 (2017). https://doi.org/10.1109/itsc.2017.8317913 Google Scholar

[6] 

Thiemann, C., Treiber, M., Kesting, A., “Estimating acceleration and lane-changing dynamics from next generation simulation trajectory data,” Transp. Res. Rec., 2088 (1), 90 –101 (2008). https://doi.org/10.3141/2088-10 Google Scholar

[7] 

Lei, C., Zhao, C., Ji, Y., Shen, Y., Du, Y., “Identifying and correcting the errors of vehicle trajectories from roadside millimetre-wave radars,” IET Intell. Transp. Syst., 17 418 –434 (2023). https://doi.org/10.1049/itr2.v17.2 Google Scholar

[8] 

Chen, X., Yin, J., Qin, G., Tang, K., Wang, Y., Sun, J., “Integrated macro-micro modelling for individual vehicle trajectory reconstruction using fixed and mobile sensor data,” Transp. Res. Part C Emerg. Technol., 145 103929 (2022). https://doi.org/10.1016/j.trc.2022.103929 Google Scholar

[9] 

Punzo, V., Borzacchiello, M. T., Ciuffo, B., “On the assessment of vehicle trajectory data accuracy and application to the Next Generation SIMulation (NGSIM) program data,” Transp. Res. Part C Emerg. Technol., 19 (6), 1243 –1262 (2011). https://doi.org/10.1016/j.trc.2010.12.007 Google Scholar

[10] 

Yang, L., Fang S., Wu, G. Y., Sheng, H., Xu, Z. G., Zhang, M. X., Zhao, X. M., “Physical model versus artificial neural network (ann) model: a comparative study on modeling car-following behavior at signalized intersections,” J. Adv. Transp., (3), 1 –18 (2022). https://doi.org/10.1155/2022/8482846 Google Scholar

[11] 

X. Hu, Z. Zheng, D. Chen, X. Zhang, and J. Sun, “Processing, assessing, and enhancing the waymo autonomous vehicle open dataset for driving behavior research,” Transp. Res. Part C Emerg. Technol., 134 (103490), 1 (2022). https://doi.org/10.1016/j.trc.2021.103490 Google Scholar

[12] 

Montanino, M., Punzo, V., “Trajectory data reconstruction and simulation-based validation against macroscopic traffic patterns,” Transp. Res. Part B Methodol., 80 82 –106. (2015). https://doi.org/10.1016/j.trb.2015.06.010 Google Scholar

[13] 

Chen, P., Wang, T., Zheng, N., “Reconstructing vehicle trajectories on freeways based on motion detection data of connected and automated vehicles,” J. Intell. Transp. Sys., 26 (6), 1 –16 (2021). https://doi.org/10.1080/15472450.2021.1955211 Google Scholar

[14] 

Venthuruthiyil, S. P., Chunchu, M., “Trajectory reconstruction using locally weighted regression: a new methodology to identify the optimum window size and polynomial order,” Transportmetrica A Transp. Sci., 1 –20 (2018). https://doi.org/10.1080/23249935.2018.1449032 Google Scholar

[15] 

van Beinum, A.; Farah, H.; Wegman, F.; Hoogendoorn, S., “Driving behaviour at motorway ramps and weaving segments based on empirical trajectory data,” Transp. Res. Part C Emerg. Technol., 92 426 –441 (2018). https://doi.org/10.1016/j.trc.2018.05.018 Google Scholar

[16] 

Ossen, S., Hoogendoorn, S.P., “Validity of trajectory-based calibration approach of car-following models in presence of measurement errors,” Transp. Res. Rec., (2088), 117 –125 (2008). https://doi.org/10.3141/2088-13 Google Scholar

[17] 

Coifman, B., Li, L., “A critical evaluation of the Next Generation Simulation (NGSIM) vehicle trajectory dataset,” Transp. Res. Part B Methodol., 105 362 –377 (2017). https://doi.org/10.1016/j.trb.2017.09.018 Google Scholar

[18] 

Savitzky, Abraham, Golay, M.J.E., “Smoothing and differentiation of data by simplified least squares procedures,” Anal. Chem., 36 1627 –1639 (1964). https://doi.org/10.1021/ac60214a047 Google Scholar

[19] 

Fitters, W, Cuzzocrea, A., Hassani, M., “Enhancing LSTM Prediction of Vehicle Traffic Flow Data via Outlier Correlations,” in Proceedings of the 2021 IEEE 45th Annual Computers, Software, and Applications Conference (COMPSAC), 210 –217 (2021). https://doi.org/10.1109/COMPSAC51774.2021.00039 Google Scholar

[20] 

Fard, M. R., Mohaymany, A. Shariat., Shahri. M., “A new methodology for vehicle trajectory reconstruction based on wavelet analysis,” Transp. Res. Part C Emerg. Technol., 74 150 –167 (2017). https://doi.org/10.1016/j.trc.2016.11.010 Google Scholar

[21] 

Wei, L., Wang, Y., Chen, P., “A particle filter-based approach for vehicle trajectory reconstruction using sparse probe data,” IEEE Trans. Intell. Transp. Syst., 22 (5), 2878 –2890 (2020). https://doi.org/10.1109/TITS.2020.2976671 Google Scholar

[22] 

He Z, Xu X, Deng S., “Discovering cluster-based local outliers,” Pattern Recogn Lett., 24 (9-10), 1641 –1650 (2003). https://doi.org/10.1016/S0167-8655(03)00003-5 Google Scholar

[24] 

Shi, N., Liu, X. M., Guan, Y., “Research on K-means Clustering Algorithm: An Improved K-means Clustering Algorithm,” in In: Third International Symposium on Intelligent Information Technology and Security Informatics, IITSI 2010, 63 –67 (2010). https://doi.org/10.1109/IITSI.2010.74 Google Scholar
(2024) Published by SPIE. Downloading of the abstract is permitted for personal use only.
Rui Cao "Two step vehicle trajectory reconstruction strategy based on an improved outlier detection and data smoothing algorithm", Proc. SPIE 13018, International Conference on Smart Transportation and City Engineering (STCE 2023), 130184R (14 February 2024); https://doi.org/10.1117/12.3024008
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Detection and tracking algorithms

Tunable filters

Reconstruction algorithms

Fourier transforms

Prototyping

Seaborgium

Data acquisition

Back to Top