|
1.INTRODUCTIONWith the continuous promotion of smart city construction and the rapid development of technologies such as Internet big data, the smart city data platform has brought together more and more urban multi-source data, showing explosive growth. Most of these data come from geographic information system (GIS) software, GIS software is an important platform for data management, different GIS software have their own different data formats, in addition, due to the limitations of technical conditions in different periods, the production of data formats are not the same, which leads to the generation of spatial data heterogeneity[1]. In a broad sense, multi-source spatial data can include multiple data sources, multiple data formats, multiple spatial and temporal data, multiple scales (multi-precision), and multiple semantic levels[2]. In a narrow sense, multi-source spatial data mainly refers to the variety of data formats, including different formats of different data sources and different data storage formats due to different data structures[3-5], which leads to the problem of multi-source data in urban planning. How to find the correlation between the data from multiple sources according to the actual needs, and help city managers better understand the state of development of city operation is one of the core tasks of the current smart city construction. 2.RELATED RESEARCHAt present, despite the centralised management of multi-source data through data aggregation in physical storage, it is difficult to efficiently establish correlations between data from different sources and provide users with in-depth data support services due to the lack of data correlations, as the data still exist in separate data tables, and different data tables have different field definitions and descriptions for the same city object. The traditional method is based on manually filtering table information and then manually adding correlations to the tables, which can be personalised according to actual needs and is suitable for data management and application needs in different scenarios. However, urban-social big data is usually characterised by diverse sources, formats, scattered storage and large data volumes, which makes collaborative data representation, information aggregation, information derivation and value addition, as well as data mining very difficult. As a result, there are problems of inefficiency, cumbersome operations and high error rates. [6] Therefore, it is recommended to use automated methods to optimise data correlation by automatically learning and calculating the correlation between data through tools such as machine learning algorithms, natural language processing techniques or graph databases, thereby reducing manual intervention and improving data processing efficiency. In addition, an integrated management system based on GIS and graphical hypertext (GBH) technology can manage complex heterogeneous data and information relationships such as graphics, attributes and multimedia in a simple, intuitive and integrated way, which can make urban spatial planning and management more efficient [7]. Based on the results of data governance, the construction of a method that can support the automatic construction of data associations is an important idea to achieve efficient association services between multiple data sources in smart cities. The key to the fusion of multi-source spatio-temporal big data lies in the merging of features such as geometric location, semantic attributes, and topological relationships of geographic entities. In other words, only by fusing and processing entities with the same name from multiple sources and scales of spatio-temporal big data, and ensuring consistent mapping relationships between the data and the physical urban space, can the different data be brought together to form a spatio-temporal database of higher quality that meets the application requirements, and thus provide data support for urban management and intelligent services [8-9]. Therefore, before the data can be correlated, it is necessary to pre-process the data, including the removal of redundant data, data desensitisation, data cleaning, and so on. Among these, it is particularly important to provide accurate and complete descriptions of the Chinese interpretations of the data fields. And then, according to the application requirements, the data relationships that need to be related are defined, making it clear which records are relevant. For example, if you are looking at the number of buildings and the population living in the buildings, there may be a relationship between these two datasets. This gives the textual similarity of the different fields in each data table and the different fields in the rest of the data, which can be compared to determine the field association relationships in the different data tables. Finally, the relationships determined are stored in the database to support subsequent cross-table queries and information presentation. 3.METHODOLOGY3.1Data governance, manual identification of fields to determine Chinese descriptionsBased on the developed data governance platform, it is common for each field to be described in Chinese to make it more intuitive and understandable for users to query and manipulate the data. The Chinese description is an accurate and complete definition of what the field contains, and must clearly state the specific name of the data field and the data table to which it belongs, to avoid possible conflicts in understanding the same field name in different data tables. For example: two different data tables for enterprises and building groups, the “name” field represents a different connotation, therefore the “name” field in the building table should be described as “building name”, the “name” field in the building table should be described as “building name”, and the “name” field in the enterprise table should be described as “enterprise name”, etc. By accurately and completely defining the Chinese descriptions of the data table fields, the user can not only better understand the meaning of each field, but also facilitate the subsequent automated/semi-automated data association calculations. 3.2Calculation of field similarity based on Chinese interpretationsUsing a dynamic longest common subsequence search algorithm, all combinations of characters in two strings are run through a double loop. The longest common subsequence (LCS) is the longest subsequence (can be discontinuous) that occurs in both strings, and the relative order of these characters in the original string remains unchanged. For example, the LCS of the strings A = “ABCD” and B = “ACDF” is “ACD”. Next, we can use a dynamic programming algorithm to solve the LCS, assuming that a and b are two strings, where m and n are the lengths of each string. Initialise a two-dimensional array dp, where dp[i][j] represents the LCS lengths of the first i characters of a and the first j characters of b. Then there is the following state transfer equation: (1) if a[i-1] == b[j-1], indicating that the character can be included in the longest common subsequence, then dp[i][j] = dp[i-1][j-1] + 1 (2) If a[i-1] ! = b[j-1], we must choose between keeping the characters of the former name or keeping the characters of the latter name so as to maximise the length of the longest common subsequence. Then let dp[i][j] = max(dp[i-1][j], dp[i][j-1]). where the max() function returns the larger of the two arguments. Finally, we can calculate the similarity of two strings based on the length of their LCSs. Assuming that the lengths of the two strings are m and n respectively, and their LCS lengths are len, the similarity is len / max(m, n). If the two strings are identical, the return value is 1; if the two strings do not have any identical characters, the return value is 0. Finally, the similarity measures of the data table fields and all the fields in the different data tables are obtained and output as a data table. To further simplify the scope of the search, only Chinese descriptions with a similarity greater than or equal to a predetermined threshold (e.g. 0.5) are output, sorted from largest to smallest, and stored in the database with the correlation relationships determined in a human-in-the-loop manner to support subsequent cross-table queries and information presentation.The flow chart of the relevant principles is as follows in Figrure 1. 4.TABLE OF EXPERIMENTAL TRANSLATIONS INTO ENGLISHFor example, if a building in a city houses a number of businesses, there is already a database containing basic information and financial data about the building and the businesses. Table 1 contains basic information such as the name,location,type,investment,area and ID. Table 2 contains basic information on the name, location, type, investment, area and ID of the company occupying the building. Table 3 contains financial data such as consumption,income,tax and net income of the company on a quarterly basis, as well as the company code. Table 1 (T1):Building base information table
Table 2 (T2):Basic information form for occupying companies
Table 3 (T3):Quarterly profit and loss information for the business
These tables can be used to manage and analyse the operations of buildings and businesses. For example, Tables 1 and 2 can be used to identify the businesses in each building and calculate the area and investment accounted for by each business. Table 3 can then be used to track the profit and loss of each shop and determine which shops make a significant contribution to the operating profit of the building as a whole. This data can also be used to forecast future income and expenditure and to make better business decisions. In general, however, the three tables exist in a large database of cities, and it would be time-consuming to establish links by manual search. However, there is a one-to-many relationship between the Chinese segments of the tables, and a string similarity algorithm (such as a dynamic longest common subsequence algorithm) can be used to compare the similarity between the Chinese segments to establish the association. For example, for the association between Table 1 and Table 2, the building address and the address of the building where the business is located can be compared, and if the similarity between them is above a certain threshold (e.g. 0.5), they are considered to be related. Similarly, for the association between Table 2 and Table 3, the business codes can be compared and considered relevant if their similarity is above a certain threshold. While this method is not necessarily 100% accurate in establishing correlations, it can be used as a tool to improve the accuracy of data matching. Example: The similarity of all fields in Table 2 and the rest of the Chinese description strings in Table 1 and Table 3 will be output as follows: (from largest to smallest, not output below 0.5; T1 represents form 1, T2 represents form 2, T3 represents form 3) As can be seen from the similarity table 4, there is a one-to-many relationship between Table 1 and Table 2, where each enterprise is uniquely identified by the ‘ID’ field (enterprise code), i.e. a building can have multiple occupants, but each occupant can belong to only one building. There is a one-to-many relationship between Table 2 and Table 3, i.e. an enterprise can have several quarters of profit and loss information, but each quarter of profit and loss information can belong to only one enterprise. This relationship is linked by the ‘ID’ (enterprise code) field. There is no direct relationship between Table 1 and Table 3, but the “location” field in Table 2 can be used to link an enterprise to its building, thus matching the financial data in Table 3 to the corresponding building and enterprise. Table 4.Similarity of Chinese description strings for all fields in Table 2 and the remaining fields in Tables 1 and 3
It is important to note that some of the fields in Table 1, despite their similarity, are not really correspondingly related to each other, so they need to be supplemented with manual judgement to filter to further determine the true association between the three tables and obtain the results shown in Table 5. Table 5.Associated Fields
In summary, by following the steps above, we were able to determine the true correlation between the three tables and obtain the results as shown in Table 4. In this way, we were able to match and analyse the building, business and financial data to obtain more comprehensive information. 5.CONCLUSION AND DISCUSSIONThis study proposes a method for calculating the association of data fields based on the Chinese descriptive text according to data governance, which in turn supports the association between data from multiple sources in the city. By analysing and processing the similarity of Chinese character strings, we can quickly correlate the association relationships between these tables and thus correlatively match the data within them. This approach helps us to better understand the relationships between the data and therefore to analyse the data more accurately. Ultimately, we can use these correlations to answer questions such as finding all the occupants of a building or finding a company’s financial data for a particular quarter. The algorithm still has many shortcomings and limitations: for example, if two strings are semantically similar but literally different, this method cannot accurately compare their similarity. In addition, the string similarity algorithm cannot deal with some complex language structure and syntax problems; in practice, the information form of social sensing and government information data is complex, and it is difficult to achieve dynamic information interconnection, interoperability, mutual use and integrated operation and management dynamic information when data fusion, which is prone to dynamic fusion fault tolerance; at the same time, it is difficult to accurately match multi-source spatio-temporal data with urban entities, and it is relatively isolated and poorly correlated between different types of data. It is also difficult to accurately match multi-source spatio-temporal data with urban entities, and the relative isolation and poor correlation between different types of data makes it difficult to analyse and construct spatio-temporal data correlation relationships, which affects the efficiency of dynamic fusion between urban entities and multi-source spatio-temporal data. Therefore, in order to make better use of Chinese character similarity association, we need to further explore and optimise the algorithm, and test and validate it with practical application scenarios[10]. Despite its limitations, the Chinese character similarity association algorithm has the following important implications:
In the future, as geographic information and big data technologies become more intertwined with the social systems of smart cities, there will be more room for the development of geographic data association technologies. For example, in the automatic extraction and deep understanding of multiple semantics, most of the current research remains in the low-level semantic features such as entity extraction and element discovery, while there is still a lack of efficient and robust extraction algorithms for the high-level semantic features such as location features, entity relationships, scene features, and geographic events contained in text and images. In a multi-objective environment, current data association algorithms face problems such as high computational complexity and poor real-time tracking. By implementing artificial intelligence technology, spatial semantics can be extracted from syntax/pixels/elements to high-level spatio-temporal scene recognition and inference, thus assisting computers to “see” and “read” the rich spatio-temporal semantics contained in ubiquitous geographic information[11]. In conclusion, Chinese character similarity association has important practical significance. Despite its limitations, by continuously improving the algorithm and combining it with practical application scenarios, we can make better use of this method to discover connections between data and improve data quality and application value. REFERENCESLIU Yingcai, LU Laijun, WANG Yan, et al.,
“Analysis and application of multi-source data integration methods in geology[J],”
Geology and Resources, 23
(6), 583
–586
(2014). Google Scholar
Sun Hongyan,
“GIS multi-source data integration model review[J],”
Power Technology,
(Z1), 8
–10
(2010). Google Scholar
Mohammadi, Hossein, Abbas Rajabifard, and Ian P. Williamson,
“Enabling spatial data sharing through multi-source spatial data integration,”
in Proceedings of GSID,
11
(2009). Google Scholar
Ergun, Bahadir, et al.,
“A case study on the historical peninsula of Istanbul based on three-dimensional modeling by using photogrammetry and terrestrial laser scanning,”
Environmental monitoring and assessment, 165 595
–601
(2010). https://doi.org/10.1007/s10661-009-0971-0 Google Scholar
Song Guanfu, Zhong Ershun, Liu Jiyuan, et al.,
“Seamless integration of multi-source spatial data[J],”
Advances in Geoscience, 19
(2), 110
–115
(2000). Google Scholar
Zheng Minjuan,
“Supervisor: Fang Ming,”
Xi’an University of Petroleum, Master’s degree (Major: Computer Application Technology),
(2008). Google Scholar
Huang Shengli,
“An integrated land resources management system based on GBH/GIS[J],”
China Land Science, 1998
(5), 39
–41 Google Scholar
Li Kexia,
“Research on semantic consistency matching of toponymic data [D],”
Southwest Petroleum University, Chengdu
(2017). Google Scholar
Yager, Ronald R,
“A framework for multi-source data fusion,”
Information Sciences, 163
(1-3), 175
–200
(2004). https://doi.org/10.1016/j.ins.2003.03.018 Google Scholar
Liu Jiping, Wang Yong, Hu Yanzhu, et al.,
“A review of Internet ubiquitous geographic information sensing fusion technology[J],”
Journal of Surveying and Mapping, 51
(7), 1618
–1628
(2022). Google Scholar
Liu Shangqin, Zhang Fuhao, Qiu Aigen, et al.,
“A framework for multi-source spatio-temporal data fusion based on urban information units[J],”
Integration Technology, Google Scholar
|