Paper
6 May 2022 Automatic web page data extraction through MD5 trigeminal tree and improved BIRCH
Jibing Gong, Xiaomeng Kou, Hanyun Zhang, Jiquan Peng, Shishan Gong, Shuli Wang
Author Affiliations +
Proceedings Volume 12256, International Conference on Electronic Information Engineering, Big Data, and Computer Technology (EIBDCT 2022); 122561M (2022) https://doi.org/10.1117/12.2635678
Event: 2022 International Conference on Electronic Information Engineering, Big Data and Computer Technology, 2022, Sanya, China
Abstract
This paper proposes an automatic data extraction algorithm for web pages based on noise reduction and visualization blocks' construction. In this algorithm, we first build an MD5 trigeminal tree of the web page's source files using a Message-Digest Algorithm. And then we arrange some noise reduction. Finally, we construct visualization blocks by an optimized cluster algorithm named BIRCH A-CF which could create an area clustering feature forest and complete the clustering by dynamically changing the circle's radius according to the correction factor. We perform experiments on eight different datasets to compare our method with eight baseline methods. The experimental results show that our approach outperforms current methods by providing more accuracy and robustness, it also accelerates noise reduction and reduces the number of nodes effectively.
© (2022) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Jibing Gong, Xiaomeng Kou, Hanyun Zhang, Jiquan Peng, Shishan Gong, and Shuli Wang "Automatic web page data extraction through MD5 trigeminal tree and improved BIRCH", Proc. SPIE 12256, International Conference on Electronic Information Engineering, Big Data, and Computer Technology (EIBDCT 2022), 122561M (6 May 2022); https://doi.org/10.1117/12.2635678
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Denoising

Information visualization

Data modeling

Feature extraction

Information science

Back to Top