This paper presents a novel clustering method for XML documents. Much research effort of document clustering is currently devoted to support the storage and retrieval of large collections of XML documents. However, traditional text clustering approaches cannot embody the structural information of semi-structured documents. Our technique is firstly to extract relative path features to represent each document. And then, we transform these documents to Vector Space Model (VSM) and propose a similarity computation. Before clustering, we apply Independent Component Analysis (ICA) to reduce dimensions of VSM. To the best of author's knowledge, ICA has not been used for XML clustering before. The standard C-means partition algorithm is also improved: When a solution can be no more improved, the algorithm makes the next iteration after an appropriate disturbance on the local minimum solution. Thus the algorithm can skip out of the local minimum and in the meanwhile, reach the whole search space. Experimental results, based on two real datasets and one synthetic dataset, show that the proposed approach is efficient and outperforms naive-clustering method without ICA applied.
Recently XML heterogeneity has become a new challenge. In this paper, a novel clustering strategy is proposed to regroup these heterogeneous XML sources, for searching in a relatively smaller space with certain similarity can reduce cost. The strategy consists of four steps. We at first extract features about paths and map them into High-dimension Vector Space (HDVS). In the data pre-process, two algorithms are applied to diminish the redundancies in XML sources. Then heterogeneous documents are clustered. Finally, Multivalued Dependency (MVD) is introduced, for MVD can be redefined according to the range of constraints of XML. This paper also proposes a novel algorithm that discovering minimal MVD, based on the rough set handling non-integrity data. It can solve the problem that non-integrity data of XML influence on finding the MVD of XML, thus patterns can be extracted from each cluster.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.