This paper presents an approach to classify/cluster the web documents by decompositions of hypergraphs. The various levels of co-occurring frequent terms, called association rules (undirected rules), of documents form a hypergraph. Clustering methods is then applied to analyze such hypergraphs; a simple and fast clustering algorithm is used to decomposing hypergraph into connected components. Each connected component represents a primitive concept within the given documents. The documents will then be classified/clustered by such primitive concepts.
A relation is a representation of a set, called the universe V, of entities by a set of tuples. Hence it is associated with a unique sub-lattice, called relation lattice, of the partition lattice of V.
In this paper, we examine the relation lattices on V and a sample V'(a subset). The analysis concludes that only very special types of samples may be able to have the "same" association rules as those of the original universe; here "same" means allowing some statistical errors. In other words, finding association rules by mere random sampling may not be able to fully reflect the association rules of the original universe; special attentions on the sampling are needed.
Finding reduct is the core theme in rough set theory, and could be considered one form of data mining. However to find the "perfect" reduct has been proved to be an NP-hard problem. In this paper, we compute a reduct based on granular data model; each granule is represented by a bit string. The computing is very fast.
Finding all closed frequent itemsets is a key step of association rule mining since the non-redundant association rule can be inferred from all the closed frequent itemsets. In this paper we present a new method for finding closed frequent itemsets based on attribute value lattice. In the new method, we argue that vertical data representation and attribute value lattice can find all closed frequent itemsets efficiently, thus greatly improve the efficiency of association rule mining algorithm. We discuss how these techniques and methods are applied to find closed frequent itemsets. In our method, the data are represented vertically; each frequent attribute value is associated with its granule, which is represented as a hybrid bitmap. Based on the partial order defined between the attribute values among the databases, an attribute value lattice is constructed, which is much smaller compared with the original databases. Instead of searching all the items in the databases, which is adopted by almost all the association rule algorithms to find frequent itemsets, our method only searches the attribute-value lattice. A bottom-up breadth-first approach is employed to search the attribute value lattice to find the closed frequent itemsets.
KEYWORDS: Data mining, Mining, Databases, Data modeling, Probability theory, Computing systems, Data processing, Computer science, Knowledge discovery, Nomenclature
Associations (not necessarily in rule forms) as patterns in data are critically analyzed. We build theory based only on what data says, and no other implicit assumptions. Data mining is regarded as a deductive science: First, we observe that isomorphic relations have isomorphic associations. Somewhat a surprise, such a simple observation turns out to have far reaching consequences. It implies that associations are properties of an isomorphic class, not an individual relation. A similar conclusion can be made for probability theory based on item counting, hence it is not adequate to characterize the "interesting-ness," since the latter one is a property of an individual relation. As a by-product of this analysis, we find that all generalized associations can be found by simply solving a set of integral linear inequalities - this is a very striking result. Finally, we observe that from the structure of the relation lattice, we may conclude that random sampling may loose substantial information about patterns.
The observation that isomorphic relations have isomorphic high frequency patterns implies some unexpected properties about the association rules. First of all, the patterns are properties of the isomorphic class, not an individual relation. Second, those countings on itemsets, association rules and etc. are invariants under isomorphism, and hence the probability theory based such countings is again a theory of the whole class, not an individual relation. On the other hand, examples show that "interesting-ness" (of association rules) are properties of an individual relation, not the whole isomorphic class. As a corollary, contrary to many authors beliefs, we conclude that interestingness cannot be characterized by such a probability theory.
Let V be a set real world entities, that admits a relational model. An attribute induces an equivalence relation on V and will be regarded as such. Let A* be the set of such equivalence relations and their refinements. Then the smallest sublattice L(A*) generated by A* in the parition lattice of V is called generalized relation lattice. The set of the granules in L(A*) whose cardinality is above certain threshhold is the set of all possible patterns derivable from the given attributes (features) of the given relation.
Data mining has been treated as a set of added operations to the classical data model. So data mining uses the same terms as database. This is an incorrect or confusing approach; though the syntax is the same, their semantics are very different. This paper presents a series of critique, and analyzes the current status and proposes a data model for data mining in both traditional relational and semantically richer databases.
This paper compares two artificial intelligence methods: the Decision Tree C4.5 and Rough Set Theory on the stock market data. The Decision Tree C4.5 is reviewed with the Rough Set Theory. An enhanced window application is developed to facilitate the pre-processing filtering by introducing the feature (attribute) transformations, which allows users to input formulas and create new attributes. Also, the application produces three varieties of data set with delaying, averaging, and summation. The results prove the improvement of pre-processing by applying feature (attribute) transformations on Decision Tree C4.5. Moreover, the comparison between Decision Tree C4.5 and Rough Set Theory is based on the clarity, automation, accuracy, dimensionality, raw data, and speed, which is supported by the rules sets generated by both algorithms on three different sets of data.
Given a finite sequence of vectors (numerical tuples), there is a complexity associated to it, called data complexity. The 'simplest' pattern that is supported by this data set has a complexity, called pattern complexity. Then the 'smallest' sub-sequence, whose pattern complexity and data complexity are both equal to the pattern complexity of the original sequence, is the smallest sample, called theoretical sample. This paper investigates such samples.
KEYWORDS: Data modeling, Data mining, Information operations, Silicon, Databases, Data storage, Mathematical modeling, Computer science, Data processing, Lithium
An attribute value, in a relational model, is a meaningful label of a collection of objects; the collection is referred to as a granule of the universe of discourse. The granule itself can be regarded a label of the collection (granule); it will be referred to as the canonical name of the granule. A relational model using these canonical names themselves as attribute values (their bit patterns or lists of members) is called a machine oriented data model. For moderate size databases, finding association rules, decision rules, and etc., are reduced to easy computation of set theoretical operations of these collections. In this paper, a very fast computing algorithm is presented.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.