Use domain knowledge to Improve data mining performance of very large data-sets via clustering
Original Publication Date: 2005-Oct-07
Included in the Prior Art Database: 2005-Oct-07
AbstractPresented are methods using domain knowledge, particularly in the medical domain, to reduce the dataset size for further data-mining analysis.
Use domain knowledge to Improve data mining performance of very large data -sets via clustering
Data mining is a very compute-intensive task. This is not a data query problem in
which some information from a data repository is queried but, rather, an exhaustive
computation to uncover information hidden in the data which represents patterns in it
. The "raw materials" are instances of data, each represented as a record of an individual (e.g., a person) in a certain population (i.e., patients in a given geographic area). A common way of organizing this data is as a table that can be viewed in a spread-sheet program (e.g. MS Excel *) or a relational database (e.g. IBM DB2**). When searching hidden patterns, the problem may very quickly explode in its computational complexity, since all possible combinations among the selected data attributes must be analyzed. When different attributes behave similarly in the data, or totally dissimilar, this is valuable information and evidence of some new discovered pattern.
The high computational complexity of this problem stems from two sources: the number of features and attributes, all of whose combinations must be analyzed, and the number of records in the data instances. Proposed here are methods to reduce this complexity by using domain knowledge to directly affect these factors and make it possible to find more dependencies in the data. For a more detailed report see a technical report .
Proposed here is using domain knowledge as follows:
1. Organize the combinations of attributes for the analysis phase, so that data is analyzed according to prior knowledge of how valuable each combination is. Domain knowledge will suggest that analyzing a column of the exam date and age of patient is less important than that of age combined with the level of sugar in blood and smoking habits. Possible source for domain knowledge can be organized in an ontology .
2. For each choice of columns to be analyzed, reduce the variability of column data, based on knowledge of its behavior. For instance, by knowing what is the normal range of certain attribute, one can combine all these values to the new abstract value called "normal".
3. Use clustering on the preprocessed data from point 2 above, for the collection of attributes produced in point 1 there, so that a classification of similar records creates a much smaller group of records than in the original collection. Proposed is a very simple classification algorithm whose association criterion is exact equivalence among records classified to be in the same class. This can be simply done via sorting and collecting similar records, via insertion into a tree. The hashing method has complexity of n and is thu...