Data mining

From Wikipedia, the free encyclopedia

This is an old revision of this page, as edited by Mydogategodshat (talk | contribs) at 03:39, 25 February 2004 (Formtng). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

Jump to navigation Jump to search

Data mining is the practice of automatically searching large stores of data for patterns. To do this, data mining uses computational techniques from Statistics and Pattern recognition.

Used in the technical context of data warehousing it is neutral. However, it also has a wider, more pejorative usage that implies imposing patterns (and particularly causal relationships) on data where none exist.

Data mining has been defined as "The nontrivial extraction of implicit, previously unknown, and potentially useful information from data" [1] and "The science of extracting useful information from large data sets or databases" [2].

It is also known as knowledge-discovery in databases (KDD).

Used in this sense, "data mining" implies scanning the data for any relationships, and then when one is found coming up with an interesting explanation. The problem is that large data sets invariably happen to have some exciting relationships peculiar to that data. Therefore any conclusions reached by data mining are likely to be highly suspect. In spite of this, some exploratory data work is always required in any applied statistical analysis to get a feel for the data, so sometimes the line between good statistical practice and data mining is less than clear.

A more significant danger is finding correlations that do not really exist. An example of this is found at the investment website The Motley Fool. In the late 1990s the website had a suggested investment portfolio known as the Foolish Four, which was based on a data mining analysis of trends in the stock market. Further research in the early 2000s has highlighted that the correlations they found were an artifact of the particular data set they used, rather than reflecting reality. This experience is one of many similar false findings linked to the stock market.

There are also privacy concerns associated with data mining. For example, if an employer has access to medical records, they may screen out people with diabetes or have had a heart attack. Screening out such employees will cut costs for insurance, but it creates ethical and legal problems.

There are many legitimate uses of data mining. For example, a database of all prescription drugs taken by people can be used to find combinations of drugs with an adverse reaction. Since the combination may occur only in 100 people and the reaction in 10 of them, a single case may not raise a red flag. Such a database could find reactions and save lives. However, there is huge potential for abuse of such a database.

Basically, data mining gives information that wouldn't be available otherwise. It must be properly interpreted to be useful. When the data collected involves individual people, there are many questions concerning privacy, legality, and ethics.

See Also

External sources

[1] W. Frawley and G. Piatetsky-Shapiro and C. Matheus, Knowledge Discovery in Databases: An Overview. AI Magazine , Fall 1992, pgs 213-228.

[2] D. Hand, H. Mannila, P. Smyth: Principles of Data Mining. MIT Press, Cambridge, MA, 2001.


Note: if you got here by looking for the rapper KDD, see KDD (rapper).