Intelligent Choices Preceeding Data Analysis

Katharina Morik

Univ. Dortmund, Computer Science VIII,
D-44221 Dortmund, Germany

e-mail: morik@ls8.cs.uni-dortmund.de
http://www-ai.cs.uni-dortmund.de


For a long time, machine learning was oriented towards the development of new and better algorithms that solve a well-defined learning task. Typically, the data are well prepared before learning begins. The main steps are sampling (the data set is representative or carefully biased by informative examples), feature generation and selection (necessary features are given and irrelevant features are excluded), data cleaning (ideally, noisy feature values have been deleted),model selection (the hypothesis space or model class is carefully chosen), and the definition of an evaluation model (the criteria for success or failure of learning are precisely stated). This approach is best illustrated by the UCI library of benchmark data for learning algorithms: given the learning task and the well-prepared data, design the algorithm that optimizes the specified criteria.

Knowledge discovery in databases (KDD) confronts machine learning with a different task: given raw data and a set of algorithms, find an adequate sequence of preparation steps and choose a learning algorithm such that the algorithm optimizes the then selected criteria. Usually, it is up to the specialist to design the preprocessing for the chosen algorithm. It is the most tedious step in the overall discovery process. This talk describes a case-based approach to easing this task. Cases of successful preprocessing are stored for their re-use. Metadata of cases are adopted to similar cases. A library of best-practice cases in the form of their meta-data is currently built up. The talk presents cases from areas ranging from on-line monitoring in intensive care to direct mailing actions.