Abstract |
: |
[eng] The amount of data in the world and in our lives seems ever-increasing and there is no end in sight. Such a situation is supported by omnipresent computers along with inexpensive disks and online storage. In order to get the best from these data that overwhelm us, computers give us the opportunity to analyse them for decision making. For example, as reported in the literature, dairy farmers in New Zealand have to make a tough business decision every year : which cows to retain in their herd and which to sell off to an abattoir. Each cow's breeding and milk production history, age, health problems, and and many other factors influence this decision. About 700 attributes for each of several million cows have been recorded over the years. This is an example of large data set (large number of individuals : several millions) containing high-dimensional descriptions (large number of variables : 700). Classically, the observed value on each variable for each individual has a scalar data type.
Large multidimensional data sets abound in real applications. Large size and high dimension are two aspects of the complexity inherent in these data sets. In the framework of this thesis, we are not interested in the complexity induced by the dimensionality...
One way to deal with large size data sets is to summarize the data and use adequate methods for mining the summarized data. Summarising data, as we understand it here, does not reduce the dimensionality. It reduces the number of individuals. In the summarized data table, each row can be viewed as the description of a concept, which is a high level individual (for example a given species of birds) containing lower level individuals in its extent (for instance all birds from the given species). The variability of elements in an extent should be revealed by the row describing the concept in the summarized table, hence the use of data structures in the cells of the summarized data table.
Summarising data using data structures leads to descriptors which are no longer scalar type attributes. For instance, knowing that a cow nears the end of its productive life at 8 years, one might group together all cows which are at least 8 years old. For such a group, the observed value on attribute 'AGE' would be an interval like [8,15], if we assume that there is no cow over 15 years old. Likewise, the other attributes used to describe a group of cows would be structure-valued, leading to a non-standard data table. In other words, each value of a variable used to describe a group of cows would be a structure. Intervals, multivalued-data, distributions, histograms, functions, time series, graphs, and so on are examples of structures.
To our knowledge, there is currently only one complete software environment that is publicly and freely available for tabular structure-valued data analysis. The Symbolic Objects Data Analysis System (SODAS2) was designed and implemented as a "black-box" and consequently does not provide necessary flexibility and adaptability in order to support research activity in the field of tabular structure-valued data analysis. Using an open source software for data analysis such as R is a good step towards the desired flexibility. Even though R, unlike the SODAS2 software, does not natively provide non-standard data types, it offers necessary infrastructure for designing and implementing these data types. Therefore one of the contributions of this thesis is to propose a well designed class system for non-standard data types representation in R.
Another major contribution of this thesis is a binary tree inducer called the Structure-Valued Attributes Tree (SVATree) for dealing with tabular non-standard data. The originality of the SVATree algorithm lies in the splitting strategy used. Built upon resemblance measures between non-standard data, the proposed splitting strategy generalizes the splitting approach traditionally used for binary tree construction. The SVATree algorithm applies in three types of problems involving tabular non-standard data : classification, regression and clustering. It was designed and implemented in R using the class system proposed in this thesis.
Non-standard data are dealt with in Part Two of this thesis. Part One focuses on standard data and gives details about a method we have developed for increasing the predictive power of any supervised classification model. |