Paul, Jérôme
[UCL]
Modern personalised medicine uses high dimensional genomic data to perform customised diagnostic/prognostic. In addition, physicians record several medical parameters to evaluate some clinical status. In this thesis we are interested in jointly using those different but complementary kinds of variables to perform classification tasks. Our main goal is to provide interpretability to predictive models by reducing the number of used variables to keep only the most relevant ones. Selecting a few variables that allow us to predict some clinical outcome greatly helps medical doctors to understand the studied biological process better. Mixing gene expression data and clinical variables is challenging because of their different nature. Indeed genomic measurements are expressed on a continuous scale while clinical variables can be continuous or categorical. While the biomedical domain is the original incentive to this work, we tackle the more general problem of feature selection in the presence of heterogeneous variables. Few variable selection methods jointly handle both kinds of features directly. That is why we focus on tree ensemble methods and kernel approaches. Tree ensemble methods, like random forests, successfully perform classification from data with heterogeneous variables. In addition, they propose a feature importance index that can rank variables according to their importance in the predictive model. Yet, that index suffers from two main drawbacks. Firstly, the provided feature rankings are highly sensitive to small variations of the datasets. Secondly, while the variables are accurately ranked, it is very difficult to decide which features actually play a role in the decision process. This work puts forward solutions to those two problems. We show in an analysis of tree ensemble methods stabilities that feature rankings get considerably stabler by growing more trees than needed to obtain good predictive performances. We also introduce a statistically interpretable feature selection index. It assesses whether the variables are important in predicting the class of unseen samples. The output p-values offer a very natural threshold to decide which features are significant. Apart from tree ensemble approaches, there are few feature selection methods that handle continuous and categorical variables in an embedded way. It is however possible to build classifiers that profit from both kinds of data by using kernels. In this thesis, we adapt those techniques to perform heterogeneous feature selection. We propose two kernel-based algorithms that rely on a recursive feature elimination procedure. The importance of the variables is extracted either from a non-linear SVM or multiple kernel learning. Those approaches are shown to provide state-of-the-art results in terms of predictive performances and feature selection stability.
Bibliographic reference |
Paul, Jérôme. Feature selection from heterogeneous biomedical data. Prom. : Dupont, Pierre |
Permanent URL |
http://hdl.handle.net/2078.1/165076 |