Feature selection from heterogeneous biomedical data

Paul, Jérôme

DIAL.pr - BOREAL

Accès à distance ? S'identifier sur le proxy UCLouvain

Feature selection from heterogeneous biomedical data

Primary tabs

download

jpaul-thesis.pdf

Open access
PDF
2.04 M

Paul, Jérôme [UCL]

Modern personalised medicine uses high dimensional genomic data to perform customised diagnostic/prognostic. In addition, physicians record several medical parameters to evaluate some clinical status. In this thesis we are interested in jointly using those different but complementary kinds of variables to perform classification tasks. Our main goal is to provide interpretability to predictive models by reducing the number of used variables to keep only the most relevant ones. Selecting a few variables that allow us to predict some clinical outcome greatly helps medical doctors to understand the studied biological process better. Mixing gene expression data and clinical variables is challenging because of their different nature. Indeed genomic measurements are expressed on a continuous scale while clinical variables can be continuous or categorical. While the biomedical domain is the original incentive to this work, we tackle the more general problem of feature selection in the presence of heterogeneous variables. Few variable selection methods jointly handle both kinds of features directly. That is why we focus on tree ensemble methods and kernel approaches. Tree ensemble methods, like random forests, successfully perform classification from data with heterogeneous variables. In addition, they propose a feature importance index that can rank variables according to their importance in the predictive model. Yet, that index suffers from two main drawbacks. Firstly, the provided feature rankings are highly sensitive to small variations of the datasets. Secondly, while the variables are accurately ranked, it is very difficult to decide which features actually play a role in the decision process. This work puts forward solutions to those two problems. We show in an analysis of tree ensemble methods stabilities that feature rankings get considerably stabler by growing more trees than needed to obtain good predictive performances. We also introduce a statistically interpretable feature selection index. It assesses whether the variables are important in predicting the class of unseen samples. The output p-values offer a very natural threshold to decide which features are significant. Apart from tree ensemble approaches, there are few feature selection methods that handle continuous and categorical variables in an embedded way. It is however possible to build classifiers that profit from both kinds of data by using kernels. In this thesis, we adapt those techniques to perform heterogeneous feature selection. We propose two kernel-based algorithms that rely on a recursive feature elimination procedure. The importance of the variables is extracted either from a non-linear SVM or multiple kernel learning. Those approaches are shown to provide state-of-the-art results in terms of predictive performances and feature selection stability.

metadata

Document type	Thèse (Dissertation)
Access type	Accès libre
Publication date	2015
Language	Anglais
Degree	(FSA - Sciences de l'ingénieur) -- UCL, 2015
Defense date	29/06/2015
Promotors	Dupont, Pierre
Affiliations	UCL - SST/ICTM/INGI - Pôle en ingénierie informatique UCL - Ecole Polytechnique de Louvain
Keywords	Machine learning ; Feature selection ; Tree ensembles ; Heterogeneous data ; Kernel methods
Links	http://hdl.handle.net/2078.1/165076[Handle]

Bibliographic reference	Paul, Jérôme. Feature selection from heterogeneous biomedical data. Prom. : Dupont, Pierre
Permanent URL	http://hdl.handle.net/2078.1/165076

User menu

Feature selection from heterogeneous biomedical data

Primary tabs

Footer Help

Languages

Footer menu

User menu

Search form

You are here

Feature selection from heterogeneous biomedical data

Primary tabs

Footer Help

Languages

Footer menu