Kaisin, Brieuc
[UCL]
Dourt, Nicolas
[UCL]
Verleysen, Michel
[UCL]
Data has become more and more available in the last decades, and such an advantage does not come without problems. The main problem is to overcome the size of the available data, in order to perform meaningful analysis and to counter the Curse of Dimensionality, a problem which can be solved via feature selection. Two subproblems arise then from there : the presence of non-numerical data and of missing data. A widely used measure for performing feature selection is the Mutual Information, whose value is usually computed with estimators from Kraskov and Ross. In this work, we propose to modify the Kraskov estimator of mutual information and the Kozachenko-Leonenko entropy estimator, which both rely on distances and do not work on non-numerical and missing data, via a distance that work with both types : the Heterogeneous Euclidean-Overlap Metric. This thesis shows that the use of such a distance for feature selection through three different algorithms works at least as well as popular state of the art methods such as Relief-F. Some perspectives are finally given on how further investigations could be carried out.


Référence bibliographique |
Kaisin, Brieuc ; Dourt, Nicolas. Machine Learning : feature selection on incomplete and non-numerical data. Ecole polytechnique de Louvain, Université catholique de Louvain, 2020. Prom. : Verleysen, Michel. |
Permalien |
http://hdl.handle.net/2078.1/thesis:26510 |