Derval, Guillaume
[UCL]
Data mining is nowadays an important part of numerous fields of science, from linguistics to ecology, via chemistry and medicine. Since the beginning of this century, many multi-billion-dollars companies, that focus on data mining as they represent their main source of income, have emerged. Data mining is now pervasive in our society. The field is thus very active, and numerous results, techniques, studies have been made to solve problems encountered in many different disciplines. This thesis explores the specific problem of finding maximum-sum submatrices in matrix data, an unsupervised learning technique for knowledge discovery. The methodology can be used in any context where users want to find biclusters based on sum of values. This problem emerged in the context of the analysis of gene expression data, matrices where each column represent a gene and each row a sample (a condition, a patient, a tissue or any other entity where gene can express themselves); a cell in such a matrix represents the expression of a gene on an entity, which can be measured in multiple ways. A submatrix with a large sum can represent a group of patients that share a common illness, induced by a common group of genes, for example. We present multiple techniques for finding good solutions to the maximum-sum submatrix problem, along with two of its variants aiming to find multiples interesting submatrices, with or without overlap between them. We focus on methods based on exact approaches, eventually modified to allow them to scale.


Bibliographic reference |
Derval, Guillaume. Finding maximum sum submatrices. Prom. : Schaus, Pierre |
Permanent URL |
http://hdl.handle.net/2078.1/258210 |