Marion, Rebecca
[UCL]
In many fields, researchers are confronted by datasets whose variables demonstrate grouping patterns. For example, in transcriptomics data, where the variables are gene expression levels, certain groups of genes are involved in the same biological processes, so their expression levels are highly correlated. For complex diseases, such as cancer or heart disease, entire groups of genes are expected to contribute to the development or progression of disease. Thus, identifying these variable groups, or "clusters," can be instrumental in uncovering the mechanisms of disease and developing targeted treatments. However, in practice, these variable clusters are not known in advance and must be learned from the data. Clustering is a data analysis technique used to assign a set of objects to groups, or clusters, where similar objects are assigned to the same cluster and dissimilar objects to different clusters. While most work in the literature has focused on the problem of clustering observations (e.g. patients) given a set of variables (e.g. genes), this thesis proposes several statistical and machine learning methods for the problem of variable clustering. The objective of the thesis is to propose methods that can improve data analysis in contexts where the ultimate objective is to predict one or more targets (e.g. disease class) and identify clusters of predictor variables (e.g. genes, metabolites) that are most predictive of the target(s). We explore three problems related to this theme, drawing on applications from the fields of metabolomics, genomics, ecology and psychology. First, we propose AdaCLV, a variable clustering method for pre-processing high-dimensional metabolomics data such that important clusters of variables can be identified with greater precision. Second, we investigate the added value of integrating the target variable (e.g. disease class) into the variable clustering process. We introduce Weighted SOS-NMF, a method that improves variable clustering and variable selection performance by supervising the clustering of variables with the target before a predictive model is fitted. Finally, we examine the case of supervised variable clustering for data with multiple, orthogonal targets. Inspired by a common research problem in ecology and psychology, we propose BIOT, a method for transforming the dimensions of the target matrix so that they can be accurately predicted by small clusters of predictor variables.


Bibliographic reference |
Marion, Rebecca. Statistical and machine learning methods for identifying clusters of variables : with applications in omics, ecology and psychology. Prom. : von Sachs, Rainer ; Govaerts, Bernadette |
Permanent URL |
http://hdl.handle.net/2078.1/252866 |