Schellekens, Vincent
[UCL]
Jacques, Laurent
[UCL]
Machine learning algorithms, such as the k-means clustering problem, typically require several passes on a dataset of learning examples, which becomes prohibitive when the amount of examples becomes very large. Inspired by the field of Compressed Sensing, recent work has proposed a new technique to reduce the size of very large dataset by summarizing it into a single object called the sketch. To construct this summary, the sketch passes random projections of the learning examples through a complex exponential before a pooling step, i.e. a summation over the set of learning examples. The resulting sketch has a size that is independent of the size of the dataset and has been used successfully in Gaussian Mixture Model estimation and clustering problems. This work considers the possibility to define a new sketching operator that replaces the complex exponential with the universal quantization function. The idea of this modification of the sketch is inspired by 1-bit embeddings of signals with the universal quantization, that preserve the local distances of high-dimensional vectors. This new sketch has the advantage of being a step closer to the hardware implementation of a sensor capturing the sketch of a dataset directly. However this modification of the sketch operator presents new challenges for the centroid retrieval algorithms due to the discontinuous nature of the universal quantization operation, and raises questions regarding the selection of the parameters of this new sketch. New algorithms to find the centroids from the quantized sketch are proposed. A theoretical analysis of generalized sketching operators that rely on any periodic function as signature function allows to obtain new insights that highlight the consequences of changing the sketch signature function. Eventually those new insights lead to empirical rules for tuning the parameters of the sketch relying on universal quantization. Experimental results show that the quantized sketch, combined with adapted algorithms and sketch parameters, allows for accurate clustering results. However, the required sketch size in the quantized case is at least 20 times larger than when the traditional sketch is used. The reason for this increase of the sketch dimension is discussed and potential solutions are proposed.


Bibliographic reference |
Schellekens, Vincent. Compressive clustering of high-dimensional datasets by 1-bit sketching. Ecole polytechnique de Louvain, Université catholique de Louvain, 2017. Prom. : Jacques, Laurent. |
Permanent URL |
http://hdl.handle.net/2078.1/thesis:10695 |