González, J.
[Universidad politécnica de Valencia]
Juan, A.
[Universidad politécnica de Valencia]
Dupont, Pierre
[Université Jean Monnet]
Vidal, E.
[Universidad politécnica de Valencia]
Casacuberta, F.
[Universidad politécnica de Valencia]
The problem of word categorisation is formulated as one of unsupervised
mixture modelling where Bernoulli distributions capture contextual information.
We detail how the free parameters of the mixture models can be estimated
through an EM procedure. A deterministic word-to-class mapping is
derived from this model using a hierarchical clustering algorithm.
Categorisation plays an important role in language modelling.
It let us reduce the number of free parameters to be estimated and allow us
to easily increase the vocabulary of the task without the need for retraining.
In this paper, we try to solve the word-class selection problem by means of a
non-supervised method which uses contextual information of the words in the
training set together with an adequate distance measure.
This paper describes a technique to build a word hierarchical structure
through an efficient agglomerative hierarchical clustering algorithm,
in a syntax-constrained task. This way, assigning words to categories
seems to be an easy job since breaking this structure wherever you want
gives you a division of the vocabulary words into categories. We call
this algorithm efficient becauses it uses minheaps in order to avoid
an extensive search of the nearest neighbour of each sample. Methods for a
good codification of the words, based on the words usually around them in
the sentences of the task, are described and experiments in order to
tune some essential representation and algorithm-dependent parameters
were carried out. Finally, subjectively good results were achieved and the
reason for calling them subjective is that the only way to evaluate the
results is looking at the obtained structure and giving her a mark.


Bibliographic reference |
González, J. ; Juan, A. ; Dupont, Pierre ; Vidal, E. ; Casacuberta, F.. A Bernoulli mixture model for word categorisation.Proceedings of Simposium Nacional de Reconocimiento de Formas y Análisis de Imágenes (Benicàssim (Spain), du 14/05/2001 au 18/05/2001). |
Permanent URL |
http://hdl.handle.net/2078.1/108944 |