Colson, Jean-Pierre
[UCL]
The automatic extraction of all collocations / phraseologisms from corpora has a crucial role to play in the development of computational phraseology. Unfortunately, after « 50-something years of work on collocations », the results are still disappointing (Gries 2013). A possible way of improving the results is to start, not from traditional statistical scores, but from other techniques inspired from information retrieval, and in particular metric clusters. The Corpus Proximity Ratio (CPR, J.-P. Colson 2014) makes it possible to reach a precision level of about 85 percent for the extraction of bigrams, but also of higher grams (up to sevengrams). In this paper we show the results obtained by a combination of raw frequency (on web corpora of 200 million words) and of CPR for the English formula It's not and its French counterpart C'est pas, as well as for the French pattern en toute and its counterparts in English, German and Spanish (resp. in all / in aller / en toda). The results show that such structures encompass a wide variety of communicative and semantic phrases, most of which are nowhere to be found in dictionaries. With the help of CPR, no less than 150 phrases were extracted with C'est pas at the beginning of the phrase.
Bibliographic reference |
Colson, Jean-Pierre. Phrases and associations across languages: experiments in corpus-based computational phraseology.Europhras 2014 (Paris, du 10/09/2014 au 12/09/2014). |
Permanent URL |
http://hdl.handle.net/2078.1/152419 |