Colson, Jean-Pierre
[UCL]
In this paper we propose a new method for the automatic extraction of set phrases from corpora: the Corpus Proximity Ratio (CPR), based on the average proximity between grams within a given window. This score is non-parametric and comes close to the vectorial models used in information retrieval. Although the score still needs experimental confirmation, the preliminary results obtained (Colson 2010, Colson & Granger 2011) and the confrontation with native speaker judgment reveals a high degree of precision, while recall still needs to be explored. This paper also reports the results of an experiment carried out in collaboration with the Centre of English Corpus Linguistics at Louvain University. We will argue that translation corpora and advanced learner corpora play a key role in this debate. Traditional studies have indeed shown that those corpora are lacking in phraseology. Efficient methods for the extraction of phraseology at all levels should therefore be able to measure a different concentration of phraseology in translation and learner corpora on the one hand, and in native corpora on the other. In our experiment, a benchmark of 11,000 collocations (from bigrams to sixgrams) was extracted from a 200 million word corpus (part of the ukWacky corpus, Baroni et al. 2009) by means of the Corpus Proximity Ratio. The n-grams (from bigrams to sixgrams) were extracted from the 2.6 million word ICLE corpus (International Corpus of Learner English, University of Louvain) and from a comparable portion of 2.6 million words randomly selected from the BAWE corpus (British Academic Written English, University of Warwick). Our results indicate a very marked difference between ICLE and the native BAWE corpus and are therefore consistent with previous work on non-native English phraseology. This paper will focus attention on the practical significance of this experiment: a partly automated correction of essays and translations, and the gradual elaboration of large collections of semi-fixed and fixed expressions that might be used by language learners and translators.


Bibliographic reference |
Colson, Jean-Pierre. Corpus-driven phraseology assessment: an experiment..Europhras 2012. Phraseology and Culture. (Maribor (Slovenia), du 27/08/2012 au 31/08/2012). |
Permanent URL |
http://hdl.handle.net/2078.1/114361 |