Bestgen, Yves
[UCL]
Pearson's chi-squared test is probably the most popular statistical test used in corpus linguistics, particularly for studying linguistic variations between corpora. Oakes and Farrow (Literary and Linguistic Computing, 2007, 22, 85-99) proposed various adaptations of this test in order to allow for the simultaneous comparison of more than two corpora, while also yielding an almost correct Type I error rate (i.e. claiming that a word is most frequently found in a variety of English, when in actuality this is not the case). By means of resampling procedures, the present study shows that when used in this context, the chi-squared test produces far too many significant results, even in its modified version. Several potential approaches to circumventing this problem are discussed in the conclusion.
Bibliographic reference |
Bestgen, Yves. Inadequacy of the chi-squared test to examine vocabulary differences between corpora. In: Literary and Linguistic Computing : the journal of digital scholarship in the humanities, Vol. 29, no. 2, p. 164-170 (2014) |
Permanent URL |
http://hdl.handle.net/2078.1/156101 |