Souza Wilkens, Rodrigo
[UCL]
Zilio, Leonardo
[UCL]
Fairon, Cédrick
[UCL]
Learning a second language is a task that requires a good amount of time and dedication. Part of the process involves the reading and writing of texts in the target language, and so, to facilitate this process, especially in terms of reading, teachers tend to search for texts that are associated to the interests and capabilities of the learners. But the search for this kind of text is also a time-consuming task. By focusing on this need for texts that are suited for different language learners, we present in this study the SW4ALL, a corpus with documents classified by language proficiency level (based on the CEFR recommendations) that allows the learner to observe ways of describing the same topic or content by using strategies from different proficiency levels. This corpus uses the alignments between the English Wikipedia and the Simple English Wikipedia for ensuring the use of similar content or topic in pairs of text, and an annotation of language levels for ensuring the difference of language proficiency level between them. Considering the size of the corpus, we used an automatic approach for the annotation, followed by an analysis to sort out annotation errors. SW4ALL contains 8.669 pairs of documents that present different levels of language proficiency.


Bibliographic reference |
Souza Wilkens, Rodrigo ; Zilio, Leonardo ; Fairon, Cédrick. SW4ALL: a CEFR Classified and Aligned Corpus for Language Learning. In: Nicoletta Calzolari, Khalid Choukri, Christopher Cieri e.a., Eleventh International Conference on Language Resources and Evaluation (LREC 2018), European language resources association : Paris 2018 |
Permanent URL |
http://hdl.handle.net/2078.1/208085 |