Colson, Jean-Pierre
[UCL]
One of the main goals of corpus-based and computational phraseology is to develop tools and algorithms that can extract all phrasemes in an objective and reproducible way. An innovative approach to this daunting challenge has recently been taken by the Parseme project (PARSEME (PARSing and Multi-word Expressions, https://typo.uni-konstanz.de/parseme/) As is often the case within the field of computational linguistics, a shared task was proposed on the occasion of the Parseme workshops: on the basis of a gold set of data, research teams are invited to take part in the task by analyzing blind data with their algorithms. An automated program then evaluates the results sent by the participants. In this paper, I present the results of an original experiment with the Parseme 2018 Shared Task, devoted to the extraction of verbal multiword expressions (VMWEs). In designing this experiment, the following research questions were posed: 1. To what extent is there a cultural bias in the parsing and analysis of phraseological units, even when fully automated tools are used? 2. Should future work on the automatic extraction of phraseology rely more on machine learning (especially deep learning) or on corpus-based methods? Regarding the first research question, several elements are examined in the Parseme dataset and in the results provided by the different systems. The selection of the various categories of verbal multiword expressions is the result of careful work by the organizers of the workshop, but it remains open to criticism as to their number and the labels used. As is often the case in computational linguistics, they reveal a partly Eurocentric vision of linguistic structure, with syntax and morphology at the center of meaningful constructions. In spite of sometimes arbitrary decisions on the borderline between several categories of VMWEs, the Parseme data provide the researcher with a gold mine of testable hypotheses. One of the most fascinating ones is the link between natural association (as measured by statistical scores) and culturally imposed associations (based on specific grammatical categories). The results of our study demonstrate that cultural elements are at stake at various levels of the compilation and analysis of the dataset. Besides, the model proposed for the analysis of the data differs from the main results obtained in the Parseme workshop, as they purely rely on two general corpora and do not use any training data. This approach, that can be situated more on the “open track” side (with external data: the corpora) yields scores (measured by the automated tools of Parseme 2018) that are for most categories superior to those obtained by deep learning, especially in the case of verbal idioms. The recall scores, in particular, outperform those obtained by deep learning, which suggests that a combination of both approaches might be useful in the future.
Bibliographic reference |
Colson, Jean-Pierre. Phraseology, data and culture: an experiment with the Parseme 2018 dataset.EUROPHRAS 2019, Computational and Corpus-based phraseology (University of Málaga, du 25/09/2019 au 27/09/2019). |
Permanent URL |
http://hdl.handle.net/2078.1/224139 |