D 2021

Introducing a Gold Standard Corpus from Young Multilinguals for the Evaluation of Automatic UD-PoS Taggers for Italian

SCHMALZ, Verena; Jennifer-Carmen FREY a Egon STEMLE

Základní údaje

Originální název

Introducing a Gold Standard Corpus from Young Multilinguals for the Evaluation of Automatic UD-PoS Taggers for Italian

Autoři

SCHMALZ, Verena; Jennifer-Carmen FREY a Egon STEMLE

Vydání

Milan, Italy, 8th Italian Conference on Computational Linguistics, CLiC-it 2021, od s. 1-7, 7 s. 2021

Nakladatel

CEUR Workshop Proceedings

Další údaje

Jazyk

angličtina

Typ výsledku

Stať ve sborníku

Stát vydavatele

Itálie

Utajení

není předmětem státního či obchodního tajemství

Forma vydání

elektronická verze "online"

Odkazy

URL

Označené pro přenos do RIV

Ano

Kód RIV

RIV/00216224:14330/21:00125291

Organizace

Fakulta informatiky – Masarykova univerzita – Repozitář

ISSN

EID Scopus

2-s2.0-85121223452

Klíčová slova anglicky

PoS tagging; automatic evaluation
Změněno: 7. 4. 2023 04:30, RNDr. Daniel Jakubík

Anotace

V originále

Part-of-speech (PoS) tagging constitutes a common task in Natural Language Processing (NLP), given its widespread applicability. However, with the advance of new information technologies and language variation, the contents and methods for PoS-tagging have changed. The majority of Italian existing data for this task originate from standard texts, where language use is far from multifaceted informal real-life situations. Automatic PoS-tagging models trained with such data do not perform reliably on non-standard language, like social media content or language learners’ texts. Our aim is to provide additional training and evaluation data from language learners tagged in Universal Dependencies (UD), as well as testing current automatic PoStagging systems and evaluating their performance on such data. We use a multilingual corpus of young language learners, LEONIDE, to create a tagged gold standard for evaluating UD PoStagging performance on the Italian nonstandard language. With the 3.7 version of Stanza, a Python NLP package, we apply available automatic PoS-taggers, namely ISDT, ParTUT, POSTWITA, TWITTIRÒ and VIT, trained with both standard and non-standard data, on our dataset. Our results show that the above taggers, trained on non-standard data or multilingual Treebanks, can achieve up to 95% of accuracy on multilingual learner data, if combined.
Zobrazeno: 3. 5. 2026 01:45