SCHMALZ, Verena, Jennifer-Carmen FREY and Egon STEMLE. Introducing a Gold Standard Corpus from Young Multilinguals for the Evaluation of Automatic UD-PoS Taggers for Italian. Online. In 8th Italian Conference on Computational Linguistics, CLiC-it 2021. Milan, Italy: CEUR Workshop Proceedings, 2021, p. 1-7. ISSN 1613-0073.
Other formats:   BibTeX LaTeX RIS
Basic information
Original name Introducing a Gold Standard Corpus from Young Multilinguals for the Evaluation of Automatic UD-PoS Taggers for Italian
Authors SCHMALZ, Verena (380 Italy), Jennifer-Carmen FREY (40 Austria) and Egon STEMLE (276 Germany, guarantor, belonging to the institution).
Edition Milan, Italy, 8th Italian Conference on Computational Linguistics, CLiC-it 2021, p. 1-7, 7 pp. 2021.
Publisher CEUR Workshop Proceedings
Other information
Original language English
Type of outcome Proceedings paper
Country of publisher Italy
Confidentiality degree is not subject to a state or trade secret
Publication form electronic version available online
WWW URL
RIV identification code RIV/00216224:14330/21:00125291
Organization Fakulta informatiky – Repository – Repository
ISSN 1613-0073
Keywords in English PoS tagging; automatic evaluation
Changed by Changed by: RNDr. Daniel Jakubík, učo 139797. Changed: 7/4/2023 04:30.
Abstract
Part-of-speech (PoS) tagging constitutes a common task in Natural Language Processing (NLP), given its widespread applicability. However, with the advance of new information technologies and language variation, the contents and methods for PoS-tagging have changed. The majority of Italian existing data for this task originate from standard texts, where language use is far from multifaceted informal real-life situations. Automatic PoS-tagging models trained with such data do not perform reliably on non-standard language, like social media content or language learners’ texts. Our aim is to provide additional training and evaluation data from language learners tagged in Universal Dependencies (UD), as well as testing current automatic PoStagging systems and evaluating their performance on such data. We use a multilingual corpus of young language learners, LEONIDE, to create a tagged gold standard for evaluating UD PoStagging performance on the Italian nonstandard language. With the 3.7 version of Stanza, a Python NLP package, we apply available automatic PoS-taggers, namely ISDT, ParTUT, POSTWITA, TWITTIRÒ and VIT, trained with both standard and non-standard data, on our dataset. Our results show that the above taggers, trained on non-standard data or multilingual Treebanks, can achieve up to 95% of accuracy on multilingual learner data, if combined.
Type Name Uploaded/Created by Uploaded/Created Rights
paper13.pdf Licence Creative Commons  File version 27/1/2022

Properties

Name
paper13.pdf
Address within IS
https://repozitar.cz/auth/repo/48309/1233896/
Address for the users outside IS
https://repozitar.cz/repo/48309/1233896/
Address within Manager
https://repozitar.cz/auth/repo/48309/1233896/?info
Address within Manager for the users outside IS
https://repozitar.cz/repo/48309/1233896/?info
Uploaded/Created
Thu 27/1/2022 02:16

Rights

Right to read
  • anyone on the Internet
Right to upload
 
Right to administer:
  • a concrete person Mgr. Lucie Vařechová, uco 106253
  • a concrete person RNDr. Daniel Jakubík, uco 139797
  • a concrete person Mgr. Jolana Surýnková, uco 220973
Attributes
 
Print
Add to clipboard Displayed: 19/5/2024 19:55