SOTOLÁŘ, Ondřej, Jaromír PLHÁK, Michal TKACZYK, Michaela LEBEDÍKOVÁ and David ŠMAHEL. Constructing Datasets from Dialogue Data. In Proceedings of the 16th Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2022. Brno: Tribun EU, 2022, p. 131-139. ISBN 978-80-263-1752-4.
Other formats:   BibTeX LaTeX RIS
Basic information
Original name Constructing Datasets from Dialogue Data
Name in Czech Sestavování datových souborů z dialogových dat
Authors SOTOLÁŘ, Ondřej (203 Czech Republic, guarantor, belonging to the institution), Jaromír PLHÁK (203 Czech Republic, belonging to the institution), Michal TKACZYK (616 Poland, belonging to the institution), Michaela LEBEDÍKOVÁ (203 Czech Republic, belonging to the institution) and David ŠMAHEL (203 Czech Republic, belonging to the institution).
Edition Brno, Proceedings of the 16th Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2022, p. 131-139, 9 pp. 2022.
Publisher Tribun EU
Other information
Original language English
Type of outcome Proceedings paper
Country of publisher Czech Republic
Confidentiality degree is not subject to a state or trade secret
Publication form printed version "print"
WWW URL URL
RIV identification code RIV/00216224:14330/22:00129251
Organization Fakulta informatiky – Repository – Repository
ISBN 978-80-263-1752-4
ISSN 2336-4289
Keywords in English Dialogue Dataset;Dataset Split;Online Conversations
Links GX19-27828X, research and development project.
Changed by Changed by: RNDr. Daniel Jakubík, učo 139797. Changed: 16/5/2024 04:14.
Abstract
We present methods for transforming raw dialogue data into a dataset suitable for processing with statistical NLP models. We reveal the potential pitfalls for processing this type of data, such as ensuring the representatives of the sample, the generalization ability of models, and the definition of the local context of the utterances. We use novel methods to solve these problems and demonstrate their effectiveness on an utterance classification problem. As a result, this paper provides guidelines for generating valuable datasets from dialogue data.
Type Name Uploaded/Created by Uploaded/Created Rights
RASLAN_2022_sotolar.pdf   File version 15/12/2022

Properties

Name
RASLAN_2022_sotolar.pdf
Address within IS
https://repozitar.cz/auth/repo/53147/1418592/
Address for the users outside IS
https://repozitar.cz/repo/53147/1418592/
Address within Manager
https://repozitar.cz/auth/repo/53147/1418592/?info
Address within Manager for the users outside IS
https://repozitar.cz/repo/53147/1418592/?info
Uploaded/Created
Thu 15/12/2022 04:34

Rights

Right to read
  • anyone on the Internet
Right to upload
 
Right to administer:
  • a concrete person Mgr. Lucie Vařechová, uco 106253
  • a concrete person RNDr. Daniel Jakubík, uco 139797
  • a concrete person Mgr. Jolana Surýnková, uco 220973
Attributes
 
Print
Add to clipboard Displayed: 19/6/2024 11:05