D
2022
Constructing Datasets from Dialogue Data
SOTOLÁŘ, Ondřej; Jaromír PLHÁK; Michal TKACZYK; Michaela LEBEDÍKOVÁ; David ŠMAHEL et al.
Basic information
Original name
Constructing Datasets from Dialogue Data
Name in Czech
Sestavování datových souborů z dialogových dat
Authors
SOTOLÁŘ, Ondřej; Jaromír PLHÁK; Michal TKACZYK; Michaela LEBEDÍKOVÁ and David ŠMAHEL
Edition
Brno, Proceedings of the 16th Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2022, p. 131-139, 9 pp. 2022
Other information
Type of outcome
Proceedings paper
Country of publisher
Czech Republic
Confidentiality degree
is not subject to a state or trade secret
Publication form
printed version "print"
Marked to be transferred to RIV
Yes
RIV identification code
RIV/00216224:14330/22:00129251
Organization
Fakulta informatiky – Repository – Repository
Keywords in English
Dialogue Dataset;Dataset Split;Online Conversations
Links
GX19-27828X, research and development project.
In the original language
We present methods for transforming raw dialogue data into a dataset suitable for processing with statistical NLP models. We reveal the potential pitfalls for processing this type of data, such as ensuring the representatives of the sample, the generalization ability of models, and the definition of the local context of the utterances. We use novel methods to solve these problems and demonstrate their effectiveness on an utterance classification problem. As a result, this paper provides guidelines for generating valuable datasets from dialogue data.
Displayed: 4/5/2026 19:32