2022
			
	    
	
	
    Constructing Datasets from Dialogue Data
SOTOLÁŘ, Ondřej; Jaromír PLHÁK; Michal TKACZYK; Michaela LEBEDÍKOVÁ; David ŠMAHEL et. al.Basic information
Original name
Constructing Datasets from Dialogue Data
	Name in Czech
Sestavování datových souborů z dialogových dat
	Authors
SOTOLÁŘ, Ondřej (203 Czech Republic, guarantor, belonging to the institution); Jaromír PLHÁK (203 Czech Republic, belonging to the institution); Michal TKACZYK (616 Poland, belonging to the institution); Michaela LEBEDÍKOVÁ (203 Czech Republic, belonging to the institution) and David ŠMAHEL (203 Czech Republic, belonging to the institution)
			Edition
 Brno, Proceedings of the 16th Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2022, p. 131-139, 9 pp. 2022
			Publisher
Tribun EU
		Other information
Language
English
		Type of outcome
Proceedings paper
		Country of publisher
Czech Republic
		Confidentiality degree
is not subject to a state or trade secret
		Publication form
printed version "print"
		RIV identification code
RIV/00216224:14330/22:00129251
		Organization
Fakulta informatiky – Repository – Repository
			ISBN
978-80-263-1752-4
		ISSN
EID Scopus
2-s2.0-85171476391
		Keywords in English
Dialogue Dataset;Dataset Split;Online Conversations
		Links
GX19-27828X, research and development project. 
			
				
				Changed: 16/5/2024 04:14, RNDr. Daniel Jakubík
				
		Abstract
In the original language
We present methods for transforming raw dialogue data into a dataset suitable for processing with statistical NLP models. We reveal the potential pitfalls for processing this type of data, such as ensuring the representatives of the sample, the generalization ability of models, and the definition of the local context of the utterances. We use novel methods to solve these problems and demonstrate their effectiveness on an utterance classification problem. As a result, this paper provides guidelines for generating valuable datasets from dialogue data.