D 2022

Constructing Datasets from Dialogue Data

SOTOLÁŘ, Ondřej; Jaromír PLHÁK; Michal TKACZYK; Michaela LEBEDÍKOVÁ; David ŠMAHEL et. al.

Basic information

Original name

Constructing Datasets from Dialogue Data

Name in Czech

Sestavování datových souborů z dialogových dat

Authors

SOTOLÁŘ, Ondřej (203 Czech Republic, guarantor, belonging to the institution); Jaromír PLHÁK (203 Czech Republic, belonging to the institution); Michal TKACZYK (616 Poland, belonging to the institution); Michaela LEBEDÍKOVÁ (203 Czech Republic, belonging to the institution) and David ŠMAHEL (203 Czech Republic, belonging to the institution)

Edition

Brno, Proceedings of the 16th Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2022, p. 131-139, 9 pp. 2022

Publisher

Tribun EU

Other information

Language

English

Type of outcome

Proceedings paper

Country of publisher

Czech Republic

Confidentiality degree

is not subject to a state or trade secret

Publication form

printed version "print"

References:

RIV identification code

RIV/00216224:14330/22:00129251

Organization

Fakulta informatiky – Repository – Repository

ISBN

978-80-263-1752-4

ISSN

EID Scopus

2-s2.0-85171476391

Keywords in English

Dialogue Dataset;Dataset Split;Online Conversations

Links

GX19-27828X, research and development project.
Changed: 16/5/2024 04:14, RNDr. Daniel Jakubík

Abstract

In the original language

We present methods for transforming raw dialogue data into a dataset suitable for processing with statistical NLP models. We reveal the potential pitfalls for processing this type of data, such as ensuring the representatives of the sample, the generalization ability of models, and the definition of the local context of the utterances. We use novel methods to solve these problems and demonstrate their effectiveness on an utterance classification problem. As a result, this paper provides guidelines for generating valuable datasets from dialogue data.

Files attached