Podrobný výpis o publikaci

V originále

V textu představujeme experimentální sondu využívající materiál automaticky přepsaných textů mluvených dat archivu Českého rozhlasu. Tento archiv obsahuje největší kolekci mluvených dokumentů nahraných v posledních devadesáti letech. Automatický přepis části pořadů Českého rozhlasu je cílem projektu, ze kterého vychází naše studie. Zvláštností tohoto přepisu je paralelní uložení původní zvukové stopy společně s formou textovou. Toto dvojí uložení jazykových informací rozšiřuje možnosti blíže zkoumat jazyk a pracovat s informacemi. Jednou z důležitých stránek konverze mluveného slova do psané podoby je automatické rozpoznání formálních jednotek řeči, resp. hranic slov, ale i vět, což je úzce spojeno s použitím interpunkce umožňující snadnou recepci textu. Právě užití interpunkce v automaticky přepsaném slovu se stalo cílem našeho zkoumání. Test, který jsme pro dané účely sestavili, ukazuje percepci řeči rodilými mluvčími a jejich potřebu použít interpunkci pro rozčlenění textu. Výsledky této studie posloužily k tzv. trénování programu pro automatické rozpoznání řečových jednotek na syntaktické rovině a automatickému užití interpunkce. Pro účely výzkumné sondy byl připraven text sestavený z projevů typologicky rozdílných mluvčích (celková délka mluveného textu byla 30 minut; 5 247 slov). Tento text byl předložen padesáti respondentům, kteří do něj doplnili interpunkci. Pro experiment jsme použili dva speciální nástroje, a to Nano Trans – pro přehrávání zvukové podoby textů a pro doplňování interpunkce do přepsané podoby textů a Transcription Viewer – pro porovnávání užití interpunkce mezi jednotlivými respondenty.

Anglicky

In this paper we introduce an experimental probe based on the transcribed texts of spoken documents stored in the large Czech Radio audio archive of oral documents. This archive contains the largest collection of spoken documents recorded during the last 90 years. The ultimate goal of the project introduced in the paper is to transcribe a part of the audio archive and store the transcription in the database, in which it will be possible to search, and retrieve information. The value of the search is that one can find the information on the two linguistic levels: in the written form and the spoken form. This doubled information-storage is important especially for the comfortable retrieval of information and it extends diametrically the possibilities of work with the information. One of the important issues of the conversion of spoken speech to the written texts is the automatic delimitation of speech units and sentences/clauses in the final text processing, which is connected with the punctuation use important for convenient perception of the rewritten texts. For this reason we decided to test Czech native speakers’ perception of speech and their need of punctuation use in the rewritten texts. The results have served to train program for automatic recognition of speech units and correct supplying of punctuation. For the probe we prepared a mix of texts spoken by typologically various speakers (the length of speech was 30 minutes; 5247 words), these were given to 50 respondents whose task was to supply punctuation to the automatically rewritten texts. We used two special tools to run this experiment; NanoTrans - this tool was used by respondents for the punctuation supply. The other tool for viewing and comparing respondents performance, especially written for the probe, was Transcription Viewer. In the text we give detailed information about these comparisons.

Použití interpunkce v automatických přepisech mluveného slova

Základní údaje

Originální název

Název česky

Název anglicky

Autoři

Vydání

Další údaje

Jazyk

Typ výsledku

Obor

Stát vydavatele

Utajení

Odkazy

Organizace

Klíčová slova česky

Klíčová slova anglicky

Návaznosti

Anotace

V originále

Anglicky