Building a 70 billion word corpus of English from ClueWeb

Building a 70 billion word corpus of English from ClueWeb Tato aplikace je zatím určena pro spuštění na stolním počítači. Na tomto mobilním zařízení je také funkční, ale zatím plně nevyužívá jeho možností.

Tato aplikace je zatím určena pro spuštění na stolním počítači. Na tomto mobilním zařízení je také funkční, ale zatím plně nevyužívá jeho možností.

Podrobný výpis o publikaci

POMIKÁLEK, Jan, Pavel RYCHLÝ a Miloš JAKUBÍČEK. Building a 70 billion word corpus of English from ClueWeb. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12). Istanbul, Turkey: European Language Resources Association (ELRA), 2012, s. 502-506. ISBN 978-2-9517408-7-7.

Další formáty: BibTeX LaTeX RIS

Základní údaje
Originální název	Building a 70 billion word corpus of English from ClueWeb
Autoři	POMIKÁLEK, Jan (203 Česká republika, domácí), Pavel RYCHLÝ (203 Česká republika, domácí) a Miloš JAKUBÍČEK (203 Česká republika, garant, domácí).
Vydání	Istanbul, Turkey, Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12), od s. 502-506, 5 s. 2012.
Nakladatel	European Language Resources Association (ELRA)

Další údaje
Originální jazyk	angličtina
Typ výsledku	Stať ve sborníku
Obor	Informatika
Stát vydavatele	Česká republika
Utajení	není předmětem státního či obchodního tajemství
Forma vydání	tištěná verze "print"
WWW	URL
Kód RIV	RIV/00216224:14330/12:00057572
Organizace	Fakulta informatiky – Masarykova univerzita – Repozitář
ISBN	978-2-9517408-7-7
UT WoS	000323927700080
Klíčová slova anglicky	corpus; clueweb; English; encoding; word sketch
Návaznosti	GAP401/10/0792, projekt VaV. LM2010013, projekt VaV. 248307, interní kód Repo.
Změnil	Změnil: RNDr. Daniel Jakubík, učo 139797. Změněno: 1. 9. 2020 12:48.

Anotace

This work describes the process of creation of a 70 billion word text corpus of English. We used an existing language resource, namely the ClueWeb09 dataset, as source for the corpus data. Processing such a vast amount of data presented several challenges, mainly associated with pre-processing (boilerplate cleaning, text de-duplication) and post-processing (indexing for efficient corpus querying using the CQL – Corpus Query Language) steps. In this paper we explain how we tackled them: we describe the tools used for boilerplate cleaning (jusText) and for de-duplication (onion) that was performed not only on full (document-level) duplicates but also on the level of near-duplicate texts. Moreover we show the impact of each of the performed pre-processing steps on the final corpus size. Furthermore we show how effective parallelization of the corpus indexation procedure was employed within the Manatee corpus management system and during computation of word sketches (one-page, automatic, corpus-derived summaries of a word’s grammatical and collocational behaviour) from the resulting corpus.

Typ	Název	Vloženo
	lrec2012.pdf	8. 1. 2013
Vlastnosti Název lrec2012.pdf Adresa v ISu https://repozitar.cz/auth/repo/15605/70357/ Adresa ze světa https://repozitar.cz/repo/15605/70357/ Adresa do Správce https://repozitar.cz/auth/repo/15605/70357/?info Ze světa do Správce https://repozitar.cz/repo/15605/70357/?info Vloženo Út 8. 1. 2013 16:56 Práva Právo číst kdokoliv v Internetu Právo vkládat Právo spravovat osoba RNDr. Daniel Jakubík, uco 139797 osoba Mgr. Ľuboš Lunter, uco 143320 Atributy
	lrec2012.pdf	1. 9. 2020
Vlastnosti Název lrec2012.pdf Adresa v ISu https://repozitar.cz/auth/repo/15605/894672/ Adresa ze světa https://repozitar.cz/repo/15605/894672/ Adresa do Správce https://repozitar.cz/auth/repo/15605/894672/?info Ze světa do Správce https://repozitar.cz/repo/15605/894672/?info Vloženo Út 1. 9. 2020 12:48 Práva Právo číst kdokoliv v Internetu Právo vkládat Právo spravovat osoba Mgr. Lucie Vařechová, uco 106253 osoba RNDr. Daniel Jakubík, uco 139797 osoba Mgr. Jolana Surýnková, uco 220973 Atributy

Vytisknout
Přidat do schránky Zobrazeno: 17. 7. 2024 15:21

Podrobný výpis o publikaci

Vlastnosti

Práva

Vlastnosti

Práva

Další aplikace