Building a 70 billion word corpus of English from ClueWeb

Building a 70 billion word corpus of English from ClueWeb Tato aplikace je zatím určena pro spuštění na stolním počítači. Na tomto mobilním zařízení je také funkční, ale zatím plně nevyužívá jeho možností.

Tato aplikace je zatím určena pro spuštění na stolním počítači. Na tomto mobilním zařízení je také funkční, ale zatím plně nevyužívá jeho možností.

Detailed Information on Publication Record

POMIKÁLEK, Jan, Pavel RYCHLÝ and Miloš JAKUBÍČEK. Building a 70 billion word corpus of English from ClueWeb. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12). Istanbul, Turkey: European Language Resources Association (ELRA), 2012, p. 502-506. ISBN 978-2-9517408-7-7.

Other formats: BibTeX LaTeX RIS

Basic information
Original name	Building a 70 billion word corpus of English from ClueWeb
Authors	POMIKÁLEK, Jan (203 Czech Republic, belonging to the institution), Pavel RYCHLÝ (203 Czech Republic, belonging to the institution) and Miloš JAKUBÍČEK (203 Czech Republic, guarantor, belonging to the institution).
Edition	Istanbul, Turkey, Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12), p. 502-506, 5 pp. 2012.
Publisher	European Language Resources Association (ELRA)

Other information
Original language	English
Type of outcome	Proceedings paper
Field of Study	Informatics
Country of publisher	Czech Republic
Confidentiality degree	is not subject to a state or trade secret
Publication form	printed version "print"
WWW	URL
RIV identification code	RIV/00216224:14330/12:00057572
Organization	Fakulta informatiky – Repository – Repository
ISBN	978-2-9517408-7-7
UT WoS	000323927700080
Keywords in English	corpus; clueweb; English; encoding; word sketch
Links	GAP401/10/0792, research and development project. LM2010013, research and development project. 248307, interní kód Repo.
Changed by	Changed by: RNDr. Daniel Jakubík, učo 139797. Changed: 1/9/2020 12:48.

Abstract

This work describes the process of creation of a 70 billion word text corpus of English. We used an existing language resource, namely the ClueWeb09 dataset, as source for the corpus data. Processing such a vast amount of data presented several challenges, mainly associated with pre-processing (boilerplate cleaning, text de-duplication) and post-processing (indexing for efficient corpus querying using the CQL – Corpus Query Language) steps. In this paper we explain how we tackled them: we describe the tools used for boilerplate cleaning (jusText) and for de-duplication (onion) that was performed not only on full (document-level) duplicates but also on the level of near-duplicate texts. Moreover we show the impact of each of the performed pre-processing steps on the final corpus size. Furthermore we show how effective parallelization of the corpus indexation procedure was employed within the Manatee corpus management system and during computation of word sketches (one-page, automatic, corpus-derived summaries of a word’s grammatical and collocational behaviour) from the resulting corpus.

Type	Name	Uploaded/Created
	lrec2012.pdf	8/1/2013
Properties Name lrec2012.pdf Address within IS https://repozitar.cz/auth/repo/15605/70357/ Address for the users outside IS https://repozitar.cz/repo/15605/70357/ Address within Manager https://repozitar.cz/auth/repo/15605/70357/?info Address within Manager for the users outside IS https://repozitar.cz/repo/15605/70357/?info Uploaded/Created Tue 8/1/2013 16:56 Rights Right to read anyone on the Internet Right to upload Right to administer: a concrete person RNDr. Daniel Jakubík, uco 139797 a concrete person Mgr. Ľuboš Lunter, uco 143320 Attributes
	lrec2012.pdf	1/9/2020
Properties Name lrec2012.pdf Address within IS https://repozitar.cz/auth/repo/15605/894672/ Address for the users outside IS https://repozitar.cz/repo/15605/894672/ Address within Manager https://repozitar.cz/auth/repo/15605/894672/?info Address within Manager for the users outside IS https://repozitar.cz/repo/15605/894672/?info Uploaded/Created Tue 1/9/2020 12:48 Rights Right to read anyone on the Internet Right to upload Right to administer: a concrete person Mgr. Lucie Vařechová, uco 106253 a concrete person RNDr. Daniel Jakubík, uco 139797 a concrete person Mgr. Jolana Surýnková, uco 220973 Attributes

Print
Add to clipboard Displayed: 17/7/2024 17:20

Detailed Information on Publication Record

Properties

Rights

Properties

Rights

Other applications