D 2015

Determining Window Size from Plagiarism Corpus for Stylometric Features

SUCHOMEL, Šimon and Michal BRANDEJS

Basic information

Original name

Determining Window Size from Plagiarism Corpus for Stylometric Features

Authors

SUCHOMEL, Šimon and Michal BRANDEJS

Edition

Toulouse, France, Experimental IR Meets Multilinguality, Multimodality, and Interaction, p. 293-299, 7 pp. 2015

Publisher

Springer International Publishing

Other information

Language

English

Type of outcome

Proceedings paper

Field of Study

Informatics

Country of publisher

France

Confidentiality degree

is not subject to a state or trade secret

Publication form

printed version "print"

References:

Marked to be transferred to RIV

Yes

RIV identification code

RIV/00216224:14330/15:00084706

Organization

Fakulta informatiky – Repository – Repository

ISBN

978-3-319-24026-8

ISSN

Keywords in English

plagiarism; average word frequency class; stylometry; text classification; intrinsic plagiarism

Links

LG13010, research and development project.
Changed: 2/9/2020 09:52, RNDr. Daniel Jakubík

Abstract

In the original language

The sliding window concept is a common method for computing a profile of a document with unknown structure. This paper outlines an experiment with stylometric word-based feature in order to determine an optimal size of the sliding window. It was conducted for a vocabulary richness method called ‘average word frequency class’ using the PAN 2015 source retrieval training corpus for plagiarism detection. The paper shows the pros and cons of the stop words removal for the sliding window document profiling and discusses the utilization of the selected feature for intrinsic plagiarism detection. The experiment resulted in the recommendation of setting the sliding windows to around 100 words in length for computing the text profile using the average word frequency class stylometric feature.

Files attached