D 2018

Implementation Notes for the Soft Cosine Measure

NOVOTNÝ, Vít

Basic information

Original name

Implementation Notes for the Soft Cosine Measure

Authors

NOVOTNÝ, Vít

Edition

Torino, Italy, Proceedings of the 27th ACM International Conference on Information and Knowledge Management (CIKM '18), p. 1639-1642, 4 pp. 2018

Publisher

Association for Computing Machinery

Other information

Language

English

Type of outcome

Proceedings paper

Country of publisher

Italy

Confidentiality degree

is not subject to a state or trade secret

Publication form

electronic version available online

References:

Marked to be transferred to RIV

Yes

RIV identification code

RIV/00216224:14330/18:00101853

Organization

Fakulta informatiky – Repository – Repository

ISBN

978-1-4503-6014-2

EID Scopus

Keywords in English

Vector Space Model; computational complexity; similarity measure

Links

MUNI/A/1038/2017, interní kód Repo. MUNI/A/1213/2017, interní kód Repo. TD03000295, research and development project.
Changed: 26/4/2022 03:08, RNDr. Daniel Jakubík

Abstract

In the original language

The standard bag-of-words vector space model (VSM) is efficient, and ubiquitous in information retrieval, but it underestimates the similarity of documents with the same meaning, but different terminology. To overcome this limitation, Sidorov et al. proposed the Soft Cosine Measure (SCM) that incorporates term similarity relations. Charlet and Damnati showed that the SCM is highly effective in question answering (QA) systems. However, the orthonormalization algorithm proposed by Sidorov et al. has an impractical time complexity of O(n^4), where n is the size of the vocabulary. In this paper, we prove a tighter lower worst-case time complexity bound of O(n^3). We also present an algorithm for computing the similarity between documents and we show that its worst-case time complexity is O(1) given realistic conditions. Lastly, we describe implementation in general-purpose vector databases such as Annoy, and Faiss and in the inverted indices of text search engines such as Apache Lucene, and ElasticSearch. Our results enable the deployment of the SCM in real-world information retrieval systems.

Files attached