Přehled o publikaci
2018
Implementation Notes for the Soft Cosine Measure
NOVOTNÝ, VítBasic information
Original name
Implementation Notes for the Soft Cosine Measure
Authors
NOVOTNÝ, Vít
Edition
Torino, Italy, Proceedings of the 27th ACM International Conference on Information and Knowledge Management (CIKM '18), p. 1639-1642, 4 pp. 2018
Publisher
Association for Computing Machinery
Other information
Language
English
Type of outcome
Proceedings paper
Country of publisher
Italy
Confidentiality degree
is not subject to a state or trade secret
Publication form
electronic version available online
Marked to be transferred to RIV
Yes
RIV identification code
RIV/00216224:14330/18:00101853
Organization
Fakulta informatiky – Repository – Repository
ISBN
978-1-4503-6014-2
UT WoS
EID Scopus
Keywords in English
Vector Space Model; computational complexity; similarity measure
Links
MUNI/A/1038/2017, interní kód Repo. MUNI/A/1213/2017, interní kód Repo. TD03000295, research and development project.
Changed: 26/4/2022 03:08, RNDr. Daniel Jakubík
Abstract
In the original language
The standard bag-of-words vector space model (VSM) is efficient, and ubiquitous in information retrieval, but it underestimates the similarity of documents with the same meaning, but different terminology. To overcome this limitation, Sidorov et al. proposed the Soft Cosine Measure (SCM) that incorporates term similarity relations. Charlet and Damnati showed that the SCM is highly effective in question answering (QA) systems. However, the orthonormalization algorithm proposed by Sidorov et al. has an impractical time complexity of O(n^4), where n is the size of the vocabulary. In this paper, we prove a tighter lower worst-case time complexity bound of O(n^3). We also present an algorithm for computing the similarity between documents and we show that its worst-case time complexity is O(1) given realistic conditions. Lastly, we describe implementation in general-purpose vector databases such as Annoy, and Faiss and in the inverted indices of text search engines such as Apache Lucene, and ElasticSearch. Our results enable the deployment of the SCM in real-world information retrieval systems.