Indsæt ("Formal definition will follow") efter problembeskrivelsen #2

Snailed · 2021-12-30T11:27:38Z

    
           The \textit{Approximate Similarity Search Problem} regards efficiently finding a set $A$ from a corpus $\mathcal{F}$ that is approximately similar to a query set $Q$ in regards to the \textit{Jaccard Similarity} metric $J(A,Q) = \frac{|A\cap Q|}{|A\cup Q|}$\cite{dahlgaard2017fast}\cite{fast-similarity-search}. Practical applications includes searching through large corpi of high-dimensional text documents like plagiarism-detection or website duplication checking among others\cite{vassilvitskii2018}. The main bottleneck in this problem is the \textit{curse of dimensionality}. Any trivial algorithm can solve this problem in $O(nd|Q|)$ time, but algorithms that query in linear time to the dimensionality of the corpus scale poorly when working with high-dimensional datasets. Text documents are especially bad in this regard since they often are encoded using \textit{$w$-shingles} ($w$ contigous words) which \citet{li2011hashing} shows easily can reach a dimensionality upwards of $d=2^{83}$ using just $5$-shingles.\\

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Indsæt ("Formal definition will follow") efter problembeskrivelsen #2

Indsæt ("Formal definition will follow") efter problembeskrivelsen #2

Snailed commented Dec 30, 2021

Indsæt ("Formal definition will follow") efter problembeskrivelsen #2

Indsæt ("Formal definition will follow") efter problembeskrivelsen #2

Comments

Snailed commented Dec 30, 2021