Skip to content

Commit

Permalink
added styles and finished first draft
Browse files Browse the repository at this point in the history
  • Loading branch information
Snailed committed Dec 30, 2021
1 parent 7336b49 commit 5aa32e8
Show file tree
Hide file tree
Showing 86 changed files with 355 additions and 8 deletions.
8 changes: 8 additions & 0 deletions bibliography.bib
Original file line number Diff line number Diff line change
Expand Up @@ -104,3 +104,11 @@ @misc{futhark
howpublished = {\url{https://futhark-lang.org/}},
note = {Accessed: 2021-12-22}
}
@inproceedings{andoni2006efficient,
author = {Andoni, Alexandr and Indyk, Piotr},
year = {2006},
month = {01},
pages = {1203-1212},
title = {Efficient algorithms for substring near neighbor problem},
doi = {10.1145/1109557.1109690}
}
45 changes: 45 additions & 0 deletions forside/eksempel.tex
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
\documentclass[11pt]{article}
\usepackage[a4paper, hmargin={2.8cm, 2.8cm}, vmargin={2.5cm, 2.5cm}]{geometry} % Geometri-pakke: Styrer bl.a. maginer %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\usepackage[babel, lille]{ku-forside} % KU-forside
%
% Mini-manual til ku-forside pakken:
%
% Sprogmuligheder: da, en
% babel loader babelpakken, med det valgte sprog
% Fakultetsmuligheder: farma, hum, jur, ku, life, nat, samf, sund, teo
% Farvemuligheder: sh, farve
% Forsidemuligheder: lille, stor, titelside
% titelside er identisk med designet på ku.dk/designmanual
% lille er giver et lille logo sammen med titlen på den første side
% stor er giver et stort logo sammen med titlen på den første side
%
% Default er [da,nat,farve,titelside]
%
% Ex. \usepackage[babel, lille, jur, sh, en]{ku-forside} giver et lille logo i sorthvid for juridisk fakultet og loader babelpakken med engelsk som sprog.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%% Titel %%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\titel{Test} %
\undertitel{Test test} %
\opgave{Overspringshandling} % Findes kun under 'titelside'
\forfatter{Navnet}%
\dato{\today}%
\vejleder{Doktoren} % Findes kun under 'titelside'
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%% Her begynder dokumentet %%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{document}
\maketitle % LAVER TITLEN
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%% TEKST BEGYNDER HER %%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Reserving on an aggregated level}
$$
R=\sum_{i=2}^{n}R_{i}=\sum_{i=2}^{n}D_{i(n+1-i)}\left(\prod_{j=n+1-i}^{n-1}\hat{f}_{j}-1\right)
\label{(a0)}
$$


\end{document}
7 changes: 4 additions & 3 deletions introduction/main.tex
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
\section{Introduction}
The \textit{Approximate Similarity Search Problem} regards efficiently finding a set $A$ from a corpus $\mathcal{F}$ that is approximately similar to a query set $Q$ in regards to the \textit{Jaccard Similarity} metric $J(A,Q) = \frac{|A\cap Q|}{|A\cup Q|}$\cite{dahlgaard2017fast}\cite{fast-similarity-search}. Practical applications includes searching through large corpi of high-dimensional text documents like plagiarism-detection or website duplication checking among others\cite{vassilvitskii2018}. The main bottleneck in this problem is the \textit{curse of dimensionality}. Any trivial algorithm can solve this problem in $O(nd|Q|)$ time, but algorithms that query in linear time to the dimensionality of the corpus scale poorly when working with high-dimensional datasets. Text documents are especially bad in this regard since they often are encoded using \textit{$w$-shingles} ($w$ contigous words) which \citet{li2011hashing} shows easily can reach a dimensionality upwards of $d=2^{83}$ using just $5$-shingles.\\
The classic solution to this problem is the MinHash algorithm presented by \citet{broder1997minhash} to perform website duplication checking for the AltaVista search engine. It preprocesses the data once using hashing to perform effective querying in $O(n + |Q|)$ time, a significant improvement independent of the dimensionality of the corpus.
Many improvements have since been presented to both improve processing time, query time and space efficiency. Notable mentions includes (but are not limited to) \textit{b-bit minwise hashing}\cite{ping2011theory}, \textit{fast similarity sketching}\cite{dahlgaard2017fast} and \textit{parallel bit-counting}\cite{fast-similarity-search} (the latter of which is the main focus of this project).
These contributions have brought the query time down to sublinear time while keeping a constant error probability.\\
The addition of parallel bit-counting for querying
Many improvements have since been presented to both improve processing time, query time and space efficiency. Notable mentions includes (but are not limited to) the use of \textit{tensoring}\cite{andoni2006efficient}, \textit{b-bit minwise hashing}\cite{ping2011theory}, \textit{fast similarity sketching}\cite{dahlgaard2017fast}. Simple applications of these techniques leads to efficient querying with a constant error probability. If one wishes to achieve an even better error probability such as $\varepsilon = o(1)$, it is standard practice within the field to use $O(\log_2(1/\varepsilon))$ independent data structures and return the best result, resulting in a query time of $O(\frac{1}{\epsilon} (n^\rho + |Q|))$. Recent advances by \citet{fast-similarity-search} show that it is possible to achieve an even better query time by sampling these data structures from one large sketch. The similarity between these sub-sketches and a query set needs to be evaluated efficiently when querying, which requires efficient computation of the cardinality of a bit-string. To do this, \citet{fast-similarity-search} presents a general parallel bit-counting algorithm that computes the cardinality of a list of bit-strings in sub-linear time amortized due to word-parallism. This brings the query time down to $O((\frac{n\log_2 w}{w})^\rho \log(1/\varepsilon) + |Q|)$.\\
The main focus of this project is to analyse, prove, implement and evaluate this parallel bit-counting technique. The analysis will be based on the original paper \cite{fast-similarity-search}, but with some modifications to resolve some of the issues with the original algorithm. This will also include a pseudo-code implementation of the algorithm since the original paper only describes it through recurrences. This leads to a proof of correctness that slightly alters from the one presented in the paper, and a runtime analysis that does indeed show the sub-linear run time as claimed.\\
This theoretical analysis will be backed up by a real-life implementation that can be benchmarked to help show this sub-linear run time in practice. At last, reflections on the results and methods will be made to back up eventual conclusions.

129 changes: 129 additions & 0 deletions ku-forside.sty
Original file line number Diff line number Diff line change
@@ -0,0 +1,129 @@
% KU-forside pakke. Forsider til opgaver skrevet på Københavns Universitet
% Skrevet af Christian Aastrup. Designet af forsiderne følger det på http://www.ku.dk/designmanual
%
\ProvidesPackage{ku-forside}[2007/07/07 v1.0 Frontpages with University of Cph. logos]
%
%Definerer Standard SPROG/AFDELING/FARVE
\def\SPROG{da}\def\FARVE{farve}\def\AFDELING{nat}\def\FORSIDE{titelside}
%
% Laver SPROG-mulighederne til 'if's
\newif\if@en \newif\if@da
%
% Laver AFDELINGS-mulighederne til 'if's
\newif\if@ku \newif\if@farma \newif\if@hum
\newif\if@jur \newif\if@life \newif\if@nat
\newif\if@samf \newif\if@sund \newif\if@teo
%
% Laver FARVE-mulighederne til 'if's
\newif\if@farve \newif\if@sh
%
% Laver FORSIDE-mulighederne til 'if's
\newif\if@titelside \newif\if@stor \newif\if@lille
%
\newif\if@babel \DeclareOption{babel}{\@babeltrue}
%
% Erklærer sprogene som 'options' i pakke-kaldet
\DeclareOption{en}{\@entrue} \DeclareOption{da}{\@datrue}
%
% Erklærer afdelingerne som 'options' i pakke-kaldet
\DeclareOption{ku}{\@kutrue} \DeclareOption{farma}{\@farmatrue} \DeclareOption{hum}{\@humtrue}
\DeclareOption{jur}{\@jurtrue} \DeclareOption{life}{\@lifetrue} \DeclareOption{nat}{\@nattrue}
\DeclareOption{samf}{\@samftrue} \DeclareOption{sund}{\@sundtrue} \DeclareOption{teo}{\@teotrue}
%
% Erklærer farverne som 'options' i pakke-kaldet
\DeclareOption{farve}{\@farvetrue} \DeclareOption{sh}{\@shtrue}
%
% Erklærer forsidemulighederne som 'options' i pakke-kaldet
\DeclareOption{lille}{\@lilletrue} \DeclareOption{stor}{\@stortrue}
\DeclareOption{titelside}{\@titelsidetrue}
%
\ProcessOptions\relax
%
% Definerer hvad der skal ske når sprogene er TRUE
\if@en \def\SPROG{en} \fi \if@da \def\SPROG{da} \fi
%
% Definerer hvad der skal ske når afdelingerne er TRUE
\if@ku \def\AFDELING{ku} \fi \if@farma \def\AFDELING{farma} \fi \if@hum \def\AFDELING{hum} \fi
\if@jur \def\AFDELING{jur} \fi \if@life \def\AFDELING{life} \fi \if@nat \def\AFDELING{nat} \fi
\if@samf \def\AFDELING{samf} \fi \if@sund \def\AFDELING{sund} \fi \if@teo \def\AFDELING{teo} \fi
%
% Definerer hvad der skal ske når farverne er TRUE
\if@sh \def\FARVE{sh} \fi \if@farve \def\FARVE{farve} \fi
%
% Definerer hvad der skal ske når de forskellige forsidemuligheder er TRUE
\if@stor \def\FORSIDE{stor} \fi \if@lille \def\FORSIDE{lille} \fi
\if@titelside \def\FORSIDE{titelside} \fi
%
\def\OPGAVE{$\backslash$opgave$\{\ldots\}$}
\def\FORFATTER{$\backslash$forfatter$\{\ldots\}$ el. $\backslash$author$\{\ldots\}$ }
\def\TITEL{$\backslash$titel$\{\ldots\}$ el. $\backslash$title$\{\ldots\}$}
\def\UNDERTITEL{$\backslash$undertitel$\{\ldots\}$}
\def\VEJLEDER{$\backslash$vejleder$\{\ldots\}$}
\def\AFLEVERINGSDATO{$\backslash$dato$\{\ldots\}$ el. $\backslash$date$\{\ldots\}$}
%
\renewcommand{\author}[1]{\def\FORFATTER{#1}}
\renewcommand{\title}[1]{\def\TITEL{#1}}
\renewcommand{\date}[1]{\def\AFLEVERINGSDATO{#1}}
%
\newcommand{\opgave}[1]{\def\OPGAVE{#1}}
\newcommand{\forfatter}[1]{\def\FORFATTER{#1}}
\newcommand{\titel}[1]{\def\TITEL{#1}}
\newcommand{\undertitel}[1]{\def\UNDERTITEL{#1}}
\newcommand{\vejleder}[1]{\def\VEJLEDER{#1}}
\newcommand{\dato}[1]{\def\AFLEVERINGSDATO{#1}}
%
% Pakker nødvendige for at sætte forsiden op %
%
%\RequirePackage[OT2,OT4]{fontenc}
\RequirePackage{eso-pic,graphicx,fix-cm,ae,aecompl,ifthen} %
\RequirePackage[usenames]{color} %
%% BABEL-option: Undersøger det erklærede sprog og sætter pakken Babel derefter %%
\if@babel
\ifthenelse{\equal{\SPROG}{en}}{\RequirePackage[danish,english]{babel}}{} % Engelsk ordeling, overskrifts- og kapitel struktur %
\ifthenelse{\equal{\SPROG}{da}}{\RequirePackage[english,danish]{babel}}{} % Dansk ordeling, overskrifts- og kapitel struktur %
% Bemærk at begge sprog indlæses. Rækkefølgen er vigtig, idet det er det sidste sprog som dokumnetet generelt sættes i. %
% Det andet sprogs orddeling mm. kan man få fat i ved at skrive \selectlanguage{sprog} i brødteksten %
\fi
%
%% FORSIDEN DEFINERES: %
%
% Mulighed: titelside
\ifthenelse{\equal{\FORSIDE}{titelside}}{
\def\tyk{\fontfamily{phv}\fontseries{bx}\selectfont} %Bold extended %
\def\tynd{\fontfamily{phv}\fontseries{sb}\selectfont} % Semi-bold %
\def\maketitle{\thispagestyle{empty} %
\AddToShipoutPicture*{\put(0,0){\includegraphics*[viewport=0 0 700 600]{\AFDELING-\FARVE}}}% %
\AddToShipoutPicture*{\put(0,602){\includegraphics*[viewport=0 600 700 1600]{\AFDELING-\FARVE}}}% %
\AddToShipoutPicture*{\put(0,0){\includegraphics*{\AFDELING-\SPROG}}}% %
\AddToShipoutPicture*{\put(50,583.5){\fontsize{20 pt}{22 pt} \tyk \OPGAVE }} % %
\AddToShipoutPicture*{\put(50,555.3){\fontsize{14 pt}{16 pt} \tynd \FORFATTER }} % %
\AddToShipoutPicture*{\put(50,499){\fontsize{22 pt}{24 pt} \tynd \TITEL }} % %
\AddToShipoutPicture*{\put(50,480.5){\fontsize{14 pt}{16 pt} \tynd \UNDERTITEL }} % %
\AddToShipoutPicture*{\put(50,92){\fontsize{11 pt}{12 pt} \tynd \VEJLEDER }} % %
\AddToShipoutPicture*{\put(50,66.7){\fontsize{11 pt}{12 pt} \tynd \AFLEVERINGSDATO }} % %
\phantom{Usynlig, men nødvendig} %
\newpage \noindent}}{} %
% Mulighed: lille
\ifthenelse{\equal{\FORSIDE}{lille}}{
\def\maketitle{\thispagestyle{plain}
\AddToShipoutPicture*{\put(035,613){\includegraphics*[viewport=0 600 700 1600, scale=0.88]{\AFDELING-\FARVE}}}% Billedet bruges
\AddToShipoutPicture*{\put(-010,613){\includegraphics*[viewport=0 600 420 1600, scale=0.88]{\AFDELING-\FARVE}}}% tre gange for at
\AddToShipoutPicture*{\put(400,613){\includegraphics*[viewport=0 600 420 1600, scale=0.88]{\AFDELING-\FARVE}}}% få stregen lang.
\AddToShipoutPicture*{\put(79,755){\large{\textbf{\TITEL}}}}%
\AddToShipoutPicture*{\put(79,733){\UNDERTITEL}}%
\AddToShipoutPicture*{\put(79,715){\tiny{\emph{\FORFATTER}}}}%
\AddToShipoutPicture*{\put(79,702){\tiny{\AFLEVERINGSDATO}}}%
\phantom{Usynlig, men nødvendig}
\vspace*{3.2cm} %
\noindent}}{} %
% Mulighed: stor
\ifthenelse{\equal{\FORSIDE}{stor}}{
\def\maketitle{\thispagestyle{plain}
\AddToShipoutPicture*{\put(0,602){\includegraphics*[viewport=156 649 700 1600, scale=1.4]{\AFDELING-\FARVE}}} % %
\AddToShipoutPicture*{\put(79,755){\LARGE{\textbf{\TITEL}}}}%
\AddToShipoutPicture*{\put(79,723){\Large{\UNDERTITEL}}}%
\AddToShipoutPicture*{\put(79,695){\normalsize{\emph{\FORFATTER}}}}%
\AddToShipoutPicture*{\put(79,670){\footnotesize{\AFLEVERINGSDATO}}}%
\phantom{Usynlig, men nødvendig}
\vspace*{5cm} %
\noindent}}{}
Binary file added ku-forside.zip
Binary file not shown.
Binary file added ku-forside/Startark.pdf
Binary file not shown.
Loading

0 comments on commit 5aa32e8

Please sign in to comment.