-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement exact string searching using FM index and LZ index #57
Implement exact string searching using FM index and LZ index #57
Conversation
Draft! Make refactor + write unit tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, please introduce appropriate changes and let me know if everything is clear.
FYI: make check
provides you a linter check for all py files, it might be useful to you. In general the convention is as following: function_name
, _private_function_name
(i.e. helper, not used outside the package), variable_name
+ ClassName
(but try to eliminate methods, use it only as C-like structs data holders)
string_indexing/fm_index.py
Outdated
if ran[0] == -1: | ||
return 0 | ||
return max(ran[1] - ran[0] + 1, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return max(ran[1] - ran[0] + 1, 0) if ran[0] > -1 else 0
or even better
return max(high - low + 1, 0) if low > -1 else 0
with tuple unpacking above
string_indexing/fm_index.py
Outdated
|
||
#prepare char mapping for F | ||
self.mapperOfChar = { self.F[2] : 0} | ||
self.begginings = [2] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
beginnings
string_indexing/fm_index.py
Outdated
self.mapperOfChar = { self.F[2] : 0} | ||
self.begginings = [2] | ||
last = self.F[2] | ||
lenOfBeginings = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove and replace with len(beginnings), it's still
string_indexing/fm_index.py
Outdated
currChar = p[size-1] | ||
if currChar not in self.mapperOfChar: | ||
return [-1, -1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if p[-1] not in self.mapperOfChar:
return -1, -1
string_indexing/fm_index.py
Outdated
|
||
# O(|p|) | ||
def count(self, p, size): | ||
ran = self.getRangeOfOccurence(p, size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unpack on return e.g.
low, high = self.getRangeOfOccurence(p, size)
string_indexing/fm_index.py
Outdated
nextSample = nextSample + self.sampleSize | ||
|
||
# should be private | ||
def getRangeOfOccurence(self, p, size): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PS. Please check spelling of "occurrences", "beginnings" etc., because you have several variants of both ;)
string_indexing/fm_index.py
Outdated
@@ -0,0 +1,118 @@ | |||
|
|||
class FMIndex: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general I'd avoid creating classes in favor of procedural (i.e. bunch of functions) approach, please check e.g. suffix_array.py for details. Of course you can build a structure which will be used in methods like contains - see suffix_array.contains
string_indexing/fm_index.py
Outdated
# O(|p| + k) where k is the number or occurances of p in text | ||
def get_all_occurrance(self, p, l): | ||
arr = self.getRangeOfOccurence(p, l) | ||
if arr[0] == -1: | ||
return [] | ||
return [self.SA[i-1] for i in range(arr[0], arr[1] + 1)] | ||
|
||
# O(|p|) | ||
def get_any_occurrance(self, p, l): | ||
arr = self.getRangeOfOccurence(p, l) | ||
if arr[0] == -1: | ||
return -1 | ||
return self.SA[arr[0]-1] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please replace with a function "contains" like in suffix_tree and suffix_array packages, which returns values sequentially using yield
test/test_fm_index.py
Outdated
from string_indexing import suffix_array | ||
from string_indexing import fm_index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from string_indexing import suffix_array, fm_index
test/test_fm_index.py
Outdated
run_large = unittest.skipUnless( | ||
os.environ.get('LARGE', False), 'Skip test in small runs') | ||
|
||
def get_all_occurences_of_pattern_naive(self, text, n, pattern, l): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can move retrieving all occurences to `test_exact_string_matching.py' (see implementations of suffix tree and suffix array there)
Is there any other operation that you would like to test aside from get_all_occurrance
and get_any_occurrance
(i.e. contains
after renaming)? If not, then maybe this file is rendundant - or if you want to test query, please restict this file only to doing so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I think FM index is ready to go (modulo some nitpicks), please proceed with other structs.
string_indexing/fm_index.py
Outdated
|
||
for i in range(size-1, 0, -1): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for c in p[::-1]:
wouldn't be simpler?
test/test_exact_string_matching.py
Outdated
@@ -45,6 +52,7 @@ def lcp_lr_contains(t, w, n, m): | |||
suffix_array.prefix_doubling(t, n), t, w, n, m), | |||
], | |||
[ 'lcp-lr array', lcp_lr_contains ], | |||
[ 'Fm index', fm_index_contains] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either FM index or fm index
test/test_exact_string_matching.py
Outdated
def fm_index_contains(t, w, n, m): | ||
SA = suffix_array.skew(t, n) | ||
BWT = burrows_wheeler.transform_from_suffix_array(SA, t, n) | ||
fm = fm_index.from_suffix_array_and_bwt(SA, BWT, t, n) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Staying true to the convention above let this be FM
, ok?
string_indexing/fm_index.py
Outdated
return [l, r] | ||
# O(|p|) | ||
def count(fm, p, size): | ||
(low, high) = _get_range_of_occurrences(fm, p, size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please replace with low, high = _get_range_of_occurrences(...)
string_indexing/fm_index.py
Outdated
l = fm.beginnings[map_idx] | ||
r = fm.n + 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
l, r = fm.beginnings[map_idx], fm.n + 1
or even better
l = fm.beginnings[map_idx]
r = fm.beginnings[map_idx + 1] - 1 if map_idx != fm.len_of_alphabet - 1 else fm.n + 1
string_indexing/fm_index.py
Outdated
|
||
# O(|p| + k) where k is the number or occurances of p in text | ||
def contains(fm, p, l): | ||
(low, high) = _get_range_of_occurrences(fm, p, l) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
low, high = ...
string_indexing/fm_index.py
Outdated
return max(high - low + 1, 0) if low > -1 else 0 | ||
|
||
# O(|p| + k) where k is the number or occurances of p in text | ||
def contains(fm, p, l): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since structures are written SA, BWT etc., probably we should use FM not fm.
string_indexing/fm_index.py
Outdated
current_value = 0 | ||
next_sample = self.sample_size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
current_value, next_sample = 0, self.sample_size
string_indexing/fm_index.py
Outdated
self.F = '#$' + ''.join(text[SA[i]] for i in range(1, n + 1)) | ||
self.n = n | ||
self.SA = SA | ||
self.sample_size = 8 # const for sampling |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make it a class-level constant:
class FMIndex:
SAMPLE_SIZE = 8
string_indexing/fm_index.py
Outdated
|
||
for i in range(size-1, 0, -1): | ||
if p[i] not in fm.mapper_of_chars: | ||
return (-1, -1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return -1, -1
here and elsewhere
Added lz_index.py Added test/test_wavelet_tree.py Tested FmIndex and Wavelet tree LzIndex has to be debugged yet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for your contribution - well done!
Added FM Index and unit tests