Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement exact string searching using FM index and LZ index #57

Merged
merged 24 commits into from
Jul 8, 2024

Conversation

prolik123
Copy link
Contributor

Added FM Index and unit tests

@prolik123
Copy link
Contributor Author

Draft!

Make refactor + write unit tests

@prolik123 prolik123 marked this pull request as ready for review March 16, 2024 15:03
Copy link
Owner

@krzysztof-turowski krzysztof-turowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, please introduce appropriate changes and let me know if everything is clear.

FYI: make check provides you a linter check for all py files, it might be useful to you. In general the convention is as following: function_name, _private_function_name (i.e. helper, not used outside the package), variable_name + ClassName (but try to eliminate methods, use it only as C-like structs data holders)

Comment on lines 81 to 83
if ran[0] == -1:
return 0
return max(ran[1] - ran[0] + 1, 0)
Copy link
Owner

@krzysztof-turowski krzysztof-turowski Mar 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return max(ran[1] - ran[0] + 1, 0) if ran[0] > -1 else 0
or even better
return max(high - low + 1, 0) if low > -1 else 0 with tuple unpacking above


#prepare char mapping for F
self.mapperOfChar = { self.F[2] : 0}
self.begginings = [2]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

beginnings

self.mapperOfChar = { self.F[2] : 0}
self.begginings = [2]
last = self.F[2]
lenOfBeginings = 1

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove and replace with len(beginnings), it's still $O(1)$

Comment on lines 53 to 55
currChar = p[size-1]
if currChar not in self.mapperOfChar:
return [-1, -1]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if p[-1] not in self.mapperOfChar:
  return -1, -1


# O(|p|)
def count(self, p, size):
ran = self.getRangeOfOccurence(p, size)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unpack on return e.g.
low, high = self.getRangeOfOccurence(p, size)

nextSample = nextSample + self.sampleSize

# should be private
def getRangeOfOccurence(self, p, size):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PS. Please check spelling of "occurrences", "beginnings" etc., because you have several variants of both ;)

@@ -0,0 +1,118 @@

class FMIndex:
Copy link
Owner

@krzysztof-turowski krzysztof-turowski Mar 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I'd avoid creating classes in favor of procedural (i.e. bunch of functions) approach, please check e.g. suffix_array.py for details. Of course you can build a structure which will be used in methods like contains - see suffix_array.contains

Comment on lines 105 to 118
# O(|p| + k) where k is the number or occurances of p in text
def get_all_occurrance(self, p, l):
arr = self.getRangeOfOccurence(p, l)
if arr[0] == -1:
return []
return [self.SA[i-1] for i in range(arr[0], arr[1] + 1)]

# O(|p|)
def get_any_occurrance(self, p, l):
arr = self.getRangeOfOccurence(p, l)
if arr[0] == -1:
return -1
return self.SA[arr[0]-1]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please replace with a function "contains" like in suffix_tree and suffix_array packages, which returns values sequentially using yield

Comment on lines 6 to 7
from string_indexing import suffix_array
from string_indexing import fm_index

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from string_indexing import suffix_array, fm_index

run_large = unittest.skipUnless(
os.environ.get('LARGE', False), 'Skip test in small runs')

def get_all_occurences_of_pattern_naive(self, text, n, pattern, l):
Copy link
Owner

@krzysztof-turowski krzysztof-turowski Mar 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can move retrieving all occurences to `test_exact_string_matching.py' (see implementations of suffix tree and suffix array there)

Is there any other operation that you would like to test aside from get_all_occurrance and get_any_occurrance (i.e. contains after renaming)? If not, then maybe this file is rendundant - or if you want to test query, please restict this file only to doing so.

Copy link
Owner

@krzysztof-turowski krzysztof-turowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I think FM index is ready to go (modulo some nitpicks), please proceed with other structs.


for i in range(size-1, 0, -1):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for c in p[::-1]: wouldn't be simpler?

@@ -45,6 +52,7 @@ def lcp_lr_contains(t, w, n, m):
suffix_array.prefix_doubling(t, n), t, w, n, m),
],
[ 'lcp-lr array', lcp_lr_contains ],
[ 'Fm index', fm_index_contains]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either FM index or fm index

def fm_index_contains(t, w, n, m):
SA = suffix_array.skew(t, n)
BWT = burrows_wheeler.transform_from_suffix_array(SA, t, n)
fm = fm_index.from_suffix_array_and_bwt(SA, BWT, t, n)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Staying true to the convention above let this be FM, ok?

return [l, r]
# O(|p|)
def count(fm, p, size):
(low, high) = _get_range_of_occurrences(fm, p, size)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please replace with low, high = _get_range_of_occurrences(...)

Comment on lines 71 to 72
l = fm.beginnings[map_idx]
r = fm.n + 1

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

l, r = fm.beginnings[map_idx], fm.n + 1
or even better

l = fm.beginnings[map_idx]
r = fm.beginnings[map_idx + 1] - 1 if map_idx != fm.len_of_alphabet - 1 else fm.n + 1


# O(|p| + k) where k is the number or occurances of p in text
def contains(fm, p, l):
(low, high) = _get_range_of_occurrences(fm, p, l)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

low, high = ...

return max(high - low + 1, 0) if low > -1 else 0

# O(|p| + k) where k is the number or occurances of p in text
def contains(fm, p, l):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since structures are written SA, BWT etc., probably we should use FM not fm.

Comment on lines 33 to 34
current_value = 0
next_sample = self.sample_size

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

current_value, next_sample = 0, self.sample_size

self.F = '#$' + ''.join(text[SA[i]] for i in range(1, n + 1))
self.n = n
self.SA = SA
self.sample_size = 8 # const for sampling

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make it a class-level constant:

class FMIndex:
  SAMPLE_SIZE = 8


for i in range(size-1, 0, -1):
if p[i] not in fm.mapper_of_chars:
return (-1, -1)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return -1, -1 here and elsewhere

@krzysztof-turowski krzysztof-turowski changed the title Add Fm index Implement exact string searching using FM index and LZ index Jul 8, 2024
Copy link
Owner

@krzysztof-turowski krzysztof-turowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for your contribution - well done!

@krzysztof-turowski krzysztof-turowski merged commit 7e12217 into krzysztof-turowski:master Jul 8, 2024
@krzysztof-turowski krzysztof-turowski added documentation Improvements or additions to documentation enhancement New feature or request labels Jul 10, 2024
@krzysztof-turowski krzysztof-turowski linked an issue Jul 10, 2024 that may be closed by this pull request
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement alternative indices for standard text problems
2 participants