Implement exact string searching using FM index and LZ index #57

prolik123 · 2024-03-15T20:21:10Z

Added FM Index and unit tests

prolik123 · 2024-03-15T20:22:02Z

Draft!

Make refactor + write unit tests

string_indexing/__pycache__/suffix_array.cpython-310.pyc

string_indexing/__pycache__/suffix_array.cpython-311.pyc

krzysztof-turowski

Thank you, please introduce appropriate changes and let me know if everything is clear.

FYI: make check provides you a linter check for all py files, it might be useful to you. In general the convention is as following: function_name, _private_function_name (i.e. helper, not used outside the package), variable_name + ClassName (but try to eliminate methods, use it only as C-like structs data holders)

krzysztof-turowski · 2024-03-19T16:31:27Z

string_indexing/fm_index.py

+      if ran[0] == -1:
+         return 0
+      return max(ran[1] - ran[0] + 1, 0)


return max(ran[1] - ran[0] + 1, 0) if ran[0] > -1 else 0
or even better
return max(high - low + 1, 0) if low > -1 else 0 with tuple unpacking above

krzysztof-turowski · 2024-03-20T07:58:32Z

string_indexing/fm_index.py

+
+      #prepare char mapping for F
+      self.mapperOfChar = { self.F[2] : 0}
+      self.begginings = [2]


krzysztof-turowski · 2024-03-20T07:58:57Z

string_indexing/fm_index.py

+      self.mapperOfChar = { self.F[2] : 0}
+      self.begginings = [2]
+      last = self.F[2]
+      lenOfBeginings = 1


Please remove and replace with len(beginnings), it's still $O(1)$

krzysztof-turowski · 2024-03-20T08:00:25Z

string_indexing/fm_index.py

+      currChar = p[size-1]
+      if currChar not in self.mapperOfChar:
+         return [-1, -1]


if p[-1] not in self.mapperOfChar: return -1, -1

krzysztof-turowski · 2024-03-20T08:01:01Z

string_indexing/fm_index.py

+
+    # O(|p|)
+   def count(self, p, size):
+      ran = self.getRangeOfOccurence(p, size)


Unpack on return e.g.
low, high = self.getRangeOfOccurence(p, size)

krzysztof-turowski · 2024-03-20T08:06:37Z

string_indexing/fm_index.py

+               nextSample = nextSample + self.sampleSize
+
+    # should be private
+   def getRangeOfOccurence(self, p, size):


PS. Please check spelling of "occurrences", "beginnings" etc., because you have several variants of both ;)

krzysztof-turowski · 2024-03-20T08:07:45Z

string_indexing/fm_index.py

@@ -0,0 +1,118 @@
+
+class FMIndex:


In general I'd avoid creating classes in favor of procedural (i.e. bunch of functions) approach, please check e.g. suffix_array.py for details. Of course you can build a structure which will be used in methods like contains - see suffix_array.contains

krzysztof-turowski · 2024-03-20T08:11:08Z

string_indexing/fm_index.py

+   # O(|p| + k) where k is the number or occurances of p in text
+   def get_all_occurrance(self, p, l):
+      arr = self.getRangeOfOccurence(p, l)
+      if arr[0] == -1:
+         return []
+      return [self.SA[i-1] for i in range(arr[0], arr[1] + 1)]
+
+    # O(|p|)
+   def get_any_occurrance(self, p, l):
+      arr = self.getRangeOfOccurence(p, l)
+      if arr[0] == -1:
+         return -1
+      return self.SA[arr[0]-1]
+


Please replace with a function "contains" like in suffix_tree and suffix_array packages, which returns values sequentially using yield

krzysztof-turowski · 2024-03-20T08:11:38Z

test/test_fm_index.py

+from string_indexing import suffix_array
+from string_indexing import fm_index


from string_indexing import suffix_array, fm_index

krzysztof-turowski · 2024-03-20T08:14:46Z

test/test_fm_index.py

+  run_large = unittest.skipUnless(
+      os.environ.get('LARGE', False), 'Skip test in small runs')
+
+  def get_all_occurences_of_pattern_naive(self, text, n, pattern, l):


I think you can move retrieving all occurences to `test_exact_string_matching.py' (see implementations of suffix tree and suffix array there)

Is there any other operation that you would like to test aside from get_all_occurrance and get_any_occurrance (i.e. contains after renaming)? If not, then maybe this file is rendundant - or if you want to test query, please restict this file only to doing so.

krzysztof-turowski

OK, I think FM index is ready to go (modulo some nitpicks), please proceed with other structs.

krzysztof-turowski · 2024-03-24T19:15:28Z

string_indexing/fm_index.py


+   for i in range(size-1, 0, -1):


for c in p[::-1]: wouldn't be simpler?

krzysztof-turowski · 2024-03-24T19:16:07Z

test/test_exact_string_matching.py

@@ -45,6 +52,7 @@ def lcp_lr_contains(t, w, n, m):
            suffix_array.prefix_doubling(t, n), t, w, n, m),
    ],
    [ 'lcp-lr array', lcp_lr_contains ],
+    [ 'Fm index', fm_index_contains]


Either FM index or fm index

krzysztof-turowski · 2024-03-24T19:16:35Z

test/test_exact_string_matching.py

+def fm_index_contains(t, w, n, m):
+  SA = suffix_array.skew(t, n)
+  BWT = burrows_wheeler.transform_from_suffix_array(SA, t, n)
+  fm = fm_index.from_suffix_array_and_bwt(SA, BWT, t, n)


Staying true to the convention above let this be FM, ok?

krzysztof-turowski · 2024-03-24T19:22:16Z

string_indexing/fm_index.py

-      return [l, r]
+# O(|p|)
+def count(fm, p, size):
+   (low, high) = _get_range_of_occurrences(fm, p, size)


Please replace with low, high = _get_range_of_occurrences(...)

krzysztof-turowski · 2024-03-24T19:23:03Z

string_indexing/fm_index.py

+   l = fm.beginnings[map_idx]
+   r = fm.n + 1


l, r = fm.beginnings[map_idx], fm.n + 1
or even better

l = fm.beginnings[map_idx] r = fm.beginnings[map_idx + 1] - 1 if map_idx != fm.len_of_alphabet - 1 else fm.n + 1

krzysztof-turowski · 2024-03-25T14:17:53Z

string_indexing/fm_index.py

+
+# O(|p| + k) where k is the number or occurances of p in text
+def contains(fm, p, l):
+  (low, high) = _get_range_of_occurrences(fm, p, l)


low, high = ...

krzysztof-turowski · 2024-03-25T14:18:20Z

string_indexing/fm_index.py

+  return max(high - low + 1, 0) if low > -1 else 0
+
+# O(|p| + k) where k is the number or occurances of p in text
+def contains(fm, p, l):


Since structures are written SA, BWT etc., probably we should use FM not fm.

krzysztof-turowski · 2024-03-25T14:19:26Z

string_indexing/fm_index.py

+      current_value = 0
+      next_sample = self.sample_size


current_value, next_sample = 0, self.sample_size

krzysztof-turowski · 2024-03-25T14:20:13Z

string_indexing/fm_index.py

+    self.F = '#$' + ''.join(text[SA[i]] for i in range(1, n + 1))
+    self.n = n
+    self.SA = SA
+    self.sample_size = 8 # const for sampling


Please make it a class-level constant:

class FMIndex: SAMPLE_SIZE = 8

krzysztof-turowski · 2024-03-25T14:20:47Z

string_indexing/fm_index.py

+
+  for i in range(size-1, 0, -1):
+    if p[i] not in fm.mapper_of_chars:
+      return (-1, -1)


return -1, -1 here and elsewhere

Added lz_index.py Added test/test_wavelet_tree.py Tested FmIndex and Wavelet tree LzIndex has to be debugged yet

krzysztof-turowski

Thank you very much for your contribution - well done!

prolik123 added 2 commits March 15, 2024 20:59

Beginings

347c84d

It works

b495f66

prolik123 added 3 commits March 15, 2024 21:38

Add locate function

700fbb9

Fix for locate

51882c6

Dodano unit testy

72fb475

prolik123 commented Mar 16, 2024

View reviewed changes

string_indexing/__pycache__/suffix_array.cpython-310.pyc Outdated Show resolved Hide resolved

prolik123 commented Mar 16, 2024

View reviewed changes

string_indexing/__pycache__/suffix_array.cpython-311.pyc Outdated Show resolved Hide resolved

delete cache files

a85b978

prolik123 marked this pull request as ready for review March 16, 2024 15:03

krzysztof-turowski requested changes Mar 20, 2024

View reviewed changes

prolik123 added 3 commits March 24, 2024 19:54

CV changes

d92588a

Small naming changes

acb7f7f

whitespace fix

af7a06f

krzysztof-turowski reviewed Mar 25, 2024

View reviewed changes

prolik123 and others added 14 commits June 7, 2024 17:59

Added wavelet_tree.py

f0a1256

Added lz_index.py Added test/test_wavelet_tree.py Tested FmIndex and Wavelet tree LzIndex has to be debugged yet

Style fixes

11f445d

Make check fix

e05c1d3

LZIndex works + little refactor

2354d3a

Refactor and make check pass

55c7193

optimize wavelete_tree

e7a6db0

Name fix

7f301cb

Add optimal range_search function

356fdf6

revert LARGE test comment

7d91b4c

test fix

07e0a6c

Add Naive RangeSearcher

0fc0941

Add Text

92c15d9

Change text

7791ef7

Move tex files

a5b244e

krzysztof-turowski changed the title ~~Add Fm index~~ Implement exact string searching using FM index and LZ index Jul 8, 2024

Fix style in tests

5a67d93

krzysztof-turowski approved these changes Jul 8, 2024

View reviewed changes

krzysztof-turowski merged commit 7e12217 into krzysztof-turowski:master Jul 8, 2024

krzysztof-turowski assigned prolik123 Jul 10, 2024

krzysztof-turowski added documentation Improvements or additions to documentation enhancement New feature or request labels Jul 10, 2024

krzysztof-turowski linked an issue Jul 10, 2024 that may be closed by this pull request

Implement alternative indices for standard text problems #50

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement exact string searching using FM index and LZ index #57

Implement exact string searching using FM index and LZ index #57

prolik123 commented Mar 15, 2024

prolik123 commented Mar 15, 2024

krzysztof-turowski left a comment

krzysztof-turowski Mar 19, 2024 •

edited

Loading

krzysztof-turowski Mar 20, 2024

krzysztof-turowski Mar 20, 2024

krzysztof-turowski Mar 20, 2024

krzysztof-turowski Mar 20, 2024

krzysztof-turowski Mar 20, 2024

krzysztof-turowski Mar 20, 2024 •

edited

Loading

krzysztof-turowski Mar 20, 2024

krzysztof-turowski Mar 20, 2024

krzysztof-turowski Mar 20, 2024 •

edited

Loading

krzysztof-turowski left a comment

krzysztof-turowski Mar 24, 2024

krzysztof-turowski Mar 24, 2024

krzysztof-turowski Mar 24, 2024

krzysztof-turowski Mar 24, 2024

krzysztof-turowski Mar 24, 2024

krzysztof-turowski Mar 25, 2024

krzysztof-turowski Mar 25, 2024

krzysztof-turowski Mar 25, 2024

krzysztof-turowski Mar 25, 2024

krzysztof-turowski Mar 25, 2024

krzysztof-turowski left a comment

		from string_indexing import suffix_array
		from string_indexing import fm_index

Implement exact string searching using FM index and LZ index #57

Implement exact string searching using FM index and LZ index #57

Conversation

prolik123 commented Mar 15, 2024

prolik123 commented Mar 15, 2024

krzysztof-turowski left a comment

Choose a reason for hiding this comment

krzysztof-turowski Mar 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

krzysztof-turowski Mar 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

krzysztof-turowski Mar 20, 2024 • edited Loading

Choose a reason for hiding this comment

krzysztof-turowski left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

krzysztof-turowski left a comment

Choose a reason for hiding this comment

krzysztof-turowski Mar 19, 2024 •

edited

Loading

krzysztof-turowski Mar 20, 2024 •

edited

Loading

krzysztof-turowski Mar 20, 2024 •

edited

Loading