Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize string repetition and string copying #2634

Merged
merged 6 commits into from
Apr 4, 2024

Conversation

advikkabra
Copy link
Collaborator

@advikkabra advikkabra commented Mar 28, 2024

Fixes #2616

  1. The current string repetition algorithm was inefficient, where memset could be used for single char strings, and binary exponentiation could be used for multi char strings. This is faster than copying one byte at a time.
  2. _lfortran_strcpy was calling strlen in every loop iteration, which was slowing down the program significantly.

After this change, a 1e6 repetition as shown in the issue is 0.04s, which is comparable to CPython.

@kmr-srbh
Copy link
Contributor

kmr-srbh commented Mar 28, 2024

Advik, I recognize and appreciate the effort you have put in to find and resolve this issue. I fear, we have increased the time instead.

N = 10 6

string1: str = "a"
string2: str = string1 * 10**6
(base) saurabh-kumar@Awadh:~/Projects/System/lpython$ time ./src/bin/lpython ./examples/example.py
^C
real    1m43.633s
user    1m43.501s
sys     0m0.048s

The operation took around 30s before. I request you to look into it.

@kmr-srbh
Copy link
Contributor

I think you might have mistakenly tested the above using CPython.

@advikkabra
Copy link
Collaborator Author

It works fine on my machine, and after profiling more than 99% of the time used was in the strlen calls in the strcpy function.

advik@advik-Inspiron-7560:~/dev/oss/lpython$ time src/bin/lpython ../test.py
1000000

real    0m0.082s
user    0m0.058s
sys     0m0.020s
advik@advik-Inspiron-7560:~/dev/oss/lpython$ cat ../test.py
string1: str = "a";
string2: str = string1 * 10**6
print(len(string2))

@kmr-srbh
Copy link
Contributor

Great work Advik! It works properly on my machine now too! That's an outright 300x improvement! 👏

(lp) saurabh-kumar@Awadh:~/Projects/System/lpython$ time ./src/bin/lpython ./examples/example.py

real    0m0.110s
user    0m0.071s
sys     0m0.039s

The reason it did not work the previous time was because I was on an older branch. The new changes from LFortran have done an amazing job with increasing the speed.

Comment on lines +2142 to +2144
if (s_len == 1) {
memset(dest_char, *(*s), n);
} else {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why we need to handle the case for s_len == 1 separately. The else portion seems general, so I think that should be enough to handle the case of s_len == 1.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the else portion fail when s_len == 1?

Copy link
Collaborator Author

@advikkabra advikkabra Mar 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not fail when s_len == 1, but memset is very optimized for this usecase. Setting a large block of data to a single value is what memset excels in, and it is optimized for every architecture. While testing, these are the compared times:

100000000
Binary exponentiation: 0.086963
memset: 0.006341
400000000
Binary exponentiation: 0.211737
memset: 0.018612

This is after compiling with -O3

x[0][i] = y[i];
}
for (; i < x_len; i++) {
x[0][i] = ' ';
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great improvement. Thanks!

Copy link
Collaborator

@Shaikh-Ubaid Shaikh-Ubaid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add a test for this with a large n (so that we are sure that the time taken does not break in future). We can do it in a separate PR.

It looks good to me. Thanks for this. Great work!

@Shaikh-Ubaid
Copy link
Collaborator

It does not fail when s_len == 1, but memset is very optimized for this usecase.

Did you trying checking this if the else works for s_len == 1? It is fine to have a separate case for s_len == 1, but I just want to be sure that the else works perfectly (for example it works for odd numbers like 1).

@Shaikh-Ubaid
Copy link
Collaborator

Let's merge this after we are sure that the else case works for odd numbers like 1.

@advikkabra
Copy link
Collaborator Author

This test is already there in test_str_01.py in integration_tests.

def test_str_repeat():
    a: str
    a = "Xyz"
    assert a*3 == "XyzXyzXyz"
    assert a*2*3 == "XyzXyzXyzXyzXyzXyz"
    assert 3*a*3 == "XyzXyzXyzXyzXyzXyzXyzXyzXyz"
    assert a*-1 == ""

@Shaikh-Ubaid
Copy link
Collaborator

This test is already there in test_str_01.py in integration_tests.

Do we have a test with a large n like (10^6 or so)? If so, then it is good.

          I think we should add a test for this with a large `n` (so that we are sure that the time taken does not break in future). We can do it in a separate PR.

Originally posted by @Shaikh-Ubaid in #2634 (review)

@advikkabra
Copy link
Collaborator Author

How would you suggest I add this as a test? Is there a way to check if something is taking too long?

@Shaikh-Ubaid
Copy link
Collaborator

Shaikh-Ubaid commented Apr 3, 2024

How would you suggest I add this as a test? Is there a way to check if something is taking too long?

I think not yet in LPython (for LFortran we have cpu_time). For now, by adding a test we are just trying to ensure that the test is not taking too long.

Previously (as of the main branch) the time complexity of _lfortran_strcpy() was O(n^2). That means, for n = 10^6 the time taken could be in orders of n^12, which would never practically complete at the CI.

If we add a test for n = 10^6 and the test completes in a finite time (I am assuming it will take < 1s. If it take more than 1s then it is not much helpful and we should reduce n), then it verifies/ensures that the time complexity of _lfortran_strcpy() has improved.

@Shaikh-Ubaid
Copy link
Collaborator

Shaikh-Ubaid commented Apr 3, 2024

For a robust test, we should support something like time.time(). But it should not be part of this PR and can be done later.

Once we have support for time.time(), we can later come back to the above n = 10^6 test case and update it to add a check for the time taken.

@advikkabra
Copy link
Collaborator Author

I have added a test for a 10**6 repetition. This would take >1 minute in the previous code, so it would be easy to catch if it goes wrong. Right now it completes it in 20ms on my machine, so it is comfortable with the new implementation. I think the PR is ready now.

Copy link
Collaborator

@Shaikh-Ubaid Shaikh-Ubaid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect! It looks good to me. Thanks for this. Great work!

@Shaikh-Ubaid Shaikh-Ubaid merged commit 4ceac47 into lcompilers:main Apr 4, 2024
13 checks passed
assem2002 pushed a commit to assem2002/lpython that referenced this pull request Apr 28, 2024
* Faster string repeat algorithm

* Optimize `_lfortran_strcpy`

* Faster string repeat algorithm

* Optimize `_lfortran_strcpy`

* Added stress test for repeat
assem2002 pushed a commit to assem2002/lpython that referenced this pull request Apr 28, 2024
* Faster string repeat algorithm

* Optimize `_lfortran_strcpy`

* Faster string repeat algorithm

* Optimize `_lfortran_strcpy`

* Added stress test for repeat
@advikkabra advikkabra deleted the optimize-repeat branch July 21, 2024 13:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improvement: Assignment string2: str = string1 * N takes a very long time for a large integer N
3 participants