-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Searching disk images for a text regex causes rg to be SIGKILL'd #2831
Comments
The SIGKILL is likely a symptom of your OS killing your process as a result of it using too much memory. In general, this is expected in some cases. It's documented in the man page:
|
The key sentence there for your case is likely:
Binary data can have very long runs of NUL bytes. Or just long runs of bytes that don't contain a line terminator. greps are fundamentally line oriented tools. The data you are searching is likely not line oriented. From the first sentence of the README:
ripgrep makes an effort to work on data that isn't line oriented, but it is fundamentally not designed for it. |
I'm not sure how ripgrep aims to compare to ugrep, but currently it's less good at this, and there's a solution possible to fix this. I'd suggest that leaving this issue open, even if you don't intend to solve it in the near future, might be valuable. |
Without a reproduction to compare what precisely is happening, there's really not much more I can say. There are likely trade offs here that you aren't accounting for.
The issue isn't gone. It's just closed. If something changes, we can reopen it. |
It's exactly what you were describing with large strings of null bytes. Here's somewhat of a repro: This prepped file seems to work fine:# 1 MiB of random data
dd if=/dev/urandom count=1024 bs=1024 >> ./binary_test_1.bin
# about 10 GB of zeros
dd status=progress if=/dev/zero count=10240000 bs=1024 >> ./binary_test_1.bin
echo "This is the text I'm searching for" >> ./binary_test_1.bin This prepped file results in very high memory usage:dd status=progress if=/dev/zero count=10240000 bs=1024 > ./binary_test_2.bin ; echo 'This is the text I am searching for' >> ./binary_test_2.bin ; dd if=/dev/zero count=10240000 bs=1024 status=progress >> ./binary_test_2.bin Search the file:rg -uuu 'searching for' ./binary_test_2.bin
rg -uuu --text --byte-offset --only-matching 'searching for' ./binary_test_2.bin Use htop to see high memory usage while doing this. The usage grows steadily with time, all the way up to around 18G. |
Yeah, exactly as I thought. You're looking at its behavior with respect to memory, but you aren't actually checking that the result is consistent. I roughly copied your process but simplified things a little in places. Here's my
And now my grep commands:
(I added Oh it must be that ugrep is doing something amazing here! No.... Let's check the output file:
Both GNU grep and ripgrep emit the correct result here. Namely, the input file has two ~10GB sized lines, by construction. The thing we're searching for is at the very end of the first line. Both GNU grep and ripgrep print that line as-is. ugrep... does not. It prints a truncate version of it. You still see the match, but you're missing a huge chunk of the first part of the line. What ugrep does might be good enough for you, and that's fine and great. But silently truncating lines like this is not something ripgrep is going to do. I would rather ripgrep fail, because then at least you don't think you're getting the correct output. With all that said, there is a flag that will help in this case, where ripgrep will treat all NUL bytes as line terminators. GNU grep supports it too:
There is a perf bug in ripgrep here (filed in #2832), but it does avoid the exorbitant memory usage. The search is just slow. What about ugrep? Its README says:
So the GNU grep command above should work if I swap out grep for ugrep right?
🤔 Anyway, it looks like there's a perf bug to fix in ripgrep when |
Notice that I mention the |
A potential optimization for sure, but not one that I plan to pursue. It requires the regex engine to support streaming, and this comes with a whole host of other trade offs. It's a big effort. ripgrep's regex engine supports this to a degree, but wiring everything up is non-trivial. Stream searching is abstraction busting. The lower hanging fruit here is to fix the perf bug in |
Please tick this box to confirm you have reviewed the above.
What version of ripgrep are you using?
ripgrep 14.1.0
features:-simd-accel,-pcre2
simd(compile):+SSE2,-SSSE3,-AVX2
simd(runtime):+SSE2,+SSSE3,-AVX2
PCRE2 is not available in this build of ripgrep.
How did you install ripgrep?
Cargo
What operating system are you using ripgrep on?
Linux Mint 21.3
Describe your bug.
Running the following example on several TB of disk images fails.
What are the steps to reproduce the behavior?
What is the actual behavior?
What is the expected behavior?
It should work.
The text was updated successfully, but these errors were encountered: