-
Notifications
You must be signed in to change notification settings - Fork 197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rewrite regexes where common prefix can be pulled out from alternation branches #464
base: master
Are you sure you want to change the base?
Conversation
d0d8da3
to
af27b24
Compare
I've noticed one thing here... Is this intentional? Should I add the new suite 27 to RunTest.bat? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't fully look at the code (or even pulled it), as the difference on style/design and white space changes keeps distracting me though, but if this code hasn't been tested yet in Windows it might be worth marking it as "draft".
Notice there are some warnings triggered by the CI, and there is at least one place where -Wdeclaration-after-statement
should trigger.
doc/pcre2_compile.3
Outdated
@@ -65,6 +65,7 @@ The primary option bits are: | |||
theses (named ones available) | |||
PCRE2_NO_AUTO_POSSESS Disable auto-possessification | |||
PCRE2_NO_DOTSTAR_ANCHOR Disable automatic anchoring for .* | |||
PCRE2_NO_PATTERN_REWRITE Disable pattern rewriting optimizations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the future, there might be more rewrite optimizations. I would name this specifically for #411 optimization only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Disagree (at least in principle). As described in the discussion for that feature request this is indeed adding a completely new mode of operation for PCRE which was NEVER an optimizing compiler and where the normal use cases expected some other way to optimize the pattern (the same applies to JIT BTW which mostly translates the interpreter opcodes into native CPU code)
I "agree" that long term other "rewrites" might be added and would be a good idea to identify them here, but the way to do so would be to add flags for each case as extended options IMHO. Indeed I would argue that this option SHOULDN'T be using one of the few options bits left and should be instead enabled through an extended option itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would use an O0-O3 system similar to gcc. The O1 is the default, and the behaviour is same as now. Then we could introduce optimizations, which costs more time, but improves performace more. We could also disable optimizations, such as auto-possession with O0. Disabling first character optimizations would be deprecated as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My view when Alex and I discussed this was that NO_PATTERN_REWRITE would disable all rewriting optimizations, present and future. If there is a need to disable individual ones, then additional extended options can be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I wasn't clear enough and indeed might had confused the discussion further by mentioning the extended options, which I was proposing should be used to ENABLE each optimization, and therefore the bit in the compile options that this thread is attached to wouldn't be needed and could be retired (with the advantage that we are back at having two "free" options for the future).
Agree that the (*NO_PATTERN_REWRITE)
pattern option setting should disable all optimizations.
Also agree with the OP that it makes sense to enable this optimization with an option that should be more specific, and indeed, using extended options for them would make the design more future proof.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought the rewriting happens after the parsing is successful, so no error (besides out of memory) can happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought the rewriting happens after the parsing is successful, so no error (besides out of memory) can happen.
You are right that rewriting happens after a successful parse... but there are just a few types of errors which are not detected during parsing, but rather when processing the parsed_pattern
vector. For all such cases, offsets into the original pattern string are stored in the parsed_pattern
vector, and the pattern rewriter makes sure not to destroy these.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The erroroffset issue was likely a red herring, but it is offtopic for this conversation which I think needs a resolution.
It would seem that @PhilipHazel likes the current one as it is sufficient and doesn't make a difference when there is only one rewriting optimization anyway and might want this discussion punted until later.
Note that the option flag mentioned here is to disable all rewriting (hence the subsystem and not only this specific "optimization").
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whatever we do, it should happen before the next release. There are only a few compile flags left, and this flag feel out of scope for me. What about making this always enabled in this patch, and make it conditional when we decide how to do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely, this issue needs to be resolved. But even just for testing purposes, it's vital that we can disable the optimization. I think making it "always enabled" is not an option.
My next priority on this PR is to investigate why the 'code memory' size has increased for some patterns. After that will be to eliminate all compiler warnings. Other issues to be worked on include: enabling tests on Windows if appropriate, coming to a resolution on whether PCRE2 will move to the C99 standard or not, questions of brace style, and performance benchmarking. (Philip has already done a bit of benchmarking, and from his limited results, it appeared that this optimization is both 1) worthwhile and 2) has relatively small impact on compile time. But more thorough and extensive benchmarking is needed. It's possible that the benchmarking might cause me to do more performance work to reduce the impact on compile time.)
As all those matters fall into place one by one, at some point, the issue of how to disable the optimization must be worked out. I agree with Zoltan that the number of bits left in the compile options is truly small. But otherwise I haven't thought hard enough to weigh in on the issue. I just followed what has been done for previous optimizations.
I consider it extremely important that Zoltan, Carlo, and other key contributors who have helped to make PCRE2 what it is now are all happy with whatever is decided.
Similar to other analysis, I would move this out of the pcre_compile. As a first step, I would introduce a |
Are we ready to merge this yet? |
Absolutely not. 😄 There has been some great feedback from @carenas and others which I need to implement. |
I have marked this PR as draft. |
Could you share your results about the optimization? Compile time overhead vs speed gain. Interpreter and jit is also interesting. |
I realized we are not just adding a new function here, we are also adding an entirely new subsystem. So we should discuss how similar optimizations should be handled in the future, because changing the api after a release is difficult. My api proposal: Most people don't know the effect of various optimizations, so they can simply set The It is also a good question how patterns could override the optimization level. Setting it to 0 may hurt system performance, and not necessary desired in all cases. Shall we add an option to |
bd4423e
to
12147f7
Compare
Although not all issues raised by reviewers have been resolved, I am pushing an update so any comments can refer to the latest version of the PR. |
3fcbe8e
to
fa645ae
Compare
Tomorrow I hope to look into the issue @carenas mentioned whereby the 'code size memory' has increased for some patterns. Will post my findings here. |
Two comments: (1) I was about to comment about erroroffset not being a problem, but that seems to have been explained by others. (2) How about adding a new 32-bit field to the compile context called, say, OPTIMIZE_SWITCHES, which contains a bit for each optimization that can be enabled/disabled. A function could then set them individually, or "levels" such as 0, 1, 2, etc could set a selection of them. Existing NO_xxx options would have to be kept, of course, for backward compatibility, but we could move the new NO_REWRITE into the new field. |
👏🏻 Nice. Very nice. |
Carlo raised the question of why the memory used for regex code seems to have increased in some test cases. It's because of rewriting alternations to character classes, like: PCRE2 has more than one way to represent character classes in its bytecode, but for cases like this, it's one unit for For char classes which only have a small number of character values (say up to 8), I wouldn't be surprised if it would be faster and make the regex bytecode smaller to represent the character values as literal bytes. Although it would require some benchmarking to see if this is actually faster or not, I would think of possibly using SIMD to test for 8 different byte values at once... or as an alternative, test all 8 values and OR the results together before doing a single branch (rather than branching 8 times). |
For x86, perhaps something like: __m64 class = _mm_set_pi8(char1, char2, char3, char4, char5, char6, char7, char8);
__m64 test = _mm_set1_pi8(subject_char);
if (_mm_movemask_pi8(_mm_cmpeq_pi8(class, test)) {
/* subject_char matches char class */
} In JIT'd code, hopefully it might be possible to load the MM register just once and keep the character class values there for the duration of the match. |
JIT has many optimizations for classes. E.g. |
Most optimizations are a simple boolean, so I would prefer a boolean array. The other alternative is multiple call of the API function, where each call sets one optimization. We can also add more api functions later if needed. We have a context, so everything is backward compatible. |
A function such as pcre2_set_optimization() should have as its arguments (1) a pointer to the compile context, (2) the index into an array of 32-bit words (so that we can extend ad infinitum), (3) two 32-bit words, the first being bits to unset followed by the bits to set. Alternatively, do you think, if we were to make it a 64-bit field, we would ever run out? That makes for an easier interface. I happened to be looking at some historical documents today, and found this quote from a paper published in 1963, which I cannot resist passing on: "In general it is true to say that the facilities most popular with programmers are those which are easy to use." |
Dear Philip, I'm just trying to figure out how this interface will work. Why is there both an index into an array of 32-bit words, and also two 32-bit words with bits to set/clear? When you say there should be an index into an array of 32-bit words, how will this array work? What will the values in the array be? To make this discussion a bit more concrete, I may open a PR today with a proposed interface. |
frankly, I think that a 63-bit field must be enough (even maybe stick with With that the inteface will be instead:
|
Some optimizations may take more time, and less useful for simple patterns, and disabling it is ok. I would prefer a O0-O3 system for that, even if we don't use all levels at the moment. Enabling optimizations individually is more of a private system, because users unlikely know the details of them. Hence a bitset (uint32_t) array could be enough in long term. |
Because since you wanted an infinitely large bitset vector, then the mask to enable/disable has to know at which offset on that vector to apply the passed values.
because the API is called I agree with Philip though that doing |
Thanks to all for the suggestions; it is helpful. I've opened a (draft) PR adding an API which I believe is better than the kind suggestions which the reviewers have shared. It currently has Zoltan's -O0 and -O3. |
I have rebased this PR on the latest For all (or basically all) of the current compile option flags, it is possible to set them using syntax like If more optimization flags are added in the future, I doubt if we should keep adding more and more of such options syntax. Personally, it doesn't seem necessary that users should be able to set any possible optimization flag using such syntax. So I'm thinking there may be no need to support an inline |
No; the in pattern VERB are there to allow disabling the option for final users of the library who might not have access to the C API to do so, and even for the ones that do, it might need to be disabled for specific patterns because it is not a good option (compilation time can be significantly increased for an already optimized pattern) or has bugs that are triggered by it. |
The original motivation for introducing (*NO_JIT) etc was to allow users who have no access to the C code to change these options. In other words, users of some application that allows them to provide patterns might want or need to set such things. In particular (*UTF) and (*UCP) might be necessary - though these can be locked out by the application. I can see that there those that relate to optimizations are somewhat different, but it's not really much code in pcre2_compile.c. We cannot remove the old ones without provoking complaints about backwards compatibility, and if we don't add NO_PATTERN_REWRITE somebody is sure to grumble. So I would add it, for an easy life, but what do others think? |
One important thing to clarify here too, is WHAT is being disabled, and if the name might need to be changed as a result. There is the "rewrite subsystem" which is an extra compilation step that could be potentially used for more than optimizations, and there is the "rewrite optimization" itself that uses that subsystem mainly to optimize alternations. |
I would say that for now, it's disabling all rewrites. If there is need to add more fine-grained options, that can be done later. |
Looks like I need to understand Zoltan's new code for handling character classes. The code in this PR for rewriting char classes worked on the previous |
Am I understand that rewriting happens using META code? My new code is running in META -> BYTE CODE translation phase, and should not interact with rewriting, since it happens later. Also, I have introduced a |
Indeed, it does. So as you said, it seems that the new code should not have an effect here. But I'll investigate just to make sure.
No objection, though it means that it will not be possible to make the entry point |
Not really, and I think this point is worth clarifying, because unlike most codebases we do have a lot of static functions in C files that are included into other C files. All that Zoltan is suggesting is to move your code to a different file, if you MUST keep your entrypoint static (and I can't see why, since it is not an exported symbol anyway, it is called only once and is probably too big to inline) then you could by following our existing style. |
Ooh, this is interesting. Thanks for the tip. |
ff9d7ef
to
43e4112
Compare
…n branches Thanks to Michael Voříšek for suggesting this optimization. A new "rewrite" pass has been added to the regex compilation process. For now, the rewrite pass only optimized one type of regex: those where every branch of an alternation construct has a common prefix. In such cases, we rewrite the regex like so (for example): (abc|abd|abe) ⇒ (ab(?:c|d|e)) An extra non-capturing group is not introduced if the alternation is within a non-capturing group (which is not quantified using ?, *, or a similar suffix). In that case we simply do something like: (?:abc|abd|abe) ⇒ ab(?:c|d|e) In some edge cases, it is possible that rewriting a group with common alternation prefix might open up the opportunity to pull out more common prefixes. For example: (a(b|c)d|(ab|ac)e) In that case, if the group '(ab|ac)' was rewritten to pull out the common prefix, it would then become possible to pull out a common prefix from the top-level group. However, we do not take advantage of that opportunity. Further, we do not perform the rewrite in cases where the prefixes are semantically equivalent, but parse to a different parsed_pattern sequence. Groups which the regex engine might need to backtrack into are never pulled out, since this could change the order in which the regex engine considers possible ways of matching the pattern against the subject string, and could thus change the returned match. For example, this pattern will not be rewritten: ((?:a|b)c|(?:a|b)d) Also, callouts are never extracted even if they form a common prefix to an alternation. Some backtracking control verbs, like (*SKIP) and (*COMMIT), are never extracted either. Further, we never rewrite alternation which appears directly within a lookbehind assertion, since this would tend to convert (fast) fixed-length lookbehind to (slow) variable-length lookbehind. A different type of rewrite is performed if an alternation construct matches only single, literal characters: (a|b|c) ⇒ ([a-c]) The pattern rewrite phase can be disabled, either using pcre2_set_optimize(), or by including the inline option (*NO_REWRITE). Co-authored-by: Carlo Marcelo Arenas Belón <[email protected]>
Just updated this PR to work on the latest Tests are passing, except for "check autogenerated file freshness" task which is complaining about "noise" changes, like changes in the spaces between words in |
Yes, I apologise. It's due to the |
Dear @NWilson, this is quite enlightening. Do you happen to know what version of |
Apparently it's 1.23.0: https://packages.ubuntu.com/search?keywords=groff&searchon=names&suite=all§ion=all (I'm travelling at the moment) I have a set of Linux VMs for testing features of different distributions, with my home directory mounted and shared across them. I use the Ubuntu 24.04 one for running PCRE2 commands. It shouldn't be needed though. We really ought to make it less sensitive to tool versions. |
I just cloned the repo for GNU nroff and built version 1.23.0, but its output is quite different from the documentation files which are currently checked into the PCRE2 repository. For now, I am considering the above failure of the "Check autogenerated file freshness" CI task to be spurious. |
Thanks to @mvorisek for suggesting this optimization.
A new "rewrite" pass has been added to the regex compilation process. For now, the rewrite pass only optimized one type of regex: those where every branch of an alternation construct has a common prefix.
In such cases, we rewrite the regex like so (for example):
An extra non-capturing group is not introduced if the alternation is within a non-capturing group (which is not quantified using ?, *, or a similar suffix). In that case we simply do something like:
In some edge cases, it is possible that rewriting a group with common alternation prefix might open up the opportunity to pull out more common prefixes. For example:
In that case, if the group '(ab|ac)' was rewritten to pull out the common prefix, it would then become possible to pull out a common prefix from the top-level group. However, we do not take advantage of that opportunity. Further, we do not perform the rewrite in cases where the prefixes are semantically equivalent, but parse to a different
parsed_pattern
sequence.Groups which the regex engine might need to backtrack into are never pulled out, since this could change the order in which the regex engine considers possible ways of matching the pattern against the subject string, and could thus change the returned match. For example, this pattern will not be rewritten:
Also, callouts are never extracted even if they form a common prefix to an alternation. Some backtracking control verbs, like (*SKIP) and (*COMMIT), are never extracted either. Further, alternation is not rewritten in lookbehind assertions, since this can easily convert fixed-length lookbehind to variable lookbehind.
A different type of rewrite is performed if an alternation construct matches only single, literal characters:
A new compile option,
PCRE2_NO_PATTERN_REWRITE
, has been added to skip the pattern rewrite phase when compiling a pattern.I have extensively fuzzed this patch. Fuzzing revealed many bugs, which have been fixed.