Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite regexes where common prefix can be pulled out from alternation branches #464

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

alexdowad
Copy link
Contributor

Thanks to @mvorisek for suggesting this optimization.

A new "rewrite" pass has been added to the regex compilation process. For now, the rewrite pass only optimized one type of regex: those where every branch of an alternation construct has a common prefix.

In such cases, we rewrite the regex like so (for example):

    (abc|abd|abe) ⇒ (ab(?:c|d|e))

An extra non-capturing group is not introduced if the alternation is within a non-capturing group (which is not quantified using ?, *, or a similar suffix). In that case we simply do something like:

    (?:abc|abd|abe) ⇒ ab(?:c|d|e)

In some edge cases, it is possible that rewriting a group with common alternation prefix might open up the opportunity to pull out more common prefixes. For example:

    (a(b|c)d|(ab|ac)e)

In that case, if the group '(ab|ac)' was rewritten to pull out the common prefix, it would then become possible to pull out a common prefix from the top-level group. However, we do not take advantage of that opportunity. Further, we do not perform the rewrite in cases where the prefixes are semantically equivalent, but parse to a different parsed_pattern sequence.

Groups which the regex engine might need to backtrack into are never pulled out, since this could change the order in which the regex engine considers possible ways of matching the pattern against the subject string, and could thus change the returned match. For example, this pattern will not be rewritten:

    ((?:a|b)c|(?:a|b)d)

Also, callouts are never extracted even if they form a common prefix to an alternation. Some backtracking control verbs, like (*SKIP) and (*COMMIT), are never extracted either. Further, alternation is not rewritten in lookbehind assertions, since this can easily convert fixed-length lookbehind to variable lookbehind.

A different type of rewrite is performed if an alternation construct matches only single, literal characters:

    (a|b|c) ⇒ ([a-c])

A new compile option, PCRE2_NO_PATTERN_REWRITE, has been added to skip the pattern rewrite phase when compiling a pattern.

I have extensively fuzzed this patch. Fuzzing revealed many bugs, which have been fixed.

@alexdowad alexdowad force-pushed the rewrite branch 6 times, most recently from d0d8da3 to af27b24 Compare September 6, 2024 23:54
@alexdowad
Copy link
Contributor Author

I've noticed one thing here... RunTest.bat only runs test suites 1-23 on Windows (24, 25, and 26) are not run.

Is this intentional? Should I add the new suite 27 to RunTest.bat?

Copy link
Contributor

@carenas carenas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't fully look at the code (or even pulled it), as the difference on style/design and white space changes keeps distracting me though, but if this code hasn't been tested yet in Windows it might be worth marking it as "draft".
Notice there are some warnings triggered by the CI, and there is at least one place where -Wdeclaration-after-statement should trigger.

RunTest Show resolved Hide resolved
testdata/testinput17 Show resolved Hide resolved
src/pcre2_compile.c Outdated Show resolved Hide resolved
src/pcre2_compile.c Outdated Show resolved Hide resolved
src/pcre2_compile.c Show resolved Hide resolved
@@ -65,6 +65,7 @@ The primary option bits are:
theses (named ones available)
PCRE2_NO_AUTO_POSSESS Disable auto-possessification
PCRE2_NO_DOTSTAR_ANCHOR Disable automatic anchoring for .*
PCRE2_NO_PATTERN_REWRITE Disable pattern rewriting optimizations
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the future, there might be more rewrite optimizations. I would name this specifically for #411 optimization only.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disagree (at least in principle). As described in the discussion for that feature request this is indeed adding a completely new mode of operation for PCRE which was NEVER an optimizing compiler and where the normal use cases expected some other way to optimize the pattern (the same applies to JIT BTW which mostly translates the interpreter opcodes into native CPU code)

I "agree" that long term other "rewrites" might be added and would be a good idea to identify them here, but the way to do so would be to add flags for each case as extended options IMHO. Indeed I would argue that this option SHOULDN'T be using one of the few options bits left and should be instead enabled through an extended option itself.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would use an O0-O3 system similar to gcc. The O1 is the default, and the behaviour is same as now. Then we could introduce optimizations, which costs more time, but improves performace more. We could also disable optimizations, such as auto-possession with O0. Disabling first character optimizations would be deprecated as well.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My view when Alex and I discussed this was that NO_PATTERN_REWRITE would disable all rewriting optimizations, present and future. If there is a need to disable individual ones, then additional extended options can be used.

Copy link
Contributor

@carenas carenas Sep 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I wasn't clear enough and indeed might had confused the discussion further by mentioning the extended options, which I was proposing should be used to ENABLE each optimization, and therefore the bit in the compile options that this thread is attached to wouldn't be needed and could be retired (with the advantage that we are back at having two "free" options for the future).

Agree that the (*NO_PATTERN_REWRITE) pattern option setting should disable all optimizations.

Also agree with the OP that it makes sense to enable this optimization with an option that should be more specific, and indeed, using extended options for them would make the design more future proof.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the rewriting happens after the parsing is successful, so no error (besides out of memory) can happen.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the rewriting happens after the parsing is successful, so no error (besides out of memory) can happen.

You are right that rewriting happens after a successful parse... but there are just a few types of errors which are not detected during parsing, but rather when processing the parsed_pattern vector. For all such cases, offsets into the original pattern string are stored in the parsed_pattern vector, and the pattern rewriter makes sure not to destroy these.

Copy link
Contributor

@carenas carenas Sep 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The erroroffset issue was likely a red herring, but it is offtopic for this conversation which I think needs a resolution.

It would seem that @PhilipHazel likes the current one as it is sufficient and doesn't make a difference when there is only one rewriting optimization anyway and might want this discussion punted until later.

Note that the option flag mentioned here is to disable all rewriting (hence the subsystem and not only this specific "optimization").

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whatever we do, it should happen before the next release. There are only a few compile flags left, and this flag feel out of scope for me. What about making this always enabled in this patch, and make it conditional when we decide how to do it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely, this issue needs to be resolved. But even just for testing purposes, it's vital that we can disable the optimization. I think making it "always enabled" is not an option.

My next priority on this PR is to investigate why the 'code memory' size has increased for some patterns. After that will be to eliminate all compiler warnings. Other issues to be worked on include: enabling tests on Windows if appropriate, coming to a resolution on whether PCRE2 will move to the C99 standard or not, questions of brace style, and performance benchmarking. (Philip has already done a bit of benchmarking, and from his limited results, it appeared that this optimization is both 1) worthwhile and 2) has relatively small impact on compile time. But more thorough and extensive benchmarking is needed. It's possible that the benchmarking might cause me to do more performance work to reduce the impact on compile time.)

As all those matters fall into place one by one, at some point, the issue of how to disable the optimization must be worked out. I agree with Zoltan that the number of bits left in the compile options is truly small. But otherwise I haven't thought hard enough to weigh in on the issue. I just followed what has been done for previous optimizations.

I consider it extremely important that Zoltan, Carlo, and other key contributors who have helped to make PCRE2 what it is now are all happy with whatever is decided.

@zherczeg
Copy link
Collaborator

zherczeg commented Sep 7, 2024

Similar to other analysis, I would move this out of the pcre_compile. As a first step, I would introduce a pcre_compile.h which contains the internal definitions, and only included by similar optimizations. We could organize optimizations better in the future.

@PhilipHazel
Copy link
Collaborator

Are we ready to merge this yet?

@alexdowad
Copy link
Contributor Author

Are we ready to merge this yet?

Absolutely not. 😄 There has been some great feedback from @carenas and others which I need to implement.

@PhilipHazel PhilipHazel marked this pull request as draft September 7, 2024 11:10
@PhilipHazel
Copy link
Collaborator

I have marked this PR as draft.

@zherczeg
Copy link
Collaborator

zherczeg commented Sep 7, 2024

Could you share your results about the optimization? Compile time overhead vs speed gain. Interpreter and jit is also interesting.

@zherczeg
Copy link
Collaborator

zherczeg commented Sep 8, 2024

I realized we are not just adding a new function here, we are also adding an entirely new subsystem. So we should discuss how similar optimizations should be handled in the future, because changing the api after a release is difficult.

My api proposal:
pcre2_compile_set_opt(compile_context, optimization_level, disabled_optimizations);

Most people don't know the effect of various optimizations, so they can simply set disabled_optimizations to 0. However, if a certain optimization cause issues in a certain system, they can disable that particular one.

The optimization_level can be 0-3, and performs a disabled_optimizations |= operation to add disabled optimizations internally. Hence the compile context does not store optimization_level directly.

It is also a good question how patterns could override the optimization level. Setting it to 0 may hurt system performance, and not necessary desired in all cases. Shall we add an option to disabled_optimizations which prevents overriding the optimization level? Or can only increase or decrease the optimization level.

@alexdowad alexdowad force-pushed the rewrite branch 2 times, most recently from bd4423e to 12147f7 Compare September 8, 2024 03:06
@alexdowad
Copy link
Contributor Author

Although not all issues raised by reviewers have been resolved, I am pushing an update so any comments can refer to the latest version of the PR.

@alexdowad alexdowad force-pushed the rewrite branch 4 times, most recently from 3fcbe8e to fa645ae Compare September 8, 2024 09:06
@alexdowad
Copy link
Contributor Author

Tomorrow I hope to look into the issue @carenas mentioned whereby the 'code size memory' has increased for some patterns. Will post my findings here.

@PhilipHazel
Copy link
Collaborator

Two comments: (1) I was about to comment about erroroffset not being a problem, but that seems to have been explained by others. (2) How about adding a new 32-bit field to the compile context called, say, OPTIMIZE_SWITCHES, which contains a bit for each optimization that can be enabled/disabled. A function could then set them individually, or "levels" such as 0, 1, 2, etc could set a selection of them. Existing NO_xxx options would have to be kept, of course, for backward compatibility, but we could move the new NO_REWRITE into the new field.

@alexdowad
Copy link
Contributor Author

How about adding a new 32-bit field to the compile context called, say, OPTIMIZE_SWITCHES, which contains a bit for each optimization that can be enabled/disabled...

👏🏻

Nice. Very nice.

@alexdowad
Copy link
Contributor Author

alexdowad commented Sep 9, 2024

Carlo raised the question of why the memory used for regex code seems to have increased in some test cases. It's because of rewriting alternations to character classes, like: (?:a|b) ⇒ [ab].

PCRE2 has more than one way to represent character classes in its bytecode, but for cases like this, it's one unit for OP_CLASS, then 32 more bytes for a bit vector which represents all character values from 0-255. So using a lot of char classes tends to make the bytecode a bit large.

For char classes which only have a small number of character values (say up to 8), I wouldn't be surprised if it would be faster and make the regex bytecode smaller to represent the character values as literal bytes. Although it would require some benchmarking to see if this is actually faster or not, I would think of possibly using SIMD to test for 8 different byte values at once... or as an alternative, test all 8 values and OR the results together before doing a single branch (rather than branching 8 times).

@alexdowad
Copy link
Contributor Author

...I would think of possibly using SIMD to test for 8 different byte values at once...

For x86, perhaps something like:

    __m64 class = _mm_set_pi8(char1, char2, char3, char4, char5, char6, char7, char8);
    __m64 test = _mm_set1_pi8(subject_char);
    if (_mm_movemask_pi8(_mm_cmpeq_pi8(class, test)) {
        /* subject_char matches char class */
    }

In JIT'd code, hopefully it might be possible to load the MM register just once and keep the character class values there for the duration of the match.

@zherczeg
Copy link
Collaborator

zherczeg commented Sep 9, 2024

JIT has many optimizations for classes. E.g. [Aa] is converted to (char | 0x20) == a, or [a-z] is converted to (char - a) <= z - a.

@zherczeg
Copy link
Collaborator

Most optimizations are a simple boolean, so I would prefer a boolean array. The other alternative is multiple call of the API function, where each call sets one optimization. We can also add more api functions later if needed. We have a context, so everything is backward compatible.

@PhilipHazel
Copy link
Collaborator

A function such as pcre2_set_optimization() should have as its arguments (1) a pointer to the compile context, (2) the index into an array of 32-bit words (so that we can extend ad infinitum), (3) two 32-bit words, the first being bits to unset followed by the bits to set. Alternatively, do you think, if we were to make it a 64-bit field, we would ever run out? That makes for an easier interface.

I happened to be looking at some historical documents today, and found this quote from a paper published in 1963, which I cannot resist passing on: "In general it is true to say that the facilities most popular with programmers are those which are easy to use."

@alexdowad
Copy link
Contributor Author

A function such as pcre2_set_optimization() should have as its arguments (1) a pointer to the compile context, (2) the index into an array of 32-bit words (so that we can extend ad infinitum), (3) two 32-bit words, the first being bits to unset followed by the bits to set.

Dear Philip, I'm just trying to figure out how this interface will work. Why is there both an index into an array of 32-bit words, and also two 32-bit words with bits to set/clear? When you say there should be an index into an array of 32-bit words, how will this array work? What will the values in the array be?

To make this discussion a bit more concrete, I may open a PR today with a proposed interface.

@carenas
Copy link
Contributor

carenas commented Sep 14, 2024

if we were to make it a 64-bit field, we would ever run out? That makes for an easier interface.

frankly, I think that a 63-bit field must be enough (even maybe stick with uint32_t, although I can't think of any relevant architecture were a int64_t wouldn't be as available), after all if an optimization NEEDS to be disabled then it probably shouldn't exist anyway and so these bits will be mostly allocated when the optimization is developed and just to make sure that it can be disabled if found buggy or to compare with the unoptimized code, and hopefully retired before being made permanent.

With that the inteface will be instead:

#define PCRE2_COMPILE_OPTIMIZE_ALTERNATE (1l << 0)
void pcre2_disable_optimizations(pcre2_compile_context *, int64_t);

@zherczeg
Copy link
Collaborator

Some optimizations may take more time, and less useful for simple patterns, and disabling it is ok. I would prefer a O0-O3 system for that, even if we don't use all levels at the moment. Enabling optimizations individually is more of a private system, because users unlikely know the details of them. Hence a bitset (uint32_t) array could be enough in long term.

@carenas
Copy link
Contributor

carenas commented Sep 14, 2024

Why is there both an index into an array of 32-bit words

Because since you wanted an infinitely large bitset vector, then the mask to enable/disable has to know at which offset on that vector to apply the passed values.

and also two 32-bit words with bits to set/clear?

because the API is called _set_, so doesn't assume either that all optimizations are enabled by default unlike my "simplified" suggestion.

I agree with Philip though that doing _set_ would be better for this API but wanted to avoid the confusion this comment hopefully clarifies.

@alexdowad
Copy link
Contributor Author

Thanks to all for the suggestions; it is helpful.

I've opened a (draft) PR adding an API which I believe is better than the kind suggestions which the reviewers have shared. It currently has Zoltan's -O0 and -O3.

@alexdowad
Copy link
Contributor Author

I have rebased this PR on the latest master. Before pushing an update, I would like to hear opinions from others: Do you agree that (*NO_PATTERN_REWRITE) can be removed?

For all (or basically all) of the current compile option flags, it is possible to set them using syntax like (*NO_JIT), etc. at the beginning of a regex. This syntax is converted to META_OPTIONS by the parser.

If more optimization flags are added in the future, I doubt if we should keep adding more and more of such options syntax. Personally, it doesn't seem necessary that users should be able to set any possible optimization flag using such syntax. So I'm thinking there may be no need to support an inline (*NO_PATTERN_REWRITE) option.

@carenas
Copy link
Contributor

carenas commented Sep 21, 2024

Do you agree that (*NO_PATTERN_REWRITE) can be removed?

No; the in pattern VERB are there to allow disabling the option for final users of the library who might not have access to the C API to do so, and even for the ones that do, it might need to be disabled for specific patterns because it is not a good option (compilation time can be significantly increased for an already optimized pattern) or has bugs that are triggered by it.

@PhilipHazel
Copy link
Collaborator

The original motivation for introducing (*NO_JIT) etc was to allow users who have no access to the C code to change these options. In other words, users of some application that allows them to provide patterns might want or need to set such things. In particular (*UTF) and (*UCP) might be necessary - though these can be locked out by the application. I can see that there those that relate to optimizations are somewhat different, but it's not really much code in pcre2_compile.c. We cannot remove the old ones without provoking complaints about backwards compatibility, and if we don't add NO_PATTERN_REWRITE somebody is sure to grumble. So I would add it, for an easy life, but what do others think?

@carenas
Copy link
Contributor

carenas commented Sep 21, 2024

I can see that there those that relate to optimizations are somewhat different, but it's not really much code in pcre2_compile.c. We cannot remove the old ones without provoking complaints about backwards compatibility, and if we don't add NO_PATTERN_REWRITE somebody is sure to grumble.

One important thing to clarify here too, is WHAT is being disabled, and if the name might need to be changed as a result.

There is the "rewrite subsystem" which is an extra compilation step that could be potentially used for more than optimizations, and there is the "rewrite optimization" itself that uses that subsystem mainly to optimize alternations.

@alexdowad
Copy link
Contributor Author

One important thing to clarify here too, is WHAT is being disabled

I would say that for now, it's disabling all rewrites. If there is need to add more fine-grained options, that can be done later.

@alexdowad
Copy link
Contributor Author

Looks like I need to understand Zoltan's new code for handling character classes. The code in this PR for rewriting char classes worked on the previous master, but doesn't look like it works now.

@zherczeg
Copy link
Collaborator

Am I understand that rewriting happens using META code? My new code is running in META -> BYTE CODE translation phase, and should not interact with rewriting, since it happens later. Also, I have introduced a pcre_compile.h, and added new .c file for doing range optimization, and I would like that rewriting optimization also goes to another .c file. The pcre_compile.c is big enough.

@alexdowad
Copy link
Contributor Author

Am I understand that rewriting happens using META code?

Indeed, it does. So as you said, it seems that the new code should not have an effect here. But I'll investigate just to make sure.

I would like that rewriting optimization also goes to another .c file. The pcre_compile.c is big enough.

No objection, though it means that it will not be possible to make the entry point static. I guess it will be more in line with the existing style of the project as well (since DFA matching is in a separate file, the interpreter is in a separate file, etc).

@carenas
Copy link
Contributor

carenas commented Sep 22, 2024

I guess it will be more in line with the existing style of the project as well

Not really, and I think this point is worth clarifying, because unlike most codebases we do have a lot of static functions in C files that are included into other C files.

All that Zoltan is suggesting is to move your code to a different file, if you MUST keep your entrypoint static (and I can't see why, since it is not an exported symbol anyway, it is called only once and is probably too big to inline) then you could by following our existing style.

@alexdowad
Copy link
Contributor Author

Not really, and I think this point is worth clarifying, because unlike most codebases we do have a lot of static functions in C files that are included into other C files.

Ooh, this is interesting. Thanks for the tip.

@alexdowad alexdowad force-pushed the rewrite branch 6 times, most recently from ff9d7ef to 43e4112 Compare December 28, 2024 01:20
…n branches

Thanks to Michael Voříšek for suggesting this optimization.

A new "rewrite" pass has been added to the regex compilation process.
For now, the rewrite pass only optimized one type of regex: those where
every branch of an alternation construct has a common prefix.

In such cases, we rewrite the regex like so (for example):

    (abc|abd|abe) ⇒ (ab(?:c|d|e))

An extra non-capturing group is not introduced if the alternation
is within a non-capturing group (which is not quantified using ?, *, or
a similar suffix). In that case we simply do something like:

    (?:abc|abd|abe) ⇒ ab(?:c|d|e)

In some edge cases, it is possible that rewriting a group with common
alternation prefix might open up the opportunity to pull out more common
prefixes. For example:

    (a(b|c)d|(ab|ac)e)

In that case, if the group '(ab|ac)' was rewritten to pull out the
common prefix, it would then become possible to pull out a common
prefix from the top-level group. However, we do not take advantage of
that opportunity.

Further, we do not perform the rewrite in cases where the prefixes are
semantically equivalent, but parse to a different parsed_pattern
sequence.

Groups which the regex engine might need to backtrack into are never
pulled out, since this could change the order in which the regex
engine considers possible ways of matching the pattern against the
subject string, and could thus change the returned match. For
example, this pattern will not be rewritten:

    ((?:a|b)c|(?:a|b)d)

Also, callouts are never extracted even if they form a common prefix
to an alternation. Some backtracking control verbs, like (*SKIP) and
(*COMMIT), are never extracted either. Further, we never rewrite
alternation which appears directly within a lookbehind assertion,
since this would tend to convert (fast) fixed-length lookbehind to
(slow) variable-length lookbehind.

A different type of rewrite is performed if an alternation construct
matches only single, literal characters:

    (a|b|c) ⇒ ([a-c])

The pattern rewrite phase can be disabled, either using
pcre2_set_optimize(), or by including the inline option
(*NO_REWRITE).

Co-authored-by: Carlo Marcelo Arenas Belón <[email protected]>
@alexdowad
Copy link
Contributor Author

Just updated this PR to work on the latest master, though it needs to be fuzzed again.

Tests are passing, except for "check autogenerated file freshness" task which is complaining about "noise" changes, like changes in the spaces between words in pcre2.txt, which are not caused by this PR. (It would be helpful if the PrepareRelease script could be made to generate text files with consistent spacing, so that diffs are not cluttered with irrelevant changes.)

@NWilson
Copy link
Member

NWilson commented Dec 28, 2024

Yes, I apologise. It's due to the nroff tool using different justification rules at different versions. I've standardised on the version that's in a Ubuntu 24.04LTS just for convenience in the CI. It is a pain though. I'll see what I can do.

@alexdowad
Copy link
Contributor Author

Yes, I apologise. It's due to the nroff tool using different justification rules at different versions. I've standardised on the version that's in a Ubuntu 24.04LTS just for convenience in the CI. It is a pain though. I'll see what I can do.

Dear @NWilson, this is quite enlightening. Do you happen to know what version of nroff that is? The version I have locally is GNU nroff 1.22.4, but I can build another version if it helps to generate files which match those generated by CI.

@NWilson
Copy link
Member

NWilson commented Dec 28, 2024

Apparently it's 1.23.0: https://packages.ubuntu.com/search?keywords=groff&searchon=names&suite=all&section=all

(I'm travelling at the moment)

I have a set of Linux VMs for testing features of different distributions, with my home directory mounted and shared across them. I use the Ubuntu 24.04 one for running PCRE2 commands.

It shouldn't be needed though. We really ought to make it less sensitive to tool versions.

@alexdowad
Copy link
Contributor Author

I just cloned the repo for GNU nroff and built version 1.23.0, but its output is quite different from the documentation files which are currently checked into the PCRE2 repository.

For now, I am considering the above failure of the "Check autogenerated file freshness" CI task to be spurious.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants