-
Notifications
You must be signed in to change notification settings - Fork 197
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Rewrite regexes where common prefix can be pulled out from alternatio…
…n branches Thanks to Michael Voříšek for suggesting this optimization. A new "rewrite" pass has been added to the regex compilation process. For now, the rewrite pass only optimized one type of regex: those where every branch of an alternation construct has a common prefix. In such cases, we rewrite the regex like so (for example): (abc|abd|abe) ⇒ (ab(?:c|d|e)) An extra non-capturing group is not introduced if the alternation is within a non-capturing group (which is not quantified using ?, *, or a similar suffix). In that case we simply do something like: (?:abc|abd|abe) ⇒ ab(?:c|d|e) In some edge cases, it is possible that rewriting a group with common alternation prefix might open up the opportunity to pull out more common prefixes. For example: (a(b|c)d|(ab|ac)e) In that case, if the group '(ab|ac)' was rewritten to pull out the common prefix, it would then become possible to pull out a common prefix from the top-level group. However, we do not take advantage of that opportunity. Further, we do not perform the rewrite in cases where the prefixes are semantically equivalent, but parse to a different parsed_pattern sequence. Groups which the regex engine might need to backtrack into are never pulled out, since this could change the order in which the regex engine considers possible ways of matching the pattern against the subject string, and could thus change the returned match. For example, this pattern will not be rewritten: ((?:a|b)c|(?:a|b)d) Also, callouts are never extracted even if they form a common prefix to an alternation. Some backtracking control verbs, like (*SKIP) and (*COMMIT), are never extracted either. A different type of rewrite is performed if an alternation construct matches only single, literal characters: (a|b|c) ⇒ ([a-c]) A new compile option, PCRE2_NO_PATTERN_REWRITE, has been added to skip the pattern rewrite phase when compiling a pattern.
- Loading branch information
Showing
30 changed files
with
3,858 additions
and
528 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.