Deactivate many `regex` Unicode crate features #2702

lopopolo · 2023-12-27T15:51:24Z

#1643 disabled many deafault features of the regex crate but left the unicode meta feature enabled. With the unicode feature enabled and bindgen as a build dependency, regex-syntax (a direct dependency of the regex crate) takes 7 seconds to compile as a build dependency in my application.

The unicode feature includes support for many Unicode character class lookups which I find unlikely that bindgen uses. Even without the unicode feature enabled, the regex crate supports Unicode. The various unicode-* features only remove compiled in data tables that support various types of character classes.

From https://docs.rs/regex/latest/regex/#unicode-features:

unicode-age - Provide the data for the Unicode Age property. This makes it possible to use classes like \p{Age:6.0} to refer to all codepoints first introduced in Unicode 6.0

unicode-bool - Provide the data for numerous Unicode boolean properties. The full list is not included here, but contains properties like Alphabetic, Emoji, Lowercase, Math, Uppercase and White_Space.

unicode-case - Provide the data for case insensitive matching using Unicode's "simple loose matches" specification.

unicode-gencat - Provide the data for Unicode general categories. This includes, but is not limited to, Decimal_Number, Letter, Math_Symbol, Number and Punctuation.

unicode-script - Provide the data for Unicode scripts and script extensions. This includes, but is not limited to, Arabic, Cyrillic, Hebrew, Latin and Thai.

unicode-segment - Provide the data necessary to provide the properties used to implement the Unicode text segmentation algorithms. This enables using classes like \p{gcb=Extend}, \p{wb=Katakana} and \p{sb=ATerm}.

I have retained the unicode-perl feature, which gives support for \w, \s and \d, because these character classes were required to get tests to pass.

Removing support for these character classes removes the need to compile many data tables, which should significantly reduce compile times.

#1643 disabled many deafault features of the `regex` crate but left the `unicode` meta feature enabled. With the `unicode` feature enabled and `bindgen` as a build dependency, `regex-syntax` (a direct dependency of the `regex` crate) takes 7 seconds to compile as a build dependency in my application. The `unicode` feature includes support for many Unicode character class lookups which I find unlikely that bindgen uses. From https://docs.rs/regex/latest/regex/#unicode-features: > - unicode-age - Provide the data for the Unicode Age property. This > makes it possible to use classes like `\p{Age:6.0}` to refer to all > codepoints first introduced in Unicode 6.0 > - unicode-bool - Provide the data for numerous Unicode boolean > properties. The full list is not included here, but contains > properties like `Alphabetic`, `Emoji`, `Lowercase`, `Math`, > `Uppercase` and `White_Space`. > - unicode-case - Provide the data for case insensitive matching using > Unicode's "simple loose matches" specification. > - unicode-gencat - Provide the data for Unicode general categories. > This includes, but is not limited to, `Decimal_Number`, `Letter`, > `Math_Symbol`, `Number` and `Punctuation`. > - unicode-script - Provide the data for Unicode scripts and script > extensions. This includes, but is not limited to, `Arabic`, `Cyrillic`, > `Hebrew`, `Latin` and `Thai`. > - unicode-segment - Provide the data necessary to provide the > properties used to implement the Unicode text segmentation > algorithms. This enables using classes like `\p{gcb=Extend}`, > `\p{wb=Katakana}` and `\p{sb=ATerm}`. I have retained the `unicode-perl` feature, which gives support for `\w`, `\s` and `\d`, because these character classes were required to get tests to pass. Removing support for these character classes removes the need to compile many data tables, which should significantly reduce compile times.

lopopolo · 2023-12-27T16:01:37Z

As an aside, the various regexes in bindgen that use \d are probably incorrect and should instead prefer to use [[:digit:]]. \d is a Unicode aware character class and supports matching things like Devanagari numerals whereas [[:digit:]] matches only ASCII 0-9.

Dr-Emann · 2024-02-13T04:02:01Z

I think case insensitivity (unicode-case) was useful:

--allowlist-item '(?i)(aaa|bbb).*' used to work for e.g. AAA_CONST, and bbb_func()

Now it (silently 😲) matches nothing.

As a workaround, it looks like for my case I can use --allowlist-item '(?i-u:aaa|bbb).*', which turns on case insensitivity, and turns off unicode matching for my literal prefixes.

MightyPork · 2024-07-25T12:08:12Z

This broke my project, I relied on (?i) as well. There's no warning, the output file is just empty.

--allowlist-function "(?i)(bacnet|bactext|bacapp|bacstack|bacw).*"

Is ?i-u: equivalent? i never saw this syntax before. It doesn't work in (?i-u) like the old syntax

so something like this?

  --allowlist-function "(?i-u:bacnet|bactext|bacapp|bacstack|bacw).*"

emilio merged commit d0c2b1e into rust-lang:main Dec 27, 2023
32 checks passed

lopopolo deleted the lopopolo/regex-cheaper-compile branch December 28, 2023 00:34

Dr-Emann mentioned this pull request Feb 13, 2024

Case-insensitive regexes were useful #2760

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deactivate many `regex` Unicode crate features #2702

Deactivate many `regex` Unicode crate features #2702

lopopolo commented Dec 27, 2023

lopopolo commented Dec 27, 2023

Dr-Emann commented Feb 13, 2024 •

edited

Loading

MightyPork commented Jul 25, 2024

Deactivate many regex Unicode crate features #2702

Deactivate many regex Unicode crate features #2702

Conversation

lopopolo commented Dec 27, 2023

lopopolo commented Dec 27, 2023

Dr-Emann commented Feb 13, 2024 • edited Loading

MightyPork commented Jul 25, 2024

Deactivate many `regex` Unicode crate features #2702

Deactivate many `regex` Unicode crate features #2702

Dr-Emann commented Feb 13, 2024 •

edited

Loading