Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deactivate many regex Unicode crate features #2702

Merged
merged 1 commit into from
Dec 27, 2023
Merged

Deactivate many regex Unicode crate features #2702

merged 1 commit into from
Dec 27, 2023

Conversation

lopopolo
Copy link
Contributor

#1643 disabled many deafault features of the regex crate but left the unicode meta feature enabled. With the unicode feature enabled and bindgen as a build dependency, regex-syntax (a direct dependency of the regex crate) takes 7 seconds to compile as a build dependency in my application.

The unicode feature includes support for many Unicode character class lookups which I find unlikely that bindgen uses. Even without the unicode feature enabled, the regex crate supports Unicode. The various unicode-* features only remove compiled in data tables that support various types of character classes.

From https://docs.rs/regex/latest/regex/#unicode-features:

  • unicode-age - Provide the data for the Unicode Age property. This makes it possible to use classes like \p{Age:6.0} to refer to all codepoints first introduced in Unicode 6.0
  • unicode-bool - Provide the data for numerous Unicode boolean properties. The full list is not included here, but contains properties like Alphabetic, Emoji, Lowercase, Math, Uppercase and White_Space.
  • unicode-case - Provide the data for case insensitive matching using Unicode's "simple loose matches" specification.
  • unicode-gencat - Provide the data for Unicode general categories. This includes, but is not limited to, Decimal_Number, Letter, Math_Symbol, Number and Punctuation.
  • unicode-script - Provide the data for Unicode scripts and script extensions. This includes, but is not limited to, Arabic, Cyrillic, Hebrew, Latin and Thai.
  • unicode-segment - Provide the data necessary to provide the properties used to implement the Unicode text segmentation algorithms. This enables using classes like \p{gcb=Extend}, \p{wb=Katakana} and \p{sb=ATerm}.

I have retained the unicode-perl feature, which gives support for \w, \s and \d, because these character classes were required to get tests to pass.

Removing support for these character classes removes the need to compile many data tables, which should significantly reduce compile times.

#1643 disabled many deafault features of the
`regex` crate but left the `unicode` meta feature enabled. With the
`unicode` feature enabled and `bindgen` as a build dependency,
`regex-syntax` (a direct dependency of the `regex` crate) takes 7
seconds to compile as a build dependency in my application.

The `unicode` feature includes support for many Unicode character class
lookups which I find unlikely that bindgen uses.

From https://docs.rs/regex/latest/regex/#unicode-features:

> - unicode-age - Provide the data for the Unicode Age property. This
>   makes it possible to use classes like `\p{Age:6.0}` to refer to all
>   codepoints first introduced in Unicode 6.0
> - unicode-bool - Provide the data for numerous Unicode boolean
>   properties. The full list is not included here, but contains
>   properties like `Alphabetic`, `Emoji`, `Lowercase`, `Math`,
>   `Uppercase` and `White_Space`.
> - unicode-case - Provide the data for case insensitive matching using
>   Unicode's "simple loose matches" specification.
> - unicode-gencat - Provide the data for Unicode general categories.
>   This includes, but is not limited to, `Decimal_Number`, `Letter`,
>   `Math_Symbol`, `Number` and `Punctuation`.
> - unicode-script - Provide the data for Unicode scripts and script
>   extensions. This includes, but is not limited to, `Arabic`, `Cyrillic`,
>   `Hebrew`, `Latin` and `Thai`.
> - unicode-segment - Provide the data necessary to provide the
>   properties used to implement the Unicode text segmentation
>   algorithms. This enables using classes like `\p{gcb=Extend}`,
>   `\p{wb=Katakana}` and `\p{sb=ATerm}`.

I have retained the `unicode-perl` feature, which gives support for
`\w`, `\s` and `\d`, because these character classes were required
to get tests to pass.

Removing support for these character classes removes the need to compile
many data tables, which should significantly reduce compile times.
@lopopolo
Copy link
Contributor Author

As an aside, the various regexes in bindgen that use \d are probably incorrect and should instead prefer to use [[:digit:]]. \d is a Unicode aware character class and supports matching things like Devanagari numerals whereas [[:digit:]] matches only ASCII 0-9.

@emilio emilio merged commit d0c2b1e into rust-lang:main Dec 27, 2023
32 checks passed
@lopopolo lopopolo deleted the lopopolo/regex-cheaper-compile branch December 28, 2023 00:34
@Dr-Emann
Copy link

Dr-Emann commented Feb 13, 2024

I think case insensitivity (unicode-case) was useful:

--allowlist-item '(?i)(aaa|bbb).*' used to work for e.g. AAA_CONST, and bbb_func()

Now it (silently 😲) matches nothing.

As a workaround, it looks like for my case I can use --allowlist-item '(?i-u:aaa|bbb).*', which turns on case insensitivity, and turns off unicode matching for my literal prefixes.

@MightyPork
Copy link

This broke my project, I relied on (?i) as well. There's no warning, the output file is just empty.

--allowlist-function "(?i)(bacnet|bactext|bacapp|bacstack|bacw).*"

Is ?i-u: equivalent? i never saw this syntax before. It doesn't work in (?i-u) like the old syntax

so something like this?

  --allowlist-function "(?i-u:bacnet|bactext|bacapp|bacstack|bacw).*"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants