Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regex 'v' mode case insensitive matching says characters are folded to lower case, which is incorrect? #37466

Open
tjenkinson opened this issue Jan 1, 2025 · 3 comments
Labels
accepting PR Feel free to open a PR to resolve this issue Content:JS JavaScript docs

Comments

@tjenkinson
Copy link

tjenkinson commented Jan 1, 2025

MDN URL

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Regular_expressions/Character_class

What specific section or headline is this issue about?

Complement classes and case-insensitive matching

What information was incorrect, unhelpful, or incomplete?

Recall that case-insensitive matching happens by folding both the pattern and the input to the same case (see ignoreCase for more details). For r1, the character class a-z stays the same after case folding, while both upper- and lower-case ASCII string inputs are folded to lower-case

What did you expect to see?

I thought it would be folded to upper case, as the linked page says

If the regex is Unicode-unaware, case mapping uses the Unicode Default Case Conversion — the same algorithm used in String.prototype.toUpperCase()

and https://tc39.es/ecma262/multipage/text-processing.html#sec-runtime-semantics-canonicalize-ch says

Let u be toUppercase(« cp »), according to the Unicode Default Case Conversion algorithm.

Do you have any supporting links, references, or citations?

Do you have anything more you want to share?

Also confused by the example. I think it implies that /[^A-Z]/vi should match a, but it doesn't?

/[^A-Z]/vi.test('a') === false

MDN metadata

Page report details
@tjenkinson tjenkinson added the needs triage Triage needed by staff and/or partners. Automatically applied when an issue is opened. label Jan 1, 2025
@github-actions github-actions bot added the Content:JS JavaScript docs label Jan 1, 2025
@Josh-Cena
Copy link
Member

Josh-Cena commented Jan 2, 2025

Yes this is kind of surprising, but Unicode-aware case mapping folds everything to lowercase. If you check https://unicode.org/Public/UCD/latest/ucd/CaseFolding.txt, you see that, for example, the first line 0041; C; 0061; maps A to a. Unicode-unaware case mapping does use toUpperCase. For v mode, the Unicode-aware case mapping is relevant, which maps uppercase letters to lowercase.

Also confused by the example. I think it implies that /[^A-Z]/vi should match a, but it doesn't?

/[^A-Z]/vi.test('a') === false

Yeah that does make the example kind of confusing. The problem is that A-Z is not \P{Lowercase_Letter}; it's \p{Uppercase_Letter}, which, in case-insensitive matching, is exactly the same as \p{Lowercase_Letter}. The example here is just to illustrate why the behavior is like this for ui, but there's no simplified analogy for vi.

Welcoming two note blocks to these two pages making these clarifications!

@Josh-Cena Josh-Cena added accepting PR Feel free to open a PR to resolve this issue and removed needs triage Triage needed by staff and/or partners. Automatically applied when an issue is opened. labels Jan 2, 2025
@tjenkinson
Copy link
Author

tjenkinson commented Jan 2, 2025

Thanks @Josh-Cena

so IIUC with

/[a-z\p{Uppercase_Letter}]/vi

if the input was a then A will be compared with A-Z and a with \p{Lowercase_Letter}? Which is exactly the same as if the input was A?

\p{Uppercase_Letter} -> \p{Lowercase_Letter}, and also an input of A -> a for that check because of unicode case folding rules?

@tjenkinson
Copy link
Author

\p{Uppercase_Letter} -> \p{Lowercase_Letter}, and also an input of A -> a for that check because of unicode case folding rules?

or actually it's more like every code point in \p{Uppercase_Letter} is folded into a new set, where the result would end up matching \p{Lowercase_Letter}?

So then for

/[^a\P{Lowercase_Letter}]/vi

this essentially becomes a mix of "not A with input uppercased" and "something in Lowercase_Letter after each code point run through unicode case folding, with the input run through unicode case folding"

whereas

/[^a\P{Lowercase_Letter}]/ui

becomes a mix of "not A with input uppercased" and "not something in the inverse of Lowercase_Letter after each code point run through unicode case folding, with the input run through unicode case folding". Given the inverse of Lowercase_Letter would be all upper case letters, which would then be lowercased and match the input lower cased, this would match nothing

Hence why

/[^a\P{Lowercase_Letter}]/ui.test('b') === false
/[^a\P{Lowercase_Letter}]/vi.test('b') === true

?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepting PR Feel free to open a PR to resolve this issue Content:JS JavaScript docs
Projects
Status: No status
Development

No branches or pull requests

2 participants