Regex 'v' mode case insensitive matching says characters are folded to lower case, which is incorrect? #37466

tjenkinson · 2025-01-01T20:19:00Z

MDN URL

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Regular_expressions/Character_class

What specific section or headline is this issue about?

Complement classes and case-insensitive matching

What information was incorrect, unhelpful, or incomplete?

Recall that case-insensitive matching happens by folding both the pattern and the input to the same case (see ignoreCase for more details). For r1, the character class a-z stays the same after case folding, while both upper- and lower-case ASCII string inputs are folded to lower-case

What did you expect to see?

I thought it would be folded to upper case, as the linked page says

If the regex is Unicode-unaware, case mapping uses the Unicode Default Case Conversion — the same algorithm used in String.prototype.toUpperCase()

and https://tc39.es/ecma262/multipage/text-processing.html#sec-runtime-semantics-canonicalize-ch says

Let u be toUppercase(« cp »), according to the Unicode Default Case Conversion algorithm.

Do you have any supporting links, references, or citations?

Do you have anything more you want to share?

Also confused by the example. I think it implies that /[^A-Z]/vi should match a, but it doesn't?

/[^A-Z]/vi.test('a') === false

MDN metadata

Page report details

Folder: en-us/web/javascript/reference/regular_expressions/character_class
MDN URL: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Regular_expressions/Character_class
GitHub URL: https://github.com/mdn/content/blob/main/files/en-us/web/javascript/reference/regular_expressions/character_class/index.md
Last commit: 4f86aad
Document last modified: 2023-09-12T03:52:28.000Z

The text was updated successfully, but these errors were encountered:

Josh-Cena · 2025-01-02T03:29:36Z

Yes this is kind of surprising, but Unicode-aware case mapping folds everything to lowercase. If you check https://unicode.org/Public/UCD/latest/ucd/CaseFolding.txt, you see that, for example, the first line 0041; C; 0061; maps A to a. Unicode-unaware case mapping does use toUpperCase. For v mode, the Unicode-aware case mapping is relevant, which maps uppercase letters to lowercase.

Also confused by the example. I think it implies that /[^A-Z]/vi should match a, but it doesn't?

/[^A-Z]/vi.test('a') === false

Yeah that does make the example kind of confusing. The problem is that A-Z is not \P{Lowercase_Letter}; it's \p{Uppercase_Letter}, which, in case-insensitive matching, is exactly the same as \p{Lowercase_Letter}. The example here is just to illustrate why the behavior is like this for ui, but there's no simplified analogy for vi.

Welcoming two note blocks to these two pages making these clarifications!

tjenkinson · 2025-01-02T21:09:57Z

Thanks @Josh-Cena

so IIUC with

/[a-z\p{Uppercase_Letter}]/vi

if the input was a then A will be compared with A-Z and a with \p{Lowercase_Letter}? Which is exactly the same as if the input was A?

\p{Uppercase_Letter} -> \p{Lowercase_Letter}, and also an input of A -> a for that check because of unicode case folding rules?

tjenkinson · 2025-01-02T21:52:20Z

\p{Uppercase_Letter} -> \p{Lowercase_Letter}, and also an input of A -> a for that check because of unicode case folding rules?

or actually it's more like every code point in \p{Uppercase_Letter} is folded into a new set, where the result would end up matching \p{Lowercase_Letter}?

So then for

/[^a\P{Lowercase_Letter}]/vi

this essentially becomes a mix of "not A with input uppercased" and "something in Lowercase_Letter after each code point run through unicode case folding, with the input run through unicode case folding"

whereas

/[^a\P{Lowercase_Letter}]/ui

becomes a mix of "not A with input uppercased" and "not something in the inverse of Lowercase_Letter after each code point run through unicode case folding, with the input run through unicode case folding". Given the inverse of Lowercase_Letter would be all upper case letters, which would then be lowercased and match the input lower cased, this would match nothing

Hence why

/[^a\P{Lowercase_Letter}]/ui.test('b') === false

/[^a\P{Lowercase_Letter}]/vi.test('b') === true

?

tjenkinson added the needs triage Triage needed by staff and/or partners. Automatically applied when an issue is opened. label Jan 1, 2025

github-actions bot added the Content:JS JavaScript docs label Jan 1, 2025

Josh-Cena added accepting PR Feel free to open a PR to resolve this issue and removed needs triage Triage needed by staff and/or partners. Automatically applied when an issue is opened. labels Jan 2, 2025

bsmth added this to MDN Issue Board Jan 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regex 'v' mode case insensitive matching says characters are folded to lower case, which is incorrect? #37466

Regex 'v' mode case insensitive matching says characters are folded to lower case, which is incorrect? #37466

tjenkinson commented Jan 1, 2025 •

edited

Loading

Josh-Cena commented Jan 2, 2025 •

edited

Loading

tjenkinson commented Jan 2, 2025 •

edited

Loading

tjenkinson commented Jan 2, 2025

Regex 'v' mode case insensitive matching says characters are folded to lower case, which is incorrect? #37466

Regex 'v' mode case insensitive matching says characters are folded to lower case, which is incorrect? #37466

Comments

tjenkinson commented Jan 1, 2025 • edited Loading

MDN URL

What specific section or headline is this issue about?

What information was incorrect, unhelpful, or incomplete?

What did you expect to see?

Do you have any supporting links, references, or citations?

Do you have anything more you want to share?

MDN metadata

Josh-Cena commented Jan 2, 2025 • edited Loading

tjenkinson commented Jan 2, 2025 • edited Loading

tjenkinson commented Jan 2, 2025

tjenkinson commented Jan 1, 2025 •

edited

Loading

Josh-Cena commented Jan 2, 2025 •

edited

Loading

tjenkinson commented Jan 2, 2025 •

edited

Loading