Lexer: Properly support Unicode 15.1.0 #4
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The previous lexer implementation in
Language.Rust.Parser.Lexer
was broken for Unicode characters with sufficiently large codepoints, as the previous implementation incorrectly attempted to port UTF-16–encoded codepoints over toalex
, which is UTF-8–encoded. Rather than try to fix the previous implementation (which was based on oldrustc
code that is no longer used), this ports the lexer to a new implementation that is based on the Rustunicode-xid
crate (which is how modern versions ofrustc
lex Unicode characters). Specifically:This adapts
unicode-xid
's lexer generation script to generate analex
-based lexer instead of a Rust-based one.The new lexer is generated to support codepoints from Unicode 15.1.0. (It is unclear which exact Unicode version the previous lexer targeted, but given that it was last updated in 2016, it was likely quite an old version.)
I have verified that the new lexer can lex exotic Unicode characters such as
𝑂
and𐌝
by adding them as regression tests.Fixes #3.