Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lexer: Properly support Unicode 15.1.0 #4

Merged
merged 2 commits into from
Sep 4, 2024
Merged

Conversation

RyanGlScott
Copy link

The previous lexer implementation in Language.Rust.Parser.Lexer was broken for Unicode characters with sufficiently large codepoints, as the previous implementation incorrectly attempted to port UTF-16–encoded codepoints over to alex, which is UTF-8–encoded. Rather than try to fix the previous implementation (which was based on old rustc code that is no longer used), this ports the lexer to a new implementation that is based on the Rust unicode-xid crate (which is how modern versions of rustc lex Unicode characters). Specifically:

  • This adapts unicode-xid's lexer generation script to generate an alex-based lexer instead of a Rust-based one.

  • The new lexer is generated to support codepoints from Unicode 15.1.0. (It is unclear which exact Unicode version the previous lexer targeted, but given that it was last updated in 2016, it was likely quite an old version.)

  • I have verified that the new lexer can lex exotic Unicode characters such as 𝑂 and 𐌝 by adding them as regression tests.

Fixes #3.

@RyanGlScott RyanGlScott requested a review from yav August 28, 2024 20:04
@RyanGlScott RyanGlScott self-assigned this Aug 28, 2024
The previous lexer implementation in `Language.Rust.Parser.Lexer` was broken
for Unicode characters with sufficiently large codepoints, as the previous
implementation incorrectly attempted to port UTF-16–encoded codepoints over to
`alex`, which is UTF-8–encoded. Rather than try to fix the previous
implementation (which was based on old `rustc` code that is no longer used),
this ports the lexer to a new implementation that is based on the Rust
`unicode-xid` crate (which is how modern versions of `rustc` lex Unicode
characters). Specifically:

* This adapts `unicode-xid`'s lexer generation script to generate an
  `alex`-based lexer instead of a Rust-based one.

* The new lexer is generated to support codepoints from Unicode 15.1.0.
  (It is unclear which exact Unicode version the previous lexer targeted, but
  given that it was last updated in 2016, it was likely quite an old version.)

* I have verified that the new lexer can lex exotic Unicode characters such as
  `𝑂` and `𐌝` by adding them as regression tests.

Fixes #3.
@RyanGlScott RyanGlScott force-pushed the T3-fix-unicode-lexing branch from 34ed94d to 86a6540 Compare September 4, 2024 19:29
Copy link
Author

@RyanGlScott RyanGlScott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yav and I reviewed this synchronously, and @yav indicated that he is happy with this direction of travel.

@RyanGlScott RyanGlScott merged commit 74a05b7 into master Sep 4, 2024
3 checks passed
@RyanGlScott RyanGlScott deleted the T3-fix-unicode-lexing branch September 4, 2024 19:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

language-rust lexer rejects Unicode symbols that rustc accepts
1 participant