Lexer: Properly support Unicode 15.1.0 #4

RyanGlScott · 2024-08-28T20:04:53Z

The previous lexer implementation in Language.Rust.Parser.Lexer was broken for Unicode characters with sufficiently large codepoints, as the previous implementation incorrectly attempted to port UTF-16–encoded codepoints over to alex, which is UTF-8–encoded. Rather than try to fix the previous implementation (which was based on old rustc code that is no longer used), this ports the lexer to a new implementation that is based on the Rust unicode-xid crate (which is how modern versions of rustc lex Unicode characters). Specifically:

This adapts unicode-xid's lexer generation script to generate an alex-based lexer instead of a Rust-based one.
The new lexer is generated to support codepoints from Unicode 15.1.0. (It is unclear which exact Unicode version the previous lexer targeted, but given that it was last updated in 2016, it was likely quite an old version.)
I have verified that the new lexer can lex exotic Unicode characters such as 𝑂 and 𐌝 by adding them as regression tests.

Fixes #3.

The previous lexer implementation in `Language.Rust.Parser.Lexer` was broken for Unicode characters with sufficiently large codepoints, as the previous implementation incorrectly attempted to port UTF-16–encoded codepoints over to `alex`, which is UTF-8–encoded. Rather than try to fix the previous implementation (which was based on old `rustc` code that is no longer used), this ports the lexer to a new implementation that is based on the Rust `unicode-xid` crate (which is how modern versions of `rustc` lex Unicode characters). Specifically: * This adapts `unicode-xid`'s lexer generation script to generate an `alex`-based lexer instead of a Rust-based one. * The new lexer is generated to support codepoints from Unicode 15.1.0. (It is unclear which exact Unicode version the previous lexer targeted, but given that it was last updated in 2016, it was likely quite an old version.) * I have verified that the new lexer can lex exotic Unicode characters such as `𝑂` and `𐌝` by adding them as regression tests. Fixes #3.

RyanGlScott

@yav and I reviewed this synchronously, and @yav indicated that he is happy with this direction of travel.

RyanGlScott requested a review from yav August 28, 2024 20:04

RyanGlScott self-assigned this Aug 28, 2024

RyanGlScott added 2 commits September 4, 2024 15:29

Whitespace only

fd184b1

RyanGlScott force-pushed the T3-fix-unicode-lexing branch from 34ed94d to 86a6540 Compare September 4, 2024 19:29

RyanGlScott commented Sep 4, 2024

View reviewed changes

RyanGlScott merged commit 74a05b7 into master Sep 4, 2024
3 checks passed

RyanGlScott deleted the T3-fix-unicode-lexing branch September 4, 2024 19:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lexer: Properly support Unicode 15.1.0 #4

Lexer: Properly support Unicode 15.1.0 #4

RyanGlScott commented Aug 28, 2024

RyanGlScott left a comment

Lexer: Properly support Unicode 15.1.0 #4

Lexer: Properly support Unicode 15.1.0 #4

Conversation

RyanGlScott commented Aug 28, 2024

RyanGlScott left a comment

Choose a reason for hiding this comment