Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf(es/lexer): Use logos lexer as a sub-lexer #9807

Draft
wants to merge 201 commits into
base: main
Choose a base branch
from

Conversation

kdy1
Copy link
Member

@kdy1 kdy1 commented Dec 19, 2024

Description:

  • Define RawToken.

    • It should be declared in a standalone crate because it requires heavy code generation. logos generates a deterministic finite state machine that consists of lookup tables and jump tables.
    • In the future, RawToken will be renamed to Token and used directly by the parser. I need to investigate much more about API.
    • For performance, RawToken should be single-byte-sized.
  • Adjust lexer to work on RawToken instead of char.

    • logos takes &str and generates Iterator<Item = RawToken>.
    • Some ECMAScript code cannot be lexed using FSM. logos provides callback API, and that's how we should handle ambiguous tokens.
    • The regex syntax of logos is a bit inferior to that of regex. In other words, even the regex is valid, the logos lexer may not generate matching RawToken. This is for performance, and it's documented here.
  • Wrapper: logos::Lexer => RawLexer => Lexer => Parser

    • The callback API of logos is not enough for lexing ECMAScript. We have to wrap the logos lexer with our lexer and call logos lexer only if it's valid to do. Well, it's valid most of the time, but currently, we handle regex and template literals using a separate method.
    • I introduced RawLexer as a sort of buffer (based on peek_nth from itertools), but I found that it's a good place to have various lexing methods, so I added read_regexp. read_regexp uses another logos token definition, so it should be in the swc_ecma_raw_lexer crate.
  • Fix tests

    • Currently, some of spans are wrong.
    • Currently, processed values of AST are simply filled with the raw value from the lexer. These should have the correct values. For example, Str has the value of \\\\ and the raw value of \\\\ for input string \\\\. But the value field should be \\ instead.

Related issue (if exists):

@kdy1 kdy1 self-assigned this Dec 19, 2024
Copy link

changeset-bot bot commented Dec 19, 2024

⚠️ No Changeset found

Latest commit: c862038

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@kdy1 kdy1 changed the title perf(es/lexer): Use logos for Text => RawToken => Token phase perf(es/lexer): Use logos for sub-lexer Jan 2, 2025
@kdy1 kdy1 changed the title perf(es/lexer): Use logos for sub-lexer perf(es/lexer): Use logos lexer as a sub-lexer Jan 2, 2025
@kdy1 kdy1 added this to the Planned milestone Jan 2, 2025
@kdy1 kdy1 removed their assignment Jan 7, 2025
@@ -372,7 +353,7 @@ impl Iterator for Lexer<'_> {
}

self.state.update(start, token.kind());
self.state.prev_hi = self.last_pos();
self.state.prev_hi = self.input.cur_pos();
Copy link

@GiveMe-A-Name GiveMe-A-Name Jan 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not need maintain the cur_pos message in parser.lexer. How about get span from logos.lexer ? It's a better way to reduce ecma.parser.lexer complexity.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it can be a better design. You would only need to add start_pos to those.


#[derive(Logos, Debug, Clone, Copy, PartialEq, Eq)]
#[logos(error = LogosError)]
pub enum JsxToken {}
Copy link

@GiveMe-A-Name GiveMe-A-Name Jan 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

< in jsx maybe BinOpToken::Lt or JSXTagStart, logos lexer is so hard using a kind enum to express two token kind. So I think we don't generate jsx token by logos lexer.
We can think of logos lexer as a basic lexer that generate basic token kind such as LtAngle(<), RtAngle(>) ....
Lexer using logos and state to generate more specific Token.
Do you think it a better way?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeap, that's the way I was thinking of.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants