Relative tokenizer #14270

lukaszsamson · 2025-02-16T14:19:57Z

This is an attempt at making the tokenizer produce relative tokens. My intention is to enable building an incremental parsers better suited to LSP use case. With relative tokenization and parsing it may be possible to build a parser that does not need to rebuild the entire AST on each document edit.

Design choices:

In relative mode token line and column represents a difference from the last token position
In case of interpolated binaries, charlists and sigils the parts are relative to the position of the token itself
Inside interpolation the positions are relative to the begin of interpolation #{

The current state:

the relative mode produces valid relative tokens that after conversion to absolute via :elixir_tokenizer.to_absolute_tokens are identical to the ones produced by absolute mode. I verified this over elixir source as well as a number of other projects.
all tests pass
parser and errors tests return the same tokens and errors/warnings

Examples:

iex(2)> :elixir_tokenizer.tokenize(~c'fun(x + 1)', 1, 1, [mode: :absolute]) |> elem(4) |> Enum.reverse
[
  {:paren_identifier, {1, 1, ~c"fun"}, :fun},
  {:"(", {1, 4, nil}},
  {:identifier, {1, 5, ~c"x"}, :x},
  {:dual_op, {1, 7, nil}, :+},
  {:int, {1, 9, 1}, ~c"1"},
  {:")", {1, 10, nil}}
]
iex(3)> :elixir_tokenizer.tokenize(~c'fun(x + 1)', 1, 1, [mode: :relative]) |> elem(4) |> Enum.reverse
[
  {:paren_identifier, {0, 0, ~c"fun"}, :fun},
  {:"(", {0, 3, nil}},
  {:identifier, {0, 1, ~c"x"}, :x},
  {:dual_op, {0, 2, nil}, :+},
  {:int, {0, 2, 1}, ~c"1"},
  {:")", {0, 1, nil}}
]

iex(7)> :elixir_tokenizer.tokenize(~c'"\#{fun(x + 1)}" <> ""', 1, 1, [mode: :absolute]) |> elem(4) |> Enum.reverse
[
  {:bin_string, {1, 1, nil},
   [
     {{1, 2, nil}, {1, 14, nil},
      [
        {:paren_identifier, {1, 4, ~c"fun"}, :fun},
        {:"(", {1, 7, nil}},
        {:identifier, {1, 8, ~c"x"}, :x},
        {:dual_op, {1, 10, nil}, :+},
        {:int, {1, 12, 1}, ~c"1"},
        {:")", {1, 13, nil}}
      ]}
   ]},
  {:concat_op, {1, 17, nil}, :<>},
  {:bin_string, {1, 20, nil}, [""]}
]
iex(8)> :elixir_tokenizer.tokenize(~c'"\#{fun(x + 1)}" <> ""', 1, 1, [mode: :relative]) |> elem(4) |> Enum.reverse
[
  {:bin_string, {0, 0, nil},
   [
     {{0, 1, nil}, {0, 13, nil},
      [
        {:paren_identifier, {0, 3, ~c"fun"}, :fun},
        {:"(", {0, 3, nil}},
        {:identifier, {0, 1, ~c"x"}, :x},
        {:dual_op, {0, 2, nil}, :+},
        {:int, {0, 2, 1}, ~c"1"},
        {:")", {0, 1, nil}}
      ]}
   ]},
  {:concat_op, {0, 16, nil}, :<>},
  {:bin_string, {0, 3, nil}, [""]}
]

josevalim · 2025-02-20T05:30:38Z

I have been thinking about this. Couldn't this be implemented by doing a later pass on the tokens or the AST that computes the difference and relative positions? Or perhaps we include more information on the metadata so it can be done by a later pass?

lukaszsamson · 2025-02-25T09:18:57Z

I'm not fully convinced this is the right step on the road to incremental parsing. My intention was to explore a direction and see where we can go from there. Metadata in the current form is the problem. What I would like to see is separation of AST and position/range metadata so were possible to keep AST nodes in one place and have metadata stored elsewhere. A weak map would be nice but I'm not aware of one for OTP.

josevalim · 2025-02-25T09:33:04Z

@lukaszsamson something like this for the tracking? https://github.com/jonatanklosko/elixir_ast_ranges

I honestly don't think a weak map would be that hard to implement in C as you could send messages when a reference/resource is GCed, so I think we could explore solutions alongside that.

lukaszsamson added 18 commits February 9, 2025 14:15

Add relative mode for tokenizer

ed8a8d1

fix newline handling

16c55f6

Make interpolation tests pass

2e7b64f

Correctly handle offsets in interpolation

7da4f3b

Fix sigil positions

617e1ba

Fix kw_identifier

6c19bd0

Fix match on terminator

4182978

Fix variable name

caf1ae9

Temp fix for crashes

d06715d

Correct error meta

0c8e39b

Fix position in errors

cacf7f1

handle cursor

c28c98c

fix confusables warning

80e17f8

fix invalid match

5be62d7

add helper

377e7c8

handle cursor

c25cfc3

handle relative tokens in fragment

7e2c313

format

14d0e12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Relative tokenizer #14270

Relative tokenizer #14270

lukaszsamson commented Feb 16, 2025

josevalim commented Feb 20, 2025

lukaszsamson commented Feb 25, 2025

josevalim commented Feb 25, 2025

Relative tokenizer #14270

Are you sure you want to change the base?

Relative tokenizer #14270

Conversation

lukaszsamson commented Feb 16, 2025

josevalim commented Feb 20, 2025

lukaszsamson commented Feb 25, 2025

josevalim commented Feb 25, 2025