Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSON example fails to lex unicode escapes like "unicode\u2028escape" #458

Closed
kdy1 opened this issue Jan 2, 2025 · 10 comments · Fixed by #464
Closed

JSON example fails to lex unicode escapes like "unicode\u2028escape" #458

kdy1 opened this issue Jan 2, 2025 · 10 comments · Fixed by #464
Labels
bug Something isn't working question Further information is requested

Comments

@kdy1
Copy link

kdy1 commented Jan 2, 2025

While working on swc-project/swc#9807, I found that logos is failing to lex some string literals, and after some debugging, I found that the official example fails to lex Unicode escapes, even if it has String defined as

    #[regex(r#""([^"\\]|\\["\\bnfrt]|u[a-fA-F0-9]{4})*""#, |lex| lex.slice())]
    String(&'source str),

test.json:

{
  "use\u2028strict": "use\u2028strict"
}
  • Command: cargo run --example json-borrowed test.json

    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.03s
     Running `target/debug/examples/json-borrowed test.json`
Error: Invalid JSON
   ╭─[test.json:2:11]
   │
 2 │   "use\u2028strict": "use\u2028strict"
   │   ──┬─
   │     ╰─── unexpected token here (context: object)
───╯
The graph
{
    1: ::<skip> (<skip>),
    2: {
        [09-0A] ⇒ 2,
        [0C-0D] ⇒ 2,
          ⇒ 2,
        _ ⇒ 1,
    },
    4: ::Bool (<inline>),
    5: ::Bool (<inline>),
    6: ::BraceOpen,
    7: ::BraceClose,
    8: ::BracketOpen,
    9: ::BracketClose,
    10: ::Colon,
    11: ::Comma,
    12: ::Null,
    13: ::Number (<inline>),
    14: {
        [0-9] ⇒ 14,
        [D9] ⇒ 15,
        [DB] ⇒ 16,
        [DF] ⇒ 17,
        [E0] ⇒ 44,
        [E1] ⇒ 69,
        [EA] ⇒ 82,
        [EF] ⇒ 83,
        [F0] ⇒ 209,
        _ ⇒ 13,
    },
    15: [A0-A9] ⇒ 14,
    16: [B0-B9] ⇒ 14,
    17: [80-89] ⇒ 14,
    21: [A6-AF] ⇒ 14,
    40: [90-99] ⇒ 14,
    44: {
        [A5] ⇒ 21,
        [A7] ⇒ 21,
        [A9] ⇒ 21,
        [AB] ⇒ 21,
        [AD] ⇒ 21,
        [AF] ⇒ 21,
        [B1] ⇒ 21,
        [B3] ⇒ 21,
        [B5] ⇒ 21,
        [B7] ⇒ 21,
        [B9] ⇒ 40,
        [BB] ⇒ 40,
        [BC] ⇒ 15,
    },
    54: [86-8F] ⇒ 14,
    61: {
        [80-89] ⇒ 14,
        [90-99] ⇒ 14,
    },
    69: {
        [81] ⇒ 17,
        [82] ⇒ 40,
        [9F] ⇒ 15,
        [A0] ⇒ 40,
        [A5] ⇒ 54,
        [A7] ⇒ 40,
        [AA] ⇒ 61,
        [AD] ⇒ 40,
        [AE] ⇒ 16,
        [B1] ⇒ 61,
    },
    78: {
        [90-99] ⇒ 14,
        [B0-B9] ⇒ 14,
    },
    81: [AF][B0-B9] ⇒ 14,
    82: {
        [98] ⇒ 15,
        [A3] ⇒ 40,
        [A4] ⇒ 17,
        [A7] ⇒ 78,
        [A9] ⇒ 40,
        [AF] ⇒ 16,
    },
    83: [BC][90-99] ⇒ 14,
    93: {
        [92] ⇒ 15,
        [B4] ⇒ 16,
        [B5] ⇒ 17,
    },
    105: [B6-BF] ⇒ 14,
    135: {
        [80-89] ⇒ 14,
        [90-A3] ⇒ 14,
    },
    165: {
        [81] ⇒ 21,
        [83] ⇒ 16,
        [84] ⇒ 105,
        [87] ⇒ 40,
        [8B] ⇒ 16,
        [91] ⇒ 40,
        [93] ⇒ 40,
        [99] ⇒ 40,
        [9B] ⇒ 135,
        [9C] ⇒ 16,
        [A3] ⇒ 15,
        [A5] ⇒ 40,
        [AF] ⇒ 16,
        [B1] ⇒ 40,
        [B5] ⇒ 40,
        [B6] ⇒ 15,
        [BD] ⇒ 40,
    },
    183: {
        [84] ⇒ 16,
        [A9] ⇒ 15,
        [AB] ⇒ 17,
        [AD] ⇒ 40,
        [B5] ⇒ 16,
    },
    186: [B3][B0-B9] ⇒ 14,
    189: [9F][8E-BF] ⇒ 14,
    204: [B1-BA] ⇒ 14,
    207: {
        [85] ⇒ 17,
        [8B] ⇒ 16,
        [93] ⇒ 16,
        [97] ⇒ 204,
        [A5] ⇒ 40,
    },
    209: {
        [90] ⇒ 93,
        [91] ⇒ 165,
        [96] ⇒ 183,
        [9C] ⇒ 186,
        [9D] ⇒ 189,
        [9E] ⇒ 207,
        [9F] ⇒ 81,
    },
    210: {
        [0-9] ⇒ 14,
        [D9] ⇒ 15,
        [DB] ⇒ 16,
        [DF] ⇒ 17,
        [E0] ⇒ 44,
        [E1] ⇒ 69,
        [EA] ⇒ 82,
        [EF] ⇒ 83,
        [F0] ⇒ 209,
    },
    211: {
        + ⇒ 210,
        - ⇒ 210,
        _ ⇒ 210,
    },
    212: {
        E ⇒ 211,
        e ⇒ 211,
        _ ⇒ 13,
    },
    213: {
        [0-9] ⇒ 213,
        [D9] ⇒ 214,
        [DB] ⇒ 215,
        [DF] ⇒ 216,
        [E0] ⇒ 243,
        [E1] ⇒ 268,
        [EA] ⇒ 281,
        [EF] ⇒ 282,
        [F0] ⇒ 408,
        _ ⇒ 212,
    },
    214: [A0-A9] ⇒ 213,
    215: [B0-B9] ⇒ 213,
    216: [80-89] ⇒ 213,
    220: [A6-AF] ⇒ 213,
    239: [90-99] ⇒ 213,
    243: {
        [A5] ⇒ 220,
        [A7] ⇒ 220,
        [A9] ⇒ 220,
        [AB] ⇒ 220,
        [AD] ⇒ 220,
        [AF] ⇒ 220,
        [B1] ⇒ 220,
        [B3] ⇒ 220,
        [B5] ⇒ 220,
        [B7] ⇒ 220,
        [B9] ⇒ 239,
        [BB] ⇒ 239,
        [BC] ⇒ 214,
    },
    253: [86-8F] ⇒ 213,
    260: {
        [80-89] ⇒ 213,
        [90-99] ⇒ 213,
    },
    268: {
        [81] ⇒ 216,
        [82] ⇒ 239,
        [9F] ⇒ 214,
        [A0] ⇒ 239,
        [A5] ⇒ 253,
        [A7] ⇒ 239,
        [AA] ⇒ 260,
        [AD] ⇒ 239,
        [AE] ⇒ 215,
        [B1] ⇒ 260,
    },
    277: {
        [90-99] ⇒ 213,
        [B0-B9] ⇒ 213,
    },
    280: [AF][B0-B9] ⇒ 213,
    281: {
        [98] ⇒ 214,
        [A3] ⇒ 239,
        [A4] ⇒ 216,
        [A7] ⇒ 277,
        [A9] ⇒ 239,
        [AF] ⇒ 215,
    },
    282: [BC][90-99] ⇒ 213,
    292: {
        [92] ⇒ 214,
        [B4] ⇒ 215,
        [B5] ⇒ 216,
    },
    304: [B6-BF] ⇒ 213,
    334: {
        [80-89] ⇒ 213,
        [90-A3] ⇒ 213,
    },
    364: {
        [81] ⇒ 220,
        [83] ⇒ 215,
        [84] ⇒ 304,
        [87] ⇒ 239,
        [8B] ⇒ 215,
        [91] ⇒ 239,
        [93] ⇒ 239,
        [99] ⇒ 239,
        [9B] ⇒ 334,
        [9C] ⇒ 215,
        [A3] ⇒ 214,
        [A5] ⇒ 239,
        [AF] ⇒ 215,
        [B1] ⇒ 239,
        [B5] ⇒ 239,
        [B6] ⇒ 214,
        [BD] ⇒ 239,
    },
    382: {
        [84] ⇒ 215,
        [A9] ⇒ 214,
        [AB] ⇒ 216,
        [AD] ⇒ 239,
        [B5] ⇒ 215,
    },
    385: [B3][B0-B9] ⇒ 213,
    388: [9F][8E-BF] ⇒ 213,
    403: [B1-BA] ⇒ 213,
    406: {
        [85] ⇒ 216,
        [8B] ⇒ 215,
        [93] ⇒ 215,
        [97] ⇒ 403,
        [A5] ⇒ 239,
    },
    408: {
        [90] ⇒ 292,
        [91] ⇒ 364,
        [96] ⇒ 382,
        [9C] ⇒ 385,
        [9D] ⇒ 388,
        [9E] ⇒ 406,
        [9F] ⇒ 280,
    },
    409: {
        [0-9] ⇒ 213,
        [D9] ⇒ 214,
        [DB] ⇒ 215,
        [DF] ⇒ 216,
        [E0] ⇒ 243,
        [E1] ⇒ 268,
        [EA] ⇒ 281,
        [EF] ⇒ 282,
        [F0] ⇒ 408,
    },
    410: [
        . ⇒ 409,
        _ ⇒ 212,
    ],
    412: {
        [0-9] ⇒ 412,
        [D9] ⇒ 413,
        [DB] ⇒ 414,
        [DF] ⇒ 415,
        [E0] ⇒ 442,
        [E1] ⇒ 467,
        [EA] ⇒ 480,
        [EF] ⇒ 481,
        [F0] ⇒ 607,
        _ ⇒ 410,
    },
    413: [A0-A9] ⇒ 412,
    414: [B0-B9] ⇒ 412,
    415: [80-89] ⇒ 412,
    419: [A6-AF] ⇒ 412,
    438: [90-99] ⇒ 412,
    442: {
        [A5] ⇒ 419,
        [A7] ⇒ 419,
        [A9] ⇒ 419,
        [AB] ⇒ 419,
        [AD] ⇒ 419,
        [AF] ⇒ 419,
        [B1] ⇒ 419,
        [B3] ⇒ 419,
        [B5] ⇒ 419,
        [B7] ⇒ 419,
        [B9] ⇒ 438,
        [BB] ⇒ 438,
        [BC] ⇒ 413,
    },
    452: [86-8F] ⇒ 412,
    459: {
        [80-89] ⇒ 412,
        [90-99] ⇒ 412,
    },
    467: {
        [81] ⇒ 415,
        [82] ⇒ 438,
        [9F] ⇒ 413,
        [A0] ⇒ 438,
        [A5] ⇒ 452,
        [A7] ⇒ 438,
        [AA] ⇒ 459,
        [AD] ⇒ 438,
        [AE] ⇒ 414,
        [B1] ⇒ 459,
    },
    476: {
        [90-99] ⇒ 412,
        [B0-B9] ⇒ 412,
    },
    479: [AF][B0-B9] ⇒ 412,
    480: {
        [98] ⇒ 413,
        [A3] ⇒ 438,
        [A4] ⇒ 415,
        [A7] ⇒ 476,
        [A9] ⇒ 438,
        [AF] ⇒ 414,
    },
    481: [BC][90-99] ⇒ 412,
    491: {
        [92] ⇒ 413,
        [B4] ⇒ 414,
        [B5] ⇒ 415,
    },
    503: [B6-BF] ⇒ 412,
    533: {
        [80-89] ⇒ 412,
        [90-A3] ⇒ 412,
    },
    563: {
        [81] ⇒ 419,
        [83] ⇒ 414,
        [84] ⇒ 503,
        [87] ⇒ 438,
        [8B] ⇒ 414,
        [91] ⇒ 438,
        [93] ⇒ 438,
        [99] ⇒ 438,
        [9B] ⇒ 533,
        [9C] ⇒ 414,
        [A3] ⇒ 413,
        [A5] ⇒ 438,
        [AF] ⇒ 414,
        [B1] ⇒ 438,
        [B5] ⇒ 438,
        [B6] ⇒ 413,
        [BD] ⇒ 438,
    },
    581: {
        [84] ⇒ 414,
        [A9] ⇒ 413,
        [AB] ⇒ 415,
        [AD] ⇒ 438,
        [B5] ⇒ 414,
    },
    584: [B3][B0-B9] ⇒ 412,
    587: [9F][8E-BF] ⇒ 412,
    602: [B1-BA] ⇒ 412,
    605: {
        [85] ⇒ 415,
        [8B] ⇒ 414,
        [93] ⇒ 414,
        [97] ⇒ 602,
        [A5] ⇒ 438,
    },
    607: {
        [90] ⇒ 491,
        [91] ⇒ 563,
        [96] ⇒ 581,
        [9C] ⇒ 584,
        [9D] ⇒ 587,
        [9E] ⇒ 605,
        [9F] ⇒ 479,
    },
    609: {
        0 ⇒ 410,
        [1-9] ⇒ 412,
    },
    611: ::String (<inline>),
    612: " ⇒ 611,
    613: {
        [00-!] ⇒ 613,
        [#-[] ⇒ 613,
        \ ⇒ 615,
        []-t] ⇒ 613,
        u ⇒ 622,
        [v-FF] ⇒ 613,
        _ ⇒ 612,
    },
    615: {
        " ⇒ 613,
        \ ⇒ 613,
        b ⇒ 613,
        f ⇒ 613,
        n ⇒ 613,
        r ⇒ 613,
        t ⇒ 613,
    },
    622: {
        [00-!] ⇒ 613,
        " ⇒ 611,
        [#-/] ⇒ 613,
        [0-9] ⇒ 623,
        [:-@] ⇒ 613,
        [A-F] ⇒ 623,
        [G-[] ⇒ 613,
        \ ⇒ 615,
        []-`] ⇒ 613,
        [a-f] ⇒ 623,
        [g-t] ⇒ 613,
        u ⇒ 622,
        [v-FF] ⇒ 613,
    },
    623: {
        [00-!] ⇒ 613,
        " ⇒ 611,
        [#-/] ⇒ 613,
        [0-9] ⇒ 624,
        [:-@] ⇒ 613,
        [A-F] ⇒ 624,
        [G-[] ⇒ 613,
        \ ⇒ 615,
        []-`] ⇒ 613,
        [a-f] ⇒ 624,
        [g-t] ⇒ 613,
        u ⇒ 622,
        [v-FF] ⇒ 613,
    },
    624: {
        [00-!] ⇒ 613,
        " ⇒ 611,
        [#-/] ⇒ 613,
        [0-9] ⇒ 625,
        [:-@] ⇒ 613,
        [A-F] ⇒ 625,
        [G-[] ⇒ 613,
        \ ⇒ 615,
        []-`] ⇒ 613,
        [a-f] ⇒ 625,
        [g-t] ⇒ 613,
        u ⇒ 622,
        [v-FF] ⇒ 613,
    },
    625: {
        [00-!] ⇒ 613,
        " ⇒ 611,
        [#-[] ⇒ 613,
        \ ⇒ 615,
        []-t] ⇒ 613,
        u ⇒ 622,
        [v-FF] ⇒ 613,
    },
    627: alse ⇒ 4,
    628: rue ⇒ 5,
    629: ull ⇒ 12,
    630: {
        [09-0A] ⇒ 2,
        [0C-0D] ⇒ 2,
          ⇒ 2,
        " ⇒ 613,
        , ⇒ 11,
        - ⇒ 609,
        0 ⇒ 410,
        [1-9] ⇒ 412,
        : ⇒ 10,
        [ ⇒ 8,
        ] ⇒ 9,
        f ⇒ 627,
        n ⇒ 629,
        t ⇒ 628,
        { ⇒ 6,
        } ⇒ 7,
    },
}
  • Note: cat test.json | jq works without any issue

Is there a working syntax for escapes in string literals?

@kdy1 kdy1 changed the title Borrowed json fails to lex unicode escapes like "unicode\u2028escape" JSON example fails to lex unicode escapes like "unicode\u2028escape" Jan 2, 2025
@jeertmans
Copy link
Collaborator

Hi @kdy1, thanks for reporting this bug!

Indeed, the regex used in the example is not perfect, and was observed to failed on some rare cases, that I excluded for simplicity. I am not sure whether this is possible to express something that matches all possible JSON strings as a regex, as it includes possibly very complex patterns, like escapes. If you are aware of such a regex, please let me know. Otherwise, using callback is probably the right way.

@jeertmans jeertmans added bug Something isn't working question Further information is requested labels Jan 2, 2025
@kdy1
Copy link
Author

kdy1 commented Jan 2, 2025

I workarounded by creating another logos lexer and by using it within the callback.

#[derive(Logos, Debug, Clone, Copy, PartialEq, Eq)]
enum StrContent {
    #[regex(r#"\\["'\\bfnrtv]"#, priority = 100)]
    #[regex(r#"\\0[0-7]*"#, priority = 100)]
    #[regex(r#"\\x[0-9a-fA-F]{2}"#, priority = 100)]
    #[regex(r#"\\u[0-9a-fA-F]{4}"#, priority = 100)]
    #[regex(r#"\\[^'"\\]+"#)]
    Escape,

    #[regex(r#"[^'"\\]+"#)]
    Normal,

    #[regex(r#"'"#)]
    SingleQuote,

    #[regex(r#"""#)]
    DoubleQuote,
}

@jeertmans
Copy link
Collaborator

Nice! Do you have a complete example to share?

@kdy1
Copy link
Author

kdy1 commented Jan 2, 2025

I have one, but as it's still on the fork branch and still WIP, I'll post a comment with a link to the main branch after finishing the PR.

@jeertmans
Copy link
Collaborator

Thanks!

@pamburus
Copy link
Contributor

The problem in the example seems to be the lack of a regex group for a single escape sequence.
Regex like this r#""([^"\\\x00-\x1F]|\\(["\\bnfrt/]|u[a-fA-F0-9]{4}))*""# works.
It is important to make a group here: (["\\bnfrt/]|u[a-fA-F0-9]{4})) so that | can properly select alternatives.
Another approach is to repeat \\ before u. It also works.

By the way, pay attention to the escape sequence \\/. It is also valid by the JSON specification, but it is missing in the example.

@jeertmans
Copy link
Collaborator

@pamburus The JSON example String regex does not cover all possible cases, I know, but I am open to better regexes, if any.

@pamburus
Copy link
Contributor

pamburus commented Jan 18, 2025

@jeertmans
According to the spec, it seems to me that the regex I provided above is exhaustive.
Here it is: r#""([^"\\\x00-\x1F]|\\(["\\bnfrt/]|u[a-fA-F0-9]{4}))*""#.
It fixes the issue with unicode escape points, covers \/ sequence, and forbids control characters (0x00..0x1F). Everything else seems to be fine already.

@jeertmans
Copy link
Collaborator

Nice @pamburus! If I remember correctly, my example failed to parse this: https://github.com/json-iterator/test-data/blob/master/large-file.json.

If yours succeeds (or passes other examples that mine fails), please create a PR to change the regex ;-)

@pamburus
Copy link
Contributor

pamburus commented Jan 19, 2025

@jeertmans
No, the example successfully parses that file. In fact, it does not contain any \u unicode escape sequences. That was the main issue with the original regular expression. Escape sequence \/ is almost never used because it makes not sense, it is just present in the spec for some reason that I cannot understand. And \x00-\x1F characters never present inside a string in a valid JSON, so this should just protect from unintended successful parsing of invalid JSON files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants