fix(marchaler): escape runes outside the multilingual plane #937

JanDeDobbeleer · 2024-03-12T08:33:57Z

I recently switched to your amazing toml library and noticed it has an issue quoting strings as literal which ruines the rendering in oh-my-posh when exporting a configuration (see JanDeDobbeleer/oh-my-posh#4750).

This PR ensure we only quote strings as literal when it only contains characters inside the multilingual plane and/or emoji.

There was a V1 test that, rightfully so, fails which I also adjusted for correctness.

Paste benchstat results here (running ATM, takes a loooong time). Will post when it's done.

pelletier · 2024-03-12T12:56:08Z

Thank you for your contribution!

I'm a little confused about the need to marshal strings with those characters into TOML basic strings instead of literal strings.

Looking at the spec, literal strings can contain the same characters as basic strings, except those that should be quoted:

Any Unicode character may be used except those that must be escaped: quotation mark, backslash, and the control characters other than tab (U+0000 to U+0008, U+000A to U+001F, U+007F).

So any other Unicode point, as long as it is correctly encoded in UTF-8, should be allowed in literal strings.

Do you mind elaborating on why this is a bug? I'm sorry if I've missed something.

JanDeDobbeleer · 2024-03-12T13:07:24Z

@pelletier it is allowed, but when you unmarchal these, also in go they will be string literals. And there we get into issues as this is in our case then also printed literally instead of interpreted. You can see the effect in the linked issue. They aren't literal strings in the object, yet always exported as such.

In our case, this struct:

{
    PowerlineSymbol: "\ue0b0"
}

Is exported as:

[[blocks.segments]]
    powerline_symbol = '\ue0b0'

while it should be:

[[blocks.segments]]
    powerline_symbol = "\ue0b0"

I'd argue there's a difference between the above and:

{
    PowerlineSymbol: `\ue0b0`
}

which is an actual string literal.

pelletier · 2024-03-12T13:50:34Z

The literal "\ue0b0" in Go represents a string with one code point, which should be marshaled in its UTF-8 encoding as '' by go-toml. That seems to be the case https://go.dev/play/p/oY0zaVTb9Tl

For go-toml to emit the TOML string "\ue0b0", the Go string needs to contain those exact characters. For example, any of those should work:

`\ue0b0`
"\\ue0b0"

I'm not familiar with oh-my-posh, but took at stab at looking that the config export code in that repo. This seems to be the relevant snippet:

https://github.com/JanDeDobbeleer/oh-my-posh/blob/3d7be9250c7b84a173fbf503670167570a47e45d/src/engine/config.go#L170-L211

Looks like all three export formats follow a pattern. First export using a format-specific library, then all apply the same post-processing routine escapeGlyphs:

https://github.com/JanDeDobbeleer/oh-my-posh/blob/3d7be9250c7b84a173fbf503670167570a47e45d/src/engine/config.go#L265-L333

From what I gather, this function goes over all runes in the output, and transforms some unicode points to escape them in a \u... notation.

This process probably only works in JSON, where strings are always represented in double quotes, and double quotes accept the \u... notation. In the case of TOML (and I suspect YAML, as the original poster reports a problem with it too), it would only work if the emitted string happens to be double quoted to accept the \u... notation as a way to express code points.

Assuming this analysis is correct, I don't think there is an issue with go-toml here. Maybe move the the glyph escaping to a pre-processing step, before calling the encoders? That way it would avoid needing to be format-specific, and work for all three formats?

JanDeDobbeleer · 2024-03-12T14:04:20Z

@pelletier I can validate that indeed. Allow me to check and report back. I didn't think about that anymore as that part was somewhere in the past and did miss that during my analysis.

JanDeDobbeleer · 2024-03-12T14:20:33Z

@pelletier confirmed, and sorry for missing that part completely. If I need to get a string representation of these code points, I'll figure that out later. For now that's not as important.

pelletier · 2024-03-12T14:24:54Z

No worries, glad you figured out a solution to your issue! 🙌🏻

fix(marchaler): escape runes outside the multilingual plane

7e3c9c2

JanDeDobbeleer closed this Mar 12, 2024

JanDeDobbeleer deleted the fix-quotes branch March 12, 2024 14:20

JanDeDobbeleer restored the fix-quotes branch March 12, 2024 14:20

JanDeDobbeleer deleted the fix-quotes branch March 12, 2024 14:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(marchaler): escape runes outside the multilingual plane #937

fix(marchaler): escape runes outside the multilingual plane #937

JanDeDobbeleer commented Mar 12, 2024

pelletier commented Mar 12, 2024

JanDeDobbeleer commented Mar 12, 2024

pelletier commented Mar 12, 2024 •

edited

Loading

JanDeDobbeleer commented Mar 12, 2024 •

edited

Loading

JanDeDobbeleer commented Mar 12, 2024

pelletier commented Mar 12, 2024

fix(marchaler): escape runes outside the multilingual plane #937

fix(marchaler): escape runes outside the multilingual plane #937

Conversation

JanDeDobbeleer commented Mar 12, 2024

pelletier commented Mar 12, 2024

JanDeDobbeleer commented Mar 12, 2024

pelletier commented Mar 12, 2024 • edited Loading

JanDeDobbeleer commented Mar 12, 2024 • edited Loading

JanDeDobbeleer commented Mar 12, 2024

pelletier commented Mar 12, 2024

pelletier commented Mar 12, 2024 •

edited

Loading

JanDeDobbeleer commented Mar 12, 2024 •

edited

Loading