Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

is_leading_utf8_byte returns wrong answer for 0xf8..=0xff (utf-8 invalid bytes are not leading bytes) #54

Closed
thomcc opened this issue May 9, 2020 · 0 comments

Comments

@thomcc
Copy link
Contributor

thomcc commented May 9, 2020

is_leading_utf8_byte seems to assume it's input byte came from a stream of utf-8 encoded text

bstr/src/utf8.rs

Lines 797 to 802 in 91edb3f

fn is_leading_utf8_byte(b: u8) -> bool {
// In the ASCII case, the most significant bit is never set. The leading
// byte of a 2/3/4-byte sequence always has the top two most significant
// bigs set.
(b & 0b1100_0000) != 0b1000_0000
}

Specifically, it returns true for bytes in the range 0xf8..=0xff, which is wrong. These bytes are not leading bytes, nor are they trailing bytes. They're bytes which can only appear in sequences of invalid utf-8 -- there's no case where they appear in a stream of valid utf8-encoded text, but obviously bstr doesn't want to rely on this.

This is an internal function so it may not matter, but I have a hard time understanding much of the code in this module (I am not someone who looks at a DFA state table and sees meaning in the numbers), so I am unsure whether or not it can cause problems... It's also totally possible this is handled in some other way, I was just finally getting around to doing #44 and noticed this.

(I did try some trivial tests to try and get weird things to happen without avail, but figured it was better to report it -- when I look at those tables my eyes glaze over, but hopefully the code's author understands it!)

BurntSushi added a commit that referenced this issue May 10, 2020
It turns out that this returns true not just if the byte is a leading
byte, but also if the byte never appears in any valid UTF-8 sequence.
Furthermore, this is OK based on how we're using the function. It's only
used in two places. The first place only ever calls it with valid UTF-8
bytes, since it's backing up in the input in a region that's guaranteed
to be valid UTF-8. The second place (decoding a codepoint in reverse) is
OK treating invalid UTF-8 bytes as leading bytes, since they can never
be a valid prefix anyway. After backing up, the forward decode function
will handle the case correctly.

Since the behavior is OK, we change the name of the function to indicate
its true behavior and add a comment clarifying it.

Fixes #54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant