You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
// In the ASCII case, the most significant bit is never set. The leading
// byte of a 2/3/4-byte sequence always has the top two most significant
// bigs set.
(b &0b1100_0000) != 0b1000_0000
}
Specifically, it returns true for bytes in the range 0xf8..=0xff, which is wrong. These bytes are not leading bytes, nor are they trailing bytes. They're bytes which can only appear in sequences of invalid utf-8 -- there's no case where they appear in a stream of valid utf8-encoded text, but obviously bstr doesn't want to rely on this.
This is an internal function so it may not matter, but I have a hard time understanding much of the code in this module (I am not someone who looks at a DFA state table and sees meaning in the numbers), so I am unsure whether or not it can cause problems... It's also totally possible this is handled in some other way, I was just finally getting around to doing #44 and noticed this.
(I did try some trivial tests to try and get weird things to happen without avail, but figured it was better to report it -- when I look at those tables my eyes glaze over, but hopefully the code's author understands it!)
The text was updated successfully, but these errors were encountered:
It turns out that this returns true not just if the byte is a leading
byte, but also if the byte never appears in any valid UTF-8 sequence.
Furthermore, this is OK based on how we're using the function. It's only
used in two places. The first place only ever calls it with valid UTF-8
bytes, since it's backing up in the input in a region that's guaranteed
to be valid UTF-8. The second place (decoding a codepoint in reverse) is
OK treating invalid UTF-8 bytes as leading bytes, since they can never
be a valid prefix anyway. After backing up, the forward decode function
will handle the case correctly.
Since the behavior is OK, we change the name of the function to indicate
its true behavior and add a comment clarifying it.
Fixes#54
is_leading_utf8_byte
seems to assume it's input byte came from a stream of utf-8 encoded textbstr/src/utf8.rs
Lines 797 to 802 in 91edb3f
Specifically, it returns true for bytes in the range 0xf8..=0xff, which is wrong. These bytes are not leading bytes, nor are they trailing bytes. They're bytes which can only appear in sequences of invalid utf-8 -- there's no case where they appear in a stream of valid utf8-encoded text, but obviously bstr doesn't want to rely on this.
This is an internal function so it may not matter, but I have a hard time understanding much of the code in this module (I am not someone who looks at a DFA state table and sees meaning in the numbers), so I am unsure whether or not it can cause problems... It's also totally possible this is handled in some other way, I was just finally getting around to doing #44 and noticed this.
(I did try some trivial tests to try and get weird things to happen without avail, but figured it was better to report it -- when I look at those tables my eyes glaze over, but hopefully the code's author understands it!)
The text was updated successfully, but these errors were encountered: