`is_leading_utf8_byte` returns wrong answer for 0xf8..=0xff (utf-8 invalid bytes are not leading bytes) #54

thomcc · 2020-05-09T21:34:28Z

is_leading_utf8_byte seems to assume it's input byte came from a stream of utf-8 encoded text

Lines 797 to 802 in 91edb3f

    
           fn is_leading_utf8_byte(b: u8) -> bool { 
        
               // In the ASCII case, the most significant bit is never set. The leading 
        
               // byte of a 2/3/4-byte sequence always has the top two most significant 
        
               // bigs set. 
        
               (b & 0b1100_0000) != 0b1000_0000 
        
           }

Specifically, it returns true for bytes in the range 0xf8..=0xff, which is wrong. These bytes are not leading bytes, nor are they trailing bytes. They're bytes which can only appear in sequences of invalid utf-8 -- there's no case where they appear in a stream of valid utf8-encoded text, but obviously bstr doesn't want to rely on this.

This is an internal function so it may not matter, but I have a hard time understanding much of the code in this module (I am not someone who looks at a DFA state table and sees meaning in the numbers), so I am unsure whether or not it can cause problems... It's also totally possible this is handled in some other way, I was just finally getting around to doing #44 and noticed this.

(I did try some trivial tests to try and get weird things to happen without avail, but figured it was better to report it -- when I look at those tables my eyes glaze over, but hopefully the code's author understands it!)

The text was updated successfully, but these errors were encountered:

It turns out that this returns true not just if the byte is a leading byte, but also if the byte never appears in any valid UTF-8 sequence. Furthermore, this is OK based on how we're using the function. It's only used in two places. The first place only ever calls it with valid UTF-8 bytes, since it's backing up in the input in a region that's guaranteed to be valid UTF-8. The second place (decoding a codepoint in reverse) is OK treating invalid UTF-8 bytes as leading bytes, since they can never be a valid prefix anyway. After backing up, the forward decode function will handle the case correctly. Since the behavior is OK, we change the name of the function to indicate its true behavior and add a comment clarifying it. Fixes #54

BurntSushi closed this as completed in c9e0e73 May 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`is_leading_utf8_byte` returns wrong answer for 0xf8..=0xff (utf-8 invalid bytes are not leading bytes) #54

`is_leading_utf8_byte` returns wrong answer for 0xf8..=0xff (utf-8 invalid bytes are not leading bytes) #54

thomcc commented May 9, 2020

is_leading_utf8_byte returns wrong answer for 0xf8..=0xff (utf-8 invalid bytes are not leading bytes) #54

is_leading_utf8_byte returns wrong answer for 0xf8..=0xff (utf-8 invalid bytes are not leading bytes) #54

Comments

thomcc commented May 9, 2020

`is_leading_utf8_byte` returns wrong answer for 0xf8..=0xff (utf-8 invalid bytes are not leading bytes) #54

`is_leading_utf8_byte` returns wrong answer for 0xf8..=0xff (utf-8 invalid bytes are not leading bytes) #54