-
-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Distinguish between invalid and valid-but-incomplete utf-8 #42
Comments
😅 (To be fair, I figured this might be the case, hence my apology-in-advance). That said it's a little tricky to use for my case, because I do intend to do convert the non-truncated portion in a lossy way (via to_str_lossy_into). That said, it will probably let me delete I think the remaining issues I have with the code here would be solved if bstr had APIs like I suggested above, but I'll file another issue for it, since it's not directly related to this problem, and the names need a bit of bikeshed discussion since I think they imply valid UTF-8. |
Had another case where I was slightly frustrated by this in bstr. Here's the concrete example, with a caveat that it's a pretty frustrating case, Mouse input has a few styles it can come in to a terminal program. Two methods
You don't really know which of these formats the data is in if you turned on Anyway, you get this data from nonblocking IO, and you have no idea if it's So, assuming you don't already know the format for sure, basically what you want
The difficulty here is:
So yeah, it's pretty annoying. I ended up punting on it because it was a hassle, That said, the point here isn't so much any specifics about parsing these (it's * In practice the ambiguity is not a huge problem since heuristics work well, |
@thomcc Sorry, I might be missing something, but what API would be useful in that context? (It's not clear to me why |
A version of decode_utf8 that returns a Utf8Error would be a definite improvement. |
Hmmm, I'm not sure I quite follow.? The |
Right, I'm interested in truncation within a codepoint -- I'd like to be able to distinguish a case like Utf8Error seems do distinguish these cases -- The Utf8Error API is just awkward for what I'm doing, for reasons described above (I need the codepoints, and before getting them I don't know where the region that's valid ends) |
Oh I see. Yeah, that is fundamentally at odds with Maybe there is another kind of helper function that could help here? Perhaps a |
I mean, I'm making the API up but the difference I want between \xf0\xff and \xf0 would just be just why the failure happened in a very broad sense -- e.g. if one says it saw an illegal byte and the other that it saw truncation that would be usable. E.g. I don't plan on using Anyway I don't think this requires looking ahead, you already know these things at the time you return the data. Note that the docs in these cases are imprecise in ways that have caused a bit of confusion for me In particular
You should probably adjust to "a maximal prefix of a valid UTF-8 code unit sequence, or if none exists, 1" which matches the behavior and the description of "maximal subpart" in https://www.unicode.org/review/pr-121.html Sadly, the way it's written would be more helpful to me for this case than the current function. It feels like the current function is mainly useful when you plan on replacing anything invalid anyway -- which is only in the rarest cases a thing I want to do, and even then, only when I'm absolutely certain it's not caused by truncation. That said, it's possible I'm on the outside here -- bstr is certainly a library written with the assumption that lossy replacement is an okay default (and for very many cases this is probably true -- I'm not attempting to be critical of the design on the whole). And for the case where you do want to replace, it does seem to give you the ideal result.
Hmm, I mean, maybe? I already feel like I'm cobbling together what I need from a bunch of other APIs. It would be possible to use this, but I think it would make the code more awkward. It's worth noting again: before I decode, I don't know how long the codepoint is going to be (obviously), and I also have no guarantee that the bytes after the codepoint are going to be valid UTF8 at all -- its completely reasonable for them to be non-utf8. So either is_utf8_prefix would have to only look at one codepoints worth of bytes? It's not like I could pass in Would it help if I gave you example code to show you? [0] This one is dubious also and confused me earlier in this conversation: https://docs.rs/bstr/0.2.8/bstr/struct.Utf8Error.html#method.valid_up_to
Since it goes on to talk about sub-bytes within a codepoint I thought this would mean it would tell me about a partial codepoint read, but what it actually does only seems to consider bytes which lie on codepoint boundaries, or on the boundaries where you'd insert a replacement character. |
I think I have pretty well lost track of what it is you're trying to do, so yes, I think a short example would help here.
I don't know whether it's worth making your use case completely non-awkward. If your use case is very rare, then I'd consider "possible but awkward" to be a success here. But I don't mean to state any absolutes. I'm generally happy to add more methods or functions that are orthogonal in purpose, assuming a use case for them can be easily articulated in the docs with a short example. The road I don't want to go down is to provide "alternate" APIs of things like As far as documentation clarity, I'm always happy to see that improved. I confess that I do not quite grok your confusion on the existing wording. I think the best way to hash that out, if you're willing, is to submit a PR with the docs rewritten to match your ideal. Thanks for your patience in dealing with me here. :) |
Here's an example. It's not the same as the example I opened this bug with, essentially each time this comes up for me, I've found the way I have to handle it in bstr fairly painful. That said, I don't know if it's compelling, as real usage would be more complex, and it's vaguely similar to what I'm asking for. Anyway, the issues with it are:
And I guess that's the crux of it -- do you want to support cases where the action on for seeing invalid utf8 is anything other than 'replace' or 'bail out'? IMO the answer should be yes, at least in the case where the reason it's invalid is that it's truncated... That said, essentially every API in the crate that can make a choice here seems to behave the other way -- either they perform the replacement for you, or they discard the information (e.g. https://github.com/BurntSushi/bstr/blob/master/src/utf8.rs#L601) you'd need in order to handle it without replacing. It's admittedly a little frustrating, although I truly apologize if I've let my frustration come through or I've come across as impatient. |
OK, I've finally had a chance to look at your code. Your example helped a lot. I'd like to brainstorm some possible solutions for you. Up front, I'd All right, so here is the relevant code from the playground: #[derive(Debug, Clone, Copy, PartialEq)]
enum ReadErr {
Partial,
Invalid,
}
fn try_decode(b: &[u8]) -> Result<(char, usize), ReadErr> {
if b.is_empty() {
return Err(ReadErr::Partial);
}
let (ch, l) = bstr::decode_utf8(b);
if let Some(c) = ch {
Ok((c, l))
} else if l < b.len() {
// If we didn't see all characters, something must have been invalid?
Err(ReadErr::Invalid)
} else {
// Could either be invalid-because-truncate, or invalid-because-invalid
assert_eq!(l, b.len(), "decode_utf8 returned length greater than input slice?");
let err = b.to_str().expect_err("should fail here...");
// sanity check...
debug_assert_eq!(err.valid_up_to(), 0);
if err.error_len().is_some() {
Err(ReadErr::Invalid)
} else {
Err(ReadErr::Partial)
}
}
} For me personally, I think that if we could make the if utf8_prefix_len(b) == 0 {
Err(ReadErr::Invalid)
} else {
Err(ReadErr::Partial)
} And that way, you avoid a weird second decoding step. Specifically, I think /// Returns the number of bytes at the beginning of `slice` that correspond to
/// a valid prefix of a UTF-8 encoded codepoint.
///
/// If slice does not start with a valid prefix, then this returns 0. If slice
/// starts with a complete UTF-8 encoded codepoint, then the prefix length is
/// equivalent to the number of bytes in the encoded codepoint.
///
/// The maximum value this function can return is 4.
fn utf8_prefix_len(slice: &[u8]) -> usize;
It would help if you could be more specific than "every API," because to me, Now, there are many APIs that subtitute in replacement codepoints, but that's Moreover, truncation is precisely the motivation behind the Since my judgment at the moment (absent further evidence) is that this is a |
You've expressed as much. My desire is mainly wanting to minimize how much
Yes, I agree. Also, thank you for writing the signature and documentation I'll respond to the rest of this just to get my thoughts here down.
You're right, of course -- and looking more closely, I had thought it was worse I think part of the difficulty is that the functions which can produce Just that when when many commonly used or desirable functions in the library Which I guess says it comes down to the fact that I don't really having a (There's more I could say here (not always a path forward when you realize the
I mean... I've seen In an ideal world this sort of operation would be less common, though... But |
std uses It would really help if you could suggest ways to improve. I personally don't know what to do, but I'm ready and willing to believe it's confusing and I would love to fix it. I just don't know how.
Right. That's why bstr has a top-level |
I'm going to close this since life stuff happened before I got a chance to do it, and I believe that in the mean time there it looks like #68 is a better approach overall. I will note that:
This is one of the initial things I really liked about |
I have &[u8] that represents a section of a streaming read of some text. I'd like to convert it (possibly lossily) to a String as if it had all arrived in a single chunk. The issue is that the chunk may end part-way-through a valid utf8 sequence, and naively converting will corrupt any character unlucky enough to get sliced in this manner. To avoid that, I need to be able to distinguish if the final utf8 sequence is invalid or just incomplete.
Right now I do something like this:
Where
utf8_valid_prefix
looks like this:This works fine [0], (assuming I didn't mangle it too badly trying to simplify/clean it up for the issue), but it feels redundant given that
bstr
already done this when decoding (or something equivalent), and it just feels like the library can probably help more.Anyway, I don't have strong opinions on what an API for this should look like, but I figured I'd bring it up.
One easy option, might be adding e.g.
ByteSlice::is_char_boundary(&self, index: usize) -> bool
andByteSlice::utf8_sequence_len(&self, starting_at: usize) -> Option<usize>
might be enough. Maybe also aprev_char_boundary
? Had these existed, I probably wouldn't be filing this issue, but implementing these for not-necessarily-valid utf8 might end up hard or producing confusing results, and they might not be obvious functions to look for if you aren't familiar with how utf8 is encoded. I'm open to thoughts though.Also, apologies in advance if something for this exists and I'm just missing it...
[0]: Actually, I suspect the last part (where we check that
tail
is valid) has a bug in that it can cause us to take a sequence which is both invalid and incomplete, and cause it to turn into two replacement characters, whereas if we had processed the whole buffer in one go we would only have one. That said, I'm not worried about it really, since the data was already invalid at that position. That said, it might be an indication that "distinguish between invalid and incomplete" isn't quite right.The text was updated successfully, but these errors were encountered: