Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose Utf8LossyChunksIter #54

Closed
dylni opened this issue Jun 18, 2022 · 7 comments
Closed

Expose Utf8LossyChunksIter #54

dylni opened this issue Jun 18, 2022 · 7 comments
Labels
ACP-accepted API Change Proposal is accepted (seconded with no objections) api-change-proposal A proposal to add or alter unstable APIs in the standard libraries T-libs-api

Comments

@dylni
Copy link

dylni commented Jun 18, 2022

Proposal

Problem statement

When Utf8Error::valid_up_to and Utf8Error::error_len are used, their results will almost always be used to get substrings of the original string. However, since Utf8Error does not have a reference to the original string, it cannot have methods to return the substrings.

Utf8LossyChunksIter is also much easier to use.

Motivation, use-cases

This is useful when creating a custom byte string formatter. UTF-8 portions are usually output using the Display implementation for str or str::escape_debug, but invalid portions might require custom formatting.

Example

Code using str::from_utf8 (requires unsafe):

while !string.is_empty() {
    let (valid, invalid) = match str::from_utf8(string) {
        Ok(string) => (string, &[][..]),
        Err(error) => {
            let valid_len = error.valid_up_to();
            let valid = unsafe { str::from_utf8_unchecked(&string[..valid_len]) };
            let mut invalid = &string[valid_len..];
            if let Some(invalid_len) = error.error_len() {
                invalid = &invalid[..invalid_len];
            }
            (valid, invalid)
        }
    };
    // formatting for `valid` and `invalid`
    string = &string[valid.len() + invalid.len()..];
}

Code using the new API:

for chunk in Utf8Chunks::new(string) {
    let valid = chunk.valid();
    let invalid = chunk.invalid();
    // formatting for `valid` and `invalid`
}

Solution sketches

Make the following changes, and change the feature for these structs from str_internals to utf8_chunks.

Links and related work

What happens now?

This issue is part of the libs-api team API change proposal process. Once this issue is filed the libs-api team will review open proposals in its weekly meeting. You should receive feedback within a week or two.

@dylni dylni added api-change-proposal A proposal to add or alter unstable APIs in the standard libraries T-libs-api labels Jun 18, 2022
@m-ou-se
Copy link
Member

m-ou-se commented Jun 20, 2022

The bstr crate has a similar API, which also exposes another field/method on the chunks: incomplete, to check whether the incomplete sequence might still become valid if more bytes are added: https://docs.rs/bstr/latest/bstr/struct.Utf8Chunk.html

@m-ou-se
Copy link
Member

m-ou-se commented Jun 20, 2022

Regardless of that minor detail, I'm in favor of working towards exposing this as a public API!

@m-ou-se m-ou-se added initial-comment-period Will be merged/postponed/closed in ~10 calendar days unless new substational objections are raised. disposition-merge This RFC is in PFCP or FCP with a disposition to merge it. labels Jun 20, 2022
@dylni
Copy link
Author

dylni commented Jun 20, 2022

Great! I agree that basing the API off bstr would be sensible, so I have updated the issue. The changes were actually surprisingly close already, but providing methods instead of fields will be better for future compatibility.

I'm leaving out the incomplete method for now, since that can be added later and I don't personally need it. However, I can add it back if it seems useful.

Also, it might be worth discussing if the constructor should be <[u8]>::utf8_chunks or UtfChunks::new. If I understand the process correctly, that discussion can happen on the tracking issue.

@dylni
Copy link
Author

dylni commented Jul 17, 2022

@m-ou-se Is this proposal ready for a PR? It still has the "initial-comment-period" label, so I'm not sure if it's been agreed that this should be added.

@m-ou-se
Copy link
Member

m-ou-se commented Jul 20, 2022

@dylni Sorry! We don't have this processs automated yet. Yes, feel free to go ahead and open a tracking issue for this new unstable feature and send a PR for its implementation.

@m-ou-se m-ou-se added finished-initial-comment-period The initial comment period is finished for this API change proposal. and removed initial-comment-period Will be merged/postponed/closed in ~10 calendar days unless new substational objections are raised. labels Jul 20, 2022
@dylni
Copy link
Author

dylni commented Jul 21, 2022

Thanks!

Tracking issue: rust-lang/rust#99543

matthiaskrgr added a commit to matthiaskrgr/rust that referenced this issue Aug 20, 2022
…lacrum

Expose `Utf8Lossy` as `Utf8Chunks`

This PR changes the feature for `Utf8Lossy` from `str_internals` to `utf8_lossy` and improves the API. This is done to eventually expose the API as stable.

Proposal: rust-lang/libs-team#54
Tracking Issue: rust-lang#99543
@dylni
Copy link
Author

dylni commented Aug 20, 2022

This has been merged: rust-lang/rust#99544

@dylni dylni closed this as completed Aug 20, 2022
workingjubilee pushed a commit to tcdi/postgrestd that referenced this issue Sep 15, 2022
Expose `Utf8Lossy` as `Utf8Chunks`

This PR changes the feature for `Utf8Lossy` from `str_internals` to `utf8_lossy` and improves the API. This is done to eventually expose the API as stable.

Proposal: rust-lang/libs-team#54
Tracking Issue: #99543
@dtolnay dtolnay added ACP-accepted API Change Proposal is accepted (seconded with no objections) and removed disposition-merge This RFC is in PFCP or FCP with a disposition to merge it. finished-initial-comment-period The initial comment period is finished for this API change proposal. labels Nov 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ACP-accepted API Change Proposal is accepted (seconded with no objections) api-change-proposal A proposal to add or alter unstable APIs in the standard libraries T-libs-api
Projects
None yet
Development

No branches or pull requests

3 participants