-
Notifications
You must be signed in to change notification settings - Fork 560
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new function bytes_to_utf8_free_me #22823
Conversation
utf8.c
Outdated
|
||
const U8 * const send = s + *lenp; | ||
Size_t variant_count = variant_under_utf8_count(s, send); | ||
if (free_me_ptr != NULL && variant_count == 0 && s[*lenp-1] == '\0') { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here it looks like *lenp
includes the terminating NUL, but below it doesn't.
Consider when *lenp
starts as 1, this expects s[0]
to be NUL which doesn't match what I expect from this API.
That said, we would be assuming that s[*lenp]
is valid, s+*lenp
would always be valid as a one-past-the-end pointer, but such a pointer cannot be dereferenced.
So that s[*lenp]
is safe becomes a pre-condition when free_me_ptr
isn't NULL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm trying to understand this comment. I don't see how I'm dereferencing s[*lenp]
. I am dereferencing s[*lenp - 1]
.
The input is not required to be NUlL terminated, but the output is. So if *lenp
is 1, it looks at s[0]
. If that is NUL, the function returns s
unchanged, as it is a NUL-terminated string whose representation doesn't change when encoded in UTF-8. If it isn't a NUL, the function allocates new memory that includes whatever byte is in s[0]
and appends a NUL to it.
In re-reading the code, I see I failed to consider the possibility that *lenp
is 0, and that I might be overallocating the new memory by 1 byte. And I did a bit more clean up, so I dereference instead *(send -1)
. And I think it is better to dereference a pointer once into a local variable, rather than to dereference it multiple times
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no tests or example code so it's hard to tell how it's meant to be called.
Let's say I call:
Size_t len = 10; /* does not count the NUL, is that typical/expected? */
const U8 *free_me;
const U8 *result = bytes_to_utf8_free_me("0123456789", &len, &free_me);
...
Safefree(free_me);
As the code is written now, this will always allocate a new string, but if I call it with:
Size_t len = 11; /* does count the NUL */
const U8 *free_me;
const U8 *result = bytes_to_utf8_free_me("0123456789", &len, &free_me);
...
Safefree(free_me);
result
will be a pointer to the string passed in, and free_me
will be NULL.
Is including the NUL in the count passed in the intended way to call this function?
Note that if you do expect that, then when a string is allocated due to variants the resulting string will have double NUL termination, which is a bit unexpected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you reviewed this before refreshing with the latest version available at the time. I had already noticed the double NUL and fixed it.
I don't know what to do about the length disparity. If you include the NUL in the length in blead, you will get a double NUL. In order for the new form to know that there is a trailing NUL, it has to be able examine that byte, and so the length has to include it. I added a paragraph to the pod explaining it. (hopefully)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It has since occurred to me that it might be better to just not make the guarantee of a trailing NUL if there is no other reason to allocate new memory.
ce98a9c
to
5d31895
Compare
5d31895
to
4a3e5ea
Compare
4a3e5ea
to
e051041
Compare
The only change since the last time is rewriting the pod |
This is like bytes_to_utf8, but if the representation of the input string is the same in UTF-8 as it is in native format, the allocation of new memory is skipped. This presents optimization possibilities.
e051041
to
cf64e7c
Compare
Still approved :) |
This is like bytes_to_utf8, but if the representation of the input string is the same in UTF-8 as it is in native format, the allocation of new memory is skipped.
This presents optimization possibilities.
Suggestions for a better name are welcome