-
-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unicode String #179
base: main
Are you sure you want to change the base?
Unicode String #179
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for putting this together. I have left some comments about things I would prefer to see changed.
Also, I suspect that this RFC may not be totally complete yet, in terms of tracking what all of the implications would be. I think it would be valuable for someone to start implementing this change (particularly the method signatures, even if the code for encoding/decoding is not fully robust yet) so that we could examine in more detail all of the things that will be affected by this change.
I also suspect there could be certain implications within this RFC that may be implicitly understood by the author, but may be not clear to the reader, resulting in confusion about exactly how these changes will look. The step I mentioned above of putting together a proof of concept level implementation could also help with this problem of potential misunderstandings of this very complex change.
text/0000-unicode-string.md
Outdated
1. The values() function will return an iterator over the string codepoints. Same as runes(). | ||
1. A concat_bytes() function will be added to add a sequence of codepoints to the string from an iterator of bytes. | ||
|
||
Add traits Encoder and Decoder to the builtin package. Any function that produces a String from bytes, or produces bytes from a String must take an Encoder or Decoder as a parameter as is appropriate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding these type names to the builtin package means that no Pony program will be able to use those type names for user-defined types. The words Encoder
and Decoder
are incredibly general, and I can imagine many Pony programs or libraries using them to refer to things other than this string encoding use case.
I suggest using the names StringEncoder
and StringDecoder
instead.
|
||
The ByteSeq type defined in std_stream.pony will be changed to remove String. | ||
``` | ||
type ByteSeq is (Array[U8] val) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's not much point in having this type alias if it is not a union.
The purpose of this type is to define the types for which it is safe to call cpointer
and size
to pass to FFI calls. In the standard library it is only safe to do those for standard library types, since we can't trust any user-defined types that implement the cpointer
and size
interface to give us a pointer that is valid for the given number of bytes.
Given that size
is no longer the right method to call (it needs to call byte_size
instead, I recommend just removing this union altogether and using (String | Array[U8] val)
in places where it is currently used, and in each such place we would need to introduce a match
statement that checks which type it is and uses size
for Array and byte_size
for String.
text/0000-unicode-string.md
Outdated
|
||
Change Writer in buffered/writer.pony to accept Encoder parameters in the write() and writev() functions. | ||
|
||
Add a FileCharacters class in buffered/file_characters.pony that provides an iterator of characters in a file. The implementation will be similar to the FileLines class. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you meant to put this in the files
package (where it can live alongside FileLines
) rather than the buffered
package.
Add a FileCharacters class in buffered/file_characters.pony that provides an iterator of characters in a file. The implementation will be similar to the FileLines class. | |
Add a FileCharacters class in files/file_characters.pony that provides an iterator of characters in a file. The implementation will be similar to the FileLines class. |
text/0000-unicode-string.md
Outdated
1. The truncate() function will only support len parameter values less than the string size. | ||
1. The utf32() function will be removed. It is redundant, and returns pair that includes a byte count that is no longer needed. | ||
1. The insert_byte() function will be changed to insert_utf32() | ||
1. The values() function will return an iterator over the string codepoints. Same as runes(). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest that we rename values
to bytes
instead of removing it entirely, so that the user can still iterate over bytes when they need to, without introducing an allocation to convert to Array[U8] val
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps it's unclear to me about the allocations, but it does seem like the array() method has not been removed? If so, then one can still use that method for an allocation-less iteration method (and if necessary that type can be improved to fun box (): this->Array[U8] box
). That said, I believe that bytes() is a much clearer name for that use case
# Alternatives | ||
|
||
1. Leave the String class as it is. This is likely to result in hidden errors in many programs that work with Strings as they will not work correctly with unicode data if they encounter it. It will also make it difficult to use Pony in settings where ASCII is not sufficient for local language (most non English speaking countries). | ||
1. A more complicated implementation with multiple String types capable of storing text data internally using different byte encodings. This approach would improve performance in situations where strings needed to be created from bytes of various encodings. The String type could be matched to the native byte encoding to eliminate any need for conversion. Such an implementation would add complexity to the String API as it would require use of a String trait with multiple implementations. It would also add considerable complexity to the String class implementation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I mentioned in our discussion in Zulip, I would prefer to see a version of this RFC that takes the encoding as a type parameter to all encoding-aware operations within the String
class, using UTF-8 as the default (so that everything in this RFC works for the user as you describe by default), but so that non-UTF-8 use cases can also be supported.
This does not necessarily mean you need multiple String classes or a String trait - just multiple encoding classes and an encoding trait (I believe you already have an encoder/decoder traits as part of this RFC, so that could potentially be used).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for creating this RFC!
I left a few comments.
The main concern here is that this very substantial change to the language is missing some motivational comments. Especially: What will be improved? What will this change make possible, which isnt possible yet?
I wanna see the costs of implementing this RFC be justified.
|
||
# Summary | ||
|
||
Change the builtin String class to present a sequence of unicode codepoints as 4 byte numbers (U32) instead of a sequence of bytes. All conversions from a sequence of bytes to a String, or from a String to a sequence of bytes will require specifying a decoding or encoding respectively. The default encoding will be UTF-8, and Strings will be represented internally as a sequence of UTF-8 encoded codepoints. Provide encoders/ decoders for UTF-8, UTF-16, UTF-32, ASCII and ISO-8859-1 as part of the stdlib. Change the character literal to represent a single unicode codepoint and produce a UTF-32 codepoint value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not 100% clear to me what the internal representation of String will be. Will it remain a pointer and a size only that the bytes pointed to will be actual U32
codepoints without encoding? Or will the actual underlying data remain encoded? This RFC should also state why this approach was chosen and the other ones dicarded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get the sense that by "present a sequence..." he may just mean that it implements Seq[U32]
and would support those APIs and that as he's said below the data would still be represented internally as UTF-8.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So to make sure I understand correctly, "hello" would be internally stored as five bytes: 104, 101, 108, 108, 111
, but if we asked for third character it would return 0,0,0,108
(assuming endian)?
and, when you stray out of ASCII into codepoints, for example:
"hӧle" => 104, **211, 167**, 108, 101
Would still have a length of 4, and the third character would be l?
fun encode(codepoint: U32): (USize, U8, U8, U8, U8) | ||
|
||
trait val Decoder | ||
fun decode(bytes: U32): (U32, U8) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the meaning of the elements of the returned tuple? Should this function be partial considering that not all byte-combinations are valid in every encoding?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a better signature/approach for encoding/decoding would be to encode a whole String into an Array[U8] and vice versa for decoding. This might seem more work on the implementor of Encoder
/Decoder
, but might allow to employ more tricks and be slightly more performant/convenient for some encodings. I also suspect, that implementing a special Encoder
/Decoder
is fairly rare.
Hi Joe,
Thanks for providing feedback so quickly. I have make some comments below.
*Rowland E. Smith*
+1-201-396-3842
On 8/3/20 11:29 AM, Joe Eli McIlvain wrote:
***@***.**** commented on this pull request.
Thanks for putting this together. I have left some comments about
things I would prefer to see changed.
Also, I suspect that this RFC may not be totally complete yet, in
terms of tracking what all of the implications would be. I think it
would be valuable for someone to start implementing this change
(particularly the method signatures, even if the code for
encoding/decoding is not fully robust yet) so that we could examine in
more detail all of the things that will be affected by this change.
I also suspect there could be certain implications within this RFC
that may be implicitly understood by the author, but may be not clear
to the reader, resulting in confusion about exactly how these changes
will look. The step I mentioned above of putting together a proof of
concept level implementation could also help with this problem of
potential misunderstandings of this very complex change.
I agree that a sample implementation would help. I have 70% of an
implementation done. How I go about sharing that with the group? Create
a pull request?
------------------------------------------------------------------------
In text/0000-unicode-string.md
<#179 (comment)>:
> +
+Change the builtin String class to present of sequence of unicode codepoints as 4 byte numbers (U32).
+```
+class val String is (Seq[U32] & Comparable[String box] & Stringable)
+```
+Where String functions took parameters of type U8 or returned values of U8, they will take or return values of type U32. All indexes will be interpreted as codepoints. Only functions that manage String allocated memory will continue to refer to bytes.
+
+The following additional changes will be made to the String class:
+1. The size() function will return the number of unicode codepoints in the string. A new byte_size() function will return the number of bytes.
+1. The truncate() function will only support len parameter values less than the string size.
+1. The utf32() function will be removed. It is redundant, and returns pair that includes a byte count that is no longer needed.
+1. The insert_byte() function will be changed to insert_utf32()
+1. The values() function will return an iterator over the string codepoints. Same as runes().
+1. A concat_bytes() function will be added to add a sequence of codepoints to the string from an iterator of bytes.
+
+Add traits Encoder and Decoder to the builtin package. Any function that produces a String from bytes, or produces bytes from a String must take an Encoder or Decoder as a parameter as is appropriate.
Adding these type names to the builtin package means that no Pony
program will be able to use those type names for user-defined types.
The words |Encoder| and |Decoder| are incredibly general, and I can
imagine many Pony programs or libraries using them to refer to things
other than this string encoding use case.
I suggest using the names |StringEncoder| and |StringDecoder| instead.
I agree, I will make the change.
------------------------------------------------------------------------
In text/0000-unicode-string.md
<#179 (comment)>:
> +1. The insert_byte() function will be changed to insert_utf32()
+1. The values() function will return an iterator over the string codepoints. Same as runes().
+1. A concat_bytes() function will be added to add a sequence of codepoints to the string from an iterator of bytes.
+
+Add traits Encoder and Decoder to the builtin package. Any function that produces a String from bytes, or produces bytes from a String must take an Encoder or Decoder as a parameter as is appropriate.
+```
+trait val Encoder
+ fun encode(codepoint: U32): (USize, U8, U8, U8, U8)
+
+trait val Decoder
+ fun decode(bytes: U32): (U32, U8)
+```
+
+The ByteSeq type defined in std_stream.pony will be changed to remove String.
+```
+type ByteSeq is (Array[U8] val)
There's not much point in having this type alias if it is not a union.
The purpose of this type is to define the types for which it is safe
to call |cpointer| and |size| to pass to FFI calls. In the standard
library it is only safe to do those for standard library types, since
we can't trust any user-defined types that implement the |cpointer|
and |size| interface to give us a pointer that is valid for the given
number of bytes.
Given that |size| is no longer the right method to call (it needs to
call |byte_size| instead, I recommend just removing this union
altogether and using |(String | Array[U8] val)| in places where it is
currently used, and in each such place we would need to introduce a
|match| statement that checks which type it is and uses |size| for
Array and |byte_size| for String.
I agree. I left the type in to improve backward compatibility.
------------------------------------------------------------------------
In text/0000-unicode-string.md
<#179 (comment)>:
> +Many functions that accept ByteSeq as a parameter will be changed to accept (String | ByteSeq) as a parameter.
+
+A new StringIter interface will be added to std_stream.pony
+```
+interface val StringIter
+ """
+ An iterable collection of String box.
+ """
+ fun values(): Iterator[this->String box]
+```
+
+Change Reader in buffered/reader.pony to add functions to read a codepoint and to read a String of a given number of codepoints. Update function line() to accept a decoder, and to return a pair with the line string, and the number of bytes consumed.
+
+Change Writer in buffered/writer.pony to accept Encoder parameters in the write() and writev() functions.
+
+Add a FileCharacters class in buffered/file_characters.pony that provides an iterator of characters in a file. The implementation will be similar to the FileLines class.
I think you meant to put this in the |files| package (where it can
live alongside |FileLines|) rather than the |buffered| package.
⬇️ Suggested change
-Add a FileCharacters class in buffered/file_characters.pony that provides an iterator of characters in a file. The implementation will be similar to the FileLines class.
+Add a FileCharacters class in files/file_characters.pony that provides an iterator of characters in a file. The implementation will be similar to the FileLines class.
Yes, that is a mistake. I will fix.
------------------------------------------------------------------------
In text/0000-unicode-string.md
<#179 (comment)>:
> +Unicode is an accepted standard for representing text in all languages used on planet earth. Modern programming languages should have first class support for unicode.
+
+# Detailed design
+
+Change the builtin String class to present of sequence of unicode codepoints as 4 byte numbers (U32).
+```
+class val String is (Seq[U32] & Comparable[String box] & Stringable)
+```
+Where String functions took parameters of type U8 or returned values of U8, they will take or return values of type U32. All indexes will be interpreted as codepoints. Only functions that manage String allocated memory will continue to refer to bytes.
+
+The following additional changes will be made to the String class:
+1. The size() function will return the number of unicode codepoints in the string. A new byte_size() function will return the number of bytes.
+1. The truncate() function will only support len parameter values less than the string size.
+1. The utf32() function will be removed. It is redundant, and returns pair that includes a byte count that is no longer needed.
+1. The insert_byte() function will be changed to insert_utf32()
+1. The values() function will return an iterator over the string codepoints. Same as runes().
I suggest that we rename |values| to |bytes| instead of removing it
entirely, so that the user can still iterate over bytes when they need
to, without introducing an allocation to convert to |Array[U8] val|
||A bytes function would need to take an Encoder as a parameter, but I
am OK with adding it.
------------------------------------------------------------------------
In text/0000-unicode-string.md
<#179 (comment)>:
> +This can be presented as a continuation of existing Pony patterns.
+
+The Pony tutorial will need to be updated to reflect these changes.
+
+# How We Test This
+
+Extend CI coverage to cover these changes and new features.
+
+# Drawbacks
+
+This is a change to the String API, and as such will break existing programs. String functions returning or taking as parameters U8 values now returning or taking U32 values will probably be the most common source of program breakage. Also, programs that used the existing unicode friendly functions in String will need to change, but they should be greatly simplified.
+
+# Alternatives
+
+1. Leave the String class as it is. This is likely to result in hidden errors in many programs that work with Strings as they will not work correctly with unicode data if they encounter it. It will also make it difficult to use Pony in settings where ASCII is not sufficient for local language (most non English speaking countries).
+1. A more complicated implementation with multiple String types capable of storing text data internally using different byte encodings. This approach would improve performance in situations where strings needed to be created from bytes of various encodings. The String type could be matched to the native byte encoding to eliminate any need for conversion. Such an implementation would add complexity to the String API as it would require use of a String trait with multiple implementations. It would also add considerable complexity to the String class implementation.
As I mentioned in our discussion in Zulip
<https://ponylang.zulipchat.com/#narrow/stream/192795-contribute-to.20Pony/topic/Changes.20to.20String.20class/near/202574862>,
I would prefer to see a version of this RFC that takes the encoding as
a type parameter to all encoding-aware operations within the |String|
class, using UTF-8 as the default (so that everything in this RFC
works for the user as you describe by default), but so that non-UTF-8
use cases can also be supported.
This does not necessarily mean you need multiple String classes or a
String trait - just multiple encoding classes and an encoding trait (I
believe you already have an encoder/decoder traits as part of this
RFC, so that could potentially be used).
I am not sure what you mean by the "encoding-aware operations". I think
that what would work would be adding an "internal encoding" parameter to
all of the String constructors. This would require storing the internal
encoding in every String instance, so some impact on memory footprint.
Adding an "internal encoding" parameter to every encoding-aware
operation would cover many operations. For example, the eq function
would need to be encoding aware because it takes a String as a
parameter. It would not be possible to compare 2 strings without knowing
how the codepoints were encoded in each of them. It would also require
that any class that takes a String as a parameter would need to know the
strings are internal encoding in order to call its functions. That does
not sound workable.
Multiple encodings would also complicate FFI. Even in the standard lib,
strings encoded with UTF-16 would require some changes in the pony runtime.
I would prefer to defer this enhancement as it could be implemented
later with minimal disruption if needed, and it would significantly
increase the complexity and work required to make this String change.
… —
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#179 (review)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AGEQLV4FJR2Z4KZP66XU3WDR63JXNANCNFSM4PS4AMOQ>.
|
Hi @mfelsche,
Thanks for the feedback. Comments below.
*Rowland E. Smith*
+1-201-396-3842
On 8/3/20 4:25 PM, Matthias Wahl wrote:
***@***.**** commented on this pull request.
Thanks for creating this RFC!
I left a few comments.
The main concern here is that this very substantial change to the
language is missing some motivational comments. Especially: What will
be improved? What will this change make possible, which isnt possible yet?
I wanna see the costs of implementing this RFC be justified.
------------------------------------------------------------------------
In text/0000-unicode-string.md
<#179 (comment)>:
> @@ -0,0 +1,97 @@
+- Feature Name: unicode-string
+- Start Date: 2020-07-30
+- RFC PR: (leave this empty)
+- Pony Issue: (leave this empty)
+
+# Summary
+
+Change the builtin String class to present a sequence of unicode codepoints as 4 byte numbers (U32) instead of a sequence of bytes. All conversions from a sequence of bytes to a String, or from a String to a sequence of bytes will require specifying a decoding or encoding respectively. The default encoding will be UTF-8, and Strings will be represented internally as a sequence of UTF-8 encoded codepoints. Provide encoders/ decoders for UTF-8, UTF-16, UTF-32, ASCII and ISO-8859-1 as part of the stdlib. Change the character literal to represent a single unicode codepoint and produce a UTF-32 codepoint value.
It is not 100% clear to me what the internal representation of String
will be. Will it remain a pointer and a size only that the bytes
pointed to will be actual |U32| codepoints without encoding? Or will
the actual underlying data remain encoded? This RFC should also state
why this approach was chosen and the other ones dicarded.
The internal representation of String will remain as it is now except
that the pointer will always reference a UTF-8 encoding of the strings
codepoints. Today, there is no such certainty.
------------------------------------------------------------------------
In text/0000-unicode-string.md
<#179 (comment)>:
> +
+The following additional changes will be made to the String class:
+1. The size() function will return the number of unicode codepoints in the string. A new byte_size() function will return the number of bytes.
+1. The truncate() function will only support len parameter values less than the string size.
+1. The utf32() function will be removed. It is redundant, and returns pair that includes a byte count that is no longer needed.
+1. The insert_byte() function will be changed to insert_utf32()
+1. The values() function will return an iterator over the string codepoints. Same as runes().
+1. A concat_bytes() function will be added to add a sequence of codepoints to the string from an iterator of bytes.
+
+Add traits Encoder and Decoder to the builtin package. Any function that produces a String from bytes, or produces bytes from a String must take an Encoder or Decoder as a parameter as is appropriate.
+```
+trait val Encoder
+ fun encode(codepoint: U32): (USize, U8, U8, U8, U8)
+
+trait val Decoder
+ fun decode(bytes: U32): (U32, U8)
What is the meaning of the elements of the returned tuple? Should this
function be partial considering that not all byte-combinations are
valid in every encoding?
The tuple returned by the encode function contains the number of bytes
required to encode the codepoint (1-4), and the byte values of the
encoded codepoint. Perhaps it would be more efficient to compact the 4
bytes in a single U32. That is what is done with decode. In the case of
bytes that do not correctly decode into codepoints, the unicode
replacement character (0xFFFD) is inserted for each invalid byte
consumed. This is how the existing UTF-8 decoding function worked, and I
think it is the best approach. Without it, many String constructors
become partial, and that is a big backward compatibility issue.
------------------------------------------------------------------------
In text/0000-unicode-string.md
<#179 (comment)>:
> +
+The following additional changes will be made to the String class:
+1. The size() function will return the number of unicode codepoints in the string. A new byte_size() function will return the number of bytes.
+1. The truncate() function will only support len parameter values less than the string size.
+1. The utf32() function will be removed. It is redundant, and returns pair that includes a byte count that is no longer needed.
+1. The insert_byte() function will be changed to insert_utf32()
+1. The values() function will return an iterator over the string codepoints. Same as runes().
+1. A concat_bytes() function will be added to add a sequence of codepoints to the string from an iterator of bytes.
+
+Add traits Encoder and Decoder to the builtin package. Any function that produces a String from bytes, or produces bytes from a String must take an Encoder or Decoder as a parameter as is appropriate.
+```
+trait val Encoder
+ fun encode(codepoint: U32): (USize, U8, U8, U8, U8)
+
+trait val Decoder
+ fun decode(bytes: U32): (U32, U8)
Maybe a better signature/approach for encoding/decoding would be to
encode a whole String into an Array[U8] and vice versa for decoding.
This might seem more work on the implementor of |Encoder|/|Decoder|,
but might allow to employ more tricks and be slightly more
performant/convenient for some encodings. I also suspect, that
implementing a special |Encoder|/|Decoder| is fairly rare.
The Encoder/Decoder classes are used in a variety of situations. When
reading from a file or a socket, the bytes read might not decode to a
characters completely. An additional few bytes from the next read might
be required. This approach deals with that situation with reasonable
ease. Yes, I don't expect that there will be very many Encoder/Decoder's.
…
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#179 (review)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AGEQLV644N6VX2AEHQSKKHTR64MK3ANCNFSM4PS4AMOQ>.
|
You don't necessarily need to make a pull request - you can just make sure your branch is public on GitHub in your fork of the ponyc repo, then share the link here so we can go inspect your branch as we read the RFC. |
Regarding some of your other responses - I'm a bit confused currently, but I'll wait and look at your sample implementation so I can be sure I understand your comments correctly before I respond. |
Thanks Joe
I will distribute the link as soon as I can. My power is out due to the
storm on the east coast, so it may take me a while. Also it looks like am
going to miss the weekly sync call.
…On Tue, Aug 4, 2020, 1:02 PM Joe Eli McIlvain ***@***.***> wrote:
I agree that a sample implementation would help. I have 70% of an
implementation done. How I go about sharing that with the group? Create
a pull request?
You don't necessarily need to make a pull request - you can just make sure
your branch is public on GitHub in your fork of the ponyc repo, then share
the link here so we can go inspect your branch as we read the RFC.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#179 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AGEQLV3E73QK6M4F5E3KZRDR7A5MFANCNFSM4PS4AMOQ>
.I
|
I'm worried about the encoding, and it wasn't clear to me if the proposal would leave that unchanged and merely implement the Seq trait as though it were a sequence of U32 elements, or make the representation more pervasive. With a representation as a subset of byte sequences we can have |
Hi All,
My house has finally gotten electric power back, and I have been able to
fork the ponyc repo and push a branch with my string class changes. The
URL for the branch is: https://github.com/rowland66/ponyc/tree/string
I hope that this make some of the items that we have been discussing
clearer. I am new to GitHub, so if you have any trouble with access,
please let me know.
*Rowland E. Smith*
+1-201-396-3842
…On 8/4/20 1:02 PM, Joe Eli McIlvain wrote:
I agree that a sample implementation would help. I have 70% of an
implementation done. How I go about sharing that with the group?
Create
a pull request?
You don't necessarily need to make a pull request - you can just make
sure your branch is public on GitHub in your fork of the ponyc repo,
then share the link here so we can go inspect your branch as we read
the RFC.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#179 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AGEQLV3E73QK6M4F5E3KZRDR7A5MFANCNFSM4PS4AMOQ>.
|
Rowland has updated the RFC in response to comments. @jemc @jasoncarr0 @mfelsche can you have another look? |
Some notes from the post-sync-call on suggestions to improve the performance and flexibility of various use cases, based on some limited review of Rowland's branch so far:
|
Though, we should perhaps verify that there is a runtime cost being paid here without the generics. I suspect given the size and structure of the method that LLVM would have an easy time eliminating it. |
Hi Joe and all,
I have added comments below.
On Tue, Aug 18, 2020 at 3:40 PM Joe Eli McIlvain ***@***.***> wrote:
Some notes from the post-sync-call on suggestions to improve the
performance and flexibility of various use cases, based on some limited
review of Rowland's branch so far:
-
add embed _array: Array[U8] as the field within String, instead of the
three fields we have now that happen to be the same as Array[U8]
- this makes it so that string to array conversions can be zero
allocation - every string already contains an array that it can share as a
box/val. See the next bullet point for more info;
[RS] I have no objection to doing this in principal. It would be a lot of
work, and is not related to the changes that I am proposing. Does anyone
know why String is not implemented this way currently? I am just concerned
that there might be some subtle reason why it won't work.
- fun val array: Array[U8] val become fun array: this->Array[U8] box
-
- it is a superset of what we have now, and allows a readable "byte
string" reference to a String ref, rather than requiring val.
[RS] OK. Again, this is just an improvement to String that could be
implemented independently of this RFC, right?
-
add whatever methods we need to add Array[U8] that are restricted by
A:U8 (e.g. read_u8) to make Array[U8] have everything that a "byte string"
class needs
- this will mitigate the usability issues that will result from
bytewise operations disappearing from String.
[RS] I like this idea. This way, it would not be necessary to convert a
byte array to a String just to get access to some String functions. I will
look at both classes and figure out what is missing.
…
- encoder as a type parameter rather than a runtime parameter
-
- also make the method use iftype E <: UTF8StringEncoder instead of if
encoder is UTF8StringEncoder
- these together make it so that there is zero cost for detecting
the encoder at runtime - it is known at compile time.
[RS] I don't understand this. Can you provide a more complete example?
- —
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#179 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AGEQLVZFBZXHVKO3JOFVYZTSBLKKBANCNFSM4PS4AMOQ>
.
--
*Rowland E. Smith*
P: (862) 260-4163
M: (201) 396-3842
|
@rowland66 The reasons that these changes are being mentioned now is because of the change to byte-wise API in the String. Efficient access to the underlying array and byte array methods is one way to allow us to access these lost APIs more effectively, without compromising on performance. It is true that I certainly don't know why the existing String has not been implemented that way. Hence why these suggestions are intended to be part of this RFC (but note that implementation details such as the embed could be implemented in multiple PRs). Anyone else feel free to chime in if I'm not understanding correctly. |
|
||
# Detailed design | ||
|
||
Change the builtin String class to present of sequence of unicode codepoints as 4 byte numbers (U32). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would make sense to mention endianness here (which I assume will be the one of the machine for which the application is compiled)
fun values(): Iterator[this->String box] | ||
``` | ||
|
||
Change Reader in buffered/reader.pony to add functions to read a codepoint and to read a String of a given number of codepoints. Update function line() to accept a decoder, and to return a pair with the line string, and the number of bytes consumed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's worth clarifying here if only \r\n
and \n
will be considered line breaks or if support for unicode line breaks (like U+2028, U+2029) will be added
|
||
Change Writer in buffered/writer.pony to accept StringEncoder parameters in the write() and writev() functions. | ||
|
||
Add a FileCharacters class in files/file_characters.pony that provides an iterator of characters in a file. The implementation will be similar to the FileLines class in files/file_lines.pony. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the Unicode standard does not define what a character is, I think it's worth claryfing if it means codepoints, grapheme clusters or something else
Hi Agares and all,
Thank you for the feedback. I have added comments below, and will make
changes to the RFC document.
On Tue, Oct 20, 2020 at 8:25 AM agares ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In text/0000-unicode-string.md
<#179 (comment)>:
> +- Feature Name: unicode-string
+- Start Date: 2020-07-30
+- RFC PR: (leave this empty)
+- Pony Issue: (leave this empty)
+
+# Summary
+
+Change the builtin String class to present a sequence of unicode codepoints as 4 byte numbers (U32) instead of a sequence of bytes. All conversions from a sequence of bytes to a String, or from a String to a sequence of bytes will require specifying a decoding or encoding respectively. The default encoding will be UTF-8, and Strings will be represented internally as a sequence of UTF-8 encoded codepoints. Provide encoders/ decoders for UTF-8, UTF-16, UTF-32, ASCII and ISO-8859-1 as part of the stdlib. Change the character literal to represent a single unicode codepoint and produce a UTF-32 codepoint value.
+
+# Motivation
+
+Unicode is an accepted standard for representing text in all languages used on planet earth. Modern programming languages should have first class support for unicode.
+
+# Detailed design
+
+Change the builtin String class to present of sequence of unicode codepoints as 4 byte numbers (U32).
I think it would make sense to mention endianness here (which I assume
will be the one of the machine for which the application is compiled)
For the provided encoders, I have UTF-16BE, UTF-16LE, UTF-32BE and
UTF-32LE. As I understand it, UTF-8 has no concept of endianness. I will
update the text above to indicate the additional provided encoders. Does
this cover your concern?
------------------------------
In text/0000-unicode-string.md
<#179 (comment)>:
> +The ByteSeq type defined in std_stream.pony will be changed to remove String.
+```
+type ByteSeq is (Array[U8] val)
+```
+Many functions that accept ByteSeq as a parameter will be changed to accept (String | ByteSeq) as a parameter.
+
+A new StringIter interface will be added to std_stream.pony
+```
+interface val StringIter
+ """
+ An iterable collection of String box.
+ """
+ fun values(): Iterator[this->String box]
+```
+
+Change Reader in buffered/reader.pony to add functions to read a codepoint and to read a String of a given number of codepoints. Update function line() to accept a decoder, and to return a pair with the line string, and the number of bytes consumed.
I think it's worth clarifying here if only \r\n and \n will be considered
line breaks or if support for unicode line breaks (like U+2028, U+2029)
will be added
I will add this clarification. Only \r\n and \n will be considered line
breaks. No change to what is done today. As per wikipedia
https://en.wikipedia.org/wiki/Newline "Recognizing and using the newline
codes greater than 0x7F (NEL, LS and PS) is not often done."
In text/0000-unicode-string.md
<#179 (comment)>:
> +Many functions that accept ByteSeq as a parameter will be changed to accept (String | ByteSeq) as a parameter.
+
+A new StringIter interface will be added to std_stream.pony
+```
+interface val StringIter
+ """
+ An iterable collection of String box.
+ """
+ fun values(): Iterator[this->String box]
+```
+
+Change Reader in buffered/reader.pony to add functions to read a codepoint and to read a String of a given number of codepoints. Update function line() to accept a decoder, and to return a pair with the line string, and the number of bytes consumed.
+
+Change Writer in buffered/writer.pony to accept StringEncoder parameters in the write() and writev() functions.
+
+Add a FileCharacters class in files/file_characters.pony that provides an iterator of characters in a file. The implementation will be similar to the FileLines class in files/file_lines.pony.
Since the Unicode standard does not define what a character is, I think
it's worth claryfing if it means codepoints, grapheme clusters or
something else
Agreed. I will change the FileCharacters class to FileCodepoints.
… —
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#179 (review)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AGEQLVYAKMNI5CLAKNDENVLSLV6UTANCNFSM4PS4AMOQ>
.
--
*Rowland E. Smith*
P: (862) 260-4163
M: (201) 396-3842
|
I'd like to leave a note here that, after looking over the PR (which has not yet implemented this portion), I believe my suggested change for the capability of the array() method causes very confusing behavior This can be fixed by using an embedded array, or two separate methods with restricted types and necessary warnings about confusing behavior. |
We talked about this over Sync. The plan we went forward with is keeping the The low-level access can be provided by a method with the suggested type signature: The documentation for the method will indicate that the array will track writes until the String's buffer is reallocated due to length changes. |
Hi Jason,
I have implemented this with this function.
fun current_byte_buffer(): Array[U8] =>
Array[U8].from_cpointer(_ptr._unsafe(), _size, _alloc)
I did not understand what "this->Array[U8] box" would mean. As I understand
it, "this->" mean return the reference capability of the receiver, but then
why say box. I think that we want Array[U8] box, but since the default
reference capability for a function is box, so the function will return a
box also. This code seems to work. I tried it with a val, trn and ref
String. If this makes sense, I will add some tests and commit the change.
Also, how does Pony garbage collection work for this type of thing. I seems
that we now have 2 references to the same memory. How does the Pony runtime
know than and not collect the underlying array bytes while they are still
referenced?
…On Tue, Dec 8, 2020 at 2:26 PM Jason Carr ***@***.***> wrote:
We talked about this over Sync. The plan we went forward with is keeping
the array method as val only. This will continue the behavior of not
copying for UTF8Encoder
The low-level access can be provided by a method with the suggested type
signature:
fun current_byte_buffer(): this->Array[U8] box
which always constructs the underlying byte array without regard to
encoding.
The documentation for the method will indicate that the array will track
updates until the String's buffer is reallocated.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#179 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AGEQLVZMTE2BHJRQRVFE5ETSTZ4VRANCNFSM4PS4AMOQ>
.
--
*Rowland E. Smith*
P: (862) 260-4163
M: (201) 396-3842
|
In this case we want to ensure regardless of the receiver that it is read-only (so as to not break the string invariants), but we can be a bit stronger for
As I understand it, the allocation as a whole is still garbage collected (and it needs to be read by the garbage collector for general arrays) and deallocation doesn't occur until neither the String nor Array are pointing to it. |
Hi Jason,
I have been struggling with this with no success. I can write three
separate functions and it compiles:
fun ref current_byte_buffer_ref(): Array[U8] ref =>
Array[U8].from_cpointer(_ptr._unsafe(), _size, _alloc)
fun box current_byte_buffer_box(): Array[U8] box =>
Array[U8].from_cpointer(_ptr._unsafe(), _size, _alloc)
fun val current_byte_buffer_val(): Array[U8] val =>
recover val Array[U8].from_cpointer(_ptr._unsafe(), _size, _alloc) end
I can see how current_byte_buffer_ref() and current_byte_buffer_box() could
be combined as a single function that returned this->Array[U8] box. But I
can't figure out how to handle the val case, and since a box function works
with a ref, val and box receiver I need to handle all three. What am I
missing?
…On Tue, Dec 8, 2020 at 11:44 PM Jason Carr ***@***.***> wrote:
I did not understand what "this->Array[U8] box" would mean
this->Array[U8] box is valid syntax for the return type.
What that means can be factored into
this->(Array[U8] box)
i.e. it's the type that the current receiver views box as. Specifically
this means that if the receiver is box or ref, it is indeed just box. But
if the receiver is val then it means val. This is still useful given the
other array method as it doesn't copy for any encoding.
Also, how does Pony garbage collection work for this type of thing. I seems
that we now have 2 references to the same memory. How does the Pony runtime
know than and not collect the underlying array bytes while they are still
referenced?
As I understand it, the allocation as a whole is still garbage collected
(and it needs to be read by the garbage collector for general arrays) and
deallocation doesn't occur until neither the String nor Array are pointing
to it.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#179 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AGEQLV2T5RF2IYXVHNMWFUDST36ENANCNFSM4PS4AMOQ>
.
--
*Rowland E. Smith*
P: (862) 260-4163
M: (201) 396-3842
|
In this case you can use a bare fun current_byte_buffer(): this->Array[U8] box =>
recover Array[U8].from_cpointer(_ptr._unsafe(), _size, _alloc) end Strictly speaking you should also be able to recover to val and cover it, but it's not actually "correct" for most of the cases |
Finally got my ponyc env working again. The field access checks for recover blocks are still a tiny bit too strict, they should drop down to tag instead of yelling in this case, so you actually need to do: fun current_byte_buffer(): this->Array[U8] box =>
let ptr = _ptr
recover Array[U8].from_cpointer(ptr._unsafe(), _size, _alloc) end But this code builds for me |
Hello, just realized this PR existed and am going to carry my feedback over from the issue I originally commented on. I provided a quick overview of some differences between what was suggested there and what is done in Rust and the benefits of those design choices. My understanding is that this RFC encodes all strings as UTF-8 byte arrays and only allows users to operate on them using their character indices. Assuming that is correct and all reads and mutations of strings are performed on the byte representation, then I am concerned there will be issues with offering efficient string operations. Specifically because on an arbitrary UTF-8 byte sequence computing the byte index from the character index is What is the intended approach of this RFC and how does it avoid or mitigate this issue? |
Hi Kyle, Thank you for taking the time to look at this. I am the author of the RFC, and have worked on the implementation. As you mention, the biggest down side to using a variable size encoding for unicode is that index operations are slow, and the implementation of this RFC has no particularly good solution. Pony does have an Array class that has many of the same functions as String, so for performance critical code, one option is to use Array's to hold String bytes. Also, some of the String functions take positive and negative index values. Negative indexes are relative to the end of the String, so in some cases index access using a negative index would be much faster. In how different programming languages deal with Strings seems to be a tradeoff between speed, efficiency and correctness. As I understand it, Rust strings are UTF-8 encoded internally, but there are functions to access the underlying bytes, but they are prone to panics if used incorrectly. So Rust seems to be leaning toward correctness, but provides a loaded gun in case you want to take it and shoot yourself. Phython seems to be favoring efficiency and speed with the ability to choose between 1, 2 and 4 bytes per character strings. I assume that adding certain unicode characters to a 1 or 2 byte per character string results in some type of runtime error. So it is easier to write programs that fail at runtime unless you use 4 character byte strings, but that is pretty inefficient in most cases. The Pony implemenation in this RFC favors correctness (number 1 on the list in the Pony Philosophy). In the RFC design, we have tried to leave room for alternate String implementations that might be better suited for some situations. The new String design also tries to remain as comaptible as possible with the existing Pony String class. The existing Pony String has a mix of functions some of which assume 1 byte per character and others that support a UTF-8 encoding. This was a confusing API that left a lot of room for programming error. Hope this makes sense, and I am happy to discuss further if you have some ideas for improvement. |
Hi Rowland, Thanks for getting back to me quickly, I have a few specific comments below about things you mentioned and have tried to better outline which Rust features would best inform a Pony Unicode String implementation. RE: Negative IndicesThat is an interesting point about negative indices. If reverse unicode segmentation is used, you can potentially traverse fewer bytes. RE: Rust
This is technically true of the Rust design, but there are panic-free options available and the ones that can panic are meant to be used with something like Additionally, the fact that it does allow RE: Python
This is not correct because strings in python are immutable and you cannot create such a scenario. Python can more correctly be said to emphasize correctness over speed/efficiency in the same way that concatenating Java, JavaScript, or C# strings directly does. i.e. Python is in the "shared immutable strings" family and is optimized for that case What would it look like to "follow Rust's example"?Where Python offers shared immutable strings and is a good example for how to do so, Rust offers owned mutable append-only strings with an efficient and clean (ignoring the aforementioned) API. In Pony terms, as I understand them, implementing an approach similar to Rust's would involve the following characteristics.
|
@Kylebrown9 - if I understand your proposal correctly, it sounds like you are recommending something like: Keep byte-based operations as the first-class operations, but also have character-based operations. This seems similar to what we have in Pony today (perhaps with the differences of: preference for UTF8 rather than UTF32 and perhaps with support for more character operations), so I'd be curious to hear you compare your suggested approach to Pony's current approach, rather than the approach proposed in this PR. |
In Rust, making a string from bytes is effectively partial because not all byte sequences are valid UTF-8 strings. As a change to Pony today
Edit: If this was going to follow the Rust approach, then a different string type(s) could/would be added to handle platform specific strings (see |
Also, I realize I've ended up proposing and defending a specific prescription for how to do strings in Pony. The language comparisons and proposal outlines I've made are just intended to demonstrate what |
For byte reading in this RFC, we have the |
The only two things I would ask of a byte-based method on a string is that
There is an argument to be made that some methods might confuse users or encourage them to ignore the fact Note: I don't consider a method to be "byte-based" if it does not take in or return U8 values, even if it uses byte-indices. Edit: Forgot to directly answer your question. What you described meets the requirements I would set for it and sounds fine but others may have opinions on those choices. |
Hi Kyle, There is a lot to digest here. Let me start with your list from above of String features in Rust and how the compare to Pony post adoption of this RFC.
Same
Same. Pony has no Character type, but String functions return unicode codepoints as U32
Same
No. As mentioned above, String has an array function that is very efficient assuming you want the String bytes UTF-8 encoded. However, the resulting Array is only safe to use if you know that your String contains only ASCII characters. The resulting Array is also a val, so it cannot be modified. Since the Arrays shares the memory used by the String, modifying the Array might corrupt the String.
This would add a bunch of partial functions to String which in undesireable. Byte index values are never safe to change (increment or decrement), so their value is limited to remembing a position in a String. They could be useful if you know that your String is ASCII, but then you can use the array function and work with the String as an array of bytes. There is no right or wrong answer here. Its a decision on whether adding functions that are likely to be misused resulting in runtime errors is worth some added utility provided by correct use of the functions.
There is already a byte_size() function that serves this purpose. Knowing a String size in codepoints is more often what is needed, so I think that the size() behavior is correct.
I think that it makes sense that String be a sequence of Characters, and Characters are no longer single bytes. For applications that want to work with bytes, Pony has the Array class. I think that keeping String's and Arrays of bytes separate and distinct results in less error prone code and I value that highly. Something worth considering is whether Pony should support alternate internal presentations of String data. The intent is to leave that possibility open for the future. Another thing to consider is optimizations to the current String class. For example, a String could maintain a boolean indicator of whether it contains any multi-byte encoded characters. In cases where it does not, character index based operations could be much faster. This does add additional data to every String instance, so such a feature would need to be carefully considered. |
Hi Rowland, I think so far it sounds like we're both in agreement about my priority A that It seems like we disagree about priority B, that it should use "indexing which matches the stored representation". 1. Principle of Least SurprisePopular programming languages usually offer constant time indexing for strings and I believe users now expect this. 2. PerformanceI think there are also other use cases where code that could have worked well on the existing string design or other options we have now will be meaningfully less performant. |
Honestly for 1. I'd be surprised to see constant-time indexing for anything other than ascii. |
@Kylebrown9 For 1, I agree with @jasoncarr0 why would you expect constant-time indexing for variable size elements. Also there is another way to look at principal of least surprise. I should not be surprised when my code fails because it encounters strings containing multi-byte characters. Having more functions in String that use byte indexes increases the likelyhood of these unpleasant surprises. For 2, I agree that there is a performance pennalty in some cases for using Strings. In cases where the pennalty is large (a compiler tokenizing text) there is always the option to use Array. |
How many average programming language users are thinking actively about the fact that strings are variable length when using a unicode API that gives character indices? I think they'll tend to expect indexing into this As a simple example program that I think will surprise people, imagine writing a palindrome checker. |
Here's a table I'm putting together to summarize different language's design choices on this matter.
|
Ah, so you're specifically talking about O(1) byte-indexing. I'm more thinking character indexing, and hence I would be surprised to see it be O(1), at least with good character support. Could you help explain the use case for byte-indexing, particularly if there's something that's not covered by substrings? Or O(1) character-indexing? |
Among those examples there are some that index by character (e.g. Python) and some that index by byte (e.g. Rust). Most cases for byte-indexing (e.g. compiler lexers & error reporting and text editor piece tables) can probably be replaced by allocating a new substring if the cost of doing so is acceptable. I think that offering (I've spent quite a lot of time writing these responses and will be dialing that back, but I can clarify things if needed) |
Got it, thanks for the thoughtfulness @Kylebrown9 |
One more datapoint - The BEAM (erlang, elixir, lfe et al) uses utf-8 encoded binaries as the native string datatype. There are multiple other formats supported for external compatibility - but that is its primary. All the string operations work on graphemes, not bytes. |
(I'm also going to demonstrate my lack of CS education and admit I don't understand how you can have an algorithm which can identify say the 10,000th character in a string with O(1) performance when you have the variable width encoding that utf8 provides) |
The only examples on my list above that uses UTF-8 and offers |
I think the only other thing I think I should mention should be in the "How We Teach" section. Since we're counting and indexing on codepoints and not graphemes we should probably mention that it is possible to get legal codepoints out of this interface which are illegal graphemes. For example, in a grapheme-based implementation:
ie: "Man" + "ZWJ" + "Red Hair" In this RFC implementation, the length would be 3 (as there are three codepoints). Without documenting it - that's arguably unexpected behaviour because most (or many) developers don't know that there isn't a 1:1 mapping between codepoints and characters. Having the interface be grapheme-based would end up having our functions that return single characters have to return Array[U8] again instead of U32. A significantly higher lift for the 99% use-case. |
RFC for changes to String class to better support unicode.