Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

System.Convert.ToBase64String uses CR+LF on non-Windows platforms #32452

Open
perlun opened this issue Feb 17, 2020 · 6 comments
Open

System.Convert.ToBase64String uses CR+LF on non-Windows platforms #32452

perlun opened this issue Feb 17, 2020 · 6 comments
Labels
area-System.Runtime backlog-cleanup-candidate An inactive issue that has been marked for automated closure. bug no-recent-activity
Milestone

Comments

@perlun
Copy link
Contributor

perlun commented Feb 17, 2020

Hi,

While debugging a failing test in my code tonight, I was made aware of the utterly surprising fact that a call to Convert.ToBase64String(plainTextBytes, Base64FormattingOptions.InsertLineBreaks) will insert not just LF linefeeds but CR+LF, even though I am running my code on Linux.

This is definitely not what I would expect; I would have assumed that the class would use the newline format used by the platform in question (i.e. CR+LF on Windows, LF on Linux and macOS).

Here is the code in question (permalink):

for (i = offset; i < calcLength; i += 3)
{
if (insertLineBreaks)
{
if (charcount == base64LineBreakPosition)
{
outChars[j++] = '\r';
outChars[j++] = '\n';
charcount = 0;
}
charcount += 4;
}
outChars[j] = base64[(inData[i] & 0xfc) >> 2];
outChars[j + 1] = base64[((inData[i] & 0x03) << 4) | ((inData[i + 1] & 0xf0) >> 4)];
outChars[j + 2] = base64[((inData[i + 1] & 0x0f) << 2) | ((inData[i + 2] & 0xc0) >> 6)];
outChars[j + 3] = base64[inData[i + 2] & 0x3f];
j += 4;
}

I realize that changing this behavior at this stage would be a slightly breaking change, so it's not something to be taken lightly. Still, it seems like the only sensible thing to do since this definitely breaks the principle of least surprise. Please let me know what you think; I can probably manage to fix a PR if you are happy with the change per se.

(I will fix my failing test now and go to sleep. 😉)

@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Feb 17, 2020
@filipnavara
Copy link
Member

It is documented behavior:

The Convert.ToBase64CharArray and Convert.ToBase64String methods convert the value of an array of 8-bit unsigned integers to an equivalent string representation that consists of base 64 digits. The string representation can contain one or more line breaks, where a line break is defined as a carriage return character (U+000D) followed by a line feed character (U+000A). Because line breaks are considered white-space characters in a base-64 encoding, they are ignored when converting a base-64 encoded string back to a byte array. The line breaks are simply convenient when displaying the encoded string to a control or a device such as a console window.

@perlun
Copy link
Contributor Author

perlun commented Feb 18, 2020

It is documented behavior:

Yes, I noted that. What I am questioning is whether this is the right, reasonable thing, though.

Case in point: Console.WriteLine honors the value of Console.Out.NewLine as the "current line terminator". Quoting from the docs (emphasis mine):

Writes the specified string value, followed by the current line terminator, to the standard output stream.

Console.Out.NewLine is initialized with \n (LF) only on Linux, just like you would expect it to be.

I would personally prefer for the framework to be consistent in its handling of newlines across all corefx classes. To me, the fact that Convert.ToBase64String uses "non-native" line terminators on LF-only platforms seems to be nothing more than an indication that the code hasn't been adjusted to be truly cross-platform. I might clearly be wrong on that one (it could very well be a conscious decision), and if the issue has been discussed before, please point me to that discussion (I searched the repo briefly before filing the issue).


If changing the current semantics would be considered a "too big" breaking change, I think the second-best thing would be to introduce a new enum value like Base64FormattingOptions.InsertPlatformLineBreaks or something like that, to get the semantics I am after: use CR+LF when running on Windows, use LF when running on LF-only platforms. If we go this route, we should probably add a forth enum member as well: Base64FormattingOptions.InsertUnixLineBreaks, to support all scenarios:

  • Base64FormattingOptions.InsertLineBreaks: always insert CR+LF line endings
  • Base64FormattingOptions.InsertUnixLineBreaks: always insert LF line endings
  • Base64FormattingOptions.InsertPlatformLineBreaks: use CR+LF on Windows, LF elsewhere

Thinking out loud, this would perhaps be the "best" option in a way. It adds a little bit of code complexity but OTOH, it makes the method much more flexible and users can choose the semantics that fit their particular use case best. Comments on this idea?

(Feel free to suggest better enum member names)

@filipnavara
Copy link
Member

My personal opinion is that Convert.ToBase64String is somewhat legacy API. It is intentionally not flexible and enforces the line breaks after 76 characters. It doesn't scale well to large text sizes. The limitation to CRLF and 76 characters closely matches the email RFC specifications where such use was common.

There's new high performance API in System.Buffers.Text.Base64 but AFAIK it doesn't do the convenient line breaking for you. Unfortunately the MSDN pages are down for me at the moment so I cannot consult the documentation.

@gfoidl
Copy link
Member

gfoidl commented Feb 18, 2020

but AFAIK it doesn't do the convenient line breaking for you.

System.Buffers.Text.Base64 doesn't insert any linebreaks.

@perlun with the point of view that the base64 encoded string is a kind of "contract" it makes sense to have CR + LF (note: it could have also be defined with just LF), so that decoding can be done with expected line endings on all platforms. See also Variants summary table. So according to the various RFCs about base64 the behavior is correct. Hence there is no need for Base64FormattingOptions.InsertPlatformLineBreaks, as that would violate the RFCs (and make decoding more difficult).

Also look at HTTP, where line endings are CR + LF independent of the platform. For historical and practical (parsing) reasons.
This has nothing to do with Console.WriteLine

prefer for the framework to be consistent

...with the RFCs / standards 😉

@perlun
Copy link
Contributor Author

perlun commented Feb 18, 2020

Interesting. You are indeed correct, the RFC:s speak about CR+LF like you suggest. In practice, I don't think I've ever seen a Unix-oriented tool generate base64 with CR+LF though, which is why I was (perhaps incorrectly) assuming that LF would be kosher to use here. A couple of examples:

GNU coreutils

Note the single dot character in the stream at the v.I position, hex character 0a on the left-hand view:

$ echo "hello world, hello world, hello world, hello world, hello world" | base64 | hexdump -C
00000000  61 47 56 73 62 47 38 67  64 32 39 79 62 47 51 73  |aGVsbG8gd29ybGQs|
00000010  49 47 68 6c 62 47 78 76  49 48 64 76 63 6d 78 6b  |IGhlbGxvIHdvcmxk|
00000020  4c 43 42 6f 5a 57 78 73  62 79 42 33 62 33 4a 73  |LCBoZWxsbyB3b3Js|
00000030  5a 43 77 67 61 47 56 73  62 47 38 67 64 32 39 79  |ZCwgaGVsbG8gd29y|
00000040  62 47 51 73 49 47 68 6c  62 47 78 76 0a 49 48 64  |bGQsIGhlbGxv.IHd|
00000050  76 63 6d 78 6b 43 67 3d  3d 0a                    |vcmxkCg==.|
0000005a
$ base64 --version 
base64 (GNU coreutils) 8.30
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Written by Simon Josefsson.

MRI Ruby

Also outputs LF as line separator. Note that it splits the lines at 60 characters instead of 76. It claims to be compliant with RFC 2045, but I can only see CRLF mentioned in that RFC.

$ ruby -r base64 -e  "puts Base64.encode64('hello world, hello world, hello world, hello world, hello world');" | hexdump -C
00000000  61 47 56 73 62 47 38 67  64 32 39 79 62 47 51 73  |aGVsbG8gd29ybGQs|
00000010  49 47 68 6c 62 47 78 76  49 48 64 76 63 6d 78 6b  |IGhlbGxvIHdvcmxk|
00000020  4c 43 42 6f 5a 57 78 73  62 79 42 33 62 33 4a 73  |LCBoZWxsbyB3b3Js|
00000030  5a 43 77 67 61 47 56 73  62 47 38 67 0a 64 32 39  |ZCwgaGVsbG8g.d29|
00000040  79 62 47 51 73 49 47 68  6c 62 47 78 76 49 48 64  |ybGQsIGhlbGxvIHd|
00000050  76 63 6d 78 6b 0a                                 |vcmxk.|
00000056
$ ruby -v
ruby 2.5.5p157 (2019-03-15 revision 67260) [x86_64-linux-gnu]

Python 3

Also outputs LF when splitting lines in base64-encoded content (around the v\nI part in the printed data):

$ python3 -c "import base64; print(base64.encodebytes(b'hello world, hello world, hello world, hello world, hello world'));"
b'aGVsbG8gd29ybGQsIGhlbGxvIHdvcmxkLCBoZWxsbyB3b3JsZCwgaGVsbG8gd29ybGQsIGhlbGxv\nIHdvcmxk\n'
$ python3 --version
Python 3.7.3

To be honest, I challenge anyone reading this to find me one single Unix utility or programming language/platform (apart from .NET Core) that generates base64-content with CR+LF line endings. 🙂

I think it depends greatly on what the imagined use case is here. If it is for programmatically generating RFC-compliant content for inserting into a MIME-encoded email, sure; the current semantics are fine and valid. If it is for other use cases (i.e. printing the output to the console), producing something with extra characters isn't as useful.

Also look at HTTP, where line endings are CR + LF independent of the platform. For historical and practical (parsing) reasons.

Thanks, I must admit my ignorance, I wasn't actually aware of this (have obviously been doing too little wiresharking lately. 😉)

Bottom line/TL;DR:

Hence there is no need for Base64FormattingOptions.InsertPlatformLineBreaks, as that would violate the RFCs (and make decoding more difficult).

While I do agree that it can be seen as a violation of the RFCs, the de facto way tools work on Linux and other LF-only platforms is that they bend the RFCs to fit the platform's native world view better... 🙂

prefer for the framework to be consistent

...with the RFCs / standards wink

Being consistent with RFCs and standards is indeed important. What also makes sense is to carefully inspect how other tools & languages interpret those same RFCs. I don't think we should be too "wise in our own eyes", to use some biblical vocabulary. It might very well be better to assume all these other tools have good reasons for their choices and that we perhaps just haven't seen the light yet. 🙈

@joperezr joperezr added bug and removed untriaged New issue has not been triaged by the area owner labels Jul 7, 2020
@joperezr joperezr added this to the Future milestone Jul 7, 2020
Copy link
Contributor

Due to lack of recent activity, this issue has been marked as a candidate for backlog cleanup. It will be closed if no further activity occurs within 14 more days. Any new comment (by anyone, not necessarily the author) will undo this process.

This process is part of our issue cleanup automation.

@dotnet-policy-service dotnet-policy-service bot added backlog-cleanup-candidate An inactive issue that has been marked for automated closure. no-recent-activity labels Jan 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-System.Runtime backlog-cleanup-candidate An inactive issue that has been marked for automated closure. bug no-recent-activity
Projects
None yet
Development

No branches or pull requests

7 participants