-
Notifications
You must be signed in to change notification settings - Fork 338
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Emit warnings when dumping binary strings #643
Merged
Merged
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one is #138. I'm having second thoughts about deprecating / removing it, as I fear lots of people depend on it, and looking at other parsers they all seem to accept it. So maybe we should just support it in the Java parser?
WDYT @headius @eregon ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops, I commented too early and thought this was a YAML issue.
After reexamining the code, it does appear that the json implementation for JRuby works directly with bytes via the ragel parser. It would probably be possible to make it accept invalid characters.
My position remains the same, however. I'm not sure what other parsers you tested, but all json specifications I looked at explicitly require content to be in one of the main unicode UTF encodings. That says to me that providing invalid characters is off spec and should produce an error.
The original RFC from 2006 even says:
https://www.ietf.org/rfc/rfc4627.txt
The character stream cannot be parsed as Unicode, then it is not up to spec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oj
and Python's stdlib.I totally get your argument, and generally agree, the concern is really about breaking existing (misbehaving) code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I'm crazy, but I would break it in a minor release so it's easy to revert with another minor release, rather than break it in a major release and have to go back. I doubt the impact will be large. If we put out a minor release that starts erroring on bad input and it doesn't get reported for a few months, it's probably fine.
I think it's also important to remember that Ruby has become much more strict about character encodings over the years so the chance of getting bad content has reduced significantly since the original issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A common case where this still happens a lot today is JSON POST requests handled by Rack.
The Rack spec clearly state the body should be
ASCII_8BIT
, so if you do the commonJSON.parse(request.body)
you hit this case.That's why I'm worried about breaking this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's a decent point.
Just yesterday someone at work told me they had an outage because they have a JSON based logger and were logging binary strings (some avro stuff or whatever).
So perhaps you are right and should just keep this behavior, unless
strict: true
is set.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
json/lib/json/pure/generator.rb
Lines 227 to 228 in 6d3b3ac
I think even strict: true should accept it given the documented meaning of it.
It could be another option, but not sure there is any value to prohibit broken strings if anyway it works fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you be willing to make a PR with a test of what you have in mind? I can make it pass, just want to be 100% sure we're talking about the same thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I'm thinking is:
And that passes on master, because of the
rb_str_conv_enc()
below which if there is an error during conversion returns the source string as-is (https://github.com/ruby/ruby/blob/22abcce704cf3d82aaa9af5b8ae4e6bf628502ea/spec/ruby/optional/capi/string_spec.rb#L884). And I guess the C extension doesn't really care whether the input String is internally in UTF-8 or BINARY and just works on bytes (although that could lead to subtle bugs as e.g.ENC_CODERANGE_VALID
means something different for both encodings).If I replace the
rb_str_conv_enc()
withreturn rb_funcall(source, rb_intern("encode"), 1, rb_const_get(rb_path2class("Encoding"), rb_intern("UTF_8")));
then I getEncoding::UndefinedConversionError
.If on top of that I undo the change in this file then it works again.
BTW, the pure-Ruby parser actually uses the input String as BINARY internally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#665