Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: url_decode function #2957

Closed
wants to merge 3 commits into from
Closed

Conversation

rodcoffani
Copy link

@rodcoffani rodcoffani commented Nov 30, 2023

url_decode function

Creates:

  • url_decode
  • url_encode (just to keep the names similar)
  • tests for both of them.

The code for this functions were based of Rosetta Code, like suggested by others maintainers.

Motivation

There is some issues related and asking for this implementation, for example: #2261 and even older ones like #798 , closing this issues.

@emanuele6
Copy link
Member

emanuele6 commented Nov 30, 2023

This implementation is just not correct.
You are not converting bytes back to codepoints; you are just using bytes as if they were codepoints:

$ ./jq -na '"è" | @uri | ., url_decode'
"%C3%A8"
"\u00c3\u00a8"
$ ./jq -na '"è" | ., (@uri | ., (url_decode | ., (@uri | ., url_decode)))'
"\u00e8"
"%C3%A8"
"\u00c3\u00a8"
"%C3%83%C2%A8"
"\u00c3\u0083\u00c2\u00a8"

The correct result should have been an è/\u00e8

$ jq -na '"è"'
"\u00e8"
"\u00e8"
"%C3%A8"
"\u00e8"
"%C3%A8"
"\u00e8"

@rodcoffani
Copy link
Author

This implementation is just not correct. You are not converting bytes back to codepoints; you are just using bytes as if they were codepoints:

$ ./jq -na '"è" | @uri | ., url_decode'
"%C3%A8"
"\u00c3\u00a8"
$ ./jq -na '"è" | ., (@uri | ., (url_decode | ., (@uri | ., url_decode)))'
"\u00e8"
"%C3%A8"
"\u00c3\u00a8"
"%C3%83%C2%A8"
"\u00c3\u0083\u00c2\u00a8"

The correct result should have been an è/\u00e8

$ jq -na '"è"'
"\u00e8"
"\u00e8"
"%C3%A8"
"\u00e8"
"%C3%A8"
"\u00e8"

You were right! Sorry, we didn't catch those cases in the first implementation. Analyzing the issues we commented (especially #798 ), there were a function that matches exactly with your test cases and even with emotes. (This one

Example:
image

@rodcoffani
Copy link
Author

Showed up an failed test at the checks, but as it is only in one check and it is labeled as "disabled", I don't believe it has to do with this new feature.

@emanuele6
Copy link
Member

emanuele6 commented Nov 30, 2023

You have added those tests to tests/jq.test instead of tests/onig.test, but your implementations requires regular expression support (provided by liboniguruma) for gsub/2, so they fail in the build without liboniguruma.

@wader
Copy link
Member

wader commented Nov 30, 2023

#2261 and #798 is about doing this is a format and also gojq implemented this as @urid, think that would be better. Also would be great to be consistent with uri vs url

@itchyny
Copy link
Contributor

itchyny commented Nov 30, 2023

Why not just adding @urid?

@emanuele6
Copy link
Member

emanuele6 commented Nov 30, 2023

URL decoding a byte sequence that only contains non-utf-8 bytes triggers an error:

$ ./jq -na '"%ff%fa" | url_decode'
jq: error (at <unknown>): null (null) and number (128) cannot be subtracted

URL decoding a byte sequence that contains non-utf-8 bytes, but contains a valid utf-8 subsequences, results in a single \ufffd.

$ ./jq -na '"%ff%c3%a7%fa" | url_decode'
"\ufffd"

The correct behaviour would be that every non-utf-8 byte becomes \ufffd:

$ jq -Ra . <<< $'\xff\xfa'
"\ufffd\ufffd"
$ jq -Ra . <<< $'\xff\xc3\xa7\xfa'
"\ufffd\u00e7\ufffd"

@emanuele6
Copy link
Member

I've just noticed that the implementation is literally just entirely copied from this gist https://gist.github.com/jcracknell/52cf4e0f8c4518a853784638db8a258d

Come on now. :/

@emanuele6 emanuele6 closed this Nov 30, 2023
@rodcoffani
Copy link
Author

rodcoffani commented Dec 1, 2023

As the code was never merged to the main repo I thought would be a good addition to it, that's why I referenced the gist in my comment, sorry for not being more clear.

But thanks for the feedback! I read the comments and I would like to know what is the best approach here:

  • would be better to add only as @urid? or with the functions url_decode/url_encode AND with @urid?
  • fixing the code and implementing more accurate tests, it still would be a nice thing to add?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants