Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text encode decode #1645

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from
Draft

Text encode decode #1645

wants to merge 2 commits into from

Conversation

hhugo
Copy link
Member

@hhugo hhugo commented Jul 26, 2024

No description provided.

@hhugo
Copy link
Member Author

hhugo commented Jul 26, 2024

I'd like to benchmark this change

@vouillon
Copy link
Member

vouillon commented Aug 1, 2024

One should create the encoder and the decoder only once and reuse it, and use a fixed buffer for caml_jsstring_of_string.

@hhugo
Copy link
Member Author

hhugo commented Aug 2, 2024

One should create the encoder and the decoder only once and reuse it, and use a fixed buffer for caml_jsstring_of_string.

Done

@hhugo
Copy link
Member Author

hhugo commented Aug 30, 2024

@vouillon, any idea on how to benchmark this other that micro benchmarks ?

@hhugo hhugo force-pushed the text-encode-decode branch 2 times, most recently from 9fea72f to 5499d77 Compare October 9, 2024 19:47
@hhugo hhugo force-pushed the text-encode-decode branch from 5499d77 to 0e9dcba Compare October 16, 2024 08:10
@hhugo hhugo force-pushed the text-encode-decode branch from 0e9dcba to 7375672 Compare October 23, 2024 21:12
@hhugo
Copy link
Member Author

hhugo commented Oct 24, 2024

A quick micro benchmark show a significant slowdown.
master

n: 3, 0.167000
n: 7, 0.268000
n: 10, 0.371000
n: 20, 0.811000
n: 50, 1.989000
n: 100, 3.618000

this PR

n: 3, 1.209000
n: 7, 1.269000
n: 10, 1.354000
n: 20, 1.660000
n: 50, 2.632000
n: 100, 4.578000

@hhugo hhugo marked this pull request as draft October 24, 2024 07:19
@hhugo hhugo added the runtime label Nov 25, 2024
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (2)

runtime/js/mlBytes.js:642

  • The function caml_jsstring_of_string should handle non-ASCII strings correctly. The previous implementation used caml_utf16_of_utf8, which may handle edge cases differently than TextDecoder.
if (jsoo_is_ascii(s)) return s;

runtime/js/mlBytes.js:659

  • The function caml_string_of_jsstring should handle non-ASCII strings correctly. The previous implementation used caml_utf8_of_utf16, which may handle edge cases differently than TextEncoder.
if (jsoo_is_ascii(s)) return caml_string_of_jsbytes(s);

var r = this.toString();
if (this.t === 9) return r;
return caml_utf16_of_utf8(r);
if (this.t === 9) return this.c;
Copy link
Preview

Copilot AI Dec 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method toUtf16 should handle ASCII and non-ASCII cases consistently. The previous implementation used caml_utf16_of_utf8 for non-ASCII strings, which may handle surrogate pairs and invalid sequences differently than TextDecoder.

Suggested change
if (this.t === 9) return this.c;
return caml_utf16_of_utf8(r);

Copilot is powered by AI, so mistakes are possible. Review output carefully before use.

Positive Feedback
Negative Feedback

Provide additional feedback

Please help us improve GitHub Copilot by sharing more details about this comment.

Please select one or more of the options
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants