Speed up EnsoInputStream again #10515

JaroslavTulach · 2024-07-10T16:37:45Z

Pull Request Description

Fix #10503 by creating a benchmark and then speeding it up by making sure usage of InteropLibrary reminds in partially evaluated code and isn't hidden behind @TruffleBoundary.

Checklist

Please ensure that the following checklist has been satisfied before submitting the PR:

All code follows the
Java,
Verify benchmarks got sped up from initial benchmark run - compare with second benchmark run

GregoryTravis · 2024-07-10T16:46:55Z

What is the change that sped it up?

JaroslavTulach · 2024-07-10T17:31:07Z

What is the change that sped it up?

First of all, I wanted to run the benchmarks to find out how bad we are. The results are coming.

Now there is also a fix 0accf83 - with that one we are around 90%-110% of the original speed (before the regression).

hubertp

Tested on a project with large csv (77MB) and it loaded almost instantaneously.

I still disagree with committing test file to our repo, even if there is an exact copy somewhere there (it shouldn't in the first place but that's another story). For this benchmark to be a bit more reliable I would prefer to have a much larger file either fetched from the internet or have a separate repo for such large test files.
Not going to block this PR because of that because it fixes a rather critical problem.

engine/runtime/src/main/java/org/enso/interpreter/runtime/data/EnsoFile.java

jdunkerley

Just a question on a couple of methods

radeusgd · 2024-07-10T21:51:53Z

I would prefer to have a much larger file either fetched from the internet or have a separate repo for such large test files.

Given that here we are just testing loading, could we just generate such file? We can just create a big CSV of random texts, numbers etc. very easily using Random or Faker types.

JaroslavTulach · 2024-07-11T03:54:07Z

I would prefer to have a much larger file either fetched from the internet or have a separate repo for such large test files.

Given that here we are just testing loading, could we just generate such file? We can just create a big CSV of random texts, numbers etc. very easily using Random or Faker types.

We could certainly generate a .csv file, but so we could generate test/Table_Tests/data/data.csv. But we don't generate that file. This data.csv is an exact copy of the existing file.

I still disagree with committing test file to our repo, even if there is an exact copy somewhere there (it shouldn't in the first place but that's another story).

I am not adding anything to the size of our Git repository. Git stores files in .git directory by content, not by name, so adding a exact copy adds nothing to the size of the Git repository that needs to be downloaded.

For this benchmark to be a bit more reliable I would prefer to have a much larger file either fetched from the internet or have a separate repo for such large test files.

Downloading only adds flakiness. Separate repo is an overkill. Why should we generate this file, when that file isn't generated? Is there any consistency in such a suggestion?

JaroslavTulach · 2024-07-11T07:37:41Z

First of all, I wanted to run the benchmarks to find out how bad we are. The results are coming.

Benchmarks are fine. We need to decide what to do with data.csv - I can for example locate the existing sibling data.csv file. Then we can announce a victory over my regression.

radeusgd · 2024-07-11T08:36:22Z

Benchmarks are fine. We need to decide what to do with data.csv - I can for example locate the existing sibling data.csv file. Then we can announce a victory over my regression.

IMHO we should proceed as is and 'fix' this later. As you said - it's the exact copy so it does not introduce additional overhead. Overall we should probably be generating both files.

Why should we generate this file, when that file isn't generated? Is there any consistency in such a suggestion?

I guess the difference is that we can randomly generate this file as we don't care about its exact contents, but only structure. For the other one we are checking aggregation computation results so we need to know the exact values. But we could probably also generate it with a deterministic seed so yeah you are right.

radeusgd

Shouldn't we revert the change that was disabling the use of default encoding?

I assumed that the problem with performance was when using Encoding.Default, so to check if this fix really addresses the issue in full, we should probably run the benchmarks also with the patch:

diff --git a/distribution/lib/Standard/Base/0.0.0-dev/src/Data/Text/Encoding.enso b/distribution/lib/Standard/Base/0.0.0-dev/src/Data/Text/Encoding.enso
--- a/distribution/lib/Standard/Base/0.0.0-dev/src/Data/Text/Encoding.enso	(revision 8da06309e99f24d7062df5638f49e000fb292eff)
+++ b/distribution/lib/Standard/Base/0.0.0-dev/src/Data/Text/Encoding.enso	(revision 71a6e2162e41653b7b239a756adb8bf14db2e8f4)
@@ -71,8 +71,7 @@
     default -> Encoding =
         # This factory method is used to publicly expose the `Default` constructor.
         # The constructor itself has to be private, because we want to make `Value` constructor private, but all constructors must have the same privacy.
-        # ToDo: This is a workaround for performance issue.
-        Encoding.utf_8
+        Encoding.Default
 
     ## PRIVATE
        A default encoding that will try to guess the encoding based on some heuristics.

JaroslavTulach · 2024-07-11T09:25:22Z

Shouldn't we revert the change that was disabling the use of default encoding?

Why?

I assumed that the problem with performance was when using Encoding.Default,

No. The first bad merge is fe2cf49 coming from

Run whole test/Base_Tests with NI #10296
the actual first bad commit: 377bd2a

so to check if this fix really addresses the issue in full, we should probably run the benchmarks also with the patch:

diff --git a/distribution/lib/Standard/Base/0.0.0-dev/src/Data/Text/Encoding.enso b/distribution/lib/Standard/Base/0.0.0-dev/src/Data/Text/Encoding.enso
--- a/distribution/lib/Standard/Base/0.0.0-dev/src/Data/Text/Encoding.enso	(revision 8da06309e99f24d7062df5638f49e000fb292eff)
+++ b/distribution/lib/Standard/Base/0.0.0-dev/src/Data/Text/Encoding.enso	(revision 71a6e2162e41653b7b239a756adb8bf14db2e8f4)
@@ -71,8 +71,7 @@
     default -> Encoding =
         # This factory method is used to publicly expose the `Default` constructor.
         # The constructor itself has to be private, because we want to make `Value` constructor private, but all constructors must have the same privacy.
-        # ToDo: This is a workaround for performance issue.
-        Encoding.utf_8
+        Encoding.Default
 
     ## PRIVATE
        A default encoding that will try to guess the encoding based on some heuristics.

I don't understand why we should be doing that and how? I don't want to change /Data/Text/Encoding.enso. That's some unrelated change.

hubertp · 2024-07-11T09:32:12Z

Shouldn't we revert the change that was disabling the use of default encoding?

@jdunkerley mentioned he will re-enable in a separate PR. But it would be good to compare with this benchmark how it affects the performance w/ and w/o.

radeusgd · 2024-07-11T09:54:17Z

Shouldn't we revert the change that was disabling the use of default encoding?

@jdunkerley mentioned he will re-enable in a separate PR. But it would be good to compare with this benchmark how it affects the performance w/ and w/o.

Exactly. @jdunkerley's hotfix of disabling default encoding was said to improve performance already. So it also affected the performance of the CSV loading that is measured here. To be able to well measure the target performance, we should run the benchmarks in the target state - with this hotfix reverted.

radeusgd · 2024-07-11T09:56:50Z

I don't understand why we should be doing that and how? I don't want to change /Data/Text/Encoding.enso. That's some unrelated change.

It's not unrelated. It was done in #10493 with comment "Disable Encoding.default for now as big performance impact on CSVs.". So it seems crucial to measure with benchmarks if, after we re-enable this encoding, whether the performance will be back to normal.

JaroslavTulach · 2024-07-11T10:06:57Z

OK, let's get this PR in and then you can run the measurements with and without Encoding.default enabled.

JaroslavTulach · 2024-07-11T10:15:05Z

Once this benchmarks run is over, we will have the initial results for the new .csv benchmark.

Benchmarking speed of loading of CSV file

7c7e572

JaroslavTulach added the CI: No changelog needed Do not require a changelog entry for this PR. label Jul 10, 2024

JaroslavTulach requested a review from hubertp July 10, 2024 16:37

JaroslavTulach self-assigned this Jul 10, 2024

JaroslavTulach requested review from jdunkerley, radeusgd, GregoryTravis, AdRiley and marthasharkey as code owners July 10, 2024 16:37

Usage of InteropLibrary cannot be behind @TruffleBoundary

0accf83

JaroslavTulach requested review from 4e6 and Akirathan as code owners July 10, 2024 17:27

JaroslavTulach changed the title ~~Speed up Stream_Utils.asInputStream~~ Speed up EnsoInputStream again Jul 10, 2024

Create ByteBuffer behind @TruffleBoundary

7c0b505

hubertp approved these changes Jul 10, 2024

View reviewed changes

jdunkerley reviewed Jul 10, 2024

View reviewed changes

engine/runtime/src/main/java/org/enso/interpreter/runtime/data/EnsoFile.java Show resolved Hide resolved

jdunkerley approved these changes Jul 10, 2024

View reviewed changes

enso-bot bot mentioned this pull request Jul 11, 2024

Comparable.from doesn't return Comparable #10355

Closed

Adding mark and reset method delegation

633e4d5

radeusgd reviewed Jul 11, 2024

View reviewed changes

Use S3 for the larger CSV file. (#10520)

943c461

JaroslavTulach added the CI: Ready to merge This PR is eligible for automatic merge label Jul 11, 2024

JaroslavTulach added the CI: Clean build required CI runners will be cleaned before and after this PR is built. label Jul 11, 2024

mergify bot merged commit 077b86f into develop Jul 11, 2024
43 checks passed

mergify bot deleted the wip/jtulach/SlowCsvRead10503 branch July 11, 2024 10:08

JaroslavTulach mentioned this pull request Jul 15, 2024

Output Stream built in is slow #10550

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up EnsoInputStream again #10515

Speed up EnsoInputStream again #10515

JaroslavTulach commented Jul 10, 2024 •

edited

Loading

GregoryTravis commented Jul 10, 2024

JaroslavTulach commented Jul 10, 2024 •

edited

Loading

hubertp left a comment

jdunkerley left a comment

radeusgd commented Jul 10, 2024

JaroslavTulach commented Jul 11, 2024 •

edited

Loading

JaroslavTulach commented Jul 11, 2024

radeusgd commented Jul 11, 2024

radeusgd left a comment

JaroslavTulach commented Jul 11, 2024

hubertp commented Jul 11, 2024

radeusgd commented Jul 11, 2024

radeusgd commented Jul 11, 2024

JaroslavTulach commented Jul 11, 2024

JaroslavTulach commented Jul 11, 2024

Speed up EnsoInputStream again #10515

Speed up EnsoInputStream again #10515

Conversation

JaroslavTulach commented Jul 10, 2024 • edited Loading

Pull Request Description

Checklist

GregoryTravis commented Jul 10, 2024

JaroslavTulach commented Jul 10, 2024 • edited Loading

hubertp left a comment

Choose a reason for hiding this comment

jdunkerley left a comment

Choose a reason for hiding this comment

radeusgd commented Jul 10, 2024

JaroslavTulach commented Jul 11, 2024 • edited Loading

JaroslavTulach commented Jul 11, 2024

radeusgd commented Jul 11, 2024

radeusgd left a comment

Choose a reason for hiding this comment

JaroslavTulach commented Jul 11, 2024

hubertp commented Jul 11, 2024

radeusgd commented Jul 11, 2024

radeusgd commented Jul 11, 2024

JaroslavTulach commented Jul 11, 2024

JaroslavTulach commented Jul 11, 2024

JaroslavTulach commented Jul 10, 2024 •

edited

Loading

JaroslavTulach commented Jul 10, 2024 •

edited

Loading

JaroslavTulach commented Jul 11, 2024 •

edited

Loading