Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up EnsoInputStream again #10515

Merged
merged 5 commits into from
Jul 11, 2024
Merged

Speed up EnsoInputStream again #10515

merged 5 commits into from
Jul 11, 2024

Conversation

JaroslavTulach
Copy link
Member

@JaroslavTulach JaroslavTulach commented Jul 10, 2024

Pull Request Description

Fix #10503 by creating a benchmark and then speeding it up by making sure usage of InteropLibrary reminds in partially evaluated code and isn't hidden behind @TruffleBoundary.

Checklist

Please ensure that the following checklist has been satisfied before submitting the PR:

@JaroslavTulach JaroslavTulach added the CI: No changelog needed Do not require a changelog entry for this PR. label Jul 10, 2024
@JaroslavTulach JaroslavTulach requested a review from hubertp July 10, 2024 16:37
@JaroslavTulach JaroslavTulach self-assigned this Jul 10, 2024
@GregoryTravis
Copy link
Contributor

What is the change that sped it up?

@JaroslavTulach
Copy link
Member Author

JaroslavTulach commented Jul 10, 2024

What is the change that sped it up?

First of all, I wanted to run the benchmarks to find out how bad we are. The results are coming.

Now there is also a fix 0accf83 - with that one we are around 90%-110% of the original speed (before the regression).

@JaroslavTulach JaroslavTulach changed the title Speed up Stream_Utils.asInputStream Speed up EnsoInputStream again Jul 10, 2024
Copy link
Collaborator

@hubertp hubertp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested on a project with large csv (77MB) and it loaded almost instantaneously.

I still disagree with committing test file to our repo, even if there is an exact copy somewhere there (it shouldn't in the first place but that's another story). For this benchmark to be a bit more reliable I would prefer to have a much larger file either fetched from the internet or have a separate repo for such large test files.
Not going to block this PR because of that because it fixes a rather critical problem.

Copy link
Member

@jdunkerley jdunkerley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a question on a couple of methods

@radeusgd
Copy link
Member

I would prefer to have a much larger file either fetched from the internet or have a separate repo for such large test files.

Given that here we are just testing loading, could we just generate such file? We can just create a big CSV of random texts, numbers etc. very easily using Random or Faker types.

@JaroslavTulach
Copy link
Member Author

JaroslavTulach commented Jul 11, 2024

I would prefer to have a much larger file either fetched from the internet or have a separate repo for such large test files.

Given that here we are just testing loading, could we just generate such file? We can just create a big CSV of random texts, numbers etc. very easily using Random or Faker types.

We could certainly generate a .csv file, but so we could generate test/Table_Tests/data/data.csv. But we don't generate that file. This data.csv is an exact copy of the existing file.

I still disagree with committing test file to our repo, even if there is an exact copy somewhere there (it shouldn't in the first place but that's another story).

I am not adding anything to the size of our Git repository. Git stores files in .git directory by content, not by name, so adding a exact copy adds nothing to the size of the Git repository that needs to be downloaded.

For this benchmark to be a bit more reliable I would prefer to have a much larger file either fetched from the internet or have a separate repo for such large test files.

Downloading only adds flakiness. Separate repo is an overkill. Why should we generate this file, when that file isn't generated? Is there any consistency in such a suggestion?

@JaroslavTulach
Copy link
Member Author

First of all, I wanted to run the benchmarks to find out how bad we are. The results are coming.

Significant speedup

Benchmarks are fine. We need to decide what to do with data.csv - I can for example locate the existing sibling data.csv file. Then we can announce a victory over my regression.

@radeusgd
Copy link
Member

Benchmarks are fine. We need to decide what to do with data.csv - I can for example locate the existing sibling data.csv file. Then we can announce a victory over my regression.

IMHO we should proceed as is and 'fix' this later. As you said - it's the exact copy so it does not introduce additional overhead. Overall we should probably be generating both files.

Why should we generate this file, when that file isn't generated? Is there any consistency in such a suggestion?

I guess the difference is that we can randomly generate this file as we don't care about its exact contents, but only structure. For the other one we are checking aggregation computation results so we need to know the exact values. But we could probably also generate it with a deterministic seed so yeah you are right.

Copy link
Member

@radeusgd radeusgd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we revert the change that was disabling the use of default encoding?

I assumed that the problem with performance was when using Encoding.Default, so to check if this fix really addresses the issue in full, we should probably run the benchmarks also with the patch:

diff --git a/distribution/lib/Standard/Base/0.0.0-dev/src/Data/Text/Encoding.enso b/distribution/lib/Standard/Base/0.0.0-dev/src/Data/Text/Encoding.enso
--- a/distribution/lib/Standard/Base/0.0.0-dev/src/Data/Text/Encoding.enso	(revision 8da06309e99f24d7062df5638f49e000fb292eff)
+++ b/distribution/lib/Standard/Base/0.0.0-dev/src/Data/Text/Encoding.enso	(revision 71a6e2162e41653b7b239a756adb8bf14db2e8f4)
@@ -71,8 +71,7 @@
     default -> Encoding =
         # This factory method is used to publicly expose the `Default` constructor.
         # The constructor itself has to be private, because we want to make `Value` constructor private, but all constructors must have the same privacy.
-        # ToDo: This is a workaround for performance issue.
-        Encoding.utf_8
+        Encoding.Default
 
     ## PRIVATE
        A default encoding that will try to guess the encoding based on some heuristics.

@JaroslavTulach
Copy link
Member Author

Shouldn't we revert the change that was disabling the use of default encoding?

Why?

I assumed that the problem with performance was when using Encoding.Default,

No. The first bad merge is fe2cf49 coming from

so to check if this fix really addresses the issue in full, we should probably run the benchmarks also with the patch:

diff --git a/distribution/lib/Standard/Base/0.0.0-dev/src/Data/Text/Encoding.enso b/distribution/lib/Standard/Base/0.0.0-dev/src/Data/Text/Encoding.enso
--- a/distribution/lib/Standard/Base/0.0.0-dev/src/Data/Text/Encoding.enso	(revision 8da06309e99f24d7062df5638f49e000fb292eff)
+++ b/distribution/lib/Standard/Base/0.0.0-dev/src/Data/Text/Encoding.enso	(revision 71a6e2162e41653b7b239a756adb8bf14db2e8f4)
@@ -71,8 +71,7 @@
     default -> Encoding =
         # This factory method is used to publicly expose the `Default` constructor.
         # The constructor itself has to be private, because we want to make `Value` constructor private, but all constructors must have the same privacy.
-        # ToDo: This is a workaround for performance issue.
-        Encoding.utf_8
+        Encoding.Default
 
     ## PRIVATE
        A default encoding that will try to guess the encoding based on some heuristics.

I don't understand why we should be doing that and how? I don't want to change /Data/Text/Encoding.enso. That's some unrelated change.

@JaroslavTulach JaroslavTulach added the CI: Ready to merge This PR is eligible for automatic merge label Jul 11, 2024
@hubertp
Copy link
Collaborator

hubertp commented Jul 11, 2024

Shouldn't we revert the change that was disabling the use of default encoding?

@jdunkerley mentioned he will re-enable in a separate PR. But it would be good to compare with this benchmark how it affects the performance w/ and w/o.

@JaroslavTulach JaroslavTulach added the CI: Clean build required CI runners will be cleaned before and after this PR is built. label Jul 11, 2024
@radeusgd
Copy link
Member

Shouldn't we revert the change that was disabling the use of default encoding?

@jdunkerley mentioned he will re-enable in a separate PR. But it would be good to compare with this benchmark how it affects the performance w/ and w/o.

Exactly. @jdunkerley's hotfix of disabling default encoding was said to improve performance already. So it also affected the performance of the CSV loading that is measured here. To be able to well measure the target performance, we should run the benchmarks in the target state - with this hotfix reverted.

@radeusgd
Copy link
Member

I don't understand why we should be doing that and how? I don't want to change /Data/Text/Encoding.enso. That's some unrelated change.

It's not unrelated. It was done in #10493 with comment "Disable Encoding.default for now as big performance impact on CSVs.". So it seems crucial to measure with benchmarks if, after we re-enable this encoding, whether the performance will be back to normal.

@JaroslavTulach
Copy link
Member Author

OK, let's get this PR in and then you can run the measurements with and without Encoding.default enabled.

@mergify mergify bot merged commit 077b86f into develop Jul 11, 2024
43 checks passed
@mergify mergify bot deleted the wip/jtulach/SlowCsvRead10503 branch July 11, 2024 10:08
@JaroslavTulach
Copy link
Member Author

Once this benchmarks run is over, we will have the initial results for the new .csv benchmark.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI: Clean build required CI runners will be cleaned before and after this PR is built. CI: No changelog needed Do not require a changelog entry for this PR. CI: Ready to merge This PR is eligible for automatic merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Investigate slow opening of a project having large CSV file
5 participants