Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[1.20][FLINK-37100][tests] Fix test_netty_shuffle_memory_control.sh with Netty4 RPC #25955

Merged
merged 2 commits into from
Jan 23, 2025

Conversation

ferenc-csaky
Copy link
Contributor

@ferenc-csaky ferenc-csaky commented Jan 10, 2025

What is the purpose of the change

Fixes the test executed by test_netty_shuffle_memory_control.sh that can possibly fail the CI in case Netty4 cannot reserve enough memory, hence Pekko is not able to start up.

Brief change log

  • Reverted the commit backported from the 2.0 branch, which I believe invalidates the purpose of this test case.
  • Increased the off-heap memory from 7BM to 12MB, which according to local testing has to be enough to stabilize the CI execution.

Verifying this change

Existing test should succeed consistently in CI.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @Public(Evolving): no
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): no
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
  • The S3 file system connector: no

Documentation

  • Does this pull request introduce a new feature? no
  • If yes, how is the feature documented? handled in different PR

@ferenc-csaky ferenc-csaky changed the title [FLINK-37100][tests] Fix test_netty_shuffle_memory_control.sh in CI for JDK11+ [1.20][FLINK-37100][tests] Fix test_netty_shuffle_memory_control.sh in CI for JDK11+ Jan 10, 2025
@flinkbot
Copy link
Collaborator

flinkbot commented Jan 10, 2025

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

@ferenc-csaky ferenc-csaky force-pushed the FLINK-37100-1.20 branch 2 times, most recently from aa8a499 to 09cf21f Compare January 20, 2025 13:11
@ferenc-csaky
Copy link
Contributor Author

@flinkbot run azure

@afedulov
Copy link
Contributor

Hey @ferenc-csaky , thanks for addressing this.
I actually already went ahead and merged the fix for this test that was made on master after the Netty 4 upgrade.
#26041
#26042

Do you think there are benefits to use unpooled config here instead?

@He-Pin
Copy link
Member

He-Pin commented Jan 21, 2025

I think the unpooled will affect other places of Flink

@ferenc-csaky
Copy link
Contributor Author

After I rethink it again, it should probably be not enabled here, because the test should verify the flink-runtime shuffle memory allocation, which was on Netty4 back in 2019 as well when it was originally implemented, so it used the pooled allocator since ever.

Enabling reflection only seemed to stabilize the test on my machine, but since this test was flaky on the CI failed sporadically, and on my machine consistently, it may not add much. My concern about the backported commit from master is that it gives 90MB memory instead of 7. It states one chunk (4M * 20) plus some margins, which might be true for master if there were some other changes how we (did not follow this up), but definitely not true for 1.19 and 1.20, so IMO it invalidates this test case.

@afedulov
Copy link
Contributor

@ferenc-csaky thanks for the clarification. It sounds like the correct way would be to find out what is the off-heap overhead induced by Netty 4 default settings in RPC and bump that setting by exactly that amount.

@afedulov
Copy link
Contributor

@ferenc-csaky could you rerun the tests without reflection and unpooled, but with a slightly increased off-heap memory? We probably need to just find the sweet spot to account for the increased off-heap memory consumption caused by the RPC.

@He-Pin
Copy link
Member

He-Pin commented Jan 21, 2025

I would like to see anyone run a test with pekko 1.2.0 nightly ... I think then @pjfanning should be happy to get that pr in

@ferenc-csaky
Copy link
Contributor Author

@ferenc-csaky could you rerun the tests without reflection and unpooled, but with a slightly increased off-heap memory? We probably need to just find the sweet spot to account for the increased off-heap memory consumption caused by the RPC.

@afedulov Yeah, I cover that today.

@ferenc-csaky
Copy link
Contributor Author

@afedulov So on my M4 Pro chip MacBook the test starts to run with 11MB consistentlt, and consistently fail with anything under that, since TMs cannot even start because there are not enough resource to allocate for Netty. Since the CI runs on x86 machines, and only fail once in a while, it makes me think CPU architecture matters here and it allocates less memory on a non-ARM architecture. All in all this points me to think we should get away with 12MB for these tests for 1.19 and 1.20.

In the meantime I also thought a bit about where that 90mb memory the backported commit added, and I think they just multiplied the 4M chunk by the task slot number and added some extra margin: 4m x 20 + 10 = 90m, which is semantically wrong.

@He-Pin
Copy link
Member

He-Pin commented Jan 22, 2025

@ferenc-csaky I think the cache depends on how many cores you have, would you like to test the pekko 1.2.x nightly too, I would like to, but don't know how.
refs: apache/pekko#1709

@ferenc-csaky
Copy link
Contributor Author

@He-Pin That's a good point regarding cache! Probably the CI machines has less than 10 cores. I can give it a try, I just have to do a local Pekko build myself to be able to build Flink on top of it.

@He-Pin
Copy link
Member

He-Pin commented Jan 22, 2025

@ferenc-csaky Cool But the pekko 1.2.x snapshot should be binary compatible with 1.1.x

And there are nightly on https://repository.apache.org/content/groups/snapshots/org/apache/pekko/pekko-actor_2.13/1.2.0-M0+55-a75bc7a7-SNAPSHOT/

Hope that saves you some time.

@ferenc-csaky
Copy link
Contributor Author

ferenc-csaky commented Jan 22, 2025

@ferenc-csaky Cool But the pekko 1.2.x snapshot should be binary compatible with 1.1.x

And there are nightly on https://repository.apache.org/content/groups/snapshots/org/apache/pekko/pekko-actor_2.13/1.2.0-M0+55-a75bc7a7-SNAPSHOT/

Hope that saves you some time.

TBH I have very limited Scala knowledge, but Flink does not support 2.13 at all, so my preconception was that I need a 2.12 build. I pretty much figured out building it from the nightly GH workflow and just finished building with this cmd:

sbt -Dpekko.build.scalaVersion=2.12.x "++ 2.12.x ;publishLocal;publishM2"

(After 1 failure because of graphviz missing...)

And I built 1.1.x after I applied your commit from apache/pekko#1709.

@He-Pin
Copy link
Member

He-Pin commented Jan 22, 2025

ouch, but Flink does not support 2.13 at all so seems we can't drop 2.12 support very soon!

@ferenc-csaky
Copy link
Contributor Author

ouch, but Flink does not support 2.13 at all so seems we can't drop 2.12 support very soon!

We are actually trying to remove Scala from the project: https://issues.apache.org/jira/browse/FLINK-29741
I'm not sure when that happens, and maybe if upstream projects are using a more recent version would not be meaningful, but I think for the foreseeable future it will definitely stay with us.

Anyways, I managed to run the tests making Pekko use the unpooled allocator and the test PASS on my laptop with 7MB, so it works as before. IMO this config option will be definitely useful for us. If it's meaningful enough to urge another Pekko release, I am leaning towards to not, but if @afedulov thinks we can discuss this on the mailing list.

@He-Pin
Copy link
Member

He-Pin commented Jan 22, 2025

I knew , after alibaba acquired, then more Java, thanks for that information.

@afedulov
Copy link
Contributor

@He-Pin now that it is confirmed that the "unpooled" config solves the issue for us, how realistic do you think it is to get a Pekko release with the allocator config support included soon?

@ferenc-csaky I would prefer not to postpone much longer unless we get the Pekko release ASAP. Please drop a line in response to @pjfanning's question here: apache/pekko#1709 (comment)
In parallel, let's assume that there won't be a new Pekko release in the upcoming day or two. Should we simply bump from 7 to 12MB and hope it does it, what do you think?

@He-Pin
Copy link
Member

He-Pin commented Jan 22, 2025

Would you like to comment on the pr against 1.1.x too.

@ferenc-csaky
Copy link
Contributor Author

@He-Pin now that it is confirmed that the "unpooled" config solves the issue for us, how realistic do you think it is to get a Pekko release with the allocator config support included soon?

@ferenc-csaky I would prefer not to postpone much longer unless we get the Pekko release ASAP. Please drop a line in response to @pjfanning's question here: apache/pekko#1709 (comment) In parallel, let's assume that there won't be a new Pekko release in the upcoming day or two. Should we simply bump from 7 to 12MB and hope it does it, what do you think?

I agree with moving forward and adjusting the off-heap memory to 12MB. Will comment to the Pekko PR in a bit.

…by increase the direct memory of TM"

This reverts commit 407c3d5.
…Netty4 RPC

With Pekko updated and using Netty4, the default memory buffer allocation
is different compared to Netty3, thus to stabilize this test we increased
the given memory a bit.
@ferenc-csaky ferenc-csaky requested a review from afedulov January 23, 2025 16:19
@ferenc-csaky ferenc-csaky changed the title [1.20][FLINK-37100][tests] Fix test_netty_shuffle_memory_control.sh in CI for JDK11+ [1.20][FLINK-37100][tests] Fix test_netty_shuffle_memory_control.sh with Netty4 RPC Jan 23, 2025
Copy link
Contributor

@afedulov afedulov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@ferenc-csaky ferenc-csaky merged commit 72a3232 into apache:release-1.20 Jan 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants