-
Notifications
You must be signed in to change notification settings - Fork 13.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[1.20][FLINK-37100][tests] Fix test_netty_shuffle_memory_control.sh
with Netty4 RPC
#25955
[1.20][FLINK-37100][tests] Fix test_netty_shuffle_memory_control.sh
with Netty4 RPC
#25955
Conversation
test_netty_shuffle_memory_control.sh
in CI for JDK11+test_netty_shuffle_memory_control.sh
in CI for JDK11+
flink-end-to-end-tests/test-scripts/test_netty_shuffle_memory_control.sh
Outdated
Show resolved
Hide resolved
b2a98e6
to
6c2acac
Compare
aa8a499
to
09cf21f
Compare
@flinkbot run azure |
Hey @ferenc-csaky , thanks for addressing this. Do you think there are benefits to use |
I think the |
After I rethink it again, it should probably be not enabled here, because the test should verify the Enabling reflection only seemed to stabilize the test on my machine, but since this test was flaky on the CI failed sporadically, and on my machine consistently, it may not add much. My concern about the backported commit from |
@ferenc-csaky thanks for the clarification. It sounds like the correct way would be to find out what is the off-heap overhead induced by Netty 4 default settings in RPC and bump that setting by exactly that amount. |
@ferenc-csaky could you rerun the tests without reflection and unpooled, but with a slightly increased off-heap memory? We probably need to just find the sweet spot to account for the increased off-heap memory consumption caused by the RPC. |
I would like to see anyone run a test with pekko 1.2.0 nightly ... I think then @pjfanning should be happy to get that pr in |
@afedulov Yeah, I cover that today. |
@afedulov So on my M4 Pro chip MacBook the test starts to run with 11MB consistentlt, and consistently fail with anything under that, since TMs cannot even start because there are not enough resource to allocate for Netty. Since the CI runs on x86 machines, and only fail once in a while, it makes me think CPU architecture matters here and it allocates less memory on a non-ARM architecture. All in all this points me to think we should get away with 12MB for these tests for 1.19 and 1.20. In the meantime I also thought a bit about where that |
@ferenc-csaky I think the cache depends on how many cores you have, would you like to test the pekko 1.2.x nightly too, I would like to, but don't know how. |
@He-Pin That's a good point regarding cache! Probably the CI machines has less than 10 cores. I can give it a try, I just have to do a local Pekko build myself to be able to build Flink on top of it. |
@ferenc-csaky Cool But the pekko 1.2.x snapshot should be binary compatible with 1.1.x And there are nightly on https://repository.apache.org/content/groups/snapshots/org/apache/pekko/pekko-actor_2.13/1.2.0-M0+55-a75bc7a7-SNAPSHOT/ Hope that saves you some time. |
TBH I have very limited Scala knowledge, but Flink does not support 2.13 at all, so my preconception was that I need a 2.12 build. I pretty much figured out building it from the nightly GH workflow and just finished building with this cmd: sbt -Dpekko.build.scalaVersion=2.12.x "++ 2.12.x ;publishLocal;publishM2" (After 1 failure because of And I built 1.1.x after I applied your commit from apache/pekko#1709. |
ouch, |
We are actually trying to remove Scala from the project: https://issues.apache.org/jira/browse/FLINK-29741 Anyways, I managed to run the tests making Pekko use the |
I knew , after alibaba acquired, then more Java, thanks for that information. |
@He-Pin now that it is confirmed that the "unpooled" config solves the issue for us, how realistic do you think it is to get a Pekko release with the allocator config support included soon? @ferenc-csaky I would prefer not to postpone much longer unless we get the Pekko release ASAP. Please drop a line in response to @pjfanning's question here: apache/pekko#1709 (comment) |
Would you like to comment on the pr against 1.1.x too. |
I agree with moving forward and adjusting the off-heap memory to 12MB. Will comment to the Pekko PR in a bit. |
…by increase the direct memory of TM" This reverts commit 407c3d5.
…Netty4 RPC With Pekko updated and using Netty4, the default memory buffer allocation is different compared to Netty3, thus to stabilize this test we increased the given memory a bit.
09cf21f
to
61be79d
Compare
test_netty_shuffle_memory_control.sh
in CI for JDK11+test_netty_shuffle_memory_control.sh
with Netty4 RPC
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
What is the purpose of the change
Fixes the test executed by
test_netty_shuffle_memory_control.sh
that can possibly fail the CI in case Netty4 cannot reserve enough memory, hence Pekko is not able to start up.Brief change log
Verifying this change
Existing test should succeed consistently in CI.
Does this pull request potentially affect one of the following parts:
@Public(Evolving)
: noDocumentation