-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cargo test
performance regression on Windows on version 1.75.0
#119560
Comments
Maybe this also helps a bit to investigate, if we run datafusion windows CI with
it helps to reduce timing from 190m to 60m, but it still way longer than before |
It might help to look at the raw log and narrow down which steps specifically are taking a long time. So the issue is maybe around the generation of debuginfo? |
Just a quick update, narrowed down the testing: Limit to running only tpcds_physical_q44 with backtrace feature:
good (2023-10-29 nightly) = 70 seconds bad (2023-10-30 nightly) = 692 seconds Without backtrace:
good (2023-10-29 nightly) = 1.25 seconds bad (2023-10-30 nightly) = 1.26 seconds Edit: Also just to be clear, we are running with env vars:
For both good & bad runs.
The speedup seems to occur due to adding |
#117089 was merged in the right window, and the PR description notes that it contains rust-lang/backtrace-rs#569 which sounds relevant. |
cc @wesleywiser who authored #117089 and changes to the backtrace-rs submodule (IIUC) |
I'm able to repro locally:
Looking at profiling data, the vast majority of time in the regressed case is spent in One thing that's very interesting is that if do not set @Jefffrey do you know why
|
Thank you for the detailed breakdown! @wesleywiser
The reasoning you provide makes sense, I'm not sure why both
I'll experiment with setting |
Thanks for the config fix suggested @wesleywiser This does eliminate the performance regression (or at least bring the times within a more reasonable range of each other). Even though it was caused by some weird settings on our end (disabling debuginfo but still using backtrace, on Windows), it was still a significant regression in performance. Let me know if this is the expected behaviour now and I can close this issue, or to keep this open if you're planning to investigate this weird edge case further 👍 (or move it to backtrace repo?) |
This was discussed in the libs meeting. While this is an edge case and a likely bug in dbghlp, it was suggested that maybe backtrace could workaround it by somehow detecting if there is no debug info present? Even if it can only tell after trying, it could perhaps cache this knowledge so it doesn't try again the next time? |
Code
DataFusion issue with complete details: apache/datafusion#8696
arrow-datafusion runs
cargo test
on Windows runner as part of CI. When on Rust version 1.74.1 (and below), the check takes under 30 minutes. After upgrading to Rust version 1.75.0, it now takes over 3 hours, with no other change in code on our side. This seems to only take effect on Windows, as Linux/Mac tests didn't seem to be affected.After debugging, I found the regression occurs between toolchains
nightly-2023-10-29 (rust e5cfc5547)
andnightly-2023-10-30 (rust 608e9682f)
.We are running on GitHub actions runner
windows-latest
.So expected run is here, on toolchain
nightly-2023-10-29
: https://github.com/apache/arrow-datafusion/actions/runs/7394674719/job/20116418078When I bump to toolchain
nightly-2023-10-30
, with no other code changes: https://github.com/apache/arrow-datafusion/actions/runs/7394848426/job/20116910586Slow tests
The slowness occurs primarily in two tests.
tpcds_planning
Test code: https://github.com/apache/arrow-datafusion/blob/1179a76567892b259c88f08243ee01f05c4c3d5c/datafusion/core/tests/tpcds_planning.rs
On the good run (before regression):
On the bad run (after regression):
sqllogictest
Test code: https://github.com/apache/arrow-datafusion/tree/1179a76567892b259c88f08243ee01f05c4c3d5c/datafusion/sqllogictest
On the good run (before regression):
On the bad run (after regression):
These two tests are the only ones with a significant delta, the rest don't seem affected by the upgrade.
Version it worked on
Ran fast with Rust 1.74.1 (and nightly-2023-10-29)
Version with regression
Ran slow on Rust 1.75.0 (and nightly-2023-10-30)
Additional context
Apologies if the example is too large to easily determine where the issue is. I'll try to reduce this to a smaller MRE, as I don't have a Windows machine to locally test on, so have had to check via CI.
The text was updated successfully, but these errors were encountered: