Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix flaky CI checks on Ubuntu #6906

Closed
mhucka opened this issue Jan 1, 2025 · 5 comments · Fixed by #6916
Closed

Fix flaky CI checks on Ubuntu #6906

mhucka opened this issue Jan 1, 2025 · 5 comments · Fixed by #6916
Assignees
Labels
area/ci kind/health For CI/testing/release process/refactoring/technical debt items priority/p1 Fix is needed as soon as possible. Should be staffed. It is blocking some major flows for users

Comments

@mhucka
Copy link
Contributor

mhucka commented Jan 1, 2025

The current pytest runs on Ubuntu often fail during PR checks. It seems that sometimes the optimization steps take too long and time out:

=========================== short test summary info ============================
FAILED cirq-core/cirq/experiments/z_phase_calibration_test.py::test_calibrate_z_phases[angles6-error6-characterization_flags6] - RuntimeError: Optimal parameters not found: The maximum number of function evaluations is exceeded.
= 1 failed, 19825 passed, 121 skipped, 89 xfailed, 7 xpassed, 843 warnings in 334.79s (0:05:34) =
Error: Process completed with exit code 1.

I'm seeing this kind of thing happen on more than 50% of my PRs.

@mhucka mhucka added kind/health For CI/testing/release process/refactoring/technical debt items area/ci labels Jan 1, 2025
@mhucka mhucka changed the title Flaky CI checks on Ubuntu Fix flaky CI checks on Ubuntu Jan 1, 2025
@mhucka mhucka self-assigned this Jan 2, 2025
@mhucka mhucka added the priority/p1 Fix is needed as soon as possible. Should be staffed. It is blocking some major flows for users label Jan 2, 2025
@mhucka
Copy link
Contributor Author

mhucka commented Jan 2, 2025

This seems to be coming from calls to SciPy's curve_fit in the file xeb_fitting.py. The usual reason is that the optimization algorithm failed to find the best fit parameters within the allowed number of iterations.

@mhucka
Copy link
Contributor Author

mhucka commented Jan 2, 2025

What's frustrating about this is that I can't seem to get it to happen locally on a linux system, even if I explicitly set the values of the parameters angles and error in test_calibrate_z_phases_no_options to the same values as those of an example failure on GitHub.

@mhucka
Copy link
Contributor Author

mhucka commented Jan 3, 2025

I've finally been able to reproduce it. I had to set the random seed to the same value as one of the examples of a failure during a CI run.

mhucka added a commit to mhucka/Cirq that referenced this issue Jan 4, 2025
The test function ‘test_calibrate_z_phases()‘ in the file
‘cirq-core/cirq/experiments/z_phase_calibration_test.py‘ eventually
ends up calling `fit_exponential_decays()` in `xeb_fitting.py`. That
function uses SciPy's `curve_fit()` to fit an exponential function to
data. For some unlucky combination of random seed and problem cases,
`curve_fit()` can fail to find a solution, and raises `RuntimeError`
when that happens.

This change wraps the call to `curve_fit()` with a try-catch for
`RuntimeError`; when it happens, the wrapper calls `curve_fit()` one
more time with an explicit setting of higher max function evaluations.
If that fails too, then the resulting new `RuntimeError` will not be
caught.

This is admittedly not a true solution to the problem. In my testing,
this prevents the original CI error (with a particular random seed),
but it's probably the case that the second run of `curve_fit()` could
get unlucky again. Still, this may be good enough for most cases in
practice.

If we continue to see failures, we should explore a more sophisticated
solution to this.
@mhucka
Copy link
Contributor Author

mhucka commented Jan 4, 2025

This will be fixed by PR #6906 once it's accepted.

@mhucka mhucka closed this as completed Jan 4, 2025
mhucka added a commit to mhucka/Cirq that referenced this issue Jan 4, 2025
This workflow is designed to let you run pytest with various debugging
options. It's meant to be executed manually using "Run workflow"on
https://github.com/quantumlib/Cirq/actions/workflows/debug.yaml

Clicking the "Run workflow" button there will make GitHub present a
form interface with various options for the characteristics of the
run. Options include:

- give it a specific test to run (in the form of a path to a test file)
- set the number of repetitions of the test
- set the python version to use
- set the random seed
- turn on verbose logging

This workflow was developed as an aid to debugging issue quantumlib#6906,
involving curve fitting that failed on occasion. This workflow is
general and I think it will be useful for other debugging needs.
@pavoljuhas pavoljuhas reopened this Jan 7, 2025
pavoljuhas added a commit to pavoljuhas/Cirq that referenced this issue Jan 8, 2025
Remove test dependence on the randomly-seed value by passing `random_state`
argument to `calibrate_z_phases` and `z_phase_calibration_workflow`.

Choose seed value from several trial runs with a faster fit convergence
and shorter execution time.

Related to quantumlib#6906
@mhucka
Copy link
Contributor Author

mhucka commented Jan 8, 2025

Noting here how I reproduced it on a Linux system. Using Python 3.11 and this command,

./check/pytest -v --randomly-seed=3258636985 \
cirq-core/cirq/experiments/z_phase_calibration_test.py::test_calibrate_z_phases_no_options

results:

FAILED cirq-core/cirq/experiments/z_phase_calibration_test.py::
test_calibrate_z_phases_no_options[angles1-error1] - RuntimeError:
Optimal parameters not found: The maximum number of function 
evaluations is exceeded.

Attached is a file containing the full output, for historical reference: curve-fit-error.log

pavoljuhas added a commit that referenced this issue Jan 8, 2025
Remove test dependence on the pytest randomly-seed value by passing
`random_state` to `calibrate_z_phases` and `z_phase_calibration_workflow`.

Choose seed value from several trial runs with a faster fit convergence
and shorter execution time.

Related to #6906
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ci kind/health For CI/testing/release process/refactoring/technical debt items priority/p1 Fix is needed as soon as possible. Should be staffed. It is blocking some major flows for users
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants