-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test_put_multi_fetch_page
fails frequently against Ubuntu 24.04 / MySQL 8.4
#279
Comments
While this is not the same failure as #246, it is in the same test class as that failure, so I'm going to link them as possibly related... |
Saw the following messages in the log following a failure...
Here's a more legible version of that iRODS stacktrace:
I see many of these failures throughout the logs, but this is the only data object which resulted in a failure to trim and a stale replica on the destination resource. I also see many messages like the ones described here: #246 (comment) |
Forgot to include this in previous comment... Found the place in the logs where the unlink failed for this data object:
Things are still fuzzy, but I'm seeing a few things now. First, the "Function sequence error" is likely a problem similar (but not necessarily identical) to irods/irods#7440. I will need to try building against 4-3-stable to see if that changes the issue from the database error to an iRODS error (preferable, but not the solution to this issue). The cause behind these database errors is still unclear. My initial thought was that multiple delay rule executors were picking up the rule at the same time and multiple replications were happening as a result. My suspicion is this "transaction isolation level" that we have to mess with for MySQL in the testing environment and elsewhere: https://github.com/irods/irods_testing_environment/blob/56e7c810dc6fbbfc9f0ccd139119314b8c3c6cb5/projects/ubuntu-24.04/ubuntu-24.04-mysql-8.4/docker-compose.yml#L4-L39 However, we don't see this failure for other databases. I tried running the test against Postgres 16 and carefully watched the logs. While I saw many messages about delay rules not existing...
I did not see any errors about failure to replicate. I tried running the test a few times on MySQL 8.0 and did not see the failure. I did see the failure to replicate, but no failure in the test itself, and no failures to trim afterwards. Finally, I added some logging statements using the new logger (e.g. Investigation ongoing. |
I think this failure is not related to storage tiering. I was able to reproduce with a native rule language rule: irods/irods#8154 I don't think we will be able to fix this in 4.3.3.1 as the issue exists in the core server (either in the delay server or the replication API or the database plugin). |
Removing from the 4.3.3.1 milestone as this issue is not related to storage tiering. We can circle back to this once irods/irods#8154 is resolved. |
Bug Report
Encountered during testing of what will be iRODS 4.3.3.
Platform is Ubuntu 24.04.
Database is MySQL 8.4.
The test fails due to the following assertion.
irods_capability_storage_tiering/packaging/test_plugin_unified_storage_tiering.py
Line 1520 in 822347c
The test creates (256 * 2) + 1 data objects and then starts waiting for them to be moved to another tier. This works, but close observation shows there are failures which lead to stale replicas existing on the original tier. I've noticed at least 3 replicas in this state following test completion. The test fails because it finds replicas on the original tier, even though it moved 400+ replicas.
Decreasing the number of replicas involved (by 100 or so) resulted in the test passing. However, reducing the number of replicas isn't a real fix. We need to figure out WHY data movement fails for some replicas.
Below is the output of the failed test.
Here is the final listing before the assertion fails.
The text was updated successfully, but these errors were encountered: