`test_put_multi_fetch_page` fails frequently against Ubuntu 24.04 / MySQL 8.4 #279

korydraughn · 2024-08-14T15:44:02Z

main
4-3-stable

Bug Report

Encountered during testing of what will be iRODS 4.3.3.
Platform is Ubuntu 24.04.
Database is MySQL 8.4.

The test fails due to the following assertion.

irods_capability_storage_tiering/packaging/test_plugin_unified_storage_tiering.py

Line 1520 in 822347c

    
           delay_assert(lambda: admin_session.assert_icommand_fail(['ils', '-l', dirname], 'STDOUT_SINGLELINE', 'ufs0'))

The test creates (256 * 2) + 1 data objects and then starts waiting for them to be moved to another tier. This works, but close observation shows there are failures which lead to stale replicas existing on the original tier. I've noticed at least 3 replicas in this state following test completion. The test fails because it finds replicas on the original tier, even though it moved 400+ replicas.

Decreasing the number of replicas involved (by 100 or so) resulted in the test passing. However, reducing the number of replicas isn't a real fix. We need to figure out WHY data movement fails for some replicas.

Below is the output of the failed test.

        <testcase classname="irods.test.test_plugin_unified_storage_tiering.TestStorageTieringContinueInxMigration" name="test_put_multi_fetch_page" time="287.555" timestamp="2024-08-13T21:11:55" file="scripts/irods/test/test_plugin_unified_storage_tiering.py" line="1503">
                <failure type="AssertionError" message=""><![CDATA[Traceback (most recent call last):
  File "/var/lib/irods/scripts/irods/test/test_plugin_unified_storage_tiering.py", line 180, in delay_assert
    out, err, rc = function()
                   ^^^^^^^^^^
  File "/var/lib/irods/scripts/irods/test/test_plugin_unified_storage_tiering.py", line 1520, in <lambda>
    delay_assert(lambda: admin_session.assert_icommand_fail(['ils', '-l', dirname], 'STDOUT_SINGLELINE', 'ufs0'))
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/var/lib/irods/scripts/irods/test/session.py", line 166, in assert_icommand_fail
    return assert_command_fail(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/var/lib/irods/scripts/irods/test/command.py", line 79, in assert_command_fail
    return _assert_helper(*args, should_fail=True, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/var/lib/irods/scripts/irods/test/command.py", line 104, in _assert_helper
    assert result
           ^^^^^^
AssertionError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/var/lib/irods/scripts/irods/test/test_plugin_unified_storage_tiering.py", line 1520, in test_put_multi_fetch_page
    delay_assert(lambda: admin_session.assert_icommand_fail(['ils', '-l', dirname], 'STDOUT_SINGLELINE', 'ufs0'))
  File "/var/lib/irods/scripts/irods/test/test_plugin_unified_storage_tiering.py", line 187, in delay_assert
    assert(False)
           ^^^^^
AssertionError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/var/lib/irods/scripts/irods/test/test_plugin_unified_storage_tiering.py", line 1526, in test_put_multi_fetch_page
    admin_session.assert_icommand('irm -r ' + dirname)
  File "/var/lib/irods/scripts/irods/test/session.py", line 162, in assert_icommand
    return assert_command(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/var/lib/irods/scripts/irods/test/command.py", line 76, in assert_command
    return _assert_helper(*args, should_fail=False, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/var/lib/irods/scripts/irods/test/command.py", line 104, in _assert_helper
    assert result
           ^^^^^^
AssertionError
]]></failure>

Here is the final listing before the assertion fails.

id     name
11110 {"delay_conditions":"<INST_NAME>irods_rule_engine_plugin-unified_storage_tiering-instance</INST_NAME><EF>60s REPEAT UNTIL SUCCESS OR 5 TIMES</EF><PLUSET>1s</PLUSET>","destination-resource":"ufs1","group-name":"example_group","md5":"ca0e41ba44e21e0c2e4eb7d9064d0caf","object-path":"/tempZone/home/rods/test_put_multi_fetch_page/junk0365","preserve-replicas":false,"rule-
engine-instance-name":"irods_rule_engine_plugin-unified_storage_tiering-instance","rule-engine-operation":"irods_policy_data_movement","source-replica-number":"0","source-resource":"ufs0","user-name":"rods","user-zone":"tempZone","verification-type":"catalog"}

 --- IrodsSession: icommand executed by [rods#tempZone] [ils -l test_put_multi_fetch_page] ---
Assert FAIL Command: ils -l test_put_multi_fetch_page
Expecting STDOUT_SINGLELINE: ['ufs0']
  stdout:
    | /tempZone/home/rods/test_put_multi_fetch_page:
<snip>
    |   rods              1 ufs1            1 2024-08-14.15:32 & junk0360
    |   rods              1 ufs1            1 2024-08-14.15:32 & junk0361
    |   rods              1 ufs1            1 2024-08-14.15:32 & junk0362
    |   rods              1 ufs1            1 2024-08-14.15:32 & junk0363
    |   rods              1 ufs1            1 2024-08-14.15:32 & junk0364
    |   rods              0 ufs0            1 2024-08-14.15:30 & junk0365  <== Should not see ufs0.
    |   rods              1 ufs1            1 2024-08-14.15:32 X junk0365  <== Or a stale replica.
    |   rods              1 ufs1            1 2024-08-14.15:32 & junk0366
    |   rods              1 ufs1            1 2024-08-14.15:32 & junk0367
    |   rods              1 ufs1            1 2024-08-14.15:32 & junk0368
    |   rods              1 ufs1            1 2024-08-14.15:32 & junk0369
<snip>

The text was updated successfully, but these errors were encountered:

alanking · 2024-12-12T18:06:38Z

While this is not the same failure as #246, it is in the same test class as that failure, so I'm going to link them as possibly related...

alanking · 2025-01-31T15:45:44Z

Saw the following messages in the log following a failure...

"bindVar[1]=0"
"bindVar[2]="
"bindVar[3]=generic"
"bindVar[4]=1"
"bindVar[5]=/tmp/irods/ufs0/home/rods/test_put_multi_fetch_page/junk0474"
"bindVar[6]=rods"
"bindVar[7]=tempZone"
"bindVar[8]=1"
"bindVar[9]="
"bindVar[10]="
"bindVar[11]=00000000000"
"bindVar[12]=0"
"bindVar[13]=33204"
"bindVar[14]="
"bindVar[15]=01738335972"
"bindVar[16]=01738335972"
"bindVar[17]=11454"
"bindVar[18]=12111"
"bindVar[19]=11454"
"_cllExecSqlNoResult: SQLExecDirect error: -1 sql:update R_DATA_MAIN set data_repl_num = ?, data_version = ?, data_type_name = ?, data_size = ?, data_path = ?, data_owner_name = ?, data_owner_zone = ?, data_is_dirty = ?, data_status = ?, data_checksum = ?, data_expiry_ts = ?, data_map_id = ?, data_mode = ?, r_comment = ?, create_ts = ?, modify_ts = ?, resc_id = ? where data_id = ? and resc_id = ?"
"SQLSTATE: S1010"
"SQLCODE: 0"
"SQL Error message: [unixODBC][Driver Manager]Function sequence error"
"data_object_finalize cmlExecuteNoAnswerSql(rollback) succeeded"
"[db_data_object_finalize_op:15722] - [cmlExecuteNoAnswerSql failed [ec=[-806000]]]"
"failed to publish replica states for [12111]"
"[finalize_destination_replica:277] - failed to finalize data object [error_code=[-806000], path=[/tempZone/home/rods/test_put_multi_fetch_page/junk0474], hierarchy=[ufs1]]"
"[replicate_data:712] - closing destination replica [/tempZone/home/rods/test_put_multi_fetch_page/junk0474] failed with [-806000]"
"[replicate_data_object:929] - failed to replicate [/tempZone/home/rods/test_put_multi_fetch_page/junk0474]"
"rsDataObjRepl - Failed to replicate data object. status:[-806000]"
"[-]\t/irods_plugin_source/src/main.cpp:834:irods::error exec_rule_expression(irods::default_re_ctx &, const std::string &, msParamArray_t *, irods::callback) :  status [CAT_SQL_ERR]  errno [] -- message [iRODS Exception:\n    file: /irods_plugin_source/src/main.cpp\n    function: void (anonymous namespace)::replicate_object_to_resource(rcComm_t *, const std::string &, const std::string &, const std::string &, const std::string &)\n    line: 142\n    code: -806000 (CAT_SQL_ERR)\n    message:\n        failed to migrate [/tempZone/home/rods/test_put_multi_fetch_page/junk0474] to [ufs1]\nstack trace:\n--------------\n 0# irods::stacktrace::dump() const in /lib/libirods_common.so.4.3.3\n 1# irods::exception::assemble_full_display_what() const in /lib/libirods_common.so.4.3.3\n 2# irods::exception::what() const in /lib/libirods_common.so.4.3.3\n 3# exec_rule_expression(std::__1::tuple<>&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, MsParamArray*, irods::callback) in /usr/lib/irods/plugins/rule_engines/libirods_rule_engine_plugin-unified_storage_tiering.so\n 4# std::__1::__function::__func<irods::error (*)(std::__1::tuple<>&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, MsParamArray*, irods::callback), std::__1::allocator<irods::error (*)(std::__1::tuple<>&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, MsParamArray*, irods::callback)>, irods::error (std::__1::tuple<>&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, MsParamArray*, irods::callback)>::operator()(std::__1::tuple<>&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, MsParamArray*&&, irods::callback&&) in /usr/lib/irods/plugins/rule_engines/libirods_rule_engine_plugin-unified_storage_tiering.so\n 5# irods::pluggable_rule_engine<std::__1::tuple<> >::exec_rule_expression(std::__1::tuple<>&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, MsParamArray*, irods::callback) in /lib/libirods_server.so.4.3.3\n 6# irods::rule_engine_context_manager<std::__1::tuple<>, RuleExecInfo*, (irods::rule_execution_manager_pack)0>::exec_rule_expression(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, MsParamArray*) in /lib/libirods_server.so.4.3.3\n 7# rsExecRuleExpression(RsComm*, ExecRuleExpression*) in /lib/libirods_server.so.4.3.3\n 8# irods::api_call_adaptor<ExecRuleExpression*>::operator()(irods::plugin_context&, RsComm*, ExecRuleExpression*) in /lib/libirods_server.so.4.3.3\n 9# std::__1::__function::__func<irods::api_call_adaptor<ExecRuleExpression*>, std::__1::allocator<irods::api_call_adaptor<ExecRuleExpression*> >, irods::error (irods::plugin_context&, RsComm*, ExecRuleExpression*)>::operator()(irods::plugin_context&, RsComm*&&, ExecRuleExpression*&&) in /lib/libirods_server.so.4.3.3\n10# int irods::api_entry::call_handler<ExecRuleExpression*>(RsComm*, ExecRuleExpression*) in /lib/libirods_server.so.4.3.3\n11# rsApiHandler(RsComm*, int, BytesBuf*, BytesBuf*) in /lib/libirods_server.so.4.3.3\n12# readAndProcClientMsg(RsComm*, int) in /lib/libirods_server.so.4.3.3\n13# agentMain(RsComm*) in /lib/libirods_server.so.4.3.3\n14# runIrodsAgentFactory(sockaddr_un) in /lib/libirods_server.so.4.3.3\n15# main::$_5::operator()() const at rodsServer.cpp:?\n16# main in /usr/sbin/irodsServer\n17# 0x00007F97DE50F1CA in /lib/x86_64-linux-gnu/libc.so.6\n18# __libc_start_main in /lib/x86_64-linux-gnu/libc.so.6\n19# _start in /usr/sbin/irodsServer\n\n]\n\n"

Here's a more legible version of that iRODS stacktrace:

    file: /irods_plugin_source/src/main.cpp
    function: void (anonymous namespace)::replicate_object_to_resource(rcComm_t *, const std::string &, const std::string &, const std::string &, const std::string &)
    line: 142 
    code: -806000 (CAT_SQL_ERR)
    message:
        failed to migrate [/tempZone/home/rods/test_put_multi_fetch_page/junk0474] to [ufs1]
stack trace:
--------------
 0# irods::stacktrace::dump() const in /lib/libirods_common.so.4.3.3
 1# irods::exception::assemble_full_display_what() const in /lib/libirods_common.so.4.3.3
 2# irods::exception::what() const in /lib/libirods_common.so.4.3.3
 3# exec_rule_expression(std::__1::tuple<>&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, MsParamArray*, irods::callback) in /usr/lib/irods/plugins/rule_engines/libirods_rule_engine_plugin-unified_storage_tiering.so
 4# std::__1::__function::__func<irods::error (*)(std::__1::tuple<>&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, MsParamArray*, irods::callback), std::__1::allocator<irods::error (*)(std::__1::tuple<>&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, MsParamArray*, irods::callback)>, irods::error (std::__1::tuple<>&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, MsParamArray*, irods::callback)>::operator()(std::__1::tuple<>&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, MsParamArray*&&, irods::callback&&) in /usr/lib/irods/plugins/rule_engines/libirods_rule_engine_plugin-unified_storage_tiering.so
 5# irods::pluggable_rule_engine<std::__1::tuple<> >::exec_rule_expression(std::__1::tuple<>&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, MsParamArray*, irods::callback) in /lib/libirods_server.so.4.3.3
 6# irods::rule_engine_context_manager<std::__1::tuple<>, RuleExecInfo*, (irods::rule_execution_manager_pack)0>::exec_rule_expression(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, MsParamArray*) in /lib/libirods_server.so.4.3.3
 7# rsExecRuleExpression(RsComm*, ExecRuleExpression*) in /lib/libirods_server.so.4.3.3
 8# irods::api_call_adaptor<ExecRuleExpression*>::operator()(irods::plugin_context&, RsComm*, ExecRuleExpression*) in /lib/libirods_server.so.4.3.3
 9# std::__1::__function::__func<irods::api_call_adaptor<ExecRuleExpression*>, std::__1::allocator<irods::api_call_adaptor<ExecRuleExpression*> >, irods::error (irods::plugin_context&, RsComm*, ExecRuleExpression*)>::operator()(irods::plugin_context&, RsComm*&&, ExecRuleExpression*&&) in /lib/libirods_server.so.4.3.3
10# int irods::api_entry::call_handler<ExecRuleExpression*>(RsComm*, ExecRuleExpression*) in /lib/libirods_server.so.4.3.3
11# rsApiHandler(RsComm*, int, BytesBuf*, BytesBuf*) in /lib/libirods_server.so.4.3.3
12# readAndProcClientMsg(RsComm*, int) in /lib/libirods_server.so.4.3.3
13# agentMain(RsComm*) in /lib/libirods_server.so.4.3.3
14# runIrodsAgentFactory(sockaddr_un) in /lib/libirods_server.so.4.3.3
15# main::$_5::operator()() const at rodsServer.cpp:?
16# main in /usr/sbin/irodsServer
17# 0x00007F97DE50F1CA in /lib/x86_64-linux-gnu/libc.so.6
18# __libc_start_main in /lib/x86_64-linux-gnu/libc.so.6
19# _start in /usr/sbin/irodsServer

I see many of these failures throughout the logs, but this is the only data object which resulted in a failure to trim and a stale replica on the destination resource. I also see many messages like the ones described here: #246 (comment)

alanking · 2025-01-31T18:22:35Z

Forgot to include this in previous comment... Found the place in the logs where the unlink failed for this data object:

[-]    /irods_source/server/api/src/rsFileUnlink.cpp:112:int _rsFileUnlink(rsComm_t *, fileUnlinkInp_t *) :  status [UNIX_FILE_UNLINK_ERR]  errno [No such file or directory] -- message [fileRead failed for [/tmp/irods/ufs1/home/rods/test_put_multi_fetch_page/junk0474]-515002]
    [-] /irods_source/server/drivers/src/fileDriver.cpp:197:irods::error fileUnlink(rsComm_t *, irods::first_class_object_ptr) :  status [UNIX_FILE_UNLINK_ERR]  errno [No such file or directory] -- message [failed to call unlink]
        [-] /irods_source/plugins/resources/unixfilesystem/src/unixfilesystem.cpp:828:irods::error unix_file_unlink(irods::plugin_context &) :  status [UNIX_FILE_UNLINK_ERR]  errno [No such file or directory] -- message [Unlink error for file: "/tmp/irods/ufs1/home/rods/test_put_multi_fetch_page/junk0474", errno = "No such file or directory", status = -515002.]

"[dataObjUnlinkS:610] - failed to physically unlink replica [error_code=[-515002], path=[/tempZone/home/rods/test_put_multi_fetch_page/junk0474], hierarchy=[ufs1]"

Things are still fuzzy, but I'm seeing a few things now.

First, the "Function sequence error" is likely a problem similar (but not necessarily identical) to irods/irods#7440. I will need to try building against 4-3-stable to see if that changes the issue from the database error to an iRODS error (preferable, but not the solution to this issue).

The cause behind these database errors is still unclear. My initial thought was that multiple delay rule executors were picking up the rule at the same time and multiple replications were happening as a result. My suspicion is this "transaction isolation level" that we have to mess with for MySQL in the testing environment and elsewhere: https://github.com/irods/irods_testing_environment/blob/56e7c810dc6fbbfc9f0ccd139119314b8c3c6cb5/projects/ubuntu-24.04/ubuntu-24.04-mysql-8.4/docker-compose.yml#L4-L39

However, we don't see this failure for other databases. I tried running the test against Postgres 16 and carefully watched the logs. While I saw many messages about delay rules not existing...

"Could not find a delay rule with id [12637]."
"Could not get delay rule information [rule id=[12637]]"
"dequeueing rule because rule ID [12637] no longer exists in the catalog"

I did not see any errors about failure to replicate.

I tried running the test a few times on MySQL 8.0 and did not see the failure. I did see the failure to replicate, but no failure in the test itself, and no failures to trim afterwards.

Finally, I added some logging statements using the new logger (e.g. irods::experimental::log::rule_engine) and they're not showing up, despite calls to rodsLog working. So, something is wrong there, too.

Investigation ongoing.

alanking · 2025-01-31T22:31:09Z

I think this failure is not related to storage tiering. I was able to reproduce with a native rule language rule: irods/irods#8154 I don't think we will be able to fix this in 4.3.3.1 as the issue exists in the core server (either in the delay server or the replication API or the database plugin).

alanking · 2025-02-03T16:16:03Z

Removing from the 4.3.3.1 milestone as this issue is not related to storage tiering. We can circle back to this once irods/irods#8154 is resolved.

korydraughn added the testing label Aug 14, 2024

alanking added this to the 4.3.3.1 milestone Dec 12, 2024

alanking self-assigned this Dec 12, 2024

alanking mentioned this issue Jan 31, 2025

Scheduling replications of many independent data objects can result in many failures irods/irods#8154

Open

2 tasks

alanking removed this from the 4.3.3.1 milestone Feb 3, 2025

alanking removed their assignment Feb 3, 2025

This was referenced Feb 5, 2025

[#306] Ensure replicas are trimmed before replication in more tests (4-3-stable) #340

Merged

Test failure: TestStorageTieringContinueInxMigration.test_put_object_limit_lt #246

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`test_put_multi_fetch_page` fails frequently against Ubuntu 24.04 / MySQL 8.4 #279

`test_put_multi_fetch_page` fails frequently against Ubuntu 24.04 / MySQL 8.4 #279

korydraughn commented Aug 14, 2024

alanking commented Dec 12, 2024

alanking commented Jan 31, 2025

alanking commented Jan 31, 2025

alanking commented Jan 31, 2025

alanking commented Feb 3, 2025

test_put_multi_fetch_page fails frequently against Ubuntu 24.04 / MySQL 8.4 #279

test_put_multi_fetch_page fails frequently against Ubuntu 24.04 / MySQL 8.4 #279

Comments

korydraughn commented Aug 14, 2024

Bug Report

alanking commented Dec 12, 2024

alanking commented Jan 31, 2025

alanking commented Jan 31, 2025

alanking commented Jan 31, 2025

alanking commented Feb 3, 2025

`test_put_multi_fetch_page` fails frequently against Ubuntu 24.04 / MySQL 8.4 #279

`test_put_multi_fetch_page` fails frequently against Ubuntu 24.04 / MySQL 8.4 #279