Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_put_multi_fetch_page fails frequently against Ubuntu 24.04 / MySQL 8.4 #279

Open
2 tasks
korydraughn opened this issue Aug 14, 2024 · 5 comments
Open
2 tasks
Labels

Comments

@korydraughn
Copy link
Collaborator

  • main
  • 4-3-stable

Bug Report

Encountered during testing of what will be iRODS 4.3.3.
Platform is Ubuntu 24.04.
Database is MySQL 8.4.

The test fails due to the following assertion.

delay_assert(lambda: admin_session.assert_icommand_fail(['ils', '-l', dirname], 'STDOUT_SINGLELINE', 'ufs0'))

The test creates (256 * 2) + 1 data objects and then starts waiting for them to be moved to another tier. This works, but close observation shows there are failures which lead to stale replicas existing on the original tier. I've noticed at least 3 replicas in this state following test completion. The test fails because it finds replicas on the original tier, even though it moved 400+ replicas.

Decreasing the number of replicas involved (by 100 or so) resulted in the test passing. However, reducing the number of replicas isn't a real fix. We need to figure out WHY data movement fails for some replicas.

Below is the output of the failed test.

        <testcase classname="irods.test.test_plugin_unified_storage_tiering.TestStorageTieringContinueInxMigration" name="test_put_multi_fetch_page" time="287.555" timestamp="2024-08-13T21:11:55" file="scripts/irods/test/test_plugin_unified_storage_tiering.py" line="1503">
                <failure type="AssertionError" message=""><![CDATA[Traceback (most recent call last):
  File "/var/lib/irods/scripts/irods/test/test_plugin_unified_storage_tiering.py", line 180, in delay_assert
    out, err, rc = function()
                   ^^^^^^^^^^
  File "/var/lib/irods/scripts/irods/test/test_plugin_unified_storage_tiering.py", line 1520, in <lambda>
    delay_assert(lambda: admin_session.assert_icommand_fail(['ils', '-l', dirname], 'STDOUT_SINGLELINE', 'ufs0'))
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/var/lib/irods/scripts/irods/test/session.py", line 166, in assert_icommand_fail
    return assert_command_fail(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/var/lib/irods/scripts/irods/test/command.py", line 79, in assert_command_fail
    return _assert_helper(*args, should_fail=True, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/var/lib/irods/scripts/irods/test/command.py", line 104, in _assert_helper
    assert result
           ^^^^^^
AssertionError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/var/lib/irods/scripts/irods/test/test_plugin_unified_storage_tiering.py", line 1520, in test_put_multi_fetch_page
    delay_assert(lambda: admin_session.assert_icommand_fail(['ils', '-l', dirname], 'STDOUT_SINGLELINE', 'ufs0'))
  File "/var/lib/irods/scripts/irods/test/test_plugin_unified_storage_tiering.py", line 187, in delay_assert
    assert(False)
           ^^^^^
AssertionError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/var/lib/irods/scripts/irods/test/test_plugin_unified_storage_tiering.py", line 1526, in test_put_multi_fetch_page
    admin_session.assert_icommand('irm -r ' + dirname)
  File "/var/lib/irods/scripts/irods/test/session.py", line 162, in assert_icommand
    return assert_command(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/var/lib/irods/scripts/irods/test/command.py", line 76, in assert_command
    return _assert_helper(*args, should_fail=False, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/var/lib/irods/scripts/irods/test/command.py", line 104, in _assert_helper
    assert result
           ^^^^^^
AssertionError
]]></failure>

Here is the final listing before the assertion fails.

id     name
11110 {"delay_conditions":"<INST_NAME>irods_rule_engine_plugin-unified_storage_tiering-instance</INST_NAME><EF>60s REPEAT UNTIL SUCCESS OR 5 TIMES</EF><PLUSET>1s</PLUSET>","destination-resource":"ufs1","group-name":"example_group","md5":"ca0e41ba44e21e0c2e4eb7d9064d0caf","object-path":"/tempZone/home/rods/test_put_multi_fetch_page/junk0365","preserve-replicas":false,"rule-
engine-instance-name":"irods_rule_engine_plugin-unified_storage_tiering-instance","rule-engine-operation":"irods_policy_data_movement","source-replica-number":"0","source-resource":"ufs0","user-name":"rods","user-zone":"tempZone","verification-type":"catalog"}

 --- IrodsSession: icommand executed by [rods#tempZone] [ils -l test_put_multi_fetch_page] ---
Assert FAIL Command: ils -l test_put_multi_fetch_page
Expecting STDOUT_SINGLELINE: ['ufs0']
  stdout:
    | /tempZone/home/rods/test_put_multi_fetch_page:
<snip>
    |   rods              1 ufs1            1 2024-08-14.15:32 & junk0360
    |   rods              1 ufs1            1 2024-08-14.15:32 & junk0361
    |   rods              1 ufs1            1 2024-08-14.15:32 & junk0362
    |   rods              1 ufs1            1 2024-08-14.15:32 & junk0363
    |   rods              1 ufs1            1 2024-08-14.15:32 & junk0364
    |   rods              0 ufs0            1 2024-08-14.15:30 & junk0365  <== Should not see ufs0.
    |   rods              1 ufs1            1 2024-08-14.15:32 X junk0365  <== Or a stale replica.
    |   rods              1 ufs1            1 2024-08-14.15:32 & junk0366
    |   rods              1 ufs1            1 2024-08-14.15:32 & junk0367
    |   rods              1 ufs1            1 2024-08-14.15:32 & junk0368
    |   rods              1 ufs1            1 2024-08-14.15:32 & junk0369
<snip>
@alanking alanking added this to the 4.3.3.1 milestone Dec 12, 2024
@alanking alanking self-assigned this Dec 12, 2024
@alanking
Copy link
Contributor

While this is not the same failure as #246, it is in the same test class as that failure, so I'm going to link them as possibly related...

@alanking
Copy link
Contributor

Saw the following messages in the log following a failure...

"bindVar[1]=0"
"bindVar[2]="
"bindVar[3]=generic"
"bindVar[4]=1"
"bindVar[5]=/tmp/irods/ufs0/home/rods/test_put_multi_fetch_page/junk0474"
"bindVar[6]=rods"
"bindVar[7]=tempZone"
"bindVar[8]=1"
"bindVar[9]="
"bindVar[10]="
"bindVar[11]=00000000000"
"bindVar[12]=0"
"bindVar[13]=33204"
"bindVar[14]="
"bindVar[15]=01738335972"
"bindVar[16]=01738335972"
"bindVar[17]=11454"
"bindVar[18]=12111"
"bindVar[19]=11454"
"_cllExecSqlNoResult: SQLExecDirect error: -1 sql:update R_DATA_MAIN set data_repl_num = ?, data_version = ?, data_type_name = ?, data_size = ?, data_path = ?, data_owner_name = ?, data_owner_zone = ?, data_is_dirty = ?, data_status = ?, data_checksum = ?, data_expiry_ts = ?, data_map_id = ?, data_mode = ?, r_comment = ?, create_ts = ?, modify_ts = ?, resc_id = ? where data_id = ? and resc_id = ?"
"SQLSTATE: S1010"
"SQLCODE: 0"
"SQL Error message: [unixODBC][Driver Manager]Function sequence error"
"data_object_finalize cmlExecuteNoAnswerSql(rollback) succeeded"
"[db_data_object_finalize_op:15722] - [cmlExecuteNoAnswerSql failed [ec=[-806000]]]"
"failed to publish replica states for [12111]"
"[finalize_destination_replica:277] - failed to finalize data object [error_code=[-806000], path=[/tempZone/home/rods/test_put_multi_fetch_page/junk0474], hierarchy=[ufs1]]"
"[replicate_data:712] - closing destination replica [/tempZone/home/rods/test_put_multi_fetch_page/junk0474] failed with [-806000]"
"[replicate_data_object:929] - failed to replicate [/tempZone/home/rods/test_put_multi_fetch_page/junk0474]"
"rsDataObjRepl - Failed to replicate data object. status:[-806000]"
"[-]\t/irods_plugin_source/src/main.cpp:834:irods::error exec_rule_expression(irods::default_re_ctx &, const std::string &, msParamArray_t *, irods::callback) :  status [CAT_SQL_ERR]  errno [] -- message [iRODS Exception:\n    file: /irods_plugin_source/src/main.cpp\n    function: void (anonymous namespace)::replicate_object_to_resource(rcComm_t *, const std::string &, const std::string &, const std::string &, const std::string &)\n    line: 142\n    code: -806000 (CAT_SQL_ERR)\n    message:\n        failed to migrate [/tempZone/home/rods/test_put_multi_fetch_page/junk0474] to [ufs1]\nstack trace:\n--------------\n 0# irods::stacktrace::dump() const in /lib/libirods_common.so.4.3.3\n 1# irods::exception::assemble_full_display_what() const in /lib/libirods_common.so.4.3.3\n 2# irods::exception::what() const in /lib/libirods_common.so.4.3.3\n 3# exec_rule_expression(std::__1::tuple<>&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, MsParamArray*, irods::callback) in /usr/lib/irods/plugins/rule_engines/libirods_rule_engine_plugin-unified_storage_tiering.so\n 4# std::__1::__function::__func<irods::error (*)(std::__1::tuple<>&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, MsParamArray*, irods::callback), std::__1::allocator<irods::error (*)(std::__1::tuple<>&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, MsParamArray*, irods::callback)>, irods::error (std::__1::tuple<>&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, MsParamArray*, irods::callback)>::operator()(std::__1::tuple<>&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, MsParamArray*&&, irods::callback&&) in /usr/lib/irods/plugins/rule_engines/libirods_rule_engine_plugin-unified_storage_tiering.so\n 5# irods::pluggable_rule_engine<std::__1::tuple<> >::exec_rule_expression(std::__1::tuple<>&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, MsParamArray*, irods::callback) in /lib/libirods_server.so.4.3.3\n 6# irods::rule_engine_context_manager<std::__1::tuple<>, RuleExecInfo*, (irods::rule_execution_manager_pack)0>::exec_rule_expression(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, MsParamArray*) in /lib/libirods_server.so.4.3.3\n 7# rsExecRuleExpression(RsComm*, ExecRuleExpression*) in /lib/libirods_server.so.4.3.3\n 8# irods::api_call_adaptor<ExecRuleExpression*>::operator()(irods::plugin_context&, RsComm*, ExecRuleExpression*) in /lib/libirods_server.so.4.3.3\n 9# std::__1::__function::__func<irods::api_call_adaptor<ExecRuleExpression*>, std::__1::allocator<irods::api_call_adaptor<ExecRuleExpression*> >, irods::error (irods::plugin_context&, RsComm*, ExecRuleExpression*)>::operator()(irods::plugin_context&, RsComm*&&, ExecRuleExpression*&&) in /lib/libirods_server.so.4.3.3\n10# int irods::api_entry::call_handler<ExecRuleExpression*>(RsComm*, ExecRuleExpression*) in /lib/libirods_server.so.4.3.3\n11# rsApiHandler(RsComm*, int, BytesBuf*, BytesBuf*) in /lib/libirods_server.so.4.3.3\n12# readAndProcClientMsg(RsComm*, int) in /lib/libirods_server.so.4.3.3\n13# agentMain(RsComm*) in /lib/libirods_server.so.4.3.3\n14# runIrodsAgentFactory(sockaddr_un) in /lib/libirods_server.so.4.3.3\n15# main::$_5::operator()() const at rodsServer.cpp:?\n16# main in /usr/sbin/irodsServer\n17# 0x00007F97DE50F1CA in /lib/x86_64-linux-gnu/libc.so.6\n18# __libc_start_main in /lib/x86_64-linux-gnu/libc.so.6\n19# _start in /usr/sbin/irodsServer\n\n]\n\n"

Here's a more legible version of that iRODS stacktrace:

    file: /irods_plugin_source/src/main.cpp
    function: void (anonymous namespace)::replicate_object_to_resource(rcComm_t *, const std::string &, const std::string &, const std::string &, const std::string &)
    line: 142 
    code: -806000 (CAT_SQL_ERR)
    message:
        failed to migrate [/tempZone/home/rods/test_put_multi_fetch_page/junk0474] to [ufs1]
stack trace:
--------------
 0# irods::stacktrace::dump() const in /lib/libirods_common.so.4.3.3
 1# irods::exception::assemble_full_display_what() const in /lib/libirods_common.so.4.3.3
 2# irods::exception::what() const in /lib/libirods_common.so.4.3.3
 3# exec_rule_expression(std::__1::tuple<>&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, MsParamArray*, irods::callback) in /usr/lib/irods/plugins/rule_engines/libirods_rule_engine_plugin-unified_storage_tiering.so
 4# std::__1::__function::__func<irods::error (*)(std::__1::tuple<>&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, MsParamArray*, irods::callback), std::__1::allocator<irods::error (*)(std::__1::tuple<>&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, MsParamArray*, irods::callback)>, irods::error (std::__1::tuple<>&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, MsParamArray*, irods::callback)>::operator()(std::__1::tuple<>&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, MsParamArray*&&, irods::callback&&) in /usr/lib/irods/plugins/rule_engines/libirods_rule_engine_plugin-unified_storage_tiering.so
 5# irods::pluggable_rule_engine<std::__1::tuple<> >::exec_rule_expression(std::__1::tuple<>&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, MsParamArray*, irods::callback) in /lib/libirods_server.so.4.3.3
 6# irods::rule_engine_context_manager<std::__1::tuple<>, RuleExecInfo*, (irods::rule_execution_manager_pack)0>::exec_rule_expression(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, MsParamArray*) in /lib/libirods_server.so.4.3.3
 7# rsExecRuleExpression(RsComm*, ExecRuleExpression*) in /lib/libirods_server.so.4.3.3
 8# irods::api_call_adaptor<ExecRuleExpression*>::operator()(irods::plugin_context&, RsComm*, ExecRuleExpression*) in /lib/libirods_server.so.4.3.3
 9# std::__1::__function::__func<irods::api_call_adaptor<ExecRuleExpression*>, std::__1::allocator<irods::api_call_adaptor<ExecRuleExpression*> >, irods::error (irods::plugin_context&, RsComm*, ExecRuleExpression*)>::operator()(irods::plugin_context&, RsComm*&&, ExecRuleExpression*&&) in /lib/libirods_server.so.4.3.3
10# int irods::api_entry::call_handler<ExecRuleExpression*>(RsComm*, ExecRuleExpression*) in /lib/libirods_server.so.4.3.3
11# rsApiHandler(RsComm*, int, BytesBuf*, BytesBuf*) in /lib/libirods_server.so.4.3.3
12# readAndProcClientMsg(RsComm*, int) in /lib/libirods_server.so.4.3.3
13# agentMain(RsComm*) in /lib/libirods_server.so.4.3.3
14# runIrodsAgentFactory(sockaddr_un) in /lib/libirods_server.so.4.3.3
15# main::$_5::operator()() const at rodsServer.cpp:?
16# main in /usr/sbin/irodsServer
17# 0x00007F97DE50F1CA in /lib/x86_64-linux-gnu/libc.so.6
18# __libc_start_main in /lib/x86_64-linux-gnu/libc.so.6
19# _start in /usr/sbin/irodsServer

I see many of these failures throughout the logs, but this is the only data object which resulted in a failure to trim and a stale replica on the destination resource. I also see many messages like the ones described here: #246 (comment)

@alanking
Copy link
Contributor

Forgot to include this in previous comment... Found the place in the logs where the unlink failed for this data object:

[-]    /irods_source/server/api/src/rsFileUnlink.cpp:112:int _rsFileUnlink(rsComm_t *, fileUnlinkInp_t *) :  status [UNIX_FILE_UNLINK_ERR]  errno [No such file or directory] -- message [fileRead failed for [/tmp/irods/ufs1/home/rods/test_put_multi_fetch_page/junk0474]-515002]
    [-] /irods_source/server/drivers/src/fileDriver.cpp:197:irods::error fileUnlink(rsComm_t *, irods::first_class_object_ptr) :  status [UNIX_FILE_UNLINK_ERR]  errno [No such file or directory] -- message [failed to call unlink]
        [-] /irods_source/plugins/resources/unixfilesystem/src/unixfilesystem.cpp:828:irods::error unix_file_unlink(irods::plugin_context &) :  status [UNIX_FILE_UNLINK_ERR]  errno [No such file or directory] -- message [Unlink error for file: "/tmp/irods/ufs1/home/rods/test_put_multi_fetch_page/junk0474", errno = "No such file or directory", status = -515002.]
"[dataObjUnlinkS:610] - failed to physically unlink replica [error_code=[-515002], path=[/tempZone/home/rods/test_put_multi_fetch_page/junk0474], hierarchy=[ufs1]"

Things are still fuzzy, but I'm seeing a few things now.

First, the "Function sequence error" is likely a problem similar (but not necessarily identical) to irods/irods#7440. I will need to try building against 4-3-stable to see if that changes the issue from the database error to an iRODS error (preferable, but not the solution to this issue).

The cause behind these database errors is still unclear. My initial thought was that multiple delay rule executors were picking up the rule at the same time and multiple replications were happening as a result. My suspicion is this "transaction isolation level" that we have to mess with for MySQL in the testing environment and elsewhere: https://github.com/irods/irods_testing_environment/blob/56e7c810dc6fbbfc9f0ccd139119314b8c3c6cb5/projects/ubuntu-24.04/ubuntu-24.04-mysql-8.4/docker-compose.yml#L4-L39

However, we don't see this failure for other databases. I tried running the test against Postgres 16 and carefully watched the logs. While I saw many messages about delay rules not existing...

"Could not find a delay rule with id [12637]."
"Could not get delay rule information [rule id=[12637]]"
"dequeueing rule because rule ID [12637] no longer exists in the catalog"

I did not see any errors about failure to replicate.

I tried running the test a few times on MySQL 8.0 and did not see the failure. I did see the failure to replicate, but no failure in the test itself, and no failures to trim afterwards.

Finally, I added some logging statements using the new logger (e.g. irods::experimental::log::rule_engine) and they're not showing up, despite calls to rodsLog working. So, something is wrong there, too.

Investigation ongoing.

@alanking
Copy link
Contributor

I think this failure is not related to storage tiering. I was able to reproduce with a native rule language rule: irods/irods#8154 I don't think we will be able to fix this in 4.3.3.1 as the issue exists in the core server (either in the delay server or the replication API or the database plugin).

@alanking alanking removed this from the 4.3.3.1 milestone Feb 3, 2025
@alanking
Copy link
Contributor

alanking commented Feb 3, 2025

Removing from the 4.3.3.1 milestone as this issue is not related to storage tiering. We can circle back to this once irods/irods#8154 is resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

2 participants