Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduling replications of many independent data objects can result in many failures #8154

Open
2 tasks
alanking opened this issue Jan 31, 2025 · 0 comments
Open
2 tasks
Labels
Milestone

Comments

@alanking
Copy link
Contributor

  • main
  • 4-3-stable

Bug Report

iRODS Version, OS and Version

4.3.3, Ubuntu 24.04, MySQL 8.0

What did you try to do?

Schedule a few hundred data objects to be replicated on the delay queue.

Expected behavior

I expect them all to migrate safely with no incidents.

Observed behavior (including steps to reproduce, if applicable)

There are many errors in the logs and, in some cases, the replication fails.

Here's how to do it...

Make 513 (513 = 256 * 2 + 1... forces paging in queries) files, with some content so we're actually moving some bits. Then, iput it to iRODS:

mkdir -p bigdir ; for i in {0..513}; do echo ${i} > bigdir/file${i} ; done
iput bigdir -r

You'll have something that looks like this...

$ ils -l bigdir | head
/tempZone/home/rods/bigdir:
  rods              0 demoResc            2 2025-01-31.22:09 & file0
  rods              0 demoResc            2 2025-01-31.22:09 & file1
  rods              0 demoResc            3 2025-01-31.22:09 & file10
  rods              0 demoResc            4 2025-01-31.22:09 & file100
  rods              0 demoResc            4 2025-01-31.22:09 & file101
  rods              0 demoResc            4 2025-01-31.22:09 & file102
  rods              0 demoResc            4 2025-01-31.22:09 & file103
  rods              0 demoResc            4 2025-01-31.22:09 & file104
  rods              0 demoResc            4 2025-01-31.22:09 & file105

Run a rule that asynchronously replicates the whole collection to another resource (this is very unpolished...):

main {
        *destination_resource = "AnotherResc";
        *collection = "/tempZone/home/rods/bigdir";
        *file_count = 256 * 2 + 1;
        for (*i = 0; *i < *file_count; *i = *i + 1) {
                *logical_path = "*collection/file*i";
                delay("<PLUSET>1s</PLUSET>") {
                        writeLine("serverLog", "Migrating [*logical_path] to [*destination_resource].");
                        msiDataObjRepl("*logical_path", "destRescName=*destination_resource", *status);
                }
        }
}

INPUT null
OUTPUT ruleExecOut

Things will move along fine for a while, and then you'll start getting messages like this:

"bindVar[1]=1"                                                                                                                          
"bindVar[2]="                                                                                                                           
"bindVar[3]=generic"                                                                                                                    
"bindVar[4]=3"                                                                                                                          
"bindVar[5]=/tmp/irods/AnotherResc/home/rods/bigdir/file18"                                                                             
"bindVar[6]=rods"                                                                                                                       
"bindVar[7]=tempZone"                                                                                                                   
"bindVar[8]=1"                                                                                                                          
"bindVar[9]="                                                                                                                           
"bindVar[10]="                                                                                                                          
"bindVar[11]=00000000000"                                                                                                               
"bindVar[12]=0"                                                                                                                         
"bindVar[13]=33204"                                                                                                                     
"bindVar[14]="                                                                                                                          
"bindVar[15]=01738361363"                                                                                                               
"bindVar[16]=01738361592"                                                                                                               
"bindVar[17]=13999"                                                                                                                     
"bindVar[18]=16826"
"bindVar[19]=13999"
"_cllExecSqlNoResult: SQLExecDirect error: -1 sql:update R_DATA_MAIN set data_repl_num = ?, data_version = ?, data_type_name = ?, data_size = ?, data_path = ?, data_owner_name = ?, data_owner_zone = ?, data_is_dirty = ?, data_status = ?, data_checksum = ?, data_expiry_ts 
= ?, data_map_id = ?, data_mode = ?, r_comment = ?, create_ts = ?, modify_ts = ?, resc_id = ? where data_id = ? and resc_id = ?"
"SQLSTATE: S1010"
"SQLCODE: 0"
"SQL Error message: [unixODBC][Driver Manager]Function sequence error"
"data_object_finalize cmlExecuteNoAnswerSql(rollback) succeeded"
"[db_data_object_finalize_op:15722] - [cmlExecuteNoAnswerSql failed [ec=[-806000]]]"
"failed to publish replica states for [16826]"
"[finalize_destination_replica:277] - failed to finalize data object [error_code=[-806000], path=[/tempZone/home/rods/bigdir/file18], hierarchy=[AnotherResc]]"
"[replicate_data:712] - closing destination replica [/tempZone/home/rods/bigdir/file18] failed with [-806000]"
"[replicate_data_object:929] - failed to replicate [/tempZone/home/rods/bigdir/file18]"
"rsDataObjRepl - Failed to replicate data object. status:[-806000]"
"msiDataObjRepl: rsDataObjRepl failed /tempZone/home/rods/bigdir/file18, status = -806000"
"executeRuleAction Failed for msiDataObjRepl status = -806000 CAT_SQL_ERR"
"executeRuleBody: Microservice or Action msiDataObjRepl Failed with status -806000"
"execMicroService3: error when executing microservice\nline 3, col 3\n\t\t\tmsiDataObjRepl(\"*logical_path\", \"destRescName=*destination_resource\", *status);\n\t\t\t^\n\ncaused by: DEBUG: msiDataObjRepl: rsDataObjRepl failed /tempZone/home/rods/bigdir/file18, status = -
806000\n\n"
"Rule Engine Plugin returned [-806000]."

Sometimes (not all the time), objects will get stuck in the intermediate state.

Other times, the new replica will be stale and the source replica will be good.

$ ils -L bigdir/file18
  rods              0 demoResc            3 2025-01-31.22:09 & file18
        generic    /var/lib/irods/Vault/home/rods/bigdir/file18
  rods              1 AnotherResc            3 2025-01-31.22:13 X file18
        generic    /tmp/irods/AnotherResc/home/rods/bigdir/file18

This issue does not seem to happen in Postgres.

This issue is likely the cause of irods/irods_capability_storage_tiering#279.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

1 participant