Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MDEV-35304: Add Connects_Tried and Primary_Retry_Count to SSS #3764

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

ParadoxV5
Copy link
Contributor

@ParadoxV5 ParadoxV5 commented Jan 14, 2025

  • The Jira issue number for this PR is: MDEV-35304

Description

These new Master_info numbers supplement SHOW SLAVE REPLICA STATUS’s (most-recent) ‘Connecting’ state with statistics on (re)connect attempts:

  • Connects_Tried: how many retries have been attempted so far
    • This was previously a local variable that only counter re-attempts; it’s now meaningful even after the “Connecting” state concludes.
  • Primary_Retry_Count: out of how many configured
    • Formerly known as the global option --master-retry-count, it’s now copied per-replica to pave way for CHANGE PRIMARY MASTER … TO in MDEV-25674.
    • I’m pre-emptively naming this one ‘Primary’ re. MDEV-30189 of MDEV-18777.

This PR also includes additional commits that refactor relevant code.
I strongly recommend reviewing commit-by-commit.

What problem is the patch trying to solve?

When the IO thread (re)connect to a primary, no updates are available besides unique errors that cause the failure.

If some output changed that is not visible in a test case, what was it looking like before the change and how it's looking with this patch applied?

From #3764 (comment), appearently:

  • --master-retry-count=0 is treated as =1 for reconnecting to a previously connected primary.
  • There may have been an extra sleep in try_to_reconnect().
    In full, looks like this happens when the connection drops at a different substep in the setup phase.

Both of these code are now gone as part of a preparation commit.

Do you think this patch might introduce side-effects in other parts of the server?

Component that expects a fixed number of SHOW REPLICA STATUS entries in fixed sections might start seeïng uninvited guests.
On that note, we have many MTR tests that run show slave status directly and in full (i.e., not using source include/show_slave_status.inc).

Release Notes

Added two new SLOW REPLICA STATUS entries for statistics of the most-recent ‘Connecting’ state:

  • Connects_Tried: How many retries were attempted
  • Primary_Retry_Count: Out of how many configured (i.e., the --master-retry-count option)

Of course, they’re also accessible from INFORMATION_SCHEMA.SLAVE_STATUS; e.g.:

SELECT
  Connection_name,
  Slave_IO_Running,
  JOIN(Connects_Tried, 'attempt(s) used out of ', Primary_Retry_Count, ' max')
FROM INFORMATION_SCHEMA.SLAVE_STATUS;

Knowledge Base pages that need changing

How can this PR be tested?

I need some help from test specialists.

I’ve drafted the MTR test rpl.rpl_connects_tried for Connets_Tried’s behavior, but I couldn’t test it locally because my build doesn’t have debug_sync.
I couldn’t refer to the buildbots either – they’re either successful (no search results on Cross Reference either) or they fail with issues I have no clue how they relate to my modifications 😶‍🌫️. (Besides the ones that already fail in main, that is 😶.)

Besides checking that Primary_Retry_Count matches --master-retry-count, we should also test the option itself regarding #3764 (comment).
A draft MTR test is at #3731.

If the changes are not amenable to automated testing, please explain why not and carefully describe how to test manually.

A tester can instead replicate (wat) the MTR test by observing the SHOW REPLICA STATUS and/or INFORMATION_SCHEMA.SLAVE_STATUS of a long-master_connect_retry replication over time.

Basing the PR against the correct MariaDB version

  • This is a new feature or a refactoring, and the PR is based against the main branch.
  • This is a bug fix, and the PR is based against the earliest maintained branch in which the bug can be reproduced.

PR quality check

  • I checked the CODING_STANDARDS.md file and my PR conforms to this where appropriate.
  • For any trivial modifications to the PR, I am ok with the reviewer making the changes themselves.

@ParadoxV5 ParadoxV5 requested a review from andrelkin January 14, 2025 20:08
@ParadoxV5 ParadoxV5 changed the title Mdev 35304 MDEV-35304: Add Connects_Tried and Primary_Retry_Count to SSS Jan 14, 2025
@ParadoxV5 ParadoxV5 force-pushed the mdev-35304 branch 6 times, most recently from 87112f2 to 43da219 Compare January 16, 2025 03:51
Copy link
Contributor Author

@ParadoxV5 ParadoxV5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the code, the --master-retry-count=0 behavior appearently doesn’t match the doc’s expectation:

A value of 0 means the replica will not stop attempting to reconnect.
https://mariadb.com/kb/en/mariadbd-options/#-master-retry-count

This may be a bug.
The code is found as early as 10.5.

connect
*/
if (++err_count == master_retry_count)
if (++(mi->connects_tried) == mi->retry_count)
Copy link
Contributor Author

@ParadoxV5 ParadoxV5 Jan 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There’s no special case for zero since 👀 2002.

With an ==, it would treat 0 as ULONG_MAX + 1.
Fortunately, even if it retries once every millisecond, I don’t see a practical difference between 2^32 attempts (totalling almost 50 days) and “the server is down”.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A question: The documentation for --master-retry-count states:

Number of times a replica will attempt to connect to a primary before giving up.
A value of 0 means the replica will not stop attempting to reconnect.

Are these still true after this patch?

Right, I think this makes --master-retry-count=0 mean forever, ok (partly answering my own question above).

I believe they were not true even before this patch.

C/C++ integers are not limitless (unlike Ruby – I main Ruby 💎).
The counter (new or old) will rollover to 0 when it performs ++ULONG_MAX.

So it’s not forever forever, only practically forever.

Comment on lines -4431 to -4434
if ((*retry_count)++)
{
if (*retry_count > master_retry_count)
return 1; // Don't retry forever
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be hard to comprehend. Step by step:

  1. If *retry_count was 0, skip the block
  2. Else If *retry_count becomes > master_retry_count, return 1
  3. Else, proceed with the block

This could mean that for --master-retry-count=0, the termination condition always matches, but the flow always bypasses it for the first try. That is, 0 is equivalent to 1.

I tried writing an MTR test for --master-retry-count, but my build doesn’t have debug_sync, and appearently none of our buildbots have both it and log_bin…??? (#3731)

@ParadoxV5
Copy link
Contributor Author

ParadoxV5 commented Jan 16, 2025

innodb.log_file_size_online and main.log_state fails on main.
P.S.: and appearently unit.conc_connection too.

`try_to_reconnect()` wraps `safe_reconnect()` with logging, but the
latter already loops reconnection attempts up to `master_retry_count`
times with `mi->connect_retry`-msec sleeps inbetween.
This means `try_to_reconnect()` has been counting number of disconnects
(since it doesn’t have a loop) while `safe_reconnect()` was counting
actual attempts (which may be multiple per disconnect).
In practice, this outer counter’s only benefit was to cover the edge
case `--master-retry-count=0` for the inner loop… by treating it as 1…
“Lightly” refactor `try_to_reconnect()` `messages`
to reduce duplication and improve consistency
When the IO thread (re)connect to a primary,
no updates are available besides unique errors that cause the failure.
These new `Master_info` numbers supplement SHOW REPLICA STATUS’s (most-
recent) ‘Connecting’ state with statistics on (re)connect attempts:

* `Connects_Tried`: how many retries have been attempted so far
  This was previously a local variable that only counter re-attempts;
  it’s now meaningful even after the “Connecting” state concludes.

* `Primary_Retry_Count`: out of how many configured
  Formerly known as the global option `--master-retry-count`, it’s now
  copied per-replica to pave way for CHANGE MASTER … TO in MDEV-25674.
* `sql/mysqld.cc`: init `master-retry-count` with `master_retry_count`
* `get_master_version_and_clock()` de-duplicate label using fall-through
* `io_slave_killed()` & `check_io_slave_killed()`:
  * reüse the result from the level lower
  * add distinguishing docs
* `try_to_reconnect()`: extract `'` from `if`-`else`
* `handle_slave_io()`: Both `while`s have the same condition;
  looks like the outer `while` can simply be an `if`.
* `connect_to_master()`:
  * assume `mysql_errno()` is not 0 on connection error
  * utilize 0’s falsiness in the loop
  * remove kill check around error reporting –
    other kill checks don’t even use this result
  * extend docs
These tests dump the whole SSS – results are not guaranteed consistent!
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Room for refactor that I excluded from this PR:

  • There are a few instances of
    if (!…)
    …
    else
  • Both register_slave_on_master and request_dump provides failure warnings, yet their caller, handle_slave_io, still emits similar warnings on their failure.
  • Extraneous EOL whitespaces

@ParadoxV5 ParadoxV5 marked this pull request as ready for review January 16, 2025 06:21
@ParadoxV5
Copy link
Contributor Author

I’m done with staring at these consistent but incomprehensible test failures.

Copy link
Member

@knielsen knielsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi ParadoxV5,

Here is my review of MDEV-35304:

diff --git a/mysql-test/include/check-testcase.inc b/mysql-test/include/check-testcase.inc
index 25990c1a9e379..4ae0da30d7bba 100644
--- a/mysql-test/include/check-testcase.inc
+++ b/mysql-test/include/check-testcase.inc
@@ -74,10 +74,12 @@ if ($tmp)
--echo Slave_Non_Transactional_Groups #
--echo Slave_Transactional_Groups #
--echo Replicate_Rewrite_DB #

  • --echo Connects_Tried #
  • --echo Primary_Retry_Count #

IIUC, this is in preparation for adding a MASTER_RETRY_COUNT to the CHANGE
MASTER, MDEV-25674.

But this is not done in this patch. So let's add that output to SHOW SLAVE
STATUS as part of MDEV-25674, not as part of this change. SHOW SLAVE STATUS
is for status about the running slave and for values setable with CHANGE
MASTER, not for showing the values of configuration parameters.

Second, when this is added (here or as part of MDEV-25674), use consistent
naming using the primary Master/Slave terminology, do not introduce a mix
with the alternate terminology primary/replica.

So "Master_Retry_Count". But I would suggest that when this is added as
part of MDEV-25674, to name it better as "Master_Max_Connect_Tries" or
something like that. To make it clear that 1) it is the configured maximum,
not any current count, and 2) it is the total number of connect tries,
whereas "retry" might suggest it does not include the initial try.

diff --git a/mysql-test/suite/rpl/t/rpl_connects_tried.test b/mysql-test/suite/rpl/t/rpl_connects_tried.test
new file mode 100644
index 0000000000000..b9f5dd3163030
--- /dev/null
+++ b/mysql-test/suite/rpl/t/rpl_connects_tried.test
@@ -0,0 +1,52 @@

+--SOURCE include/have_debug.inc
+--SOURCE include/have_debug_sync.inc
+--SOURCE include/have_log_bin.inc # The test is agnostic of binlog formats.
+--SOURCE include/master-slave.inc

Please use lower-case for mysqltest commands --source, --let, --connection,
etc. to be consistent with other test files.

+--LET $rpl_server_number= 1
+--LET $status_items= Slave_IO_Running
+
+#--CALL mtr.add_suppression("Slave I/O: Master command COM_REGISTER_SLAVE failed: .")
+#--CALL mtr.add_suppression("Slave I/O: .
failed with error: Lost connection to MySQL server at 'reading initial communication packet'")
+#--CALL mtr.add_suppression("Fatal error: The slave I/O thread stops because master and slave have equal MySQL server ids; .")
+#--CALL mtr.add_suppression("Slave I/O thread .
register on master")

Why are these commented out but left in the test case?

diff --git a/mysql-test/suite/rpl/t/rpl_heartbeat_debug.test b/mysql-test/suite/rpl/t/rpl_heartbeat_debug.test
index e593786655bcb..8b1d7517b2e1e 100644
--- a/mysql-test/suite/rpl/t/rpl_heartbeat_debug.test
+++ b/mysql-test/suite/rpl/t/rpl_heartbeat_debug.test
@@ -5,7 +5,7 @@

--disable_query_log
CALL mtr.add_suppression('SET @master_heartbeat_period to master failed with error');
-CALL mtr.add_suppression('Master command COM_REGISTER_SLAVE failed: failed registering on master, reconnecting to try again');
+CALL mtr.add_suppression('Master command COM_REGISTER_SLAVE failed: Replica I/O thread: register on primary failed, reconnecting to try again');

Should be "Slave I/O thread: ...", see comment below.

diff --git a/sql/mysqld.cc b/sql/mysqld.cc
index 89d9cc646c8ca..f69d6fcfc5ae0 100644
--- a/sql/mysqld.cc
+++ b/sql/mysqld.cc
@@ -780,7 +780,7 @@ File_parser_dummy_hook file_parser_dummy_hook;

/* replication parameters, if master_host is not NULL, we are a slave */
uint report_port= 0;
-ulong master_retry_count=0;
+ulong master_retry_count=100000;

@@ -6820,7 +6820,7 @@ struct my_option my_long_options[]=
{"master-retry-count", 0,
"The number of tries the slave will make to connect to the master before giving up",
&master_retry_count, &master_retry_count, 0, GET_ULONG,

  • REQUIRED_ARG, 100000, 0, 0, 0, 0, 0},
  • REQUIRED_ARG, static_cast(master_retry_count), 0, 0, 0, 0, 0},

This looks strange. These are statically initialized global variables. How
can you initialize one of them with the value of the other? Better use a
#define or static constexpr to initialize them both. Or do you have a
reference to somewhere it says this is valid standard c++?

A question: The documentation for --master-retry-count states:

Number of times a replica will attempt to connect to a primary before
giving up.

A value of 0 means the replica will not stop attempting to reconnect.

Are these still true after this patch?

diff --git a/sql/rpl_mi.h b/sql/rpl_mi.h
index 37a5fc30099c9..de8a422ae9498 100644
--- a/sql/rpl_mi.h
+++ b/sql/rpl_mi.h
@@ -253,7 +253,12 @@ class Master_info : public Slave_reporting_capability

  • /** replica-specific @ref master_retry_count */

s/replica/slave/

  • ulong retry_count;

Better name, for example max_connect_tries. This is easily confused with a
count of the current count of connection attempts that failed.

diff --git a/sql/slave.cc b/sql/slave.cc
index e16834a387da3..1d748789a245a 100644
--- a/sql/slave.cc
+++ b/sql/slave.cc
@@ -102,57 +102,33 @@ LEX_CSTRING default_master_connection_name= { (char*) "", 0 };

  • reg=
  • {
  • "Reconnecting after a failed registration on primary",
  • "register on primary",

Keep to the standard terminology master/slave, don't introduce a mix with
the alternate primary/replica terminology except in optional user aliases.

And best not to change the error message wording unless there is a good
reason to do so. Users may occasionally have regexps or something matching
against the old wording.

  • "Reconnecting after a failed primary event read",
  • "primary event read",

s/primary/master/

@@ -2609,10 +2586,6 @@ when it try to get the value of TIME_ZONE global variable from master.";
DBUG_RETURN(0);

network_err:

  • if (master_res)
  • mysql_free_result(master_res);
  • DBUG_RETURN(2);

slave_killed_err:

Agree, thanks for spotting the redundancy.
Did you consider merging the two goto labels into one also? But just a
suggestion, up to you.

@@ -4445,36 +4409,35 @@ static int try_to_reconnect(THD *thd, MYSQL *mysql, Master_info *mi,

  • my_snprintf(buf, sizeof(buf), messages[SLAVE_RECON_MSG_FAILED],
  •            IO_RPL_LOG_NAME, mi->master_log_pos,
    
  • my_snprintf(buf, sizeof(buf), "Replica I/O thread: %s failed, "

s/Replica/Slave/

@@ -4695,7 +4656,7 @@ pthread_handler_t handle_slave_io(void *arg)

DBUG_PRINT("info",("Starting reading binary log from master"));
thd->set_command(COM_SLAVE_IO);

  • while (!io_slave_killed(mi))
  • if (!io_slave_killed(mi))
    {

So this changes a nested while() { ... while() { ... } ... } into a
single while() inside an if().

It was hard for me to see exactly what the reason was for this in the patch
/ why it is correct, can you try to explain the reason for it?

@@ -7038,16 +6998,10 @@ static int connect_to_master(THD* thd, MYSQL* mysql, Master_info* mi,
" - retry-time: %d maximum-retries: %lu message: %s",
(reconnect ? "reconnecting" : "connecting"),
mi->user, mi->host, mi->port,

  •             mi->connect_retry, master_retry_count,
    
  •             mi->connect_retry, mi->retry_count,
                mysql_error(mysql));
    
    }
  • /*
  •  By default we try forever. The reason is that failure will trigger
    
  •  master election, so if the user did not set master_retry_count we
    
  •  do not want to have election triggered on the first failure to
    
  •  connect
    
  • */
  • if (++err_count == master_retry_count)
  • if (++(mi->connects_tried) == mi->retry_count)

Right, I think this makes --master-retry-count=0 mean forever, ok (partly
answering my own question above).

diff --git a/sql/sql_show.cc b/sql/sql_show.cc
index 7f4781c1cd5b9..cdc4209faf2de 100644
--- a/sql/sql_show.cc
+++ b/sql/sql_show.cc
@@ -9201,14 +9201,16 @@ int make_proc_old_format(THD *thd, ST_SCHEMA_TABLE *schema_table)
Constants for columns that are present in
SHOW ALL SLAVES STATUS
that are not in SHOW SLAVE STATUS. Specifically, columns 0 and 1, and

  • everything at and above 56.
  • everything between 56 and 64 inclusive.
    0: Connection_name
    1: Slave_SQL_State
    56: Retried_transactions
  • 64: Master_Slave_time_diff
    */
    #define SLAVE_STATUS_COL_CONNECTION_NAME 0
    #define SLAVE_STATUS_COL_SLAVE_SQL_STATE 1
    #define SLAVE_STATUS_COL_RETRIED_TRANSACTIONS 56
    +#define SLAVE_STATUS_COL_MASTER_SLAVE_TIME_DIFF 64

No, I think instead SLAVE_STATUS_COL_RETRIED_TRANSACTIONS should be changed
to 58 (or 57 until MDEV-25674), see below.

@@ -9220,18 +9222,13 @@ static int make_slave_status_old_format(THD *thd, ST_SCHEMA_TABLE *schema_table)
for (uint i=0; !field_info->end_marker(); field_info++, i++)
{
if (all_slaves ||

  •    // not SLAVE_STATUS_COL_CONNECTION_NAME,
    
  •    // SLAVE_STATUS_COL_SLAVE_SQL_STATE
    
  •    // and less
    
  •    // SLAVE_STATUS_COL_RETRIED_TRANSACTIONS
    
  •    !(i <= SLAVE_STATUS_COL_SLAVE_SQL_STATE ||
    
  •      i >= SLAVE_STATUS_COL_RETRIED_TRANSACTIONS))
    
  •    (i > SLAVE_STATUS_COL_SLAVE_SQL_STATE &&
    
  •     i < SLAVE_STATUS_COL_RETRIED_TRANSACTIONS) ||
    
  •    i > SLAVE_STATUS_COL_MASTER_SLAVE_TIME_DIFF)
    
    {
    LEX_CSTRING field_name= field_info->name();
    Item_field *field= new (thd->mem_root)
    Item_field(thd, context, field_name);
  •  DBUG_ASSERT(all_slaves || (i > SLAVE_STATUS_COL_SLAVE_SQL_STATE &&
    
  •                             i < SLAVE_STATUS_COL_RETRIED_TRANSACTIONS));
     if (!field || add_item_to_list(thd, field))
       return 1;
    
    }
    @@ -10710,6 +10707,8 @@ ST_FIELD_INFO slave_status_info[]=
    Column("Master_last_event_time", Datetime(0), NULLABLE),
    Column("Slave_last_event_time", Datetime(0), NULLABLE),
    Column("Master_Slave_time_diff", SLonglong(10), NULLABLE),
  • Column("Connects_Tried", ULonglong(20), NOT_NULL),
  • Column("Primary_Retry_Count", ULonglong(20), NOT_NULL),
    CEnd()

No, I think this is wrong.

When new entries are added to the SHOW SLAVE STATUS output, they are added
after all the existing columns output from SHOW SLAVE STATUS, but before
the columns output by SHOW ALL SLAVES STATUS. Somewhat tricky, but that is
why there is now an INFORMATION_SCHEMA view that allows to select just the
columns of interest and not depend on column order.

So I think the patch should add the Connects_Tried column just after
Replicate_Rewrite_DB (and MDEV-25674 can add Master_Max_Connect_Tries after
that again). And then you do not need to change the if() condition or the
DBUG_ASSERT above.

And s/Primary_Retry_Count/Master_Max_Connect_Tries/ as noted above.

  • Kristian.

Copy link
Contributor Author

@ParadoxV5 ParadoxV5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your thorough review, Master @knielsen.

Some side tips about GitHub:

  • It uses Markdown, so your plaintext review rendered oddly on the web.
    (Perhaps this plaintext is what a mailing list reply looks like, complete with a line width limit?)
  • Instead of quoting diffs, you can create threads on specific line(s) or a file in a PR’s “Files changed” tab.

For oraganization, I’ve replied in said threads (if you can see them, that is 😅.)

Comment on lines 74 to +78
--echo Slave_Non_Transactional_Groups #
--echo Slave_Transactional_Groups #
--echo Replicate_Rewrite_DB #
--echo Connects_Tried #
--echo Primary_Retry_Count #
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, this is in preparation for adding a MASTER_RETRY_COUNT to the CHANGE
MASTER, MDEV-25674.

But this is not done in this patch. So let's add that output to SHOW SLAVE
STATUS as part of MDEV-25674, not as part of this change. SHOW SLAVE STATUS
is for status about the running slave and for values setable with CHANGE
MASTER, not for showing the values of configuration parameters.

The visibility on the --master-retry-count config together with the other new output was part of MDEV-35304 (and its customer)’s request.

But I agree that the change from master_retry_count to mi->retry_count belongs to MDEV-25674 better.
I included that ::retry_count because, like you said, SHOW REPLICA STATUS reports Master_info entries, not global configs.

I’ve requested assigning MDEV-25674 to myself and, especially re. your criticism, plan to include it in this PR too.
I’ll think about what’s the best sequence to reorganize the commit history to.

Second, when this is added (here or as part of MDEV-25674), use consistent
naming using the primary Master/Slave terminology, do not introduce a mix
with the alternate terminology primary/replica.

So MDEV-18777 is too soon for new columns by themselves? 👌

So "Master_Retry_Count". But I would suggest that when this is added as
part of MDEV-25674, to name it better as "Master_Max_Connect_Tries" or
something like that. To make it clear that 1) it is the configured maximum,
not any current count, and 2) it is the total number of connect tries,
whereas "retry" might suggest it does not include the initial try.

Much agreed – this name is so unfluent.
I was also thinking about deprecating --master-retry-count itself too in favor of MDEV-25674. This could be another argument.

Switching to the ‘primary’ term as part of a “rename” is quite tempting though.
Or maybe throw out the ‘master’ part like Connect_Tried did (for disambiguation) – Max_Connect_Tries? 🤔 clean

Comment on lines +2 to +5
--SOURCE include/have_debug.inc
--SOURCE include/have_debug_sync.inc
--SOURCE include/have_log_bin.inc # The test is agnostic of binlog formats.
--SOURCE include/master-slave.inc
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use lower-case for mysqltest commands --source, --let, --connection,
etc. to be consistent with other test files.

Will do
I’m still relatively new to the job and am building my coding style, especially that there’s no explcit guide on certain MTR conventions like this.

Comment on lines +6 to +13

--LET $rpl_server_number= 1
--LET $status_items= Slave_IO_Running

#--CALL mtr.add_suppression("Slave I/O: Master command COM_REGISTER_SLAVE failed: .*")
#--CALL mtr.add_suppression("Slave I/O: .* failed with error: Lost connection to MySQL server at 'reading initial communication packet'")
#--CALL mtr.add_suppression("Fatal error: The slave I/O thread stops because master and slave have equal MySQL server ids; .*")
#--CALL mtr.add_suppression("Slave I/O thread .* register on master")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these commented out but left in the test case?

I based this file on the test rpl_get_master_version_and_clock, couldn’t get either my local build nor buildbots to run the test, and forgot I had them like this 😅.
Will check again when situations improve

Comment on lines 6820 to +6823
{"master-retry-count", 0,
"The number of tries the slave will make to connect to the master before giving up",
&master_retry_count, &master_retry_count, 0, GET_ULONG,
REQUIRED_ARG, 100000, 0, 0, 0, 0, 0},
REQUIRED_ARG, static_cast<longlong>(master_retry_count), 0, 0, 0, 0, 0},
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks strange. These are statically initialized global variables. How
can you initialize one of them with the value of the other? Better use a
#define or static constexpr to initialize them both. Or do you have a
reference to somewhere it says this is valid standard c++?

It was by intuition when I made that change.
But I’ve now looked up: non-static global variables in a (read: the same) translation unit (i.e., .cpp) fall under ordered dynamic initialization.

Though of course, this refactor is unnecessary if our preference is to throw out --master-retry-count altogether in favor of MDEV-25674.

@@ -780,7 +780,7 @@ File_parser_dummy_hook file_parser_dummy_hook;

/* replication parameters, if master_host is not NULL, we are a slave */
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while on this file – Where master_host?

Comment on lines +113 to +116
reg=
{
"Reconnecting after a failed registration on primary",
"register on primary",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep to the standard terminology master/slave, don't introduce a mix with the alternate primary/replica terminology except in optional user aliases.
And best not to change the error message wording unless there is a good reason to do so.

Personal preference was my reason to do so 😅.
I’m not woke well-aware of our stance on the master–slave terminology and, seeïng this PR merges to a new version, didn’t weigh compatibility as much.

The previous (and the first) time I asked about this was #3615 (comment).

Users may occasionally have regexps or something matching against the old wording.

Including our warning suppressions in MTR 📠.

Comment on lines 2586 to 2589
DBUG_RETURN(0);

network_err:
if (master_res)
mysql_free_result(master_res);
DBUG_RETURN(2);

slave_killed_err:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you consider merging the two goto labels into one also? But just a suggestion, up to you.

I considered but made no decisions. I can’t tell if they ever were or will differ.

I prefer exceptions personally.

Comment on lines 4656 to 4660

DBUG_PRINT("info",("Starting reading binary log from master"));
thd->set_command(COM_SLAVE_IO);
while (!io_slave_killed(mi))
if (!io_slave_killed(mi))
{
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this changes a nested while() { ... while() { ... } ... } into a
single while() inside an if().

It was hard for me to see exactly what the reason was for this in the patch
/ why it is correct, can you try to explain the reason for it?

  • Both whiles have the same condition of !io_slave_killed(mi).
  • There is no code that runs after the inner loop concludes before the outer’s }.

A killed replica cannot resurrect (right?), especially during the few instructions of the inner exiting to the outer.

connect
*/
if (++err_count == master_retry_count)
if (++(mi->connects_tried) == mi->retry_count)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A question: The documentation for --master-retry-count states:

Number of times a replica will attempt to connect to a primary before giving up.
A value of 0 means the replica will not stop attempting to reconnect.

Are these still true after this patch?

Right, I think this makes --master-retry-count=0 mean forever, ok (partly answering my own question above).

I believe they were not true even before this patch.

C/C++ integers are not limitless (unlike Ruby – I main Ruby 💎).
The counter (new or old) will rollover to 0 when it performs ++ULONG_MAX.

So it’s not forever forever, only practically forever.

Comment on lines 9201 to +9213
Constants for columns that are present in
SHOW ALL SLAVES STATUS
that are not in SHOW SLAVE STATUS. Specifically, columns 0 and 1, and
everything at and above 56.
everything between 56 and 64 inclusive.
0: Connection_name
1: Slave_SQL_State
56: Retried_transactions
64: Master_Slave_time_diff
*/
#define SLAVE_STATUS_COL_CONNECTION_NAME 0
#define SLAVE_STATUS_COL_SLAVE_SQL_STATE 1
#define SLAVE_STATUS_COL_RETRIED_TRANSACTIONS 56
#define SLAVE_STATUS_COL_MASTER_SLAVE_TIME_DIFF 64
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I think this is wrong.

start a thread for myself so I don’t forget
(But yeah, something overcomplicated is goïng on with this “add at the end only” limitation.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

2 participants