MDEV-35304: Add `Connects_Tried` and `Primary_Retry_Count` to SSS #3764

ParadoxV5 · 2025-01-14T20:08:23Z

The Jira issue number for this PR is: MDEV-35304

Description

These new Master_info numbers supplement SHOW ~~SLAVE~~ REPLICA STATUS’s (most-recent) ‘Connecting’ state with statistics on (re)connect attempts:

Connects_Tried: how many retries have been attempted so far
- This was previously a local variable that only counter re-attempts; it’s now meaningful even after the “Connecting” state concludes.
Primary_Retry_Count: out of how many configured
- Formerly known as the global option --master-retry-count, it’s now copied per-replica to pave way for CHANGE ~~PRIMARY~~ MASTER … TO in MDEV-25674.
- I’m pre-emptively naming this one ‘Primary’ re. MDEV-30189 of MDEV-18777.

This PR also includes additional commits that refactor relevant code.
I strongly recommend reviewing commit-by-commit.

What problem is the patch trying to solve?

When the IO thread (re)connect to a primary, no updates are available besides unique errors that cause the failure.

If some output changed that is not visible in a test case, what was it looking like before the change and how it's looking with this patch applied?

From #3764 (comment), appearently:

--master-retry-count=0 is treated as =1 for reconnecting to a previously connected primary.
There may have been an extra sleep in try_to_reconnect().
In full, looks like this happens when the connection drops at a different substep in the setup phase.

Both of these code are now gone as part of a preparation commit.

Do you think this patch might introduce side-effects in other parts of the server?

Component that expects a fixed number of SHOW REPLICA STATUS entries in fixed sections might start seeïng uninvited guests.
On that note, we have many MTR tests that run show slave status directly and in full (i.e., not using source include/show_slave_status.inc).

Release Notes

Added two new SLOW REPLICA STATUS entries for statistics of the most-recent ‘Connecting’ state:

Connects_Tried: How many retries were attempted
Primary_Retry_Count: Out of how many configured (i.e., the --master-retry-count option)

Of course, they’re also accessible from INFORMATION_SCHEMA.SLAVE_STATUS; e.g.:

SELECT
  Connection_name,
  Slave_IO_Running,
  JOIN(Connects_Tried, 'attempt(s) used out of ', Primary_Retry_Count, ' max')
FROM INFORMATION_SCHEMA.SLAVE_STATUS;

Knowledge Base pages that need changing

https://mariadb.com/kb/en/show-replica-status/ – This PR is about adding two new entries.
https://mariadb.com/kb/en/replica-io-thread-states/ – While refactoring, I also changed ‘master’ in the status keys to ‘primary’ re. MDEV-18777. (Too soon?)

How can this PR be tested?

I need some help from test specialists.

I’ve drafted the MTR test rpl.rpl_connects_tried for Connets_Tried’s behavior, but I couldn’t test it locally because my build doesn’t have debug_sync.
I couldn’t refer to the buildbots either – they’re either successful (no search results on Cross Reference either) or they fail with issues I have no clue how they relate to my modifications 😶‍🌫️. (Besides the ones that already fail in main, that is 😶.)

Besides checking that Primary_Retry_Count matches --master-retry-count, we should also test the option itself regarding #3764 (comment).
A draft MTR test is at #3731.

If the changes are not amenable to automated testing, please explain why not and carefully describe how to test manually.

A tester can instead replicate (wat) the MTR test by observing the SHOW REPLICA STATUS and/or INFORMATION_SCHEMA.SLAVE_STATUS of a long-master_connect_retry replication over time.

Basing the PR against the correct MariaDB version

This is a new feature or a refactoring, and the PR is based against the main branch.
~~This is a bug fix, and the PR is based against the earliest maintained branch in which the bug can be reproduced.~~

PR quality check

I checked the CODING_STANDARDS.md file and my PR conforms to this where appropriate.
For any trivial modifications to the PR, I am ok with the reviewer making the changes themselves.

ParadoxV5

According to the code, the --master-retry-count=0 behavior appearently doesn’t match the doc’s expectation:

A value of 0 means the replica will not stop attempting to reconnect.
⸺ https://mariadb.com/kb/en/mariadbd-options/#-master-retry-count

This may be a bug.
The code is found as early as 10.5.

ParadoxV5 · 2025-01-16T03:59:51Z

sql/slave.cc

-      connect
-    */
-    if (++err_count == master_retry_count)
+    if (++(mi->connects_tried) == mi->retry_count)


There’s no special case for zero since 👀 2002.

With an ==, it would treat 0 as ULONG_MAX + 1.
Fortunately, even if it retries once every millisecond, I don’t see a practical difference between 2^32 attempts (totalling almost 50 days) and “the server is down”.

A question: The documentation for --master-retry-count states:

Number of times a replica will attempt to connect to a primary before giving up.
A value of 0 means the replica will not stop attempting to reconnect.

Are these still true after this patch?

Right, I think this makes --master-retry-count=0 mean forever, ok (partly answering my own question above).

I believe they were not true even before this patch.

C/C++ integers are not limitless (unlike Ruby – I main Ruby 💎).
The counter (new or old) will rollover to 0 when it performs ++ULONG_MAX.

So it’s not forever forever, only practically forever.

ParadoxV5 · 2025-01-16T04:23:52Z

sql/slave.cc

-  if ((*retry_count)++)
-  {
-    if (*retry_count > master_retry_count)
-      return 1;                             // Don't retry forever


This can be hard to comprehend. Step by step:

If *retry_count was 0, skip the block

Else If *retry_count becomes > master_retry_count, return 1

Else, proceed with the block

This could mean that for --master-retry-count=0, the termination condition always matches, but the flow always bypasses it for the first try. That is, 0 is equivalent to 1.

I tried writing an MTR test for --master-retry-count, but my build doesn’t have debug_sync, and appearently none of our buildbots have both it and log_bin…??? (#3731)

ParadoxV5 · 2025-01-16T04:47:47Z

innodb.log_file_size_online and main.log_state fails on main.
P.S.: and appearently unit.conc_connection too.

`try_to_reconnect()` wraps `safe_reconnect()` with logging, but the latter already loops reconnection attempts up to `master_retry_count` times with `mi->connect_retry`-msec sleeps inbetween. This means `try_to_reconnect()` has been counting number of disconnects (since it doesn’t have a loop) while `safe_reconnect()` was counting actual attempts (which may be multiple per disconnect). In practice, this outer counter’s only benefit was to cover the edge case `--master-retry-count=0` for the inner loop… by treating it as 1…

“Lightly” refactor `try_to_reconnect()` `messages` to reduce duplication and improve consistency

When the IO thread (re)connect to a primary, no updates are available besides unique errors that cause the failure. These new `Master_info` numbers supplement SHOW REPLICA STATUS’s (most- recent) ‘Connecting’ state with statistics on (re)connect attempts: * `Connects_Tried`: how many retries have been attempted so far This was previously a local variable that only counter re-attempts; it’s now meaningful even after the “Connecting” state concludes. * `Primary_Retry_Count`: out of how many configured Formerly known as the global option `--master-retry-count`, it’s now copied per-replica to pave way for CHANGE MASTER … TO in MDEV-25674.

* `sql/mysqld.cc`: init `master-retry-count` with `master_retry_count` * `get_master_version_and_clock()` de-duplicate label using fall-through * `io_slave_killed()` & `check_io_slave_killed()`: * reüse the result from the level lower * add distinguishing docs * `try_to_reconnect()`: extract `'` from `if`-`else` * `handle_slave_io()`: Both `while`s have the same condition; looks like the outer `while` can simply be an `if`. * `connect_to_master()`: * assume `mysql_errno()` is not 0 on connection error * utilize 0’s falsiness in the loop * remove kill check around error reporting – other kill checks don’t even use this result * extend docs

These tests dump the whole SSS – results are not guaranteed consistent!

ParadoxV5 · 2025-01-16T06:11:34Z

sql/slave.cc

Room for refactor that I excluded from this PR:

There are a few instances of
if (!…) … else …

Both register_slave_on_master and request_dump provides failure warnings, yet their caller, handle_slave_io, still emits similar warnings on their failure.

Extraneous EOL whitespaces

ParadoxV5 · 2025-01-16T06:21:31Z

I’m done with staring at these consistent but incomprehensible test failures.

knielsen

Hi ParadoxV5,

Here is my review of MDEV-35304:

diff --git a/mysql-test/include/check-testcase.inc b/mysql-test/include/check-testcase.inc
index 25990c1a9e379..4ae0da30d7bba 100644
--- a/mysql-test/include/check-testcase.inc
+++ b/mysql-test/include/check-testcase.inc
@@ -74,10 +74,12 @@ if ($tmp)
--echo Slave_Non_Transactional_Groups #
--echo Slave_Transactional_Groups #
--echo Replicate_Rewrite_DB #

--echo Connects_Tried #

--echo Primary_Retry_Count #

IIUC, this is in preparation for adding a MASTER_RETRY_COUNT to the CHANGE
MASTER, MDEV-25674.

But this is not done in this patch. So let's add that output to SHOW SLAVE
STATUS as part of MDEV-25674, not as part of this change. SHOW SLAVE STATUS
is for status about the running slave and for values setable with CHANGE
MASTER, not for showing the values of configuration parameters.

Second, when this is added (here or as part of MDEV-25674), use consistent
naming using the primary Master/Slave terminology, do not introduce a mix
with the alternate terminology primary/replica.

So "Master_Retry_Count". But I would suggest that when this is added as
part of MDEV-25674, to name it better as "Master_Max_Connect_Tries" or
something like that. To make it clear that 1) it is the configured maximum,
not any current count, and 2) it is the total number of connect tries,
whereas "retry" might suggest it does not include the initial try.

diff --git a/mysql-test/suite/rpl/t/rpl_connects_tried.test b/mysql-test/suite/rpl/t/rpl_connects_tried.test
new file mode 100644
index 0000000000000..b9f5dd3163030
--- /dev/null
+++ b/mysql-test/suite/rpl/t/rpl_connects_tried.test
@@ -0,0 +1,52 @@

+--SOURCE include/have_debug.inc
+--SOURCE include/have_debug_sync.inc
+--SOURCE include/have_log_bin.inc # The test is agnostic of binlog formats.
+--SOURCE include/master-slave.inc

Please use lower-case for mysqltest commands --source, --let, --connection,
etc. to be consistent with other test files.

+--LET $rpl_server_number= 1
+--LET $status_items= Slave_IO_Running
+
+#--CALL mtr.add_suppression("Slave I/O: Master command COM_REGISTER_SLAVE failed: .")
+#--CALL mtr.add_suppression("Slave I/O: . failed with error: Lost connection to MySQL server at 'reading initial communication packet'")
+#--CALL mtr.add_suppression("Fatal error: The slave I/O thread stops because master and slave have equal MySQL server ids; .")
+#--CALL mtr.add_suppression("Slave I/O thread . register on master")

Why are these commented out but left in the test case?

diff --git a/mysql-test/suite/rpl/t/rpl_heartbeat_debug.test b/mysql-test/suite/rpl/t/rpl_heartbeat_debug.test
index e593786655bcb..8b1d7517b2e1e 100644
--- a/mysql-test/suite/rpl/t/rpl_heartbeat_debug.test
+++ b/mysql-test/suite/rpl/t/rpl_heartbeat_debug.test
@@ -5,7 +5,7 @@

--disable_query_log
CALL mtr.add_suppression('SET @master_heartbeat_period to master failed with error');
-CALL mtr.add_suppression('Master command COM_REGISTER_SLAVE failed: failed registering on master, reconnecting to try again');
+CALL mtr.add_suppression('Master command COM_REGISTER_SLAVE failed: Replica I/O thread: register on primary failed, reconnecting to try again');

Should be "Slave I/O thread: ...", see comment below.

diff --git a/sql/mysqld.cc b/sql/mysqld.cc
index 89d9cc646c8ca..f69d6fcfc5ae0 100644
--- a/sql/mysqld.cc
+++ b/sql/mysqld.cc
@@ -780,7 +780,7 @@ File_parser_dummy_hook file_parser_dummy_hook;

/* replication parameters, if master_host is not NULL, we are a slave */
uint report_port= 0;
-ulong master_retry_count=0;
+ulong master_retry_count=100000;

@@ -6820,7 +6820,7 @@ struct my_option my_long_options[]=
{"master-retry-count", 0,
"The number of tries the slave will make to connect to the master before giving up",
&master_retry_count, &master_retry_count, 0, GET_ULONG,

REQUIRED_ARG, 100000, 0, 0, 0, 0, 0},

REQUIRED_ARG, static_cast(master_retry_count), 0, 0, 0, 0, 0},

This looks strange. These are statically initialized global variables. How
can you initialize one of them with the value of the other? Better use a
#define or static constexpr to initialize them both. Or do you have a
reference to somewhere it says this is valid standard c++?

A question: The documentation for --master-retry-count states:

Number of times a replica will attempt to connect to a primary before
giving up.

A value of 0 means the replica will not stop attempting to reconnect.

Are these still true after this patch?

diff --git a/sql/rpl_mi.h b/sql/rpl_mi.h
index 37a5fc30099c9..de8a422ae9498 100644
--- a/sql/rpl_mi.h
+++ b/sql/rpl_mi.h
@@ -253,7 +253,12 @@ class Master_info : public Slave_reporting_capability

/** replica-specific @ref master_retry_count */

s/replica/slave/

ulong retry_count;

Better name, for example max_connect_tries. This is easily confused with a
count of the current count of connection attempts that failed.

diff --git a/sql/slave.cc b/sql/slave.cc
index e16834a387da3..1d748789a245a 100644
--- a/sql/slave.cc
+++ b/sql/slave.cc
@@ -102,57 +102,33 @@ LEX_CSTRING default_master_connection_name= { (char*) "", 0 };

reg=

{

"Reconnecting after a failed registration on primary",

"register on primary",

Keep to the standard terminology master/slave, don't introduce a mix with
the alternate primary/replica terminology except in optional user aliases.

And best not to change the error message wording unless there is a good
reason to do so. Users may occasionally have regexps or something matching
against the old wording.

"Reconnecting after a failed primary event read",

"primary event read",

s/primary/master/

@@ -2609,10 +2586,6 @@ when it try to get the value of TIME_ZONE global variable from master.";
DBUG_RETURN(0);

network_err:

if (master_res)

mysql_free_result(master_res);

DBUG_RETURN(2);

slave_killed_err:

Agree, thanks for spotting the redundancy.
Did you consider merging the two goto labels into one also? But just a
suggestion, up to you.

@@ -4445,36 +4409,35 @@ static int try_to_reconnect(THD *thd, MYSQL *mysql, Master_info *mi,

my_snprintf(buf, sizeof(buf), messages[SLAVE_RECON_MSG_FAILED],
           IO_RPL_LOG_NAME, mi->master_log_pos,
my_snprintf(buf, sizeof(buf), "Replica I/O thread: %s failed, "

s/Replica/Slave/

@@ -4695,7 +4656,7 @@ pthread_handler_t handle_slave_io(void *arg)

DBUG_PRINT("info",("Starting reading binary log from master"));
thd->set_command(COM_SLAVE_IO);

while (!io_slave_killed(mi))

if (!io_slave_killed(mi))
{

So this changes a nested while() { ... while() { ... } ... } into a
single while() inside an if().

It was hard for me to see exactly what the reason was for this in the patch
/ why it is correct, can you try to explain the reason for it?

@@ -7038,16 +6998,10 @@ static int connect_to_master(THD* thd, MYSQL* mysql, Master_info* mi,
" - retry-time: %d maximum-retries: %lu message: %s",
(reconnect ? "reconnecting" : "connecting"),
mi->user, mi->host, mi->port,
            mi->connect_retry, master_retry_count,
            mi->connect_retry, mi->retry_count,
            mysql_error(mysql));
}
/*
 By default we try forever. The reason is that failure will trigger
 master election, so if the user did not set master_retry_count we
 do not want to have election triggered on the first failure to
 connect
*/

if (++err_count == master_retry_count)
if (++(mi->connects_tried) == mi->retry_count)

Right, I think this makes --master-retry-count=0 mean forever, ok (partly
answering my own question above).

diff --git a/sql/sql_show.cc b/sql/sql_show.cc
index 7f4781c1cd5b9..cdc4209faf2de 100644
--- a/sql/sql_show.cc
+++ b/sql/sql_show.cc
@@ -9201,14 +9201,16 @@ int make_proc_old_format(THD *thd, ST_SCHEMA_TABLE *schema_table)
Constants for columns that are present in
SHOW ALL SLAVES STATUS
that are not in SHOW SLAVE STATUS. Specifically, columns 0 and 1, and

everything at and above 56.

everything between 56 and 64 inclusive.
0: Connection_name
1: Slave_SQL_State
56: Retried_transactions

64: Master_Slave_time_diff
*/
#define SLAVE_STATUS_COL_CONNECTION_NAME 0
#define SLAVE_STATUS_COL_SLAVE_SQL_STATE 1
#define SLAVE_STATUS_COL_RETRIED_TRANSACTIONS 56
+#define SLAVE_STATUS_COL_MASTER_SLAVE_TIME_DIFF 64

No, I think instead SLAVE_STATUS_COL_RETRIED_TRANSACTIONS should be changed
to 58 (or 57 until MDEV-25674), see below.

@@ -9220,18 +9222,13 @@ static int make_slave_status_old_format(THD *thd, ST_SCHEMA_TABLE *schema_table)
for (uint i=0; !field_info->end_marker(); field_info++, i++)
{
if (all_slaves ||
   // not SLAVE_STATUS_COL_CONNECTION_NAME,
   // SLAVE_STATUS_COL_SLAVE_SQL_STATE
   // and less
   // SLAVE_STATUS_COL_RETRIED_TRANSACTIONS
   !(i <= SLAVE_STATUS_COL_SLAVE_SQL_STATE ||
     i >= SLAVE_STATUS_COL_RETRIED_TRANSACTIONS))
   (i > SLAVE_STATUS_COL_SLAVE_SQL_STATE &&
    i < SLAVE_STATUS_COL_RETRIED_TRANSACTIONS) ||
   i > SLAVE_STATUS_COL_MASTER_SLAVE_TIME_DIFF)
{
LEX_CSTRING field_name= field_info->name();
Item_field *field= new (thd->mem_root)
Item_field(thd, context, field_name);
 DBUG_ASSERT(all_slaves || (i > SLAVE_STATUS_COL_SLAVE_SQL_STATE &&
                            i < SLAVE_STATUS_COL_RETRIED_TRANSACTIONS));
 if (!field || add_item_to_list(thd, field))
   return 1;
}
@@ -10710,6 +10707,8 @@ ST_FIELD_INFO slave_status_info[]=
Column("Master_last_event_time", Datetime(0), NULLABLE),
Column("Slave_last_event_time", Datetime(0), NULLABLE),
Column("Master_Slave_time_diff", SLonglong(10), NULLABLE),
Column("Connects_Tried", ULonglong(20), NOT_NULL),

Column("Primary_Retry_Count", ULonglong(20), NOT_NULL),
CEnd()

No, I think this is wrong.

When new entries are added to the SHOW SLAVE STATUS output, they are added
after all the existing columns output from SHOW SLAVE STATUS, but before
the columns output by SHOW ALL SLAVES STATUS. Somewhat tricky, but that is
why there is now an INFORMATION_SCHEMA view that allows to select just the
columns of interest and not depend on column order.

So I think the patch should add the Connects_Tried column just after
Replicate_Rewrite_DB (and MDEV-25674 can add Master_Max_Connect_Tries after
that again). And then you do not need to change the if() condition or the
DBUG_ASSERT above.

And s/Primary_Retry_Count/Master_Max_Connect_Tries/ as noted above.

Kristian.

ParadoxV5

Thank you for your thorough review, Master @knielsen.

Some side tips about GitHub:

It uses Markdown, so your plaintext review rendered oddly on the web.
(Perhaps this plaintext is what a mailing list reply looks like, complete with a line width limit?)
Instead of quoting diffs, you can create threads on specific line(s) or a file in a PR’s “Files changed” tab.

For oraganization, I’ve replied in said threads (if you can see them, that is 😅.)

ParadoxV5 · 2025-01-24T20:55:18Z

mysql-test/include/check-testcase.inc

  --echo Slave_Non_Transactional_Groups	#
  --echo Slave_Transactional_Groups	#
  --echo Replicate_Rewrite_DB	#
+  --echo Connects_Tried	#
+  --echo Primary_Retry_Count	#


IIUC, this is in preparation for adding a MASTER_RETRY_COUNT to the CHANGE
MASTER, MDEV-25674.

But this is not done in this patch. So let's add that output to SHOW SLAVE
STATUS as part of MDEV-25674, not as part of this change. SHOW SLAVE STATUS
is for status about the running slave and for values setable with CHANGE
MASTER, not for showing the values of configuration parameters.

The visibility on the --master-retry-count config together with the other new output was part of MDEV-35304 (and its customer)’s request.

But I agree that the change from master_retry_count to mi->retry_count belongs to MDEV-25674 better.
I included that ::retry_count because, like you said, SHOW REPLICA STATUS reports Master_info entries, not global configs.

I’ve requested assigning MDEV-25674 to myself and, especially re. your criticism, plan to include it in this PR too.
I’ll think about what’s the best sequence to reorganize the commit history to.

Second, when this is added (here or as part of MDEV-25674), use consistent
naming using the primary Master/Slave terminology, do not introduce a mix
with the alternate terminology primary/replica.

So MDEV-18777 is too soon for new columns by themselves? 👌

So "Master_Retry_Count". But I would suggest that when this is added as
part of MDEV-25674, to name it better as "Master_Max_Connect_Tries" or
something like that. To make it clear that 1) it is the configured maximum,
not any current count, and 2) it is the total number of connect tries,
whereas "retry" might suggest it does not include the initial try.

Much agreed – this name is so unfluent.
I was also thinking about deprecating --master-retry-count itself too in favor of MDEV-25674. This could be another argument.

Switching to the ‘primary’ term as part of a “rename” is quite tempting though.
Or maybe throw out the ‘master’ part like Connect_Tried did (for disambiguation) – Max_Connect_Tries? 🤔 clean

ParadoxV5 · 2025-01-24T21:13:40Z

mysql-test/suite/rpl/t/rpl_connects_tried.test

+--SOURCE include/have_debug.inc
+--SOURCE include/have_debug_sync.inc
+--SOURCE include/have_log_bin.inc # The test is agnostic of binlog formats.
+--SOURCE include/master-slave.inc


Please use lower-case for mysqltest commands --source, --let, --connection,
etc. to be consistent with other test files.

Will do
I’m still relatively new to the job and am building my coding style, especially that there’s no explcit guide on certain MTR conventions like this.

ParadoxV5 · 2025-01-24T21:17:21Z

mysql-test/suite/rpl/t/rpl_connects_tried.test

+
+--LET $rpl_server_number= 1
+--LET $status_items= Slave_IO_Running
+
+#--CALL mtr.add_suppression("Slave I/O: Master command COM_REGISTER_SLAVE failed: .*")
+#--CALL mtr.add_suppression("Slave I/O: .* failed with error: Lost connection to MySQL server at 'reading initial communication packet'")
+#--CALL mtr.add_suppression("Fatal error: The slave I/O thread stops because master and slave have equal MySQL server ids; .*")
+#--CALL mtr.add_suppression("Slave I/O thread .* register on master")


Why are these commented out but left in the test case?

I based this file on the test rpl_get_master_version_and_clock, couldn’t get either my local build nor buildbots to run the test, and forgot I had them like this 😅.
Will check again when situations improve

ParadoxV5 · 2025-01-24T21:59:11Z

sql/mysqld.cc

  {"master-retry-count", 0,
   "The number of tries the slave will make to connect to the master before giving up",
   &master_retry_count, &master_retry_count, 0, GET_ULONG,
-   REQUIRED_ARG, 100000, 0, 0, 0, 0, 0},
+   REQUIRED_ARG, static_cast<longlong>(master_retry_count), 0, 0, 0, 0, 0},


This looks strange. These are statically initialized global variables. How
can you initialize one of them with the value of the other? Better use a
#define or static constexpr to initialize them both. Or do you have a
reference to somewhere it says this is valid standard c++?

It was by intuition when I made that change.
But I’ve now looked up: non-static global variables in a (read: the same) translation unit (i.e., .cpp) fall under ordered dynamic initialization.

Though of course, this refactor is unnecessary if our preference is to throw out --master-retry-count altogether in favor of MDEV-25674.

ParadoxV5 · 2025-01-24T22:00:41Z

sql/mysqld.cc

@@ -780,7 +780,7 @@ File_parser_dummy_hook file_parser_dummy_hook;

 /* replication parameters, if master_host is not NULL, we are a slave */


while on this file – Where master_host?

ParadoxV5 · 2025-01-24T22:18:14Z

sql/slave.cc

+  reg=
+  {
+    "Reconnecting after a failed registration on primary",
+    "register on primary",


Keep to the standard terminology master/slave, don't introduce a mix with the alternate primary/replica terminology except in optional user aliases.
And best not to change the error message wording unless there is a good reason to do so.

Personal preference was my reason to do so 😅.
I’m not ~~woke~~ well-aware of our stance on the master–slave terminology and, seeïng this PR merges to a new version, didn’t weigh compatibility as much.

The previous (and the first) time I asked about this was #3615 (comment).

Users may occasionally have regexps or something matching against the old wording.

Including our warning suppressions in MTR 📠.

ParadoxV5 · 2025-01-24T22:26:59Z

sql/slave.cc

  DBUG_RETURN(0);

 network_err:
-  if (master_res)
-    mysql_free_result(master_res);
-  DBUG_RETURN(2);
-
 slave_killed_err:


Did you consider merging the two goto labels into one also? But just a suggestion, up to you.

I considered but made no decisions. I can’t tell if they ever were or will differ.

~~I prefer exceptions personally.~~

ParadoxV5 · 2025-01-24T22:36:06Z

sql/slave.cc


  DBUG_PRINT("info",("Starting reading binary log from master"));
  thd->set_command(COM_SLAVE_IO);
-  while (!io_slave_killed(mi))
+  if (!io_slave_killed(mi))
  {


So this changes a nested while() { ... while() { ... } ... } into a
single while() inside an if().

It was hard for me to see exactly what the reason was for this in the patch
/ why it is correct, can you try to explain the reason for it?

Both whiles have the same condition of !io_slave_killed(mi).

There is no code that runs after the inner loop concludes before the outer’s }.

A killed replica cannot resurrect (right?), especially during the few instructions of the inner exiting to the outer.

ParadoxV5 · 2025-01-24T22:44:37Z

sql/slave.cc

-      connect
-    */
-    if (++err_count == master_retry_count)
+    if (++(mi->connects_tried) == mi->retry_count)


A question: The documentation for --master-retry-count states:

Number of times a replica will attempt to connect to a primary before giving up.
A value of 0 means the replica will not stop attempting to reconnect.

Are these still true after this patch?

Right, I think this makes --master-retry-count=0 mean forever, ok (partly answering my own question above).

I believe they were not true even before this patch.

C/C++ integers are not limitless (unlike Ruby – I main Ruby 💎).
The counter (new or old) will rollover to 0 when it performs ++ULONG_MAX.

So it’s not forever forever, only practically forever.

ParadoxV5 · 2025-01-24T23:06:55Z

sql/sql_show.cc

  Constants for columns that are present in
  SHOW ALL SLAVES STATUS
  that are not in SHOW SLAVE STATUS. Specifically, columns 0 and 1, and
-  everything at and above 56.
+  everything between 56 and 64 inclusive.
    0: Connection_name
    1: Slave_SQL_State
    56: Retried_transactions
+    64: Master_Slave_time_diff
 */
 #define SLAVE_STATUS_COL_CONNECTION_NAME 0
 #define SLAVE_STATUS_COL_SLAVE_SQL_STATE 1
 #define SLAVE_STATUS_COL_RETRIED_TRANSACTIONS 56
+#define SLAVE_STATUS_COL_MASTER_SLAVE_TIME_DIFF 64


No, I think this is wrong.

start a thread for myself so I don’t forget
(But yeah, something overcomplicated is goïng on with this “add at the end only” limitation.)

ParadoxV5 requested a review from andrelkin January 14, 2025 20:08

ParadoxV5 changed the title ~~Mdev 35304~~ MDEV-35304: Add Connects_Tried and Primary_Retry_Count to SSS Jan 14, 2025

ParadoxV5 force-pushed the mdev-35304 branch 6 times, most recently from 87112f2 to 43da219 Compare January 16, 2025 03:51

ParadoxV5 commented Jan 16, 2025

View reviewed changes

ParadoxV5 added 5 commits January 15, 2025 22:00

slave.cc: Refactor reconnect_messages

a6e31e1

“Lightly” refactor `try_to_reconnect()` `messages` to reduce duplication and improve consistency

update .results of bad tests

7353053

These tests dump the whole SSS – results are not guaranteed consistent!

ParadoxV5 force-pushed the mdev-35304 branch from 43da219 to 7353053 Compare January 16, 2025 05:16

ParadoxV5 requested a review from knielsen January 16, 2025 06:03

ParadoxV5 commented Jan 16, 2025

View reviewed changes

ParadoxV5 marked this pull request as ready for review January 16, 2025 06:21

ParadoxV5 added the MariaDB Corporation label Jan 17, 2025

knielsen requested changes Jan 24, 2025

View reviewed changes

ParadoxV5 commented Jan 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MDEV-35304: Add `Connects_Tried` and `Primary_Retry_Count` to SSS #3764

MDEV-35304: Add `Connects_Tried` and `Primary_Retry_Count` to SSS #3764

ParadoxV5 commented Jan 14, 2025 •

edited

Loading

ParadoxV5 left a comment

ParadoxV5 Jan 16, 2025 •

edited

Loading

ParadoxV5 Jan 24, 2025

ParadoxV5 Jan 16, 2025

ParadoxV5 commented Jan 16, 2025 •

edited

Loading

ParadoxV5 Jan 16, 2025

ParadoxV5 commented Jan 16, 2025

knielsen left a comment

ParadoxV5 left a comment

ParadoxV5 Jan 24, 2025

ParadoxV5 Jan 24, 2025

ParadoxV5 Jan 24, 2025

ParadoxV5 Jan 24, 2025

ParadoxV5 Jan 24, 2025

ParadoxV5 Jan 24, 2025

ParadoxV5 Jan 24, 2025

ParadoxV5 Jan 24, 2025

ParadoxV5 Jan 24, 2025

ParadoxV5 Jan 24, 2025

		@@ -780,7 +780,7 @@ File_parser_dummy_hook file_parser_dummy_hook;

		/* replication parameters, if master_host is not NULL, we are a slave */

MDEV-35304: Add Connects_Tried and Primary_Retry_Count to SSS #3764

Are you sure you want to change the base?

MDEV-35304: Add Connects_Tried and Primary_Retry_Count to SSS #3764

Conversation

ParadoxV5 commented Jan 14, 2025 • edited Loading

Description

What problem is the patch trying to solve?

If some output changed that is not visible in a test case, what was it looking like before the change and how it's looking with this patch applied?

Do you think this patch might introduce side-effects in other parts of the server?

Release Notes

Knowledge Base pages that need changing

How can this PR be tested?

If the changes are not amenable to automated testing, please explain why not and carefully describe how to test manually.

Basing the PR against the correct MariaDB version

PR quality check

ParadoxV5 left a comment

Choose a reason for hiding this comment

ParadoxV5 Jan 16, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ParadoxV5 commented Jan 16, 2025 • edited Loading

Choose a reason for hiding this comment

ParadoxV5 commented Jan 16, 2025

knielsen left a comment

Choose a reason for hiding this comment

ParadoxV5 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MDEV-35304: Add `Connects_Tried` and `Primary_Retry_Count` to SSS #3764

MDEV-35304: Add `Connects_Tried` and `Primary_Retry_Count` to SSS #3764

ParadoxV5 commented Jan 14, 2025 •

edited

Loading

ParadoxV5 Jan 16, 2025 •

edited

Loading

ParadoxV5 commented Jan 16, 2025 •

edited

Loading