-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MDEV-35304: Add Connects_Tried
and Primary_Retry_Count
to SSS
#3764
base: main
Are you sure you want to change the base?
Conversation
Connects_Tried
and Primary_Retry_Count
to SSS
87112f2
to
43da219
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the code, the --master-retry-count=0
behavior appearently doesn’t match the doc’s expectation:
A value of 0 means the replica will not stop attempting to reconnect.
⸺ https://mariadb.com/kb/en/mariadbd-options/#-master-retry-count
This may be a bug.
The code is found as early as 10.5.
connect | ||
*/ | ||
if (++err_count == master_retry_count) | ||
if (++(mi->connects_tried) == mi->retry_count) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There’s no special case for zero since 👀 2002.
With an ==
, it would treat 0 as ULONG_MAX
+ 1.
Fortunately, even if it retries once every millisecond, I don’t see a practical difference between 2^32 attempts (totalling almost 50 days) and “the server is down”.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A question: The documentation for --master-retry-count states:
Number of times a replica will attempt to connect to a primary before giving up.
A value of 0 means the replica will not stop attempting to reconnect.Are these still true after this patch?
Right, I think this makes --master-retry-count=0 mean forever, ok (partly answering my own question above).
I believe they were not true even before this patch.
C/C++ integers are not limitless (unlike Ruby – I main Ruby 💎).
The counter (new or old) will rollover to 0
when it performs ++ULONG_MAX
.
So it’s not forever forever, only practically forever.
if ((*retry_count)++) | ||
{ | ||
if (*retry_count > master_retry_count) | ||
return 1; // Don't retry forever |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be hard to comprehend. Step by step:
- If
*retry_count
was 0, skip the block - Else If
*retry_count
becomes> master_retry_count
,return 1
- Else, proceed with the block
This could mean that for --master-retry-count=0
, the termination condition always matches, but the flow always bypasses it for the first try. That is, 0
is equivalent to 1
.
I tried writing an MTR test for --master-retry-count
, but my build doesn’t have debug_sync
, and appearently none of our buildbots have both it and log_bin
…??? (#3731)
|
`try_to_reconnect()` wraps `safe_reconnect()` with logging, but the latter already loops reconnection attempts up to `master_retry_count` times with `mi->connect_retry`-msec sleeps inbetween. This means `try_to_reconnect()` has been counting number of disconnects (since it doesn’t have a loop) while `safe_reconnect()` was counting actual attempts (which may be multiple per disconnect). In practice, this outer counter’s only benefit was to cover the edge case `--master-retry-count=0` for the inner loop… by treating it as 1…
“Lightly” refactor `try_to_reconnect()` `messages` to reduce duplication and improve consistency
When the IO thread (re)connect to a primary, no updates are available besides unique errors that cause the failure. These new `Master_info` numbers supplement SHOW REPLICA STATUS’s (most- recent) ‘Connecting’ state with statistics on (re)connect attempts: * `Connects_Tried`: how many retries have been attempted so far This was previously a local variable that only counter re-attempts; it’s now meaningful even after the “Connecting” state concludes. * `Primary_Retry_Count`: out of how many configured Formerly known as the global option `--master-retry-count`, it’s now copied per-replica to pave way for CHANGE MASTER … TO in MDEV-25674.
* `sql/mysqld.cc`: init `master-retry-count` with `master_retry_count` * `get_master_version_and_clock()` de-duplicate label using fall-through * `io_slave_killed()` & `check_io_slave_killed()`: * reüse the result from the level lower * add distinguishing docs * `try_to_reconnect()`: extract `'` from `if`-`else` * `handle_slave_io()`: Both `while`s have the same condition; looks like the outer `while` can simply be an `if`. * `connect_to_master()`: * assume `mysql_errno()` is not 0 on connection error * utilize 0’s falsiness in the loop * remove kill check around error reporting – other kill checks don’t even use this result * extend docs
These tests dump the whole SSS – results are not guaranteed consistent!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Room for refactor that I excluded from this PR:
- There are a few instances of
if (!…) … else …
- Both
register_slave_on_master
andrequest_dump
provides failure warnings, yet their caller,handle_slave_io
, still emits similar warnings on their failure. - Extraneous EOL whitespaces
I’m done with staring at these consistent but incomprehensible test failures. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi ParadoxV5,
Here is my review of MDEV-35304:
diff --git a/mysql-test/include/check-testcase.inc b/mysql-test/include/check-testcase.inc
index 25990c1a9e379..4ae0da30d7bba 100644
--- a/mysql-test/include/check-testcase.inc
+++ b/mysql-test/include/check-testcase.inc
@@ -74,10 +74,12 @@ if ($tmp)
--echo Slave_Non_Transactional_Groups #
--echo Slave_Transactional_Groups #
--echo Replicate_Rewrite_DB #
- --echo Connects_Tried #
- --echo Primary_Retry_Count #
IIUC, this is in preparation for adding a MASTER_RETRY_COUNT to the CHANGE
MASTER, MDEV-25674.
But this is not done in this patch. So let's add that output to SHOW SLAVE
STATUS as part of MDEV-25674, not as part of this change. SHOW SLAVE STATUS
is for status about the running slave and for values setable with CHANGE
MASTER, not for showing the values of configuration parameters.
Second, when this is added (here or as part of MDEV-25674), use consistent
naming using the primary Master/Slave terminology, do not introduce a mix
with the alternate terminology primary/replica.
So "Master_Retry_Count". But I would suggest that when this is added as
part of MDEV-25674, to name it better as "Master_Max_Connect_Tries" or
something like that. To make it clear that 1) it is the configured maximum,
not any current count, and 2) it is the total number of connect tries,
whereas "retry" might suggest it does not include the initial try.
diff --git a/mysql-test/suite/rpl/t/rpl_connects_tried.test b/mysql-test/suite/rpl/t/rpl_connects_tried.test
new file mode 100644
index 0000000000000..b9f5dd3163030
--- /dev/null
+++ b/mysql-test/suite/rpl/t/rpl_connects_tried.test
@@ -0,0 +1,52 @@
+--SOURCE include/have_debug.inc
+--SOURCE include/have_debug_sync.inc
+--SOURCE include/have_log_bin.inc # The test is agnostic of binlog formats.
+--SOURCE include/master-slave.inc
Please use lower-case for mysqltest commands --source, --let, --connection,
etc. to be consistent with other test files.
+--LET $rpl_server_number= 1
+--LET $status_items= Slave_IO_Running
+
+#--CALL mtr.add_suppression("Slave I/O: Master command COM_REGISTER_SLAVE failed: .")
+#--CALL mtr.add_suppression("Slave I/O: . failed with error: Lost connection to MySQL server at 'reading initial communication packet'")
+#--CALL mtr.add_suppression("Fatal error: The slave I/O thread stops because master and slave have equal MySQL server ids; .")
+#--CALL mtr.add_suppression("Slave I/O thread . register on master")
Why are these commented out but left in the test case?
diff --git a/mysql-test/suite/rpl/t/rpl_heartbeat_debug.test b/mysql-test/suite/rpl/t/rpl_heartbeat_debug.test
index e593786655bcb..8b1d7517b2e1e 100644
--- a/mysql-test/suite/rpl/t/rpl_heartbeat_debug.test
+++ b/mysql-test/suite/rpl/t/rpl_heartbeat_debug.test
@@ -5,7 +5,7 @@--disable_query_log
CALL mtr.add_suppression('SET @master_heartbeat_period to master failed with error');
-CALL mtr.add_suppression('Master command COM_REGISTER_SLAVE failed: failed registering on master, reconnecting to try again');
+CALL mtr.add_suppression('Master command COM_REGISTER_SLAVE failed: Replica I/O thread: register on primary failed, reconnecting to try again');
Should be "Slave I/O thread: ...", see comment below.
diff --git a/sql/mysqld.cc b/sql/mysqld.cc
index 89d9cc646c8ca..f69d6fcfc5ae0 100644
--- a/sql/mysqld.cc
+++ b/sql/mysqld.cc
@@ -780,7 +780,7 @@ File_parser_dummy_hook file_parser_dummy_hook;/* replication parameters, if master_host is not NULL, we are a slave */
uint report_port= 0;
-ulong master_retry_count=0;
+ulong master_retry_count=100000;
@@ -6820,7 +6820,7 @@ struct my_option my_long_options[]=
{"master-retry-count", 0,
"The number of tries the slave will make to connect to the master before giving up",
&master_retry_count, &master_retry_count, 0, GET_ULONG,
- REQUIRED_ARG, 100000, 0, 0, 0, 0, 0},
- REQUIRED_ARG, static_cast(master_retry_count), 0, 0, 0, 0, 0},
This looks strange. These are statically initialized global variables. How
can you initialize one of them with the value of the other? Better use a
#define or static constexpr to initialize them both. Or do you have a
reference to somewhere it says this is valid standard c++?
A question: The documentation for --master-retry-count states:
Number of times a replica will attempt to connect to a primary before
giving up.
A value of 0 means the replica will not stop attempting to reconnect.
Are these still true after this patch?
diff --git a/sql/rpl_mi.h b/sql/rpl_mi.h
index 37a5fc30099c9..de8a422ae9498 100644
--- a/sql/rpl_mi.h
+++ b/sql/rpl_mi.h
@@ -253,7 +253,12 @@ class Master_info : public Slave_reporting_capability
- /** replica-specific @ref master_retry_count */
s/replica/slave/
- ulong retry_count;
Better name, for example max_connect_tries. This is easily confused with a
count of the current count of connection attempts that failed.
diff --git a/sql/slave.cc b/sql/slave.cc
index e16834a387da3..1d748789a245a 100644
--- a/sql/slave.cc
+++ b/sql/slave.cc
@@ -102,57 +102,33 @@ LEX_CSTRING default_master_connection_name= { (char*) "", 0 };
- reg=
- {
- "Reconnecting after a failed registration on primary",
- "register on primary",
Keep to the standard terminology master/slave, don't introduce a mix with
the alternate primary/replica terminology except in optional user aliases.
And best not to change the error message wording unless there is a good
reason to do so. Users may occasionally have regexps or something matching
against the old wording.
- "Reconnecting after a failed primary event read",
- "primary event read",
s/primary/master/
@@ -2609,10 +2586,6 @@ when it try to get the value of TIME_ZONE global variable from master.";
DBUG_RETURN(0);network_err:
- if (master_res)
- mysql_free_result(master_res);
- DBUG_RETURN(2);
slave_killed_err:
Agree, thanks for spotting the redundancy.
Did you consider merging the two goto labels into one also? But just a
suggestion, up to you.
@@ -4445,36 +4409,35 @@ static int try_to_reconnect(THD *thd, MYSQL *mysql, Master_info *mi,
- my_snprintf(buf, sizeof(buf), messages[SLAVE_RECON_MSG_FAILED],
IO_RPL_LOG_NAME, mi->master_log_pos,
- my_snprintf(buf, sizeof(buf), "Replica I/O thread: %s failed, "
s/Replica/Slave/
@@ -4695,7 +4656,7 @@ pthread_handler_t handle_slave_io(void *arg)
DBUG_PRINT("info",("Starting reading binary log from master"));
thd->set_command(COM_SLAVE_IO);
- while (!io_slave_killed(mi))
- if (!io_slave_killed(mi))
{
So this changes a nested while() { ... while() { ... } ... } into a
single while() inside an if().
It was hard for me to see exactly what the reason was for this in the patch
/ why it is correct, can you try to explain the reason for it?
@@ -7038,16 +6998,10 @@ static int connect_to_master(THD* thd, MYSQL* mysql, Master_info* mi,
" - retry-time: %d maximum-retries: %lu message: %s",
(reconnect ? "reconnecting" : "connecting"),
mi->user, mi->host, mi->port,
mi->connect_retry, master_retry_count,
}mi->connect_retry, mi->retry_count, mysql_error(mysql));
- /*
By default we try forever. The reason is that failure will trigger
master election, so if the user did not set master_retry_count we
do not want to have election triggered on the first failure to
connect
- */
- if (++err_count == master_retry_count)
- if (++(mi->connects_tried) == mi->retry_count)
Right, I think this makes --master-retry-count=0 mean forever, ok (partly
answering my own question above).
diff --git a/sql/sql_show.cc b/sql/sql_show.cc
index 7f4781c1cd5b9..cdc4209faf2de 100644
--- a/sql/sql_show.cc
+++ b/sql/sql_show.cc
@@ -9201,14 +9201,16 @@ int make_proc_old_format(THD *thd, ST_SCHEMA_TABLE *schema_table)
Constants for columns that are present in
SHOW ALL SLAVES STATUS
that are not in SHOW SLAVE STATUS. Specifically, columns 0 and 1, and
- everything at and above 56.
- everything between 56 and 64 inclusive.
0: Connection_name
1: Slave_SQL_State
56: Retried_transactions- 64: Master_Slave_time_diff
*/
#define SLAVE_STATUS_COL_CONNECTION_NAME 0
#define SLAVE_STATUS_COL_SLAVE_SQL_STATE 1
#define SLAVE_STATUS_COL_RETRIED_TRANSACTIONS 56
+#define SLAVE_STATUS_COL_MASTER_SLAVE_TIME_DIFF 64
No, I think instead SLAVE_STATUS_COL_RETRIED_TRANSACTIONS should be changed
to 58 (or 57 until MDEV-25674), see below.
@@ -9220,18 +9222,13 @@ static int make_slave_status_old_format(THD *thd, ST_SCHEMA_TABLE *schema_table)
for (uint i=0; !field_info->end_marker(); field_info++, i++)
{
if (all_slaves ||
// not SLAVE_STATUS_COL_CONNECTION_NAME,
// SLAVE_STATUS_COL_SLAVE_SQL_STATE
// and less
// SLAVE_STATUS_COL_RETRIED_TRANSACTIONS
!(i <= SLAVE_STATUS_COL_SLAVE_SQL_STATE ||
i >= SLAVE_STATUS_COL_RETRIED_TRANSACTIONS))
(i > SLAVE_STATUS_COL_SLAVE_SQL_STATE &&
i < SLAVE_STATUS_COL_RETRIED_TRANSACTIONS) ||
{i > SLAVE_STATUS_COL_MASTER_SLAVE_TIME_DIFF)
LEX_CSTRING field_name= field_info->name();
Item_field *field= new (thd->mem_root)
Item_field(thd, context, field_name);
DBUG_ASSERT(all_slaves || (i > SLAVE_STATUS_COL_SLAVE_SQL_STATE &&
}i < SLAVE_STATUS_COL_RETRIED_TRANSACTIONS)); if (!field || add_item_to_list(thd, field)) return 1;
@@ -10710,6 +10707,8 @@ ST_FIELD_INFO slave_status_info[]=
Column("Master_last_event_time", Datetime(0), NULLABLE),
Column("Slave_last_event_time", Datetime(0), NULLABLE),
Column("Master_Slave_time_diff", SLonglong(10), NULLABLE),
- Column("Connects_Tried", ULonglong(20), NOT_NULL),
- Column("Primary_Retry_Count", ULonglong(20), NOT_NULL),
CEnd()
No, I think this is wrong.
When new entries are added to the SHOW SLAVE STATUS output, they are added
after all the existing columns output from SHOW SLAVE STATUS, but before
the columns output by SHOW ALL SLAVES STATUS. Somewhat tricky, but that is
why there is now an INFORMATION_SCHEMA view that allows to select just the
columns of interest and not depend on column order.
So I think the patch should add the Connects_Tried column just after
Replicate_Rewrite_DB (and MDEV-25674 can add Master_Max_Connect_Tries after
that again). And then you do not need to change the if() condition or the
DBUG_ASSERT above.
And s/Primary_Retry_Count/Master_Max_Connect_Tries/ as noted above.
- Kristian.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your thorough review, Master @knielsen.
Some side tips about GitHub:
- It uses Markdown, so your plaintext review rendered oddly on the web.
(Perhaps this plaintext is what a mailing list reply looks like, complete with a line width limit?) - Instead of quoting diffs, you can create threads on specific line(s) or a file in a PR’s “Files changed” tab.
For oraganization, I’ve replied in said threads (if you can see them, that is 😅.)
--echo Slave_Non_Transactional_Groups # | ||
--echo Slave_Transactional_Groups # | ||
--echo Replicate_Rewrite_DB # | ||
--echo Connects_Tried # | ||
--echo Primary_Retry_Count # |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, this is in preparation for adding a MASTER_RETRY_COUNT to the CHANGE
MASTER, MDEV-25674.But this is not done in this patch. So let's add that output to SHOW SLAVE
STATUS as part of MDEV-25674, not as part of this change. SHOW SLAVE STATUS
is for status about the running slave and for values setable with CHANGE
MASTER, not for showing the values of configuration parameters.
The visibility on the --master-retry-count
config together with the other new output was part of MDEV-35304 (and its customer)’s request.
But I agree that the change from master_retry_count
to mi->retry_count
belongs to MDEV-25674 better.
I included that ::retry_count
because, like you said, SHOW REPLICA STATUS reports Master_info
entries, not global configs.
I’ve requested assigning MDEV-25674 to myself and, especially re. your criticism, plan to include it in this PR too.
I’ll think about what’s the best sequence to reorganize the commit history to.
Second, when this is added (here or as part of MDEV-25674), use consistent
naming using the primary Master/Slave terminology, do not introduce a mix
with the alternate terminology primary/replica.
So MDEV-18777 is too soon for new columns by themselves? 👌
So "Master_Retry_Count". But I would suggest that when this is added as
part of MDEV-25674, to name it better as "Master_Max_Connect_Tries" or
something like that. To make it clear that 1) it is the configured maximum,
not any current count, and 2) it is the total number of connect tries,
whereas "retry" might suggest it does not include the initial try.
Much agreed – this name is so unfluent.
I was also thinking about deprecating --master-retry-count
itself too in favor of MDEV-25674. This could be another argument.
Switching to the ‘primary’ term as part of a “rename” is quite tempting though.
Or maybe throw out the ‘master’ part like Connect_Tried
did (for disambiguation) – Max_Connect_Tries
? 🤔 clean
--SOURCE include/have_debug.inc | ||
--SOURCE include/have_debug_sync.inc | ||
--SOURCE include/have_log_bin.inc # The test is agnostic of binlog formats. | ||
--SOURCE include/master-slave.inc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use lower-case for mysqltest commands --source, --let, --connection,
etc. to be consistent with other test files.
Will do
I’m still relatively new to the job and am building my coding style, especially that there’s no explcit guide on certain MTR conventions like this.
|
||
--LET $rpl_server_number= 1 | ||
--LET $status_items= Slave_IO_Running | ||
|
||
#--CALL mtr.add_suppression("Slave I/O: Master command COM_REGISTER_SLAVE failed: .*") | ||
#--CALL mtr.add_suppression("Slave I/O: .* failed with error: Lost connection to MySQL server at 'reading initial communication packet'") | ||
#--CALL mtr.add_suppression("Fatal error: The slave I/O thread stops because master and slave have equal MySQL server ids; .*") | ||
#--CALL mtr.add_suppression("Slave I/O thread .* register on master") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are these commented out but left in the test case?
I based this file on the test rpl_get_master_version_and_clock
, couldn’t get either my local build nor buildbots to run the test, and forgot I had them like this 😅.
Will check again when situations improve
{"master-retry-count", 0, | ||
"The number of tries the slave will make to connect to the master before giving up", | ||
&master_retry_count, &master_retry_count, 0, GET_ULONG, | ||
REQUIRED_ARG, 100000, 0, 0, 0, 0, 0}, | ||
REQUIRED_ARG, static_cast<longlong>(master_retry_count), 0, 0, 0, 0, 0}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks strange. These are statically initialized global variables. How
can you initialize one of them with the value of the other? Better use a
#define or static constexpr to initialize them both. Or do you have a
reference to somewhere it says this is valid standard c++?
It was by intuition when I made that change.
But I’ve now looked up: non-static
global variables in a (read: the same) translation unit (i.e., .cpp
) fall under ordered dynamic initialization.
Though of course, this refactor is unnecessary if our preference is to throw out --master-retry-count
altogether in favor of MDEV-25674.
@@ -780,7 +780,7 @@ File_parser_dummy_hook file_parser_dummy_hook; | |||
|
|||
/* replication parameters, if master_host is not NULL, we are a slave */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
while on this file – Where master_host
?
reg= | ||
{ | ||
"Reconnecting after a failed registration on primary", | ||
"register on primary", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep to the standard terminology master/slave, don't introduce a mix with the alternate primary/replica terminology except in optional user aliases.
And best not to change the error message wording unless there is a good reason to do so.
Personal preference was my reason to do so 😅.
I’m not woke well-aware of our stance on the master–slave terminology and, seeïng this PR merges to a new version, didn’t weigh compatibility as much.
The previous (and the first) time I asked about this was #3615 (comment).
Users may occasionally have regexps or something matching against the old wording.
Including our warning suppressions in MTR 📠.
DBUG_RETURN(0); | ||
|
||
network_err: | ||
if (master_res) | ||
mysql_free_result(master_res); | ||
DBUG_RETURN(2); | ||
|
||
slave_killed_err: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you consider merging the two goto labels into one also? But just a suggestion, up to you.
I considered but made no decisions. I can’t tell if they ever were or will differ.
I prefer exceptions personally.
|
||
DBUG_PRINT("info",("Starting reading binary log from master")); | ||
thd->set_command(COM_SLAVE_IO); | ||
while (!io_slave_killed(mi)) | ||
if (!io_slave_killed(mi)) | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this changes a nested while() { ... while() { ... } ... } into a
single while() inside an if().It was hard for me to see exactly what the reason was for this in the patch
/ why it is correct, can you try to explain the reason for it?
- Both
while
s have the same condition of!io_slave_killed(mi)
. - There is no code that runs after the inner loop concludes before the outer’s
}
.
A killed replica cannot resurrect (right?), especially during the few instructions of the inner exiting to the outer.
connect | ||
*/ | ||
if (++err_count == master_retry_count) | ||
if (++(mi->connects_tried) == mi->retry_count) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A question: The documentation for --master-retry-count states:
Number of times a replica will attempt to connect to a primary before giving up.
A value of 0 means the replica will not stop attempting to reconnect.Are these still true after this patch?
Right, I think this makes --master-retry-count=0 mean forever, ok (partly answering my own question above).
I believe they were not true even before this patch.
C/C++ integers are not limitless (unlike Ruby – I main Ruby 💎).
The counter (new or old) will rollover to 0
when it performs ++ULONG_MAX
.
So it’s not forever forever, only practically forever.
Constants for columns that are present in | ||
SHOW ALL SLAVES STATUS | ||
that are not in SHOW SLAVE STATUS. Specifically, columns 0 and 1, and | ||
everything at and above 56. | ||
everything between 56 and 64 inclusive. | ||
0: Connection_name | ||
1: Slave_SQL_State | ||
56: Retried_transactions | ||
64: Master_Slave_time_diff | ||
*/ | ||
#define SLAVE_STATUS_COL_CONNECTION_NAME 0 | ||
#define SLAVE_STATUS_COL_SLAVE_SQL_STATE 1 | ||
#define SLAVE_STATUS_COL_RETRIED_TRANSACTIONS 56 | ||
#define SLAVE_STATUS_COL_MASTER_SLAVE_TIME_DIFF 64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I think this is wrong.
start a thread for myself so I don’t forget
(But yeah, something overcomplicated is goïng on with this “add at the end only” limitation.)
Description
These new
Master_info
numbers supplement SHOWSLAVEREPLICA STATUS’s (most-recent) ‘Connecting’ state with statistics on (re)connect attempts:Connects_Tried
: how many retries have been attempted so farPrimary_Retry_Count
: out of how many configured--master-retry-count
, it’s now copied per-replica to pave way for CHANGEPRIMARYMASTER … TO in MDEV-25674.This PR also includes additional commits that refactor relevant code.
I strongly recommend reviewing commit-by-commit.
What problem is the patch trying to solve?
When the IO thread (re)connect to a primary, no updates are available besides unique errors that cause the failure.
If some output changed that is not visible in a test case, what was it looking like before the change and how it's looking with this patch applied?
From #3764 (comment), appearently:
--master-retry-count=0
is treated as=1
for reconnecting to a previously connected primary.try_to_reconnect()
.In full, looks like this happens when the connection drops at a different substep in the setup phase.
Both of these code are now gone as part of a preparation commit.
Do you think this patch might introduce side-effects in other parts of the server?
Component that expects a fixed number of SHOW REPLICA STATUS entries in fixed sections might start seeïng uninvited guests.
On that note, we have many MTR tests that run
show slave status
directly and in full (i.e., not usingsource include/show_slave_status.inc
).Release Notes
Added two new SLOW REPLICA STATUS entries for statistics of the most-recent ‘Connecting’ state:
Connects_Tried
: How many retries were attemptedPrimary_Retry_Count
: Out of how many configured (i.e., the--master-retry-count
option)Of course, they’re also accessible from INFORMATION_SCHEMA.SLAVE_STATUS; e.g.:
Knowledge Base pages that need changing
How can this PR be tested?
I need some help from test specialists.
I’ve drafted the MTR test
rpl.rpl_connects_tried
forConnets_Tried
’s behavior, but I couldn’t test it locally because my build doesn’t havedebug_sync
.I couldn’t refer to the buildbots either – they’re either successful (no search results on Cross Reference either) or they fail with issues I have no clue how they relate to my modifications 😶🌫️. (Besides the ones that already fail in
main
, that is 😶.)Besides checking that
Primary_Retry_Count
matches--master-retry-count
, we should also test the option itself regarding #3764 (comment).A draft MTR test is at #3731.
If the changes are not amenable to automated testing, please explain why not and carefully describe how to test manually.
A tester can instead replicate (wat) the MTR test by observing the SHOW REPLICA STATUS and/or INFORMATION_SCHEMA.SLAVE_STATUS of a long-
master_connect_retry
replication over time.Basing the PR against the correct MariaDB version
main
branch.This is a bug fix, and the PR is based against the earliest maintained branch in which the bug can be reproduced.PR quality check