-
Notifications
You must be signed in to change notification settings - Fork 713
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix START/STOP SLAVE deadlock caused by slave stats daemon #1377
base: fb-mysql-8.0.32
Are you sure you want to change the base?
Fix START/STOP SLAVE deadlock caused by slave stats daemon #1377
Conversation
@hermanlee , @luqun , I am PR'ing a possible fix but note that I'm not a replication stats thread expert nor do I see any specific MTR tests I could run for it. |
fcbd7f7
to
d07cb23
Compare
@@ -230,18 +230,16 @@ static void *handle_slave_stats_daemon(void *arg MY_ATTRIBUTE((unused))) { | |||
bool start_handle_slave_stats_daemon() { | |||
DBUG_ENTER("start_handle_slave_stats_daemon"); | |||
|
|||
channel_map.rdlock(); | |||
channel_map.assert_some_lock(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this working when start_handle_slave_stats_daemon() is called from register_raft_followers()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this working when start_handle_slave_stats_daemon() is called from register_raft_followers()?
@george-reynya , is there any way I can check this with the open source build?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, is the problem that you don't have the raft plugin? Let me check what the options are.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for delay. Left a question about having the channel map lock in the raft case.
Under load, if START SLAVE IO_THREAD and STOP SLAVE execute concurrently, the following deadlock is possible: - Thread 44 is executing STOP SLAVE, has read-locked the channel map, and is waiting for the slave I/O thread to quit. - Thread 55 is executing START SLAVE, and is blocked waiting for the channel map write lock. - Thread 54 is the SLAVE I/O thread, and is trying to read-lock the channel map lock. The request of T54 is compatible with the current lock state, but according to POSIX, once a write request is pending, it is up to the implementation whether to satisfy them or block. For the fix, observe that the starting replica I/O thread only tries to signal the stats thread to start, thus move this code to the START REPLICA command-executing thread instead, which already happens to hold the channel map lock. This also forces to move the stopping of the stats thread from the replica I/O thread to the STOP REPLICA command-executing thread. This fixes intermittent but often-seen failures on rpl.rpl_multi_source_channel_map_stress under macOS. Squash with b015dd3 Stacktraces: Thread 44: frame facebook#5: 0x0000000106483108 mysqld`inline_mysql_cond_timedwait(that=0x000000014585d438, mutex=0x000000014585d2f0, abstime=0x000000016db61588, src_file="/Users/laurynas/vilniusdb/fb-mysql/sql/rpl_replica.cc", src_line=2367) at mysql_cond.h:224:16 frame facebook#6: 0x0000000106460818 mysqld`terminate_slave_thread(thd=0x00000001458fb800, term_lock=0x000000014585d2f0, term_cond=0x000000014585d438, slave_running=0x000000014585d4f4, stop_wait_timeout=0x000000016db61700, need_lock_term=false, force=false) at rpl_replica.cc:2367:9 frame facebook#7: 0x0000000106460188 mysqld`terminate_slave_threads(mi=0x000000014585ce00, thread_mask=1, stop_wait_timeout=31536000, need_lock_term=false) at rpl_replica.cc:2204:18 frame facebook#8: 0x000000010645873c mysqld`stop_slave(thd=0x0000000145952800, mi=0x000000014585ce00, net_report=true, for_one_channel=true, push_temp_tables_warning=0x000000016db61a37, invoked_by_raft=false) at rpl_replica.cc:10032:9 frame facebook#9: 0x00000001064593c4 mysqld`stop_slave_cmd(thd=0x0000000145952800) at rpl_replica.cc:971:13 Thread 55: frame facebook#2: 0x000000018e5bf73c libsystem_pthread.dylib`_pthread_rwlock_lock_slow + 720 frame facebook#3: 0x0000000104c3d2ac mysqld`native_rw_wrlock(rwp=0x00006000039dc0e8) at thr_rwlock.h:101:10 frame facebook#4: 0x0000000104c3d21c mysqld`inline_mysql_rwlock_wrlock(that=0x00006000039dc0e8, src_file="/Users/laurynas/vilniusdb/fb-mysql/sql/rpl_gtid.h", src_line=473) at mysql_rwlock.h:410:12 frame facebook#5: 0x0000000104c3c6f0 mysqld`Checkable_rwlock::wrlock(this=0x00006000039dc0e0) at rpl_gtid.h:473:5 frame facebook#6: 0x00000001054dc6e0 mysqld`Multisource_info::wrlock(this=0x000000010a182980) at rpl_msr.h:441:25 frame facebook#7: 0x0000000106458bb0 mysqld`start_slave_cmd(thd=0x0000000130008a00) at rpl_replica.cc:832:15 Thread 54: frame facebook#2: 0x000000018e5bf73c libsystem_pthread.dylib`_pthread_rwlock_lock_slow + 720 frame facebook#3: 0x0000000104b62d90 mysqld`native_rw_rdlock(rwp=0x00006000039dc0e8) at thr_rwlock.h:82:10 frame facebook#4: 0x0000000104b62d00 mysqld`inline_mysql_rwlock_rdlock(that=0x00006000039dc0e8, src_file="/Users/laurynas/vilniusdb/fb-mysql/sql/rpl_gtid.h", src_line=464) at mysql_rwlock.h:340:12 frame facebook#5: 0x0000000104b62bcc mysqld`Checkable_rwlock::rdlock(this=0x00006000039dc0e0) at rpl_gtid.h:464:5 frame facebook#6: 0x0000000104b61cec mysqld`Multisource_info::rdlock(this=0x000000010a182980) at rpl_msr.h:415:25 frame facebook#7: 0x0000000104b618f0 mysqld`start_handle_slave_stats_daemon() at slave_stats_daemon.cc:233:15 frame facebook#8: 0x00000001064627dc mysqld`handle_slave_io(arg=0x000000014585ce00) at rpl_replica.cc:6050:36
d07cb23
to
ca53b66
Compare
Rebased on 8.0.32. "This branch has conflicts that must be resolved" looks bogus |
@george-reynya has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Fix START/STOP SLAVE deadlock caused by slave stats daemon
Under load, if START SLAVE IO_THREAD and STOP SLAVE execute concurrently, the
following deadlock is possible:
The request of T54 is compatible with the current lock state, but according to POSIX, once a write request is pending, it is up to the implementation whether to satisfy them or block.
For the fix, observe that the starting replica I/O thread only tries to signal the stats thread to start, thus move this code to the START REPLICA command-executing thread instead, which already happens to hold the channel map lock. This also forces to move the stopping of the stats thread from the replica I/O thread to the STOP REPLICA command-executing thread.
This fixes intermittent but often-seen failures on rpl.rpl_multi_source_channel_map_stress under macOS.
Squash with b015dd3
Stacktraces:
Thread 44:
Thread 55:
Thread 54: