Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Fix START/STOP SLAVE deadlock caused by slave stats daemon
Under load, if START SLAVE IO_THREAD and STOP SLAVE execute concurrently, the following deadlock is possible: 1) The START thread waits to acquire the channel map RW lock in shared mode and does not check whether it was signaled to quit by the STOP thread: frame #5: 0x000000010075303c mysqld`Checkable_rwlock::rdlock(this=0x0000600001cd80e0) at rpl_gtid.h:464:5 frame #6: 0x0000000100752168 mysqld`Multisource_info::rdlock(this=0x0000000105d2a780) at rpl_msr.h:408:25 frame #7: 0x0000000100751dc8 mysqld`start_handle_slave_stats_daemon() at slave_stats_daemon.cc:233:15 frame #8: 0x00000001020520ec mysqld`handle_slave_io(arg=0x000000011a869000) at rpl_replica.cc:6051:36 2) The STOP thread waits for the previous thread to stop while holding the channel map RW lock in exclusive mode: frame #5: 0x0000000102072a38 mysqld`inline_mysql_cond_timedwait(that=0x000000011a869638, mutex=0x000000011a8694f0, abstime=0x00000001732d9588, src_file="/Users/laurynas/vilniusdb/myr-native-dd-proto/sql/rpl_replica.cc", src_line=2368) at mysql_cond.h:224:16 frame #6: 0x0000000102050128 mysqld`terminate_slave_thread(thd=0x000000011b3ed800, term_lock=0x000000011a8694f0, term_cond=0x000000011a869638, slave_running=0x000000011a8696f4, stop_wait_timeout=0x00000001732d9700, need_lock_term=false, force=false) at rpl_replica.cc:2368:9 frame #7: 0x000000010204fa6c mysqld`terminate_slave_threads(mi=0x000000011a869000, thread_mask=1, stop_wait_timeout=31536000, need_lock_term=false) at rpl_replica.cc:2204:18 frame #8: 0x0000000102048020 mysqld`stop_slave(thd=0x000000011a908000, mi=0x000000011a869000, net_report=true, for_one_channel=true, push_temp_tables_warning=0x00000001732d9a37, invoked_by_raft=false) at rpl_replica.cc:10033:9 frame #9: 0x0000000102048ca8 mysqld`stop_slave_cmd(thd=0x000000011a908000) at rpl_replica.cc:971:13 For the fix, observe that the starting replica I/O thread only tries to signal the stats thread to start, thus move this code to the START REPLICA command-executing thread instead, which already happens to hold the channel map lock. This also forces to move the stopping of the stats thread from the replica I/O thread to the STOP REPLICA command-executing thread. This fixes intermittent but often-seen failures on rpl.rpl_multi_source_channel_map_stress. Squash with b015dd3
- Loading branch information