Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CP io fibers deadlock #634

Open
Hooper9973 opened this issue Jan 23, 2025 · 2 comments
Open

CP io fibers deadlock #634

Hooper9973 opened this issue Jan 23, 2025 · 2 comments

Comments

@Hooper9973
Copy link
Contributor

we met the deadlock issue during NuObject2.0 SM long running test. here's the issue link:
https://docs.google.com/document/d/16DMqv9J-JuNs5c25IBuX01QXe5WbxPm4B4hAfz4JNZM/edit?tab=t.0#heading=h.94n4m5lw7ms3

RootCause
Blocked by a dead lock in cp flush due to fiber holds thread mutex.
For example:
During a CP flush, fiber1 is selected to execute MetaBlkService::update_sub_sb. This function acquires a thread-level lock, m_meta_mtx. Fiber1 intends to perform some I/O operations using IOManager. The FiberManagerLib::Promise m_promise, which is derived from boost::fibers::promise, is used to handle io_uring (https://github.com/eBay/IOManager/blob/f78a58264f461ba8707478d7d7252cfef37bb255/src/lib/interfaces/uring_drive_interface.cpp#L386) .
When promise::getFuture() is called to wait for the I/O to complete, fiber1 becomes blocked, allowing other fibers to be scheduled.
Subsequently, another CP flush is triggered, and fiber2 is selected to execute update_sub_sb again. Fiber2 attempts to acquire the m_meta_mtx. Since both fibers belong to the same thread, fiber2 blocks the entire thread while trying to acquire the lock. As a result, fiber1 cannot be scheduled to release the lock, leading to a deadlock.

The problem is if fiber hold a thread-level lock and do sync io , its will be yield, other fibers can come in. So there is a deadlock risk if other fiber acquire the same thread-level lock.

@Hooper9973
Copy link
Contributor Author

Hooper9973 commented Jan 23, 2025

HO's pr triggers back-to-back cp instead of force cp.
eBay/HomeObject@dbc0d71

We have adjusted the number of sync_io_fibers for CP on the HS side to 1. This is a quick way to avoid potential risks. However, the fundamental solution to the problem still requires further discussion.
c739f11

On the HS side, there is no occurrence of concurrent checkpoints. However, even with sequential CP flushes, there remains a risk of deadlock. @JacksonYao287 , could you please provide more details on this?
cc @yamingk @xiaoxichen

@JacksonYao287
Copy link
Contributor

JacksonYao287 commented Jan 24, 2025

let me clarify the root cause of the dead lock issue found in cp_flush.

Background

  1. after we create reactor for cp_manager, we select all the sync_io fiber as cp_manager worker fiber, and when we submit io job to this reactor, we will randomly select one of these worker fiber to handle the io job.
    iomanager.create_reactor("cp_io", iomgr::INTERRUPT_LOOP, 2u, [this, ctx](bool is_started) {
        if (is_started) {
            {
                std::unique_lock< std::mutex > lk{ctx->mtx};
                auto v = iomanager.sync_io_capable_fibers();
                m_cp_io_fibers.insert(m_cp_io_fibers.end(), v.begin(), v.end());

from the code above , we create a reactor with 2 fibers for now, which is 8 before the workaround PR. then we select all the sync_io fibers as our worker fibers. sync_io fibers are all the fibers in the reactor excluding the main_fiber(first fiber). thus, if the num of fibers here is larger than 2 when creating the reactor , we will have multiple worker fibers.

iomgr::io_fiber_t CPManager::pick_blocking_io_fiber() const {
    static thread_local std::random_device s_rd{};
    static thread_local std::default_random_engine s_re{s_rd()};
    static auto rand_fiber = std::uniform_int_distribution< size_t >(0, m_cp_io_fibers.size() - 1);
    return m_cp_io_fibers[rand_fiber(s_re)];
}
void CPManager::on_cp_flush_done(CP* cp) {
    HS_DBG_ASSERT_EQ(cp->m_cp_status, cp_status_t::cp_flushing);
    cp->m_cp_status = cp_status_t::cp_flush_done;

    iomanager.run_on_forget(pick_blocking_io_fiber(),

from the two code pieces above, we can see when we submit a io job to cp reactor, we will random select a sync_io fiber. thus , if there are multiple sync_io fibers, different sync_io fiber will take different io job.

  1. for the code here, we can see when doing sync io, it will call FiberManagerLib::Promise#wait to wait for the completion of this io. FiberManagerLib::Promise#wait , which is different from folly#promise#wait, will yield to another fiber.

Root Cause

void CPManager::on_cp_flush_done(CP* cp) {
    HS_DBG_ASSERT_EQ(cp->m_cp_status, cp_status_t::cp_flushing);
    cp->m_cp_status = cp_status_t::cp_flush_done;

    iomanager.run_on_forget(pick_blocking_io_fiber(), [this, cp]() {
        // Persist the superblock with this flushed cp information
        ++(m_sb->m_last_flushed_cp);
        m_sb.write();

        cleanup_cp(cp);

        // Setting promise will cause the CP manager destructor to cleanup before getting a chance to do the
        // checking if shutdown has been initiated or not.
        auto promise = std::move(cp->m_comp_promise);
        m_wd_cp->reset_cp();
        delete cp;

        bool trigger_back_2_back_cp{false};

        {
            std::unique_lock< std::mutex > lk(m_trigger_cp_mtx);
            m_in_flush_phase = false;
            trigger_back_2_back_cp = m_pending_trigger_cp;
        }

        promise.setValue(true);

        // Dont access any cp state after this, in case trigger_back_2_back_cp is false, because its false on
        // cp_shutdown_initated and setting this promise could destruct the CPManager itself.
        if (trigger_back_2_back_cp) {
            HS_PERIODIC_LOG(INFO, cp, "Triggering back to back CP");
            COUNTER_INCREMENT(*m_metrics, back_to_back_cps, 1);
            trigger_cp_flush(false);
        }
    });
}

from the code above, when one cp flush is done , it will submit an io job to the cp reactor and from the perspective of cp_manager, it considers this cp flush is done and can start a new cp flush. note that, it does not wait for this io job to be done, and we have no idea when this io job will be handled since it is scheduled by the fiber scheduler. also, in this code we can see it will try to update metablk(m_sb.write()), where a metablk lock will be acquired. moreover, m_sb.write() is a sync io, as described above as background#2, it will acquire the lock and then yield to another fiber without freeing the lock

there is a scenario that before this cp_flush_done io job is completed, another cp_flush is triggered where it will try to update metablk and try to acquire the metablk , then the deadlock will happen since the lock is occupied by the previous fiber which is has yielded , and the current fiber will block and try to acquire the lock.

one example is the wb_cache#cp_flush

void IndexWBCache::process_write_completion(IndexCPContext* cp_ctx, IndexBufferPtr const& buf) {
........................
        // We are done flushing the buffers, We flush the vdev to persist the vdev bitmaps and free blks
        // Pick a CP Manager blocking IO fiber to execute the cp flush of vdev
        iomanager.run_on_forget(cp_mgr().pick_blocking_io_fiber(), [this, cp_ctx]() {
            LOGTRACEMOD(wbcache, "Initiating CP flush");
            m_vdev->cp_flush(cp_ctx); // This is a blocking io call
            cp_ctx->complete(true);
        });
    }
}

here, m_vdev->cp_flush(cp_ctx) will try to update metablk, which will try to acquire the meta lock. if it is trying to update metablk but ATM the job created by on_cp_flush_done is still trying to update the metablk, we mightly hit the deadlock issue.

also the blocking stack can be seen at here.

Conclusion
according to the analysis above, the deadlock issue is not caused by concurrent cp_flush, even sequential cp flush can also hit this issue. it is caused by the uncontrollable execution sequence of different io fibers.

now, we use a workaround pr to set the num of sync_io fiber to only one so that all the io jobs will be executed sequentially since there is only one fiber which can do sync_io job

also , we can change run_on_forget to run_on_wait in CPManager::on_cp_flush_done to make sure before the next cp starts , all the sync_io of previous cp is done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants