You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The bug happens when pg_shard fails to INSERT to shard placement and postgres is shut down or psql connection is closed before shard placement status is updated.
This is not easy to reproduce bug. But, if a sleep() function call is added to this line, reproducing becomes easy.
Assuming that sleep() is added, the bug can be reproduced with following steps:
Create a cluster with 1 master, 2 workers
Distribute table and create worker shards with replication factor 2
Stop one of the worker nodes
Connect to psql, and get its pid, select pg_backend_pid();
Issue an INSERT on that psql session. During the INSERT (since we added a sleep, it takes at least the sleep seconds), execute shell command "kill -9 pid_of_psql"
Restart both master and the stopped worker node.
Connect to worker nodes and observe that one of the shards is divergent
But shard placements on metadata has all STATE_FINALIZED status
The main problem here is that we do not execute remote commands and state status changes in an atomic way.
A possible Solution that we can try is to check whether HOLD_INTERRUPTS()/RESUME_INTERRUPTS() works. Also, check if these function call pair has any drawbacks.
The text was updated successfully, but these errors were encountered:
The bug happens when pg_shard fails to INSERT to shard placement and postgres is shut down or psql connection is closed before shard placement status is updated.
This is not easy to reproduce bug. But, if a sleep() function call is added to this line, reproducing becomes easy.
Assuming that sleep() is added, the bug can be reproduced with following steps:
INSERT
on that psql session. During theINSERT
(since we added a sleep, it takes at least the sleep seconds), execute shell command "kill -9 pid_of_psql"STATE_FINALIZED
statusThe main problem here is that we do not execute remote commands and state status changes in an atomic way.
A possible Solution that we can try is to check whether HOLD_INTERRUPTS()/RESUME_INTERRUPTS() works. Also, check if these function call pair has any drawbacks.
The text was updated successfully, but these errors were encountered: