Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pg_shard may fail to mark shard placement as invalid under some circumstances #101

Open
onderkalaci opened this issue Apr 2, 2015 · 0 comments

Comments

@onderkalaci
Copy link
Member

The bug happens when pg_shard fails to INSERT to shard placement and postgres is shut down or psql connection is closed before shard placement status is updated.

This is not easy to reproduce bug. But, if a sleep() function call is added to this line, reproducing becomes easy.

Assuming that sleep() is added, the bug can be reproduced with following steps:

  1. Create a cluster with 1 master, 2 workers
  2. Distribute table and create worker shards with replication factor 2
  3. Stop one of the worker nodes
  4. Connect to psql, and get its pid, select pg_backend_pid();
  5. Issue an INSERT on that psql session. During the INSERT (since we added a sleep, it takes at least the sleep seconds), execute shell command "kill -9 pid_of_psql"
  6. Restart both master and the stopped worker node.
  7. Connect to worker nodes and observe that one of the shards is divergent
  8. But shard placements on metadata has all STATE_FINALIZED status

The main problem here is that we do not execute remote commands and state status changes in an atomic way.

A possible Solution that we can try is to check whether HOLD_INTERRUPTS()/RESUME_INTERRUPTS() works. Also, check if these function call pair has any drawbacks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants