Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ConnectorSource needs to avoid leaking stream IDs #3014

Open
slfritchie opened this issue Sep 18, 2019 · 2 comments
Open

ConnectorSource needs to avoid leaking stream IDs #3014

slfritchie opened this issue Sep 18, 2019 · 2 comments

Comments

@slfritchie
Copy link
Contributor

Here are observations from connector_framed_source_notify.pony that point out that the stream registry service can leak file stream ID registrations. If a stream ID is leaked (i.e., Wallaroo mistakenly believes that the stream ID is in active use by a Wallaroo worker when in fact the stream ID is not in use at any worker), then it becomes impossible to resume sending messages to Wallaroo with that ID.

  // This is a reply from a query that we'd sent in a prior TCP
  // connection, or else the TCP connection is closed now. If the
  // connection has been closed, any state about this query would
  // have already been purged from any local state ... which makes
  // it difficult to recover from the situation we're in here.
  // After all, that stream ID may already be registered & in active
  // use on some other worker right now.
  //
  // TODO: The one hammer that we have in our toolbox is a complete
  // rollback to the prior state: we can force the next checkpoint
  // to rollback. That would cause the entire cluster to rollback,
  // and each worker would tell all active ConnectorSource sessions
  // to RESTART and close. Then the entire stream registry starts
  // from a clean slate.  However, Wallaroo sources cannot abort
  // a checkpoint, so we cannot use this method.  Either, we need
  // to allow sources to abort a checkpoint, or else we need
  // another way to address the problem of leaked stream ids.

Here's a variation of a stream ID leak that can be addressed by rollback, namely, a rollback triggered by a worker crash:

  // If the global stream registry sends a success=true reply but
  // this worker were to crash immediately afterward and drop that
  // reply, then we might have a "leak" of the stream id,
  // permanently stuck in active state.  Also, we don't have
  // Erlang's process link and monitor mechanisms to help repair
  // such "leaked" stream id registrations. Fortunately, because
  // this worker crashed, when this worker restarts, it will cause a
  // global rollback and thus, as noted above, restart the stream
  // registry from a clean slate.
@slfritchie
Copy link
Contributor Author

See #3012 for additional information.

@slfritchie
Copy link
Contributor Author

Update: I haven't seen this bug happen once in the last several months. That doesn't mean it can't still happen, but if it can, it is rare.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant