ConnectorSource needs to avoid leaking stream IDs #3014

slfritchie · 2019-09-18T16:31:12Z

Here are observations from connector_framed_source_notify.pony that point out that the stream registry service can leak file stream ID registrations. If a stream ID is leaked (i.e., Wallaroo mistakenly believes that the stream ID is in active use by a Wallaroo worker when in fact the stream ID is not in use at any worker), then it becomes impossible to resume sending messages to Wallaroo with that ID.

  // This is a reply from a query that we'd sent in a prior TCP
  // connection, or else the TCP connection is closed now. If the
  // connection has been closed, any state about this query would
  // have already been purged from any local state ... which makes
  // it difficult to recover from the situation we're in here.
  // After all, that stream ID may already be registered & in active
  // use on some other worker right now.
  //
  // TODO: The one hammer that we have in our toolbox is a complete
  // rollback to the prior state: we can force the next checkpoint
  // to rollback. That would cause the entire cluster to rollback,
  // and each worker would tell all active ConnectorSource sessions
  // to RESTART and close. Then the entire stream registry starts
  // from a clean slate.  However, Wallaroo sources cannot abort
  // a checkpoint, so we cannot use this method.  Either, we need
  // to allow sources to abort a checkpoint, or else we need
  // another way to address the problem of leaked stream ids.

Here's a variation of a stream ID leak that can be addressed by rollback, namely, a rollback triggered by a worker crash:

  // If the global stream registry sends a success=true reply but
  // this worker were to crash immediately afterward and drop that
  // reply, then we might have a "leak" of the stream id,
  // permanently stuck in active state.  Also, we don't have
  // Erlang's process link and monitor mechanisms to help repair
  // such "leaked" stream id registrations. Fortunately, because
  // this worker crashed, when this worker restarts, it will cause a
  // global rollback and thus, as noted above, restart the stream
  // registry from a clean slate.

The text was updated successfully, but these errors were encountered:

slfritchie · 2019-12-07T04:21:49Z

See #3012 for additional information.

slfritchie · 2020-04-20T20:51:34Z

Update: I haven't seen this bug happen once in the last several months. That doesn't mean it can't still happen, but if it can, it is rare.

slfritchie added bug 1: - needs investigation resilience connectors labels Sep 18, 2019

slfritchie mentioned this issue Dec 7, 2019

ConnectorSource + stream registry can "leak" a stream ID after TCP close #3012

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ConnectorSource needs to avoid leaking stream IDs #3014

ConnectorSource needs to avoid leaking stream IDs #3014

slfritchie commented Sep 18, 2019

slfritchie commented Dec 7, 2019

slfritchie commented Apr 20, 2020

ConnectorSource needs to avoid leaking stream IDs #3014

ConnectorSource needs to avoid leaking stream IDs #3014

Comments

slfritchie commented Sep 18, 2019

slfritchie commented Dec 7, 2019

slfritchie commented Apr 20, 2020