Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

restarted plugin is discoverable but fails to connect and get put back into the active state #385

Open
edaniszewski opened this issue Apr 7, 2020 · 6 comments
Assignees

Comments

@edaniszewski
Copy link
Contributor

This was discovered when updating a plugin ConfigMap, where the Deployment was configured to restart on config change in order to pick up the config.

It looks like Synse failed to connect to the plugin, which makes sense if the container went down temporarily while restarting

timestamp='2020-04-07T16:07:25.459012Z' logger='synse_server.plugin' level='info' event='error on plugin context exit' exc_type=<class 'grpc._channel._MultiThreadedRendezvous'> exc_val=<_MultiThreadedRendezvous of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "failed to connect to all addresses"
debug_error_string = "{"created":"@1586275645.458472337","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3941,"referenced_errors":[{"created":"@1586275638.040326950","description":"failed to connect to all addresses","file":"src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":393,"grpc_status":14}]}"
> exc_tb=<traceback object at 0x7f523816b288> id='59aa69eb-a5da-569f-aaa4-3ebbdb72530b'
timestamp='2020-04-07T16:07:25.459193Z' logger='synse_server.plugin' level='info' event='marking plugin as inactive' id='59aa69eb-a5da-569f-aaa4-3ebbdb72530b' tag='vaporio/modbus-ip'

the issue is that it never seems to be able to reconnect. This is running 3.0.0-alpha.19, which should have the changes to create a reconnect background task when a plugin is marked inactive (need to verify this, but I am pretty confident it is true), but I did not see any logs indicating that reconnect happened. Additionally, it should have eventually reconnected/been marked active during the periodic plugin refresh. It looks like it was marked active, but immediately afterwards fails to connect again.

timestamp='2020-04-07T17:08:53.457246Z' logger='synse_server.plugin' level='info' event='registering new plugin' addr='xx.xx.xx.xx:5004' protocol='tcp'
timestamp='2020-04-07T17:08:53.457355Z' logger='synse_server.plugin' level='debug' event='application metrics enabled: registering gRPC interceptor'
timestamp='2020-04-07T17:08:53.459808Z' logger='synse_server.plugin' level='debug' event='loaded plugin info' id='59aa69eb-a5da-569f-aaa4-3ebbdb72530b' version={'plugin_version': '3.0.0-alpha.7', 'sdk_version': '3.0.0-alpha.3', 'build_date': '2020-02-24T15:35:49Z', 'git_commit': 'ce0da27', 'git_tag': '3.0.0-alpha.7', 'arch': 'amd64', 'os': 'linux'} addr='xx.xx.xx.xx:5004' tag='vaporio/modbus-ip'
timestamp='2020-04-07T17:08:53.459918Z' logger='synse_server.plugin' level='debug' event='plugin with id already registered - skipping' id='59aa69eb-a5da-569f-aaa4-3ebbdb72530b'
timestamp='2020-04-07T17:08:53.460010Z' logger='synse_server.plugin' level='info' event='marking plugin as active' id='59aa69eb-a5da-569f-aaa4-3ebbdb72530b' tag='vaporio/modbus-ip'
timestamp='2020-04-07T17:08:53.460109Z' logger='synse_server.plugin' level='info' event='registering new plugin' addr='10.42.4.64:5003' protocol='tcp'
timestamp='2020-04-07T17:08:53.460202Z' logger='synse_server.plugin' level='debug' event='application metrics enabled: registering gRPC interceptor'
timestamp='2020-04-07T17:08:53.462390Z' logger='synse_server.plugin' level='debug' event='loaded plugin info' id='d0c746de-832e-52b7-bc38-51115117a84d' version={'plugin_version': '3.0.0-alpha.7', 'sdk_version': '3.0.0-alpha.3', 'build_date': '2020-02-05T16:51:57Z', 'git_commit': '8620ab6', 'git_tag': '3.0.0-alpha.7', 'arch': 'amd64', 'os': 'linux'} addr='10.42.4.64:5003' tag='vaporio/snmp'
timestamp='2020-04-07T17:08:53.462501Z' logger='synse_server.plugin' level='debug' event='plugin with id already registered - skipping' id='d0c746de-832e-52b7-bc38-51115117a84d'
timestamp='2020-04-07T17:08:53.462608Z' logger='synse_server.plugin' level='debug' event='plugin manager refresh complete' plugin_count=9 elapsed_time=0.07009196281433105
timestamp='2020-04-07T17:08:55.083351Z' logger='synse_server.api.http' level='debug' event='processing request' request_id='KxyYkCXy5paZgAVQs4okcB' method='GET' ip='10.193.6.3' path='/test'
timestamp='2020-04-07T17:08:55.083833Z' logger='synse_server.cmd.test' level='info' event='issuing command' request_id='KxyYkCXy5paZgAVQs4okcB' command='TEST'
timestamp='2020-04-07T17:08:55.084812Z' logger='sanic.access' level='info' event='' extra={'status': 200, 'byte': 58, 'host': '10.193.6.3:49748', 'request': 'GET http://10.42.0.27:5000/test'}
timestamp='2020-04-07T17:08:55.614599Z' logger='synse_server.api.http' level='debug' event='processing request' request_id='NHF3oYJGb2NxutuiSMC4xa' method='GET' ip='10.42.2.24' path='/test'
timestamp='2020-04-07T17:08:55.615034Z' logger='synse_server.cmd.test' level='info' event='issuing command' request_id='NHF3oYJGb2NxutuiSMC4xa' command='TEST'
timestamp='2020-04-07T17:08:55.615962Z' logger='sanic.access' level='info' event='' extra={'status': 200, 'byte': 58, 'host': '10.42.2.24:54538', 'request': 'GET http://10.42.0.27:5000/test'}
timestamp='2020-04-07T17:08:55.617963Z' logger='synse_server.api.http' level='debug' event='processing request' request_id='xtztP9vw7gsX8eE7U9FXJn' method='GET' ip='10.42.2.24' path='/v3/readcache' params={'start': ['2020-04-07T17:08:50Z']}
timestamp='2020-04-07T17:08:55.618698Z' logger='synse_server.cmd.read' level='info' event='issuing command' command='READ CACHE' start='2020-04-07T17:08:50Z' end=''
timestamp='2020-04-07T17:08:55.619088Z' logger='synse_server.cmd.read' level='debug' event='getting cached readings for plugin' plugin='vaporio/modbus-ip' command='READ CACHE'
timestamp='2020-04-07T17:08:55.702634Z' logger='synse_server.cmd.read' level='debug' event='getting cached readings for plugin' plugin='vaporio/i2c' command='READ CACHE'
timestamp='2020-04-07T17:08:55.705890Z' logger='synse_server.cmd.read' level='debug' event='getting cached readings for plugin' plugin='vaporio/i2c' command='READ CACHE'
timestamp='2020-04-07T17:08:55.710092Z' logger='synse_server.cmd.read' level='debug' event='getting cached readings for plugin' plugin='vaporio/i2c' command='READ CACHE'
timestamp='2020-04-07T17:08:55.716640Z' logger='synse_server.cmd.read' level='debug' event='getting cached readings for plugin' plugin='vaporio/modbus-ip' command='READ CACHE'
timestamp='2020-04-07T17:08:58.718346Z' logger='synse_server.plugin' level='info' event='error on plugin context exit' exc_type=<class 'grpc._channel._MultiThreadedRendezvous'> exc_val=<_MultiThreadedRendezvous of RPC that terminated with:
	status = StatusCode.DEADLINE_EXCEEDED
	details = "Deadline Exceeded"
	debug_error_string = "{"created":"@1586279338.717792605","description":"Deadline Exceeded","file":"src/core/ext/filters/deadline/deadline_filter.cc","file_line":69,"grpc_status":4}"
> exc_tb=<traceback object at 0x7f523a6bbbc8> id='59aa69eb-a5da-569f-aaa4-3ebbdb72530b'
timestamp='2020-04-07T17:08:58.718714Z' logger='synse_server.plugin' level='info' event='marking plugin as inactive' id='59aa69eb-a5da-569f-aaa4-3ebbdb72530b' tag='vaporio/modbus-ip'
timestamp='2020-04-07T17:08:58.718967Z' logger='synse_server.api.http' level='error' event='failure when streaming cached readings' exception='Traceback (most recent call last):\n  File "/usr/local/lib/python3.6/site-packages/synse_server/cmd/read.py", line 192, in read_cache\n    for reading in client.read_cache(start=start, end=end):\n  File "/usr/local/lib/python3.6/site-packages/synse_grpc/client.py", line 209, in read_cache\n    errors.wrap_and_raise(e)\n  File "/usr/local/lib/python3.6/site-packages/synse_grpc/errors.py", line 35, in wrap_and_raise\n    raise exception\n  File "/usr/local/lib/python3.6/site-packages/synse_grpc/client.py", line 206, in read_cache\n    for reading in self.client.ReadCache(request, timeout=self.timeout):\n  File "/usr/local/lib/python3.6/site-packages/synse_server/metrics.py", line 219, in wrap_stream_resp\n    for item in response:\n  File "/usr/local/lib/python3.6/site-packages/grpc/_channel.py", line 416, in __next__\n    return self._next()\n  File "/usr/local/lib/python3.6/site-packages/grpc/_channel.py", line 706, in _next\n    raise self\ngrpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:\n\tstatus = StatusCode.DEADLINE_EXCEEDED\n\tdetails = "Deadline Exceeded"\n\tdebug_error_string = "{"created":"@1586279338.717792605","description":"Deadline Exceeded","file":"src/core/ext/filters/deadline/deadline_filter.cc","file_line":69,"grpc_status":4}"\n>\n\nThe above exception was the direct cause of the following exception:\n\nTraceback (most recent call last):\n  File "/usr/local/lib/python3.6/site-packages/synse_server/api/http.py", line 402, in response_streamer\n    async for reading in cmd.read_cache(start, end):\n  File "/usr/local/lib/python3.6/site-packages/synse_server/cmd/read.py", line 197, in read_cache\n    ) from e\nsynse_server.errors.ServerError: error while issuing gRPC request: read cache'

So it seems like we may be hanging onto a stale connection or something. That avenue is at least worth investigating a bit.

@edaniszewski
Copy link
Contributor Author

As a data point, this has happened again with the eGauge (modbus) plugin. The logs are fairly similar to the above, so not re-pasting here. Basically it looks like it does get discovered correctly, but beyond that it fails to connect and gets marked as inactive. Issuing a request with a force refresh doesn't seem to help, as it just goes back into the inactive state.

Looking at plugin logs, I didn't notice any grpc requests coming through for reads, which I would expect. There may have been other lightweight grpc messages (version/metadata) which have historically been able to make it through, but I didn't see them mostly because of the volume of debug logs as I was tailing the container.

This issue keeps cropping up on deploys, not super regularly, but regularly enough. I may need to carve out some time this week to see if I can get some grpc debug messages enabled and get better visibility into what is happening here.

@lazypower
Copy link
Contributor

I know how we can reproduce this if it's indeed the same bug. We observed this behavior today in ke2-ord where the egauge plugin was evicted from CN2 and migrated to CN3 due to full ephemeral disk. WIth that plugin change, was enough to disrupt synse and it dropped all devices, had no readings, and would only register the plugin as "active" with ?refresh=true appended to the url.

eg:

https://synse.c1.ke2.ord.vio.ke/v3/plugin?refresh=true

We also observed the same behavior by deploying the synse-emulator with a static config, updated the plugin configuration, then re-deployed the emulator and synse exhibited the same issue. So I think we can try that and avoid whack-a-mole to profile and figure out why this is happening.

@edaniszewski edaniszewski self-assigned this Dec 11, 2020
@edaniszewski
Copy link
Contributor Author

Thanks for the update on this. I'm going to try and poke around at a local deployment to see if I can recreate following these observations. If so, thats a huge win in being able to debug this.

@edaniszewski
Copy link
Contributor Author

I think I may be on to what is happening here. The bit from the notes above that I'm less certain of is:

WIth that plugin change, was enough to disrupt synse and it dropped all devices, had no readings

The only way I see this as plausible with the path I'm currently working down is if the egauge plugin was the only synse plugin running in ke2-ord at the time. If there were other plugins running and producing data (e.g. devices/readings) without incurring the connectivity/device+plugin drop error, then I am not confident in the leads I am working down. That or there are multiple bugs somewhere.

@lazypower @marcoceppi @MatthewHink -- I believe you were the ones investigating the aforementioned issue at ke2-ord a few days ago. Can any of you confirm/deny whether the egauge plugin was the only plugin running at the time?

@MatthewHink
Copy link
Contributor

I think egauge was the only plugin running. i2c and rs485 are not up yet. snmp has connectivity issues to the ups.

[mhink@ion ~]$ kubectl get pods -o wide
NAME                                 READY   STATUS    RESTARTS   AGE     IP            NODE   NOMINATED NODE   READINESS GATES
edens-autofan-78977c5457-ldhgp       1/1     Running   4          52d     10.42.0.10    cn2    <none>           <none>
edens-blackbox-8645fb9756-dw7p8      2/2     Running   15         52d     10.42.2.6     cn3    <none>           <none>
edens-blackbox-8645fb9756-q5pnb      2/2     Running   9          52d     10.42.0.8     cn2    <none>           <none>
edens-blackbox-8645fb9756-xt5q6      2/2     Running   8          52d     10.42.1.9     cn1    <none>           <none>
edens-blackbox-postgresql-0          2/2     Running   0          52d     10.42.0.20    cn2    <none>           <none>
edens-edge-monitor-ccd5878d7-rz7jz   2/2     Running   0          52d     10.42.1.8     cn1    <none>           <none>
edens-egauge-fc4889ffb-4n8vt         0/1     Evicted   0          12d     <none>        cn3    <none>           <none>
edens-egauge-fc4889ffb-p2rqb         1/1     Running   0          2d23h   10.42.0.59    cn2    <none>           <none>
edens-synse-688b7f8578-5h594         1/1     Running   0          2d23h   10.42.2.216   cn3    <none>           <none>

@edaniszewski
Copy link
Contributor Author

cool - thanks! that makes me pretty confident that I have an idea what this bug is finally. I need to do a bit more testing / log analysis locally to be sure. I'll post a larger writeup shortly if what I'm looking into ends up being the bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants