-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
restarted plugin is discoverable but fails to connect and get put back into the active state #385
Comments
As a data point, this has happened again with the eGauge (modbus) plugin. The logs are fairly similar to the above, so not re-pasting here. Basically it looks like it does get discovered correctly, but beyond that it fails to connect and gets marked as inactive. Issuing a request with a force refresh doesn't seem to help, as it just goes back into the inactive state. Looking at plugin logs, I didn't notice any grpc requests coming through for reads, which I would expect. There may have been other lightweight grpc messages (version/metadata) which have historically been able to make it through, but I didn't see them mostly because of the volume of debug logs as I was tailing the container. This issue keeps cropping up on deploys, not super regularly, but regularly enough. I may need to carve out some time this week to see if I can get some grpc debug messages enabled and get better visibility into what is happening here. |
I know how we can reproduce this if it's indeed the same bug. We observed this behavior today in ke2-ord where the egauge plugin was evicted from CN2 and migrated to CN3 due to full ephemeral disk. WIth that plugin change, was enough to disrupt synse and it dropped all devices, had no readings, and would only register the plugin as "active" with eg: https://synse.c1.ke2.ord.vio.ke/v3/plugin?refresh=true We also observed the same behavior by deploying the synse-emulator with a static config, updated the plugin configuration, then re-deployed the emulator and synse exhibited the same issue. So I think we can try that and avoid whack-a-mole to profile and figure out why this is happening. |
Thanks for the update on this. I'm going to try and poke around at a local deployment to see if I can recreate following these observations. If so, thats a huge win in being able to debug this. |
I think I may be on to what is happening here. The bit from the notes above that I'm less certain of is:
The only way I see this as plausible with the path I'm currently working down is if the egauge plugin was the only synse plugin running in ke2-ord at the time. If there were other plugins running and producing data (e.g. devices/readings) without incurring the connectivity/device+plugin drop error, then I am not confident in the leads I am working down. That or there are multiple bugs somewhere. @lazypower @marcoceppi @MatthewHink -- I believe you were the ones investigating the aforementioned issue at ke2-ord a few days ago. Can any of you confirm/deny whether the egauge plugin was the only plugin running at the time? |
I think egauge was the only plugin running. i2c and rs485 are not up yet. snmp has connectivity issues to the ups.
|
cool - thanks! that makes me pretty confident that I have an idea what this bug is finally. I need to do a bit more testing / log analysis locally to be sure. I'll post a larger writeup shortly if what I'm looking into ends up being the bug. |
This was discovered when updating a plugin ConfigMap, where the Deployment was configured to restart on config change in order to pick up the config.
It looks like Synse failed to connect to the plugin, which makes sense if the container went down temporarily while restarting
the issue is that it never seems to be able to reconnect. This is running
3.0.0-alpha.19
, which should have the changes to create a reconnect background task when a plugin is marked inactive (need to verify this, but I am pretty confident it is true), but I did not see any logs indicating that reconnect happened. Additionally, it should have eventually reconnected/been marked active during the periodic plugin refresh. It looks like it was marked active, but immediately afterwards fails to connect again.So it seems like we may be hanging onto a stale connection or something. That avenue is at least worth investigating a bit.
The text was updated successfully, but these errors were encountered: