Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Log more info about hosts to delete/disable in Zabbix #90

Open
pederhan opened this issue Nov 5, 2024 · 0 comments
Open

Log more info about hosts to delete/disable in Zabbix #90

pederhan opened this issue Nov 5, 2024 · 0 comments
Assignees
Labels
enhancement New feature or request

Comments

@pederhan
Copy link
Member

pederhan commented Nov 5, 2024

It would be beneficial to log more information about the hosts to delete/disable in Zabbix, such as their current host groups, templates, etc.

Improved logs

Initially, the focus should be on logging more informations about hosts to disable. The challenge here is the refactoring required to do it consistently. Whether or not we hit the failsafe determines if we enter ZabbixHostUpdater.disable_host() or not, thus we cannot rely on adding/expanding the logging statements there.

# Check if we have too many hosts to add/remove
check_failsafe(self.settings, hostnames_to_add, hostnames_to_remove)
for hostname in hostnames_to_remove:
if self.stop_event.is_set():
logging.debug("Told to stop. Breaking")
break
zabbix_host = zabbix_hosts[hostname]
self.disable_host(zabbix_host)

One solution would be to pass more information to failsafe.check_failsafe_hosts in the form of the zabbix_hosts mapping:

def check_failsafe(config: Settings, to_add: List[str], to_remove: List[str]) -> None:
"""Check if number of hosts to add/remove exceeds the failsafe and handle appropriately."""
failsafe = config.zabbix.failsafe
if len(to_remove) <= failsafe and len(to_add) <= failsafe:
return
# Failsafe exceeded - check for failsafe OK file
if check_failsafe_ok_file(config.zac):
return
# Failsafe OK file validation failed
# We must write the hosts to add/remove and raise an exception
write_failsafe_hosts(config.zac, to_add, to_remove)
logging.warning(
"Too many hosts to change (failsafe=%d). Remove: %d, Add: %d. Aborting",
failsafe,
len(to_remove),
len(to_add),
)
raise ZACException("Failsafe triggered")

However, that is not without its own set of problems. Having that function be responsible for logging information about hosts to delete even if we don't trigger the failsafe doesn't make sense.

In all likelyhood, the best solution might be simply looping through each host to delete in ZabbixHostUpdater.do_update before we check failsafe and before we attempt to delete the hosts.

Thoughts

I need to gather my thoughts on this for a bit. What do we want to achieve with the logging? Is it only to perform a better "post-mortem" once a host has been deleted, or is it to add better introspection before a host has been deleted (i.e. the logging needs to happen before failsafe checks)?

Adding better logging as we delete the host is easy. Adding better logging that also triggers if we hit the failsafe is harder.

More detailed JSON dumps

Another improvement would be dumping more information about the hosts to remove in the failsafe_hosts.json file. That would let us filter hosts by their Source-* host group to determine which source(s) they come from, and thus more easily troubleshoot a faulty source.

TUI

It could be very useful to add a command that displays a TUI with Textual to browse the detailed JSON dump. We could then dump a lot more information, and then use the TUI to visualize it in a human-readable way. The TUI should be able to filter host by name, templates and host groups at the bare minimum.

@pederhan pederhan added the enhancement New feature or request label Nov 5, 2024
@pederhan pederhan self-assigned this Nov 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant