Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CheckMK] Make staging monitoring systems match production #5725

Open
5 of 11 tasks
acozine opened this issue Jan 10, 2025 · 1 comment
Open
5 of 11 tasks

[CheckMK] Make staging monitoring systems match production #5725

acozine opened this issue Jan 10, 2025 · 1 comment
Assignees
Labels
maintenance Operations pulls issues into the Operations ZenHub board

Comments

@acozine
Copy link
Contributor

acozine commented Jan 10, 2025

What maintenance needs to be done?

Upgrade/reinstall and rename the CheckMK staging server and site so our two monitoring environments are as similar as possible.

Level of urgency

  • High
  • Moderate
  • Low

Why is this maintenance needed?

To make it easy to test future changes to our monitoring platform, we want the staging and production systems to be as similar as possible. In particular, we want:

  • clear naming for CheckMK VMs - prod and staging names should be related
  • clear DNS names for CheckMK - prod and staging site names should be related, and the sites should be easy to remember/find
  • software version and type should match in staging and production; this allows us to
    • test performance in staging before moving to prod
    • test integrating our OOBM sites (which are already running the paid version)

Acceptance criteria

  • we have two VMs for CheckMK - one prod, one staging - with identical resources (memory, CPU, storage), clear names that reflect their envs (e.g. if prod is 'checkmk-prod1' call the staging VM 'checkmk-staging1'), and the same version and type of CheckMK installed
  • we have two sites for CheckMK - one prod, one staging - with names that match ('checkmk.princeton.edu' and 'checkmk-staging.princeton.edu')

Implementation notes, if any

  • rebuild the VM for staging CheckMK, name it so it matches production (e.g. if prod is 'checkmk-prod1' call the staging VM 'checkmk-staging1')
  • update princeton_ansible inventory as needed
  • set up a new site for staging that matches prod ('checkmk.princeton.edu' and 'checkmk-staging.princeton.edu')
  • install the paid version of CheckMK on the new staging VM, using our Ansible playbook/role
  • back up the old, free-version staging CheckMK instance and restore it to the new, paid-version staging VM/site
  • decommission the old staging site and VM
@acozine acozine added maintenance Operations pulls issues into the Operations ZenHub board labels Jan 10, 2025
@kayiwa kayiwa self-assigned this Jan 10, 2025
@acozine
Copy link
Contributor Author

acozine commented Feb 4, 2025

Putting everything on one server led to performance issues, so we have changed tacks and created a distributed architecture. We will have one production VM for monitoring production infrastructure and a second production VM for monitoring staging infrastructure. Since these are both production VMs, we can call them pulmonitor-prod1 and pulmonitor-prod2 - the first will run pulmonitor.princeton.edu/production and the second will run pulmonitor.princeton.edu/staging.

We will also have a production VM for each data center to monitor anything we cannot connect to from the outside.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
maintenance Operations pulls issues into the Operations ZenHub board
Projects
None yet
Development

No branches or pull requests

2 participants