Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add post "We built server monitoring in a day" #4171

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
70 changes: 70 additions & 0 deletions content/posts/2025-02-01-we-built-server-monitoring-in-a-day.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
---
title: We built server monitoring in a day
authors:
- matt-cloyd
- john-skinner
- magdaline-derosena
- ethan-marcotte
excerpt: |
We needed to know when our server went down, so we built our own monitor — in a day.

---

Our partners at one government agency had been experiencing periodic internet outages at one of their rural data centers. They would sometimes get emails from users who encountered problems during the outages, but our partners didn't have a reliable way to know when their servers were down — until 18F stepped in.

We were already working with them on a solution to address the root problem. Deploying to the cloud would avoid the need for rural data centers with spotty coverage. But our partner's immediate need was to know when their servers were down, so they could fix it quickly.

What should we do? We considered procuring commercial server monitoring services. But that would mean we'd have to do a marketplace analysis, draft and publish a solicitation, go through a contracting process, and set up full-fledged server monitoring. Why go through a whole procurement and pay for features beyond our immediate needs, when we only need a simple monitor to see if the server is up or down?

So, we thought some more, and someone asked — _"why not make it a spreadsheet"_? What if we ran a script that would check if the server is up or down, and log the status in a spreadsheet? From there, we'd figure out how to automatically read the spreadsheet, and alert the team when the server went down.

We iterated on this solution idea a few more times, and landed on the following solution:

1. Every fifteen minutes, run a script to see if the server is up or down. We decided to use GitHub Actions to run the script at no cost.

2. Log the server status, the HTTP response code (for a little extra detail), and the current time in a plain text log file, also hosted for free on GitHub.

The log file looks like this:

```
timestamp | http | up_or_down
2024-12-04T00:00:00-05:00 | 200 | up
2024-12-04T00:15:00-05:00 | 200 | up
2024-12-04T00:30:00-05:00 | 503 | dn
```

3. In the same script on GitHub Actions, check if the server has just gone down (or just come back up). We do this by comparing the last two lines of the log file.

4. If the status has changed, we send a notification through GitHub to the team to let them know. Our team members receive emails for GitHub notifications, so they're notified in near real time.

5. We host a webpage using GitHub Pages, also for free, to show the current server status and recent status history. This page reads the last 24 hours of history and shows [a nice monitor page](https://doi-os-orda.github.io/uptime/).

![Screenshot 2025-01-29 at 5 11 26 PM](https://github.com/user-attachments/assets/cafb0c48-ab73-420a-ba6a-08fb1e4bd76e)


We developed a working solution in less than one day. We refined it over the next few weeks: we aligned it with Section 508 / WCAG accessibility guidelines, improved the overall look, and clarified the content based on user feedback.

We also discovered that one component we wanted to monitor was not accessible over the public internet, and ran in a Windows environment. This meant that we couldn't reach the component using our GitHub Actions script, nor could we use the Unix-based GitHub Actions script on a Windows operating system. So, we worked with our partner to adapt the script to run on a Windows server, allowing us to monitor the complete system.


### Fast facts

**Procurement cost**: None.

**Labor cost**: Negligible. Required significantly fewer hours than would have been needed for a procurement.

**Time**: About a day for basic working software. It's been described by one team member as "maybe the fastest delivery that's ever happened on a partner project".

Since this is open-source software, any government agency already using GitHub Pages has everything they need to set up no-cost server monitoring and alerting. All they have to do is copy the code and change a few configuration details.

We designed the software to be very simple and therefore easy to read. Theoretically, anyone in the country with a basic knowledge of HTML and JavaScript could use this to monitor their website.


### Questions engineers might have

**Won't the file you're logging to become huge?** Checking the server status every 15 minutes, it will take about 40 years before GitHub starts warning us that the file is getting too large. It will take 80 years for the file to reach GitHub's file size limits.

**Won't it take a long time to read the log file once it becomes large?** Nope! Every time we log the server status, we copy the last 24 hours of logs to a separate smaller file. This keeps the page load zippy!

**What frameworks did you use?** We used zero frameworks! The current version is about 300 lines of hand-coded HTML, CSS, and JavaScript, with no frameworks. We use one CSS reset stylesheet, and that's it.

3 changes: 3 additions & 0 deletions data/team_members.csv
Original file line number Diff line number Diff line change
Expand Up @@ -122,6 +122,7 @@ erin-strenio,Erin,Strenio,,
estherchang,Esther,Chang,,/team/estherchang/
estherkim,Esther,Kim,,/team/estherkim/
ethan,Ethan,Heppner,,/team/ethan/
ethan-marcotte,Ethan,Marcotte,,
fureigh,Rhys,Fureigh,,/team/fureigh/
gail,Gail,Swanson,,/team/gail/
garren,Garren,Givens,,/team/garren/
Expand Down Expand Up @@ -171,6 +172,7 @@ jim-sheire,Jim,Sheire,,
joe,Joe,Polastre,,
joel-minton,Joel,Minton,,
john-donmoyer,John,Donmoyer,,/team/john-donmoyer/
john-skinner,John,Skinner,,
jon-geselle,Jon,Geselle,,
jon-prisby,Jon,Prisby,,
jon-roberts,Jon,Roberts,,
Expand Down Expand Up @@ -219,6 +221,7 @@ lenny-bogdonoff,Lenny,Bogdonoff,,/team/lenny-bogdonoff/
lindsay,Lindsay,Young,,/team/lindsay/
lisagelobter,Lisa,Gelobter,,
logan-mcdonald,Logan,McDonald,,/team/logan-mcdonald/
magdaline-derosena,Magdaline,Derosena,,
majma,Raphael,Majma,,
malaika-carpenter,Malaika,Carpenter,,
manger,Noah,Manger,,/team/manger/
Expand Down