From 258ddbf4fd5acb066501061450eeefa11c45ff7d Mon Sep 17 00:00:00 2001 From: Matt Cloyd Date: Wed, 29 Jan 2025 17:12:56 -0500 Subject: [PATCH 1/5] Create 2025-02-01-we-built-server-monitoring-in-a-day.md --- ...-01-we-built-server-monitoring-in-a-day.md | 70 +++++++++++++++++++ 1 file changed, 70 insertions(+) create mode 100644 content/posts/2025-02-01-we-built-server-monitoring-in-a-day.md diff --git a/content/posts/2025-02-01-we-built-server-monitoring-in-a-day.md b/content/posts/2025-02-01-we-built-server-monitoring-in-a-day.md new file mode 100644 index 000000000..8a431b421 --- /dev/null +++ b/content/posts/2025-02-01-we-built-server-monitoring-in-a-day.md @@ -0,0 +1,70 @@ +--- +title: We built server monitoring in a day +authors: + - matt-cloyd + - john-skinner + - magdaline-derosena + - ethan-marcotte +excerpt: | + We needed to know when our server went down, so we built our own monitor — in a day. + +--- + +Our partners at one government agency experience periodic internet outages at one of their rural data centers. They sometimes get emails from users about the outages, but they didn't have a good way to reliably know when their servers went down. + +We're working a root-cause solution where we deploy to the cloud, but for now, our partner needs to know when their servers are down. + +We considered procuring commercial server monitoring services. But that would mean we'd have to do a marketplace analysis, draft and publish a solicitation, go through a contracting process, and set up full-fledged server monitoring. + +But why go through a whole procurement for and pay for features beyond our immediate needs, when we only need a simple periodic check if the server is up or down? + +So, we thought some more, and someone asked — why not make it a spreadsheet? What if we ran a script that would check if the server is up or down, and log the status in a spreadsheet? From there, we'd figure out how to automatically read the spreadsheet, and alert the team when the server went down. + +We iterated on this solution idea a few more times, and landed on the following solution. + +1. Every fifteen minutes, run a script to see if the server is up or down. We decided to use GitHub Actions to run the script at no cost. + +2. Log the server status, the HTTP response code (for a little extra detail), and the current time in a plain text log file, also hosted for free on GitHub. + +The log file looks like this: + +``` +timestamp | http | up_or_down +2024-12-04T00:00:00-05:00 | 200 | up +2024-12-04T00:15:00-05:00 | 200 | up +2024-12-04T00:30:00-05:00 | 503 | dn +``` + +3. In the same script on GitHub Actions, check if the server has just gone down (or just come back up). We do this by comparing the last two lines of the log file. + +4. If the status has changed, we send a notification through GitHub to the team to let them know. Our team members receive emails for GitHub notifications, so they're notified in near real time. + +5. We host a webpage using GitHub Pages, also for free, to show the current server status and recent status history. This page reads the last 24 hours of history and shows [a nice monitor page](https://doi-os-orda.github.io/uptime/). + +![Screenshot 2025-01-29 at 5 11 26 PM](https://github.com/user-attachments/assets/cafb0c48-ab73-420a-ba6a-08fb1e4bd76e) + + +We developed a working solution in less than one day. We refined it over the next few weeks: we brought it into alignment with Section 508 / WCAG accessibility guidelines, improved the overall look, and clarified the content. + +We also discovered that one component we wanted to monitor was not accessible over the public internet, and ran in a Windows environment. This meant that we couldn't reach the component using our GitHub Actions script. So, we worked with our partner to adapt the script to run on a Windows server, allowing us to monitor the complete system. + + +### Fast facts + +**Procurement cost**: None. + +**Labor cost**: Negligible. Required significantly fewer hours than would have been needed for a procurement. + +**Time**: About a day for basic working software. It's been described by one team member as "maybe the fastest delivery that's ever happened on a partner project". + +Because this is open-source software, and because we wrote it to be very simple, anyone in the country (or the world) with a basic knowledge of HTML and JavaScript could use this to monitor their website. + + +### Questions engineers might have + +**Won't the file you're logging to become huge?** Checking the server status every 15 minutes, it will take about 40 years for GitHub to warn us that the file is getting too large. It will take 80 years for GitHub to stop allowing us to keep writing logs to the file. + +**Won't it take a long time to read the log file once it becomes large?** Nope! Every time we log the server status, we copy the last 24 hours of logs to a separate smaller file. This keeps the page load zippy! + +**What frameworks did you use?** We used zero frameworks! The current version is about 300 lines of hand-coded HTML, CSS, and JavaScript, with no frameworks. We use one CSS reset stylesheet, and that's it. + From abaed4f7e1c7eec36a04705fc41d2f370f10e71e Mon Sep 17 00:00:00 2001 From: Matt Cloyd Date: Thu, 30 Jan 2025 11:03:54 -0500 Subject: [PATCH 2/5] Update 2025-02-01-we-built-server-monitoring-in-a-day.md Clarify some language --- ...-02-01-we-built-server-monitoring-in-a-day.md | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/content/posts/2025-02-01-we-built-server-monitoring-in-a-day.md b/content/posts/2025-02-01-we-built-server-monitoring-in-a-day.md index 8a431b421..0480a2b90 100644 --- a/content/posts/2025-02-01-we-built-server-monitoring-in-a-day.md +++ b/content/posts/2025-02-01-we-built-server-monitoring-in-a-day.md @@ -10,15 +10,13 @@ excerpt: | --- -Our partners at one government agency experience periodic internet outages at one of their rural data centers. They sometimes get emails from users about the outages, but they didn't have a good way to reliably know when their servers went down. +Our partners at one government agency experience periodic internet outages at one of their rural data centers. They would sometimes get emails from users who encountered problems during the outages, but our partners didn't have a good way to reliably know when their servers went down — until 18F stepped in. -We're working a root-cause solution where we deploy to the cloud, but for now, our partner needs to know when their servers are down. +We're working on a root-cause solution — deploying to the cloud, avoiding the need for rural data centers. But our partner's immediate need was to know when their servers were down, so they could take corrective actions quickly. -We considered procuring commercial server monitoring services. But that would mean we'd have to do a marketplace analysis, draft and publish a solicitation, go through a contracting process, and set up full-fledged server monitoring. +We considered procuring commercial server monitoring services. But that would mean we'd have to do a marketplace analysis, draft and publish a solicitation, go through a contracting process, and set up full-fledged server monitoring. Why go through a whole procurement for and pay for features beyond our immediate needs, when we only need a simple monitor to see if the server is up or down? -But why go through a whole procurement for and pay for features beyond our immediate needs, when we only need a simple periodic check if the server is up or down? - -So, we thought some more, and someone asked — why not make it a spreadsheet? What if we ran a script that would check if the server is up or down, and log the status in a spreadsheet? From there, we'd figure out how to automatically read the spreadsheet, and alert the team when the server went down. +So, we thought some more, and someone asked — _"why not make it a spreadsheet"_? What if we ran a script that would check if the server is up or down, and log the status in a spreadsheet? From there, we'd figure out how to automatically read the spreadsheet, and alert the team when the server went down. We iterated on this solution idea a few more times, and landed on the following solution. @@ -57,12 +55,14 @@ We also discovered that one component we wanted to monitor was not accessible ov **Time**: About a day for basic working software. It's been described by one team member as "maybe the fastest delivery that's ever happened on a partner project". -Because this is open-source software, and because we wrote it to be very simple, anyone in the country (or the world) with a basic knowledge of HTML and JavaScript could use this to monitor their website. +Since this is open-source software, any government agency already using GitHub Pages has everything they need to set up no-cost server monitoring and alerting. All they have to do is copy the code and change a few configuration details. + +We designed the software to be very simple and therefore easy to read. Theoretically, anyone in the country with a basic knowledge of HTML and JavaScript could use this to monitor their website. ### Questions engineers might have -**Won't the file you're logging to become huge?** Checking the server status every 15 minutes, it will take about 40 years for GitHub to warn us that the file is getting too large. It will take 80 years for GitHub to stop allowing us to keep writing logs to the file. +**Won't the file you're logging to become huge?** Checking the server status every 15 minutes, it will take about 40 years before GitHub starts warning us that the file is getting too large. It will take 80 years for the file to reach GitHub's file size limits. **Won't it take a long time to read the log file once it becomes large?** Nope! Every time we log the server status, we copy the last 24 hours of logs to a separate smaller file. This keeps the page load zippy! From 6db219048640c7f5f0bbd277304a8f3e052ab8e6 Mon Sep 17 00:00:00 2001 From: Matt Cloyd Date: Thu, 30 Jan 2025 11:05:58 -0500 Subject: [PATCH 3/5] Update team_members.csv Add missing author info --- data/team_members.csv | 3 +++ 1 file changed, 3 insertions(+) diff --git a/data/team_members.csv b/data/team_members.csv index ac5fa75de..06e9407d9 100644 --- a/data/team_members.csv +++ b/data/team_members.csv @@ -122,6 +122,7 @@ erin-strenio,Erin,Strenio,, estherchang,Esther,Chang,,/team/estherchang/ estherkim,Esther,Kim,,/team/estherkim/ ethan,Ethan,Heppner,,/team/ethan/ +ethan-marcotte,Ethan,Marcotte,, fureigh,Rhys,Fureigh,,/team/fureigh/ gail,Gail,Swanson,,/team/gail/ garren,Garren,Givens,,/team/garren/ @@ -171,6 +172,7 @@ jim-sheire,Jim,Sheire,, joe,Joe,Polastre,, joel-minton,Joel,Minton,, john-donmoyer,John,Donmoyer,,/team/john-donmoyer/ +john-skinner,John,Skinner,, jon-geselle,Jon,Geselle,, jon-prisby,Jon,Prisby,, jon-roberts,Jon,Roberts,, @@ -219,6 +221,7 @@ lenny-bogdonoff,Lenny,Bogdonoff,,/team/lenny-bogdonoff/ lindsay,Lindsay,Young,,/team/lindsay/ lisagelobter,Lisa,Gelobter,, logan-mcdonald,Logan,McDonald,,/team/logan-mcdonald/ +magdaline-derosena,Magdaline,Derosena,, majma,Raphael,Majma,, malaika-carpenter,Malaika,Carpenter,, manger,Noah,Manger,,/team/manger/ From c0cd7a558a0078f24c455bc6b901abc128048713 Mon Sep 17 00:00:00 2001 From: Matt Cloyd Date: Thu, 30 Jan 2025 11:12:40 -0500 Subject: [PATCH 4/5] Update 2025-02-01-we-built-server-monitoring-in-a-day.md Fix typos, polish some language --- .../posts/2025-02-01-we-built-server-monitoring-in-a-day.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/content/posts/2025-02-01-we-built-server-monitoring-in-a-day.md b/content/posts/2025-02-01-we-built-server-monitoring-in-a-day.md index 0480a2b90..9d2d03a21 100644 --- a/content/posts/2025-02-01-we-built-server-monitoring-in-a-day.md +++ b/content/posts/2025-02-01-we-built-server-monitoring-in-a-day.md @@ -14,7 +14,7 @@ Our partners at one government agency experience periodic internet outages at on We're working on a root-cause solution — deploying to the cloud, avoiding the need for rural data centers. But our partner's immediate need was to know when their servers were down, so they could take corrective actions quickly. -We considered procuring commercial server monitoring services. But that would mean we'd have to do a marketplace analysis, draft and publish a solicitation, go through a contracting process, and set up full-fledged server monitoring. Why go through a whole procurement for and pay for features beyond our immediate needs, when we only need a simple monitor to see if the server is up or down? +We considered procuring commercial server monitoring services. But that would mean we'd have to do a marketplace analysis, draft and publish a solicitation, go through a contracting process, and set up full-fledged server monitoring. Why go through a whole procurement and pay for features beyond our immediate needs, when we only need a simple monitor to see if the server is up or down? So, we thought some more, and someone asked — _"why not make it a spreadsheet"_? What if we ran a script that would check if the server is up or down, and log the status in a spreadsheet? From there, we'd figure out how to automatically read the spreadsheet, and alert the team when the server went down. @@ -42,9 +42,9 @@ timestamp | http | up_or_down ![Screenshot 2025-01-29 at 5 11 26 PM](https://github.com/user-attachments/assets/cafb0c48-ab73-420a-ba6a-08fb1e4bd76e) -We developed a working solution in less than one day. We refined it over the next few weeks: we brought it into alignment with Section 508 / WCAG accessibility guidelines, improved the overall look, and clarified the content. +We developed a working solution in less than one day. We refined it over the next few weeks: we brought it into alignment with Section 508 / WCAG accessibility guidelines, improved the overall look, and clarified the content based on user feedback. -We also discovered that one component we wanted to monitor was not accessible over the public internet, and ran in a Windows environment. This meant that we couldn't reach the component using our GitHub Actions script. So, we worked with our partner to adapt the script to run on a Windows server, allowing us to monitor the complete system. +We also discovered that one component we wanted to monitor was not accessible over the public internet, and ran in a Windows environment. This meant that we couldn't reach the component using our GitHub Actions script, nor could we use the Unix-based GitHub Actions script on a Windows operating system. So, we worked with our partner to adapt the script to run on a Windows server, allowing us to monitor the complete system. ### Fast facts From 37c17618859ed1e63cfa3dca9814bdea3e46667f Mon Sep 17 00:00:00 2001 From: Allison Press <94010142+allisonpress@users.noreply.github.com> Date: Thu, 30 Jan 2025 12:12:51 -0500 Subject: [PATCH 5/5] Plain language edit of we-built-server-monitoring-in-a-day.md Made some small plain language suggestions, like changing verb tenses so they're consistent and breaking up a couple of longer sentences. --- .../2025-02-01-we-built-server-monitoring-in-a-day.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/content/posts/2025-02-01-we-built-server-monitoring-in-a-day.md b/content/posts/2025-02-01-we-built-server-monitoring-in-a-day.md index 9d2d03a21..cb65aedf1 100644 --- a/content/posts/2025-02-01-we-built-server-monitoring-in-a-day.md +++ b/content/posts/2025-02-01-we-built-server-monitoring-in-a-day.md @@ -10,15 +10,15 @@ excerpt: | --- -Our partners at one government agency experience periodic internet outages at one of their rural data centers. They would sometimes get emails from users who encountered problems during the outages, but our partners didn't have a good way to reliably know when their servers went down — until 18F stepped in. +Our partners at one government agency had been experiencing periodic internet outages at one of their rural data centers. They would sometimes get emails from users who encountered problems during the outages, but our partners didn't have a reliable way to know when their servers were down — until 18F stepped in. -We're working on a root-cause solution — deploying to the cloud, avoiding the need for rural data centers. But our partner's immediate need was to know when their servers were down, so they could take corrective actions quickly. +We were already working with them on a solution to address the root problem. Deploying to the cloud would avoid the need for rural data centers with spotty coverage. But our partner's immediate need was to know when their servers were down, so they could fix it quickly. -We considered procuring commercial server monitoring services. But that would mean we'd have to do a marketplace analysis, draft and publish a solicitation, go through a contracting process, and set up full-fledged server monitoring. Why go through a whole procurement and pay for features beyond our immediate needs, when we only need a simple monitor to see if the server is up or down? +What should we do? We considered procuring commercial server monitoring services. But that would mean we'd have to do a marketplace analysis, draft and publish a solicitation, go through a contracting process, and set up full-fledged server monitoring. Why go through a whole procurement and pay for features beyond our immediate needs, when we only need a simple monitor to see if the server is up or down? So, we thought some more, and someone asked — _"why not make it a spreadsheet"_? What if we ran a script that would check if the server is up or down, and log the status in a spreadsheet? From there, we'd figure out how to automatically read the spreadsheet, and alert the team when the server went down. -We iterated on this solution idea a few more times, and landed on the following solution. +We iterated on this solution idea a few more times, and landed on the following solution: 1. Every fifteen minutes, run a script to see if the server is up or down. We decided to use GitHub Actions to run the script at no cost. @@ -42,7 +42,7 @@ timestamp | http | up_or_down ![Screenshot 2025-01-29 at 5 11 26 PM](https://github.com/user-attachments/assets/cafb0c48-ab73-420a-ba6a-08fb1e4bd76e) -We developed a working solution in less than one day. We refined it over the next few weeks: we brought it into alignment with Section 508 / WCAG accessibility guidelines, improved the overall look, and clarified the content based on user feedback. +We developed a working solution in less than one day. We refined it over the next few weeks: we aligned it with Section 508 / WCAG accessibility guidelines, improved the overall look, and clarified the content based on user feedback. We also discovered that one component we wanted to monitor was not accessible over the public internet, and ran in a Windows environment. This meant that we couldn't reach the component using our GitHub Actions script, nor could we use the Unix-based GitHub Actions script on a Windows operating system. So, we worked with our partner to adapt the script to run on a Windows server, allowing us to monitor the complete system.