Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Alerts] Create Datadog alert for edx-platform edge to learn about alerting #628

Closed
3 of 4 tasks
timmc-edx opened this issue May 7, 2024 · 7 comments
Closed
3 of 4 tasks
Assignees

Comments

@timmc-edx
Copy link
Member

timmc-edx commented May 7, 2024

A/C:

  • One edxapp Edge alert policy migrated from NR to DD
    • Include both performance and error related conditions—prod-edge-edxapp-lms-arch-bom covers this.
    • May not need to transfer everything over—some conditions are duplicates or have been disabled.
  • New alert is informed by DD best practices
  • How-to documentation created or updated for alert migration
  • Team alerts runbook updated

We'll use Edge because it has DD enabled already and actually has traffic.


Created monitors

Error rate

Apdex

Datadog doesn't have to a way to break out apdex by custom span tags, so I reimplemented the apdex formula in Trace Analytics.

Given a base filter env:(prod OR edge) service:edx-edxapp-lms operation_name:django.request @code_owner_squad:arch-bom (over service entry spans) we can define three queries:

And then alert on (a + b/2) / c < 0.8 over 15 minutes: https://app.datadoghq.com/monitors/145167039 Here I've chosen 400 ms as the apdex T value; we're using 2 seconds in New Relic but that's way too high. I set 0.85 as the recovery threshold to add some hysteresis.

@timmc-edx timmc-edx converted this from a draft issue May 7, 2024
@timmc-edx timmc-edx moved this to In Progress in Arch-BOM May 7, 2024
@timmc-edx timmc-edx self-assigned this May 7, 2024
@robrap
Copy link
Contributor

robrap commented May 7, 2024

@timmc-edx: I started a Slack thread with DD about getting a best practice doc. Also, it looks like @alangsto started some how-to docs on monitors here.

@alangsto: [inform] Tim is working on arch-bom alerts in this ticket.

@robrap
Copy link
Contributor

robrap commented May 8, 2024

@timmc-edx: I'm wondering if these supporting runbooks have dashboards, and if it would be useful migrate those dashboards as part of this ticket, or if we should spin off a separate ticket for dashboards. Let me know your thoughts. Thanks.

@timmc-edx
Copy link
Member Author

I think we should ticket dashboards separately.

@timmc-edx
Copy link
Member Author

timmc-edx commented May 8, 2024

Notes:

These conditions are currently enabled in the NR policy; the disabled ones are "baseline" error rate alerts that we replaced with static threshold alerts:

  • BOM Edge LMS APM Apdex Burst
  • BOM Edge LMS APM Apdex Sustained
  • BOM Edge LMS APM Error Burst (1% over 10 min)
  • BOM Edge LMS APM Error Sustained (0.1% over 60 min)
  • BOM Edge LMS Workers High Error Rate

Pattern: Burst/sustained are the same query with different thresholds. All filter on code owner being arch_bom or NULL.

@timmc-edx
Copy link
Member Author

timmc-edx commented May 20, 2024

I was using @http.status_code:5* for the error monitors but have switched to status:error to make it more general. However, these produce nearly identical results, as status:error is only being evaluated on the service entry span. When an error bubbles up that far, we produce a 500 anyhow. (There are a handful of traces that have status:error and no response code but these are probably just incomplete traces.)

@robrap
Copy link
Contributor

robrap commented May 20, 2024

What is an incomplete trace, and is it the result of a bug? Can we ask DD about this, possibly in a support ticket?

@timmc-edx
Copy link
Member Author

I can make a note to look into it further.

@github-project-automation github-project-automation bot moved this from In Progress to Done in Arch-BOM May 31, 2024
@jristau1984 jristau1984 moved this from Done to Done - Long Term Storage in Arch-BOM Sep 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done - Long Term Storage
Development

No branches or pull requests

2 participants