-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Alerts] Create Datadog alert for edx-platform edge to learn about alerting #628
Comments
@timmc-edx: I started a Slack thread with DD about getting a best practice doc. Also, it looks like @alangsto started some how-to docs on monitors here. @alangsto: [inform] Tim is working on arch-bom alerts in this ticket. |
@timmc-edx: I'm wondering if these supporting runbooks have dashboards, and if it would be useful migrate those dashboards as part of this ticket, or if we should spin off a separate ticket for dashboards. Let me know your thoughts. Thanks. |
I think we should ticket dashboards separately. |
Notes: These conditions are currently enabled in the NR policy; the disabled ones are "baseline" error rate alerts that we replaced with static threshold alerts:
Pattern: Burst/sustained are the same query with different thresholds. All filter on code owner being arch_bom or NULL. |
I was using |
What is an incomplete trace, and is it the result of a bug? Can we ask DD about this, possibly in a support ticket? |
I can make a note to look into it further. |
A/C:
We'll use Edge because it has DD enabled already and actually has traffic.
Created monitors
Error rate
env:stg
is downgraded to a P2Apdex
Datadog doesn't have to a way to break out apdex by custom span tags, so I reimplemented the apdex formula in Trace Analytics.
Given a base filter
env:(prod OR edge) service:edx-edxapp-lms operation_name:django.request @code_owner_squad:arch-bom
(over service entry spans) we can define three queries:<BASE> [email protected]_code:5* @duration:[0ms TO 400ms]
<BASE> [email protected]_code:5* @duration:[401ms TO 1600ms]
<BASE>
And then alert on
(a + b/2) / c < 0.8
over 15 minutes: https://app.datadoghq.com/monitors/145167039 Here I've chosen 400 ms as the apdex T value; we're using 2 seconds in New Relic but that's way too high. I set 0.85 as the recovery threshold to add some hysteresis.The text was updated successfully, but these errors were encountered: