-
-
Notifications
You must be signed in to change notification settings - Fork 218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
process_repeater_stub() task #28978
process_repeater_stub() task #28978
Conversation
fcaa4d8
to
2931f8c
Compare
"6 backoff attempts, 3 normal attempts" commit. |
b302830
to
526cd02
Compare
ping @dannyroberts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Very happy to see more models getting migrated from Couch to SQL.
corehq/motech/repeaters/models.py
Outdated
process_repeater_stub.delay(repeater_stub) | ||
|
||
|
||
def get_payload(repeater: Repeater, repeat_record: SQLRepeatRecord) -> Any: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this return type really Any
? What kinds of values are typical? (for my curiosity only)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The return value of anything that implements BasePayloadGenerator.get_payload()
. So far, form XML, form JSON, case XML, case JSON, an app ID, a Tastypie-serialized CommCareUser, a Location as JSON.
But that's a very good question, because I think all those are returned a strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think this be a good candidate for just leaving out the type hint for the return value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think yes. Maybe I'm too biased to answer that question objectively 😛
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On second thought, no. I think it's fine as is since you're using type annotations in this code anyway.
self.repeater.delete() | ||
|
||
|
||
class ServerErrorTests(RepeaterFixtureMixin, TestCase, DomainSubscriptionMixin): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these the tests you were referring to that are slow? I'm assuming they're slow because they call submit_form_locally()
, is that correct? If yes, I don't think there's any way to get around that if you really need to submit a form for every test.
Maybe you could submit one form in setUpClass
and then have tearDown
reset any form-related state that could have been changed by the test?
Co-authored-by: Daniel Miller <[email protected]>
Co-authored-by: Daniel Miller <[email protected]>
Co-authored-by: Daniel Miller <[email protected]>
Co-authored-by: Daniel Miller <[email protected]>
Summary
This PR is the next in a series to implement [CEP] Migrate RepeatRecord to SQL.
Previous PR: SQLRepeatRecord
no-obligation fyi: @orangejenny @millerdev @snopoke
I'm opening as a draft PR because there are two things I am looking for feedback on:
1. When to back off
Currently, if 10 forms are submitted and forwarded to an endpoint that is offline, that endpoint is attempted 10 times, then about an hour later it is attempted 10 times again, three hours later 10 times again, each time the interval is multiplied by 3, for up to five days, when eventually all 10 repeat records will be cancelled.
This PR changes that behaviour. Exponential backoff is tracked on a
RepeaterStub
model instead of on each independent repeat record: If 10 forms are submitted, the first form will be forwarded. If the endpoint is offline, only that form will be retried an hour later, then three hours later, etc. If it succeeds within five days, the rest of the 9 forms will be sent. If it continues to fail, it will be cancelled, and HQ will try to send just the second form.The idea is to dramatically reduce the rate of send attempts that are likely to fail. It also sets up a better foundation for automatically pausing repeaters that appear to be permanently offline.
My biggest concern is to make sure that we never back off when the server is fine, and the error is caused by the payload. The
check_repeaters()
task runs every five minutes. So the code differentiates between errors to back off on, and errors to retry on the nextcheck_repeaters()
call. My intention is that bad payloads can be cancelled fast (ideally in about half an hour; six attempts, each about five minutes apart), so that bad payloads can't hold up good payloads for five days!I am open to the idea of not retrying some kinds of errors, but in my experience some third-party servers are under-resourced, and retrying a few minutes later can turn a "500" into a "201". Reducing the number of retries for some kinds of errors might be a happy compromise. I'm also open to taking this conversation somewhere else, and pulling in some USH AEs.
2. How to improve slow tests
I am very interested in alternatives to the test suite in the "Slooow tests to test backoff" commit. It takes 11 seconds to run five tests with
REUSE_DB=True
. The tests follow the approach used by these Repeater tests. The lowest effort would probably be to submit the form insetUpClass()
and just requeue it insetUp()
. But I'm curious to know if there is a better, completely different approach the covers the behaviour that these tests are meant to verify.Do you know faster, better ways to test the same functionality?
Safety Assurance
Automated test coverage
The new code is covered by automated tests. (Please point out important gaps in testing in PR feedback.)
Safety story
Other than through unit tests, none of this code is reachable yet.
Rollback instructions