-
-
Notifications
You must be signed in to change notification settings - Fork 221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merge repeat records #35732
base: master
Are you sure you want to change the base?
Merge repeat records #35732
Conversation
corehq/motech/repeaters/tasks.py
Outdated
@@ -362,7 +362,7 @@ def process_repeater(repeater, lock_token): | |||
""" | |||
Initiates a Celery task to process a repeater. | |||
""" | |||
if hasattr(repeater, 'merge_records'): | |||
if repeater.can_merge_records: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
corehq/motech/repeaters/models.py
Outdated
@@ -721,6 +726,45 @@ def get_headers(self, repeat_record): | |||
}) | |||
return headers | |||
|
|||
def merge_records(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if I understand this method correctly, here we're simply deduplicating the repeat records, right?
The behaviour of this merge_records
method is different to the DataSourceRepeater.merge_records
method, which is somewhat surprising to me.
I like the fact that we're "cleaning up" the repeat records here, but would it not be more suited to do when we create the CaseRepeater
/UpdateCaseRepeater
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just thinking maybe we should not use the "merge records" mechanism for this, since if we want to actually chunk the CaseRepeater records in the future for some reason this method would not do that and we'd have to refactor to get it to do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but would it not be more suited to do when we create the
CaseRepeater
/UpdateCaseRepeater
?
Do you mean when we register the repeat record? i.e. If we already have a repeat record for case abc123 that is waiting to be sent, don't create another one?
That is not a bad idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we already have a repeat record for case abc123 that is waiting to be sent, don't create another one?
Exactly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean when we register the repeat record?
Yes, sorry...conflated the two concepts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kaapstorm |
Hey @kaapstorm Just sharing some thought on the approach. I am on board with clubbing repeat records. An alternate I can think of is simply clubbing repeat records to be sent when they are being processed/fired. |
We don't fire them one at a time. We fetch a chunk of records here (7 by default), and then spawn parallel tasks to fire all of them at the same time. So if we are going to be clubbing them, then we have to do that before we spawn the parallel tasks. So that's what this PR does over here.
The repeat records of DataSourceRepeaters are interesting because their payload IDs aren't the unique IDs of the data source rows. As you know, they are the IDs of the forms or cases that resulted in those rows, so one repeat record can already refer to multiple rows. One thing that I don't like about the approach that this PR takes is that it limits the number of merged repeat records to no more than 6. I was hoping for the same ballpark of performance improvements as a CSV import, so we would need to merge a lot more than 6 repeat records. I'd like at least 100, and if possible 1,000. But your comment got me thinking: The payload that we are actually sending is a I am going to keep going with this PR, and change
BRB |
Hey @kaapstorm
I didn't realize we were already fetching multiple rows for the doc id. That seems useful when we have multiple rows for the same doc id in a data source. I see the PR which DataSourceRepeater could immediately benefit from and have an advantage right away. Though, we are firing a query each time we have to register the repeater which I am not sure about, though we benefit by avoiding adding a new repeat record in relevant cases. I am still unsure about us adding a new repeat record replacing others. Though if its for 100 or so records like you mentioned then it might seem okay to do so. Just a side thought that, having more records in a single request also would increase the time it would take for the receiver to process the request and additionally could increase chances of failures at the receiving end. So, we just need to also be careful about pushing this number high, for the data to be included in each request. I think this is a challenge to solve and would take iterations to come with a wise approach. |
Repeat records are really just queue tokens. They are not like forms, cases, users, locations, etc. that store data. Instead their purpose is for scheduling tasks. So using repeat records to optimize the task schedule feels like the correct approach.
Yes. Web services that take longer to process a POST request should return a 202: Accepted response, and process the payload asynchronously. We should definitely consider this for CommCare Analytics if we want the same performance that we are getting with importing CSVs.
If processing the payload exceeds the machine's resources , that's definitely something we should avoid. On the flip side, the purpose of this change is because tens of thousands of requests, each with only one update, has strained CCA resources in the past and caused failures. So if we can optimize the payload size, then this change should reduce the chances of failures at the receiving end. This will also have a positive effect on the whole of HQ: Just this morning, one domain send a gigantic number of payloads to CCA, backing up the queue, and slowing down data forwarding across the environment. |
Thanks @kaapstorm As stated in your last comment there are definitely multiple factors to consider. I'd suggest taking this to a tech spec stage first, share it with the team and have this PR available for code reference to get some inputs to make a sound decision. |
I don't think that's correct in this specific situation: Adding this method and then dropping it later would have no after effect. I also think that setting up |
You have mentioned how one repeat record should represent one document. I'd suggest that one repeat record should represent one request. If you have multiple repeat records resulting in a single request, how should we process the response? If the request fails, did one repeat record fail, or did all of them? If there is a payload error, which repeat record should be cancelled? I think the approach here, where we create one merged But if there is a simpler way to do this, I would be happy to review the spec. |
If request fails, all fails. If payload error could consider sending all individually. Or to keep things simple, in case of errors fallback to sending all individually? |
Technical Summary
Context: SC-3872
Corresponding
commcare-analytics
PR: dimagi/commcare-analytics#108This change reduces the load on CommCare Analytics by merging payloads for the same data source. So instead of having up to 6 separate updates for the same table, this change results in just one update.
Feature Flag
SUPERSET_ANALYTICS + PROCESS_REPEATERS
Safety Assurance
Safety story
Automated test coverage
Includes tests
QA Plan
No QA planned.
Rollback instructions
Labels & Review