Merge repeat records #35732

kaapstorm · 2025-02-03T20:27:41Z

Technical Summary

Context: SC-3872
Corresponding commcare-analytics PR: dimagi/commcare-analytics#108

This change reduces the load on CommCare Analytics by merging payloads for the same data source. So instead of having up to 6 separate updates for the same table, this change results in just one update.

Feature Flag

SUPERSET_ANALYTICS + PROCESS_REPEATERS

Safety Assurance

Safety story

Tested merged payloads locally.
The PROCESS_REPEATERS flag must be enabled for this code to be called. This flag is not yet used for any production domains.

Automated test coverage

Includes tests

QA Plan

No QA planned.

Rollback instructions

This PR can be reverted after deploy with no further considerations

Labels & Review

Risk label is set correctly
The set of people pinged as reviewers is appropriate for the level of risk of the change

Charl1996 · 2025-02-04T06:43:47Z

corehq/motech/repeaters/tasks.py

@@ -362,7 +362,7 @@ def process_repeater(repeater, lock_token):
    """
    Initiates a Celery task to process a repeater.
    """
-    if hasattr(repeater, 'merge_records'):
+    if repeater.can_merge_records:


Charl1996 · 2025-02-04T06:58:39Z

corehq/motech/repeaters/models.py

@@ -721,6 +726,45 @@ def get_headers(self, repeat_record):
        })
        return headers

+    def merge_records(self):


So if I understand this method correctly, here we're simply deduplicating the repeat records, right?

The behaviour of this merge_records method is different to the DataSourceRepeater.merge_records method, which is somewhat surprising to me.

I like the fact that we're "cleaning up" the repeat records here, but would it not be more suited to do when we create the CaseRepeater/UpdateCaseRepeater?

I'm just thinking maybe we should not use the "merge records" mechanism for this, since if we want to actually chunk the CaseRepeater records in the future for some reason this method would not do that and we'd have to refactor to get it to do that.

but would it not be more suited to do when we create the CaseRepeater/UpdateCaseRepeater?

Do you mean when we register the repeat record? i.e. If we already have a repeat record for case abc123 that is waiting to be sent, don't create another one?

That is not a bad idea.

If we already have a repeat record for case abc123 that is waiting to be sent, don't create another one?

Exactly

Do you mean when we register the repeat record?

Yes, sorry...conflated the two concepts

Charl1996 · 2025-02-04T07:01:12Z

@kaapstorm
Just a couple of comments, but otherwise looking great! What a breath of fresh air!

This reverts commit 90fa60e.

This reverts commit ef1c35c.

mkangia · 2025-02-04T13:49:45Z

Hey @kaapstorm

Just sharing some thought on the approach.

I am on board with clubbing repeat records.
The approach in the PR adds one additional repeat records for each 6 repeat records.
That does add more data to store for HQ which could be okay though, it changes how one repeat record is always bound to one document/record only, which is a bigger conceptual change to do. I'd like us to reconsider this.

An alternate I can think of is simply clubbing repeat records to be sent when they are being processed/fired.
So, when fetching records and firing them one at a time, you could consider clubbing 6 of them and sending them together via a separate task.
They can then be clubbed for there payload but then be updated like usual.
I feel this would be less of a deviation from how repeat records currently function and would lead to a simpler logical branching.

kaapstorm · 2025-02-04T21:57:16Z

An alternate I can think of is simply clubbing repeat records to be sent when they are being processed/fired. So, when fetching records and firing them one at a time, you could consider clubbing 6 of them and sending them together via a separate task.

We don't fire them one at a time. We fetch a chunk of records here (7 by default), and then spawn parallel tasks to fire all of them at the same time. So if we are going to be clubbing them, then we have to do that before we spawn the parallel tasks. So that's what this PR does over here.

it changes how one repeat record is always bound to one document/record only

The repeat records of DataSourceRepeaters are interesting because their payload IDs aren't the unique IDs of the data source rows. As you know, they are the IDs of the forms or cases that resulted in those rows, so one repeat record can already refer to multiple rows.

One thing that I don't like about the approach that this PR takes is that it limits the number of merged repeat records to no more than 6. I was hoping for the same ballpark of performance improvements as a CSV import, so we would need to merge a lot more than 6 repeat records. I'd like at least 100, and if possible 1,000.

But your comment got me thinking: The payload that we are actually sending is a DataSourceUpdateLog. Its ID is set as the ID of a form or a case. But it doesn't have to be that way.

I am going to keep going with this PR, and change DataSourceUpdateLog so that:

its ID is unique
it allows sending rows from multiple forms/cases
it allows merging more than 6 payloads.

BRB

mkangia · 2025-02-05T14:58:39Z

Hey @kaapstorm

The repeat records of DataSourceRepeaters are interesting because their payload IDs aren't the unique IDs of the data source rows. As you know, they are the IDs of the forms or cases that resulted in those rows, so one repeat record can already refer to multiple rows

I didn't realize we were already fetching multiple rows for the doc id. That seems useful when we have multiple rows for the same doc id in a data source.
My point originally was about a repeat record working with one record (form/case) at once which seems to be the case with DataSourceRepeater as well.

I see the PR which DataSourceRepeater could immediately benefit from and have an advantage right away. Though, we are firing a query each time we have to register the repeater which I am not sure about, though we benefit by avoiding adding a new repeat record in relevant cases.

I am still unsure about us adding a new repeat record replacing others. Though if its for 100 or so records like you mentioned then it might seem okay to do so. Just a side thought that, having more records in a single request also would increase the time it would take for the receiver to process the request and additionally could increase chances of failures at the receiving end. So, we just need to also be careful about pushing this number high, for the data to be included in each request.

I think this is a challenge to solve and would take iterations to come with a wise approach.

kaapstorm · 2025-02-05T21:51:25Z

I am still unsure about us adding a new repeat record replacing others.

Repeat records are really just queue tokens. They are not like forms, cases, users, locations, etc. that store data. Instead their purpose is for scheduling tasks. So using repeat records to optimize the task schedule feels like the correct approach.

Just a side thought that, having more records in a single request also would increase the time it would take for the receiver to process the request ... So, we just need to also be careful about pushing this number high, for the data to be included in each request.

Yes. Web services that take longer to process a POST request should return a 202: Accepted response, and process the payload asynchronously.

We should definitely consider this for CommCare Analytics if we want the same performance that we are getting with importing CSVs.

additionally could increase chances of failures at the receiving end.

If processing the payload exceeds the machine's resources , that's definitely something we should avoid. On the flip side, the purpose of this change is because tens of thousands of requests, each with only one update, has strained CCA resources in the past and caused failures. So if we can optimize the payload size, then this change should reduce the chances of failures at the receiving end.

This will also have a positive effect on the whole of HQ: Just this morning, one domain send a gigantic number of payloads to CCA, backing up the queue, and slowing down data forwarding across the environment.

mkangia · 2025-02-07T08:58:44Z

Thanks @kaapstorm

As stated in your last comment there are definitely multiple factors to consider.
I'd still be in favor of sending payload for multiple records together as a part of the request handling mechanism instead of merging them at the model level, which would effectively achieve the same outcome without changing how the model works.
I am always hesitant with changing the architecture of the models, specifically because they can't be reverted once done.

I'd suggest taking this to a tech spec stage first, share it with the team and have this PR available for code reference to get some inputs to make a sound decision.

kaapstorm · 2025-02-07T13:03:00Z

I am always hesitant with changing the architecture of the models, specifically because they can't be reverted once done.

I don't think that's correct in this specific situation: Adding this method and then dropping it later would have no after effect.

I also think that setting up DataSourceUpdateLog as a proper payload brings DataSourceRepeater in line will all other repeaters.

kaapstorm · 2025-02-07T18:22:46Z

I'd still be in favor of sending payload for multiple records together as a part of the request handling mechanism

You have mentioned how one repeat record should represent one document. I'd suggest that one repeat record should represent one request. If you have multiple repeat records resulting in a single request, how should we process the response? If the request fails, did one repeat record fail, or did all of them? If there is a payload error, which repeat record should be cancelled?

I think the approach here, where we create one merged DataSourceUpdateLog, means that processing the response and creating a RepeatRecordAttempt remains the same. This keeps the footprint small in an area that can currently be simplified, instead of making it more complicated.

But if there is a simpler way to do this, I would be happy to review the spec.

mkangia · 2025-02-09T18:12:50Z

If you have multiple repeat records resulting in a single request, how should we process the response? If the request fails, did one repeat record fail, or did all of them? If there is a payload error, which repeat record should be cancelled?

If request fails, all fails. If payload error could consider sending all individually. Or to keep things simple, in case of errors fallback to sending all individually?

kaapstorm added 6 commits January 30, 2025 15:19

New task to merge records

eeabfb5

DataSourceRepeater.merge_records()

4a9c05e

Clean up tests

0b867ab

Add tests

04cac1e

While we're at it, CaseRepeater.merge_records()

ef1c35c

Add test

90fa60e

kaapstorm requested a review from calellowitz as a code owner February 3, 2025 20:27

kaapstorm changed the title ~~Nh/merge records~~ Merge repeat records Feb 3, 2025

kaapstorm requested review from mkangia and Charl1996 February 3, 2025 20:28

Charl1996 reviewed Feb 4, 2025

View reviewed changes

kaapstorm added 2 commits February 4, 2025 08:35

Revert "Add test"

3419eb3

This reverts commit 90fa60e.

Revert "While we're at it, CaseRepeater.merge_records()"

14e7d0d

This reverts commit ef1c35c.

kaapstorm mentioned this pull request Feb 4, 2025

CaseRepeater.register() skips duplicate payloads #35734

Open

3 tasks

kaapstorm added 2 commits February 5, 2025 11:58

Make DataSourceUpdateLog a real payload

c1832ed

Merge much more

e696df8

dimagimon added the reindex/migration Reindex or migration will be required during or before deploy label Feb 5, 2025

kaapstorm mentioned this pull request Feb 5, 2025

Support separators in doc_id dimagi/commcare-analytics#108

Open

kaapstorm added 4 commits February 5, 2025 13:46

I forget this every time

d4bfa74

Automatically delete old DataSourceUpdateLogs

1f0ed0d

Gracefully handle deleted payloads

0c944fa

Nit

0810945

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge repeat records #35732

Merge repeat records #35732

kaapstorm commented Feb 3, 2025 •

edited by jira bot

Loading

Charl1996 Feb 4, 2025

Charl1996 Feb 4, 2025

Charl1996 Feb 4, 2025

kaapstorm Feb 4, 2025

Charl1996 Feb 4, 2025

Charl1996 Feb 4, 2025

kaapstorm Feb 4, 2025

Charl1996 commented Feb 4, 2025

mkangia commented Feb 4, 2025

kaapstorm commented Feb 4, 2025

mkangia commented Feb 5, 2025

kaapstorm commented Feb 5, 2025

mkangia commented Feb 7, 2025

kaapstorm commented Feb 7, 2025 •

edited

Loading

kaapstorm commented Feb 7, 2025

mkangia commented Feb 9, 2025

Merge repeat records #35732

Are you sure you want to change the base?

Merge repeat records #35732

Conversation

kaapstorm commented Feb 3, 2025 • edited by jira bot Loading

Technical Summary

Feature Flag

Safety Assurance

Safety story

Automated test coverage

QA Plan

Rollback instructions

Labels & Review

Charl1996 Feb 4, 2025

Choose a reason for hiding this comment

Charl1996 Feb 4, 2025

Choose a reason for hiding this comment

Charl1996 Feb 4, 2025

Choose a reason for hiding this comment

kaapstorm Feb 4, 2025

Choose a reason for hiding this comment

Charl1996 Feb 4, 2025

Choose a reason for hiding this comment

Charl1996 Feb 4, 2025

Choose a reason for hiding this comment

kaapstorm Feb 4, 2025

Choose a reason for hiding this comment

Charl1996 commented Feb 4, 2025

mkangia commented Feb 4, 2025

kaapstorm commented Feb 4, 2025

mkangia commented Feb 5, 2025

kaapstorm commented Feb 5, 2025

mkangia commented Feb 7, 2025

kaapstorm commented Feb 7, 2025 • edited Loading

kaapstorm commented Feb 7, 2025

mkangia commented Feb 9, 2025

kaapstorm commented Feb 3, 2025 •

edited by jira bot

Loading

kaapstorm commented Feb 7, 2025 •

edited

Loading