Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge repeat records #35732

Open
wants to merge 14 commits into
base: master
Choose a base branch
from
Open

Merge repeat records #35732

wants to merge 14 commits into from

Conversation

kaapstorm
Copy link
Contributor

@kaapstorm kaapstorm commented Feb 3, 2025

Technical Summary

Context: SC-3872
Corresponding commcare-analytics PR: dimagi/commcare-analytics#108

This change reduces the load on CommCare Analytics by merging payloads for the same data source. So instead of having up to 6 separate updates for the same table, this change results in just one update.

Feature Flag

SUPERSET_ANALYTICS + PROCESS_REPEATERS

Safety Assurance

Safety story

  • Tested merged payloads locally.
  • The PROCESS_REPEATERS flag must be enabled for this code to be called. This flag is not yet used for any production domains.

Automated test coverage

Includes tests

QA Plan

No QA planned.

Rollback instructions

  • This PR can be reverted after deploy with no further considerations

Labels & Review

  • Risk label is set correctly
  • The set of people pinged as reviewers is appropriate for the level of risk of the change

@kaapstorm kaapstorm changed the title Nh/merge records Merge repeat records Feb 3, 2025
@@ -362,7 +362,7 @@ def process_repeater(repeater, lock_token):
"""
Initiates a Celery task to process a repeater.
"""
if hasattr(repeater, 'merge_records'):
if repeater.can_merge_records:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -721,6 +726,45 @@ def get_headers(self, repeat_record):
})
return headers

def merge_records(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if I understand this method correctly, here we're simply deduplicating the repeat records, right?

The behaviour of this merge_records method is different to the DataSourceRepeater.merge_records method, which is somewhat surprising to me.

I like the fact that we're "cleaning up" the repeat records here, but would it not be more suited to do when we create the CaseRepeater/UpdateCaseRepeater?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just thinking maybe we should not use the "merge records" mechanism for this, since if we want to actually chunk the CaseRepeater records in the future for some reason this method would not do that and we'd have to refactor to get it to do that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but would it not be more suited to do when we create the CaseRepeater/UpdateCaseRepeater?

Do you mean when we register the repeat record? i.e. If we already have a repeat record for case abc123 that is waiting to be sent, don't create another one?

That is not a bad idea.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we already have a repeat record for case abc123 that is waiting to be sent, don't create another one?

Exactly

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean when we register the repeat record?

Yes, sorry...conflated the two concepts

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Charl1996
Copy link
Contributor

@kaapstorm
Just a couple of comments, but otherwise looking great! What a breath of fresh air!

@mkangia
Copy link
Contributor

mkangia commented Feb 4, 2025

Hey @kaapstorm

Just sharing some thought on the approach.

I am on board with clubbing repeat records.
The approach in the PR adds one additional repeat records for each 6 repeat records.
That does add more data to store for HQ which could be okay though, it changes how one repeat record is always bound to one document/record only, which is a bigger conceptual change to do. I'd like us to reconsider this.

An alternate I can think of is simply clubbing repeat records to be sent when they are being processed/fired.
So, when fetching records and firing them one at a time, you could consider clubbing 6 of them and sending them together via a separate task.
They can then be clubbed for there payload but then be updated like usual.
I feel this would be less of a deviation from how repeat records currently function and would lead to a simpler logical branching.

@kaapstorm
Copy link
Contributor Author

An alternate I can think of is simply clubbing repeat records to be sent when they are being processed/fired. So, when fetching records and firing them one at a time, you could consider clubbing 6 of them and sending them together via a separate task.

We don't fire them one at a time. We fetch a chunk of records here (7 by default), and then spawn parallel tasks to fire all of them at the same time. So if we are going to be clubbing them, then we have to do that before we spawn the parallel tasks. So that's what this PR does over here.

it changes how one repeat record is always bound to one document/record only

The repeat records of DataSourceRepeaters are interesting because their payload IDs aren't the unique IDs of the data source rows. As you know, they are the IDs of the forms or cases that resulted in those rows, so one repeat record can already refer to multiple rows.

One thing that I don't like about the approach that this PR takes is that it limits the number of merged repeat records to no more than 6. I was hoping for the same ballpark of performance improvements as a CSV import, so we would need to merge a lot more than 6 repeat records. I'd like at least 100, and if possible 1,000.

But your comment got me thinking: The payload that we are actually sending is a DataSourceUpdateLog. Its ID is set as the ID of a form or a case. But it doesn't have to be that way.

I am going to keep going with this PR, and change DataSourceUpdateLog so that:

  • its ID is unique
  • it allows sending rows from multiple forms/cases
  • it allows merging more than 6 payloads.

BRB

@mkangia
Copy link
Contributor

mkangia commented Feb 5, 2025

Hey @kaapstorm

The repeat records of DataSourceRepeaters are interesting because their payload IDs aren't the unique IDs of the data source rows. As you know, they are the IDs of the forms or cases that resulted in those rows, so one repeat record can already refer to multiple rows

I didn't realize we were already fetching multiple rows for the doc id. That seems useful when we have multiple rows for the same doc id in a data source.
My point originally was about a repeat record working with one record (form/case) at once which seems to be the case with DataSourceRepeater as well.

I see the PR which DataSourceRepeater could immediately benefit from and have an advantage right away. Though, we are firing a query each time we have to register the repeater which I am not sure about, though we benefit by avoiding adding a new repeat record in relevant cases.

I am still unsure about us adding a new repeat record replacing others. Though if its for 100 or so records like you mentioned then it might seem okay to do so. Just a side thought that, having more records in a single request also would increase the time it would take for the receiver to process the request and additionally could increase chances of failures at the receiving end. So, we just need to also be careful about pushing this number high, for the data to be included in each request.

I think this is a challenge to solve and would take iterations to come with a wise approach.

@dimagimon dimagimon added the reindex/migration Reindex or migration will be required during or before deploy label Feb 5, 2025
@kaapstorm
Copy link
Contributor Author

I am still unsure about us adding a new repeat record replacing others.

Repeat records are really just queue tokens. They are not like forms, cases, users, locations, etc. that store data. Instead their purpose is for scheduling tasks. So using repeat records to optimize the task schedule feels like the correct approach.

Just a side thought that, having more records in a single request also would increase the time it would take for the receiver to process the request ... So, we just need to also be careful about pushing this number high, for the data to be included in each request.

Yes. Web services that take longer to process a POST request should return a 202: Accepted response, and process the payload asynchronously.

We should definitely consider this for CommCare Analytics if we want the same performance that we are getting with importing CSVs.

additionally could increase chances of failures at the receiving end.

If processing the payload exceeds the machine's resources , that's definitely something we should avoid. On the flip side, the purpose of this change is because tens of thousands of requests, each with only one update, has strained CCA resources in the past and caused failures. So if we can optimize the payload size, then this change should reduce the chances of failures at the receiving end.

This will also have a positive effect on the whole of HQ: Just this morning, one domain send a gigantic number of payloads to CCA, backing up the queue, and slowing down data forwarding across the environment.

@mkangia
Copy link
Contributor

mkangia commented Feb 7, 2025

Thanks @kaapstorm

As stated in your last comment there are definitely multiple factors to consider.
I'd still be in favor of sending payload for multiple records together as a part of the request handling mechanism instead of merging them at the model level, which would effectively achieve the same outcome without changing how the model works.
I am always hesitant with changing the architecture of the models, specifically because they can't be reverted once done.

I'd suggest taking this to a tech spec stage first, share it with the team and have this PR available for code reference to get some inputs to make a sound decision.

@kaapstorm
Copy link
Contributor Author

kaapstorm commented Feb 7, 2025

I am always hesitant with changing the architecture of the models, specifically because they can't be reverted once done.

I don't think that's correct in this specific situation: Adding this method and then dropping it later would have no after effect.

I also think that setting up DataSourceUpdateLog as a proper payload brings DataSourceRepeater in line will all other repeaters.

@kaapstorm
Copy link
Contributor Author

I'd still be in favor of sending payload for multiple records together as a part of the request handling mechanism

You have mentioned how one repeat record should represent one document. I'd suggest that one repeat record should represent one request. If you have multiple repeat records resulting in a single request, how should we process the response? If the request fails, did one repeat record fail, or did all of them? If there is a payload error, which repeat record should be cancelled?

I think the approach here, where we create one merged DataSourceUpdateLog, means that processing the response and creating a RepeatRecordAttempt remains the same. This keeps the footprint small in an area that can currently be simplified, instead of making it more complicated.

But if there is a simpler way to do this, I would be happy to review the spec.

@mkangia
Copy link
Contributor

mkangia commented Feb 9, 2025

If you have multiple repeat records resulting in a single request, how should we process the response? If the request fails, did one repeat record fail, or did all of them? If there is a payload error, which repeat record should be cancelled?

If request fails, all fails. If payload error could consider sending all individually. Or to keep things simple, in case of errors fallback to sending all individually?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
reindex/migration Reindex or migration will be required during or before deploy
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants