Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added batch creation of entities #345

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft

Conversation

Sujanadh
Copy link
Contributor

Updates:

Creating large entities sometimes takes longer to respond, so using batch creation improves little bit on the performance.
Used Semaphore to limit the number of simultaneous operations. For now I am using:

  • batch_size = 5000
  • concurrency = 5

Copy link
Member

@spwoodcock spwoodcock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting - so we can POST multiple entities at once via simultaneous calls (bulk upload in batches of 5000).

My first thought is: will this create inconsistencies in ODK, as we are doing simultaneous inserts. I also wonder if error handling works as expected.

Definitely not opposed to this idea though: do you have a rough idea of how much performance we gain doing this? (in approx seconds of time saved?)

It could be worth it! Plus Central is pretty robust, so I imagine the simultaneous inserts should't be a big issue

@Sujanadh Sujanadh marked this pull request as draft February 20, 2025 08:36
@spwoodcock
Copy link
Member

spwoodcock commented Feb 20, 2025

Related: what if 4/5 batches upload, but batch 5/5 fails? We have a partially uploaded set of entities then.

I would assume if this were done in one operation, the invalid entity would rollback the inserts and result in an empty entity list.

So the question is: is the upload idempotent to handle this scenario? If we try again and batch upload, will it overwrite the same entities, or will it end up inserting new additional entities?

(I have a feeling it won't work out that the entities are the same on the second upload, and we will end up with duplicates)

If the answer to the above is that we save a second or two, then it may not be worth the risk introduced. In that scenario, no need to investigate and answer these questions

@Sujanadh
Copy link
Contributor Author

That's a well thought question 🙌 Yeah, I guess we might not be able to roll back all the entities once created in batch process; i am not sure though. Without testing I can't answer those question, definetely all the questions are sensitive to consider. 👍 . But using concurrent asynchronous requests definitely improves the performance by almost 15 seconds faster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants