Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry batch delete blob on 503 #1277

Open
maingoh opened this issue May 21, 2024 · 1 comment
Open

Retry batch delete blob on 503 #1277

maingoh opened this issue May 21, 2024 · 1 comment
Assignees
Labels
api: storage Issues related to the googleapis/python-storage API. priority: p3 Desirable enhancement or fix. May not be included in next release. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.

Comments

@maingoh
Copy link

maingoh commented May 21, 2024

Is your feature request related to a problem? Please describe.

When deleting a lot of blobs using the batch API, it sometimes raises a ServiceUnavailable: 503 BATCH contentid://None: We encountered an internal error. Please try again.. This is a bit undesired, that it raises in the middle of a big deletion job.

Describe the solution you'd like

I tried settings the retry parameter at the client level client.get_bucket(bucket_name, retry=retry, timeout=600) or a the blob level blob.delete(retry=retry, timeout=600), even forcing the if_generation_match=blob.generation. No retry seem to be done. The class does not seem to use any retry here:

response = self._client._base_connection._make_request(

Either the client can support it, or at the very least the batch object should give access to the blobs (subtasks) that couldn't be deleted so that we can retry manually.
A manual retry of the full batch (for loop) does not work as some of the blobs from the batch got deleted in the first attempt, raising a 404 on the second attempt.

A clear and concise description of what you want to happen.

Retry or give the user the ability to retry only the one that fails

@product-auto-label product-auto-label bot added the api: storage Issues related to the googleapis/python-storage API. label May 21, 2024
@cojenco cojenco added type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. priority: p3 Desirable enhancement or fix. May not be included in next release. labels May 21, 2024
@StrtCoding
Copy link

HI, someone find a solution for this?

sadovnychyi added a commit to sadovnychyi/beam that referenced this issue Jan 8, 2025
A transient error might occur when writing a lot of shards to GCS, and right now
the GCS IO does not have any retry logic in place:

https://github.com/apache/beam/blob/a06454a2/sdks/python/apache_beam/io/gcp/gcsio.py#L269

It means that in such cases the entire bundle of elements fails, and then Beam
itself will attempt to retry the entire bundle, and will fail the job if it
exceeds the number of retries.

This change adds new logic to retry only failed requests, and uses the typical
exponential backoff strategy.

Note that this change accesses a private method (`_predicate`) of the retry
object, which we could avoid by basically copying the logic over here. But
existing code already accesses `_responses` property so maybe it's not a big
deal.

https://github.com/apache/beam/blob/b4c3a4ff/sdks/python/apache_beam/io/gcp/gcsio.py#L297

Existing (unresolved) issue in the GCS client library:

googleapis/python-storage#1277
sadovnychyi added a commit to sadovnychyi/beam that referenced this issue Jan 8, 2025
A transient error might occur when writing a lot of shards to GCS, and right now
the GCS IO does not have any retry logic in place:

https://github.com/apache/beam/blob/a06454a2/sdks/python/apache_beam/io/gcp/gcsio.py#L269

It means that in such cases the entire bundle of elements fails, and then Beam
itself will attempt to retry the entire bundle, and will fail the job if it
exceeds the number of retries.

This change adds new logic to retry only failed requests, and uses the typical
exponential backoff strategy.

Note that this change accesses a private method (`_predicate`) of the retry
object, which we could avoid by basically copying the logic over here. But
existing code already accesses `_responses` property so maybe it's not a big
deal.

https://github.com/apache/beam/blob/b4c3a4ff/sdks/python/apache_beam/io/gcp/gcsio.py#L297

Existing (unresolved) issue in the GCS client library:

googleapis/python-storage#1277
sadovnychyi added a commit to sadovnychyi/beam that referenced this issue Jan 8, 2025
A transient error might occur when writing a lot of shards to GCS, and right now
the GCS IO does not have any retry logic in place:

https://github.com/apache/beam/blob/a06454a2/sdks/python/apache_beam/io/gcp/gcsio.py#L269

It means that in such cases the entire bundle of elements fails, and then Beam
itself will attempt to retry the entire bundle, and will fail the job if it
exceeds the number of retries.

This change adds new logic to retry only failed requests, and uses the typical
exponential backoff strategy.

Note that this change accesses a private method (`_predicate`) of the retry
object, which we could avoid by basically copying the logic over here. But
existing code already accesses `_responses` property so maybe it's not a big
deal.

https://github.com/apache/beam/blob/b4c3a4ff/sdks/python/apache_beam/io/gcp/gcsio.py#L297

Existing (unresolved) issue in the GCS client library:

googleapis/python-storage#1277
Abacn pushed a commit to apache/beam that referenced this issue Jan 10, 2025
* Add retry logic to each batch method of the GCS IO

A transient error might occur when writing a lot of shards to GCS, and right now
the GCS IO does not have any retry logic in place:

https://github.com/apache/beam/blob/a06454a2/sdks/python/apache_beam/io/gcp/gcsio.py#L269

It means that in such cases the entire bundle of elements fails, and then Beam
itself will attempt to retry the entire bundle, and will fail the job if it
exceeds the number of retries.

This change adds new logic to retry only failed requests, and uses the typical
exponential backoff strategy.

Note that this change accesses a private method (`_predicate`) of the retry
object, which we could avoid by basically copying the logic over here. But
existing code already accesses `_responses` property so maybe it's not a big
deal.

https://github.com/apache/beam/blob/b4c3a4ff/sdks/python/apache_beam/io/gcp/gcsio.py#L297

Existing (unresolved) issue in the GCS client library:

googleapis/python-storage#1277

* Catch correct exception type in `_batch_with_retry`

The `RetryError` would be always raised since the retry decorator would catch
all HTTP-related exceptions.

* Update chanelog with GCSIO retry logic fix
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: storage Issues related to the googleapis/python-storage API. priority: p3 Desirable enhancement or fix. May not be included in next release. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.
Projects
None yet
Development

No branches or pull requests

4 participants