Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry OpenAlex SSL exceptions #48

Merged
merged 1 commit into from
Jun 24, 2024
Merged

Retry OpenAlex SSL exceptions #48

merged 1 commit into from
Jun 24, 2024

Conversation

edsu
Copy link
Contributor

@edsu edsu commented Jun 24, 2024

I noticed that I hit some SSL exceptions when harvesting more data from OpenAlex when I had AIRFLOW_VAR_DEV_LIMIT=10000.

urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='api.openalex.org', port=443): Max retries exceeded with url: /authors/https://orcid.org/0000-0001-5838-5335 (Caused by SSLError(SSLEOFError(8, '[SSL: UNEXPECTED_EOF_WHILE_READING] EOF occurred in violation of protocol (_ssl.c:1000)')))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.12/site-packages/airflow/models/taskinstance.py", line 465, in _execute_task
    result = _execute_callable(context=context, **execute_callable_kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.12/site-packages/airflow/models/taskinstance.py", line 432, in _execute_callable
    return execute_callable(context=context, **execute_callable_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.12/site-packages/airflow/models/baseoperator.py", line 401, in wrapper
    return func(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.12/site-packages/airflow/decorators/base.py", line 265, in execute
    return_value = super().execute(context)
                   ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.12/site-packages/airflow/models/baseoperator.py", line 401, in wrapper
    return func(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.12/site-packages/airflow/operators/python.py", line 235, in execute
    return_value = self.execute_callable()
                   ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.12/site-packages/airflow/operators/python.py", line 252, in execute_callable
    return self.python_callable(*self.op_args, **self.op_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/airflow/rialto_airflow/dags/harvest.py", line 59, in openalex_harvest_dois
    openalex.doi_orcids_pickle(authors_csv, pickle_file, limit=dev_limit)
  File "/opt/airflow/rialto_airflow/harvest/openalex.py", line 22, in doi_orcids_pickle
    orcid_dois[orcid] = list(dois_from_orcid(orcid))
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/airflow/rialto_airflow/harvest/openalex.py", line 41, in dois_from_orcid
    author_resp = requests.get(
                  ^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.12/site-packages/requests/api.py", line 73, in get
    return request("get", url, params=params, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.12/site-packages/requests/api.py", line 59, in request
    return session.request(method=method, url=url, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.12/site-packages/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.12/site-packages/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.12/site-packages/requests/adapters.py", line 698, in send
    raise SSLError(e, request=request)
requests.exceptions.SSLError: HTTPSConnectionPool(host='api.openalex.org', port=443): Max retries exceeded with url: /authors/https://orcid.org/0000-0001-5838-5335 (Caused by SSLError(SSLEOFError(8, '[SSL: UNEXPECTED_EOF_WHILE_READING] EOF occurred in violation of protocol (_ssl.c:1000)')))
[2024-06-24, 11:14:31 UTC] {taskinstance.py:1206} INFO - Marking task as FAILED. dag_id=harvest, task_id=openalex_harvest_dois, run_id=manual__2024-06-24T11:02:02.383856+00:00, execution_date=20240624T110202, start_date=20240624T110205, end_date=20240624T111431
[2024-06-24, 11:14:31 UTC] {standard_task_runner.py:110} ERROR - Failed to execute job 222 for task openalex_harvest_dois (HTTPSConnectionPool(host='api.openalex.org', port=443): Max retries exceeded with url: /authors/https://orcid.org/0000-0001-5838-5335 (Caused by SSLError(SSLEOFError(8, '[SSL: UNEXPECTED_EOF_WHILE_READING] EOF occurred in violation of protocol (_ssl.c:1000)'))); 86)
[2024-06-24, 11:14:31 UTC] {local_task_job_runner.py:240} INFO - Task exited with return code 1

This commit uses tenacity to retry these with a random wait between 1-5 seconds, which stops after 60 seconds of trying. We may want to adjust these based on how well they work. The retry behavior only works with the SSLError for now so we can get insight into other errors that we might encounter.

I noticed that I hit some SSL exceptions when harvesting more data from
OpenAlex (AIRFLOW_VAR_DEV_LIMIT=10000).

```
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='api.openalex.org', port=443): Max retries exceeded with url: /authors/https://orcid.org/0000-0001-5838-5335 (Caused by SSLError(SSLEOFError(8, '[SSL: UNEXPECTED_EOF_WHILE_READING] EOF occurred in violation of protocol (_ssl.c:1000)')))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.12/site-packages/airflow/models/taskinstance.py", line 465, in _execute_task
    result = _execute_callable(context=context, **execute_callable_kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.12/site-packages/airflow/models/taskinstance.py", line 432, in _execute_callable
    return execute_callable(context=context, **execute_callable_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.12/site-packages/airflow/models/baseoperator.py", line 401, in wrapper
    return func(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.12/site-packages/airflow/decorators/base.py", line 265, in execute
    return_value = super().execute(context)
                   ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.12/site-packages/airflow/models/baseoperator.py", line 401, in wrapper
    return func(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.12/site-packages/airflow/operators/python.py", line 235, in execute
    return_value = self.execute_callable()
                   ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.12/site-packages/airflow/operators/python.py", line 252, in execute_callable
    return self.python_callable(*self.op_args, **self.op_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/airflow/rialto_airflow/dags/harvest.py", line 59, in openalex_harvest_dois
    openalex.doi_orcids_pickle(authors_csv, pickle_file, limit=dev_limit)
  File "/opt/airflow/rialto_airflow/harvest/openalex.py", line 22, in doi_orcids_pickle
    orcid_dois[orcid] = list(dois_from_orcid(orcid))
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/airflow/rialto_airflow/harvest/openalex.py", line 41, in dois_from_orcid
    author_resp = requests.get(
                  ^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.12/site-packages/requests/api.py", line 73, in get
    return request("get", url, params=params, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.12/site-packages/requests/api.py", line 59, in request
    return session.request(method=method, url=url, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.12/site-packages/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.12/site-packages/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.12/site-packages/requests/adapters.py", line 698, in send
    raise SSLError(e, request=request)
requests.exceptions.SSLError: HTTPSConnectionPool(host='api.openalex.org', port=443): Max retries exceeded with url: /authors/https://orcid.org/0000-0001-5838-5335 (Caused by SSLError(SSLEOFError(8, '[SSL: UNEXPECTED_EOF_WHILE_READING] EOF occurred in violation of protocol (_ssl.c:1000)')))
[2024-06-24, 11:14:31 UTC] {taskinstance.py:1206} INFO - Marking task as FAILED. dag_id=harvest, task_id=openalex_harvest_dois, run_id=manual__2024-06-24T11:02:02.383856+00:00, execution_date=20240624T110202, start_date=20240624T110205, end_date=20240624T111431
[2024-06-24, 11:14:31 UTC] {standard_task_runner.py:110} ERROR - Failed to execute job 222 for task openalex_harvest_dois (HTTPSConnectionPool(host='api.openalex.org', port=443): Max retries exceeded with url: /authors/https://orcid.org/0000-0001-5838-5335 (Caused by SSLError(SSLEOFError(8, '[SSL: UNEXPECTED_EOF_WHILE_READING] EOF occurred in violation of protocol (_ssl.c:1000)'))); 86)
[2024-06-24, 11:14:31 UTC] {local_task_job_runner.py:240} INFO - Task exited with return code 1
```

This commit uses tenacity to retry these with a random wait between
1-5 seconds, which stops after 60 seconds of trying. We may want to
adjust these based on how well they work. The retry behavior only works
with the SSLError for now so we can get insight into other errors that
we might encounter.
@edsu edsu force-pushed the openalex-retries branch from b8d8976 to 7441c14 Compare June 24, 2024 14:17
@jacobthill jacobthill merged commit 95e1ba1 into main Jun 24, 2024
1 check passed
@jacobthill jacobthill deleted the openalex-retries branch June 24, 2024 14:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants