Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dsegog 271 handling failed ingestion #148

Merged
merged 25 commits into from
Feb 4, 2025

Conversation

moonraker595
Copy link
Contributor

@moonraker595 moonraker595 commented Nov 28, 2024

This PR addresses the cases where “network blips” may occur, impacting the ingestion process. It mainly focuses on the following issues:

  1. Failures that happen before ingestion, meaning that there are network issues where the API is still accessible, but the DB and ECHO are not BEFORE the ingestion process starts.
  2. Failures that happen mid-way through ingestion of a NEW hdf, meaning that some data has already been ingested before network issues render either or both the DB and ECHO stores unavailable.
  3. Failures that happen mid-way through ingestion of a UPDATED hdf, meaning that some data has already been ingested before network issues render either or both the DB and ECHO stores unavailable.
    Each one of these can have 3 different failure states, where either/or the database and echo can’t be contacted:

• The DB and ECHO are not accessible
• The DB is but ECHO is not accessible
• The DB is not but ECHO is accessible

To help with the manual debugging of this task, an application was created to generate and send newly created hdf files. Here: https://github.com/ral-facilities/operationsgateway-hdf5-generator

From this, the state table below can be constructed.

No State Mitigation Response Tested?
1.1 The DB and ECHO are not accessible before ingestion starts. The API is unable to authenticate the user. The API fails the request as part of its pre-flight checks, so the request is thrown out before any “code” in the endpoint is hit. 500 Manual debuging
1.2 The DB is but ECHO is not accessible before ingestion starts. Any channels that require echo such as waveforms and images are added to the list of failed channels. The failed channels are not added to the DB. The upload to ECHO happens before the DB insertions so there is nothing to roll back. Other channels are processed and the metadata for the file is updated in the DB. 201 test_echo_ failure_on_start
1.3 The DB is not but ECHO is accessible before ingestion starts. The API is unable to authenticate the user, same outcome as 1.1 500 Manual debuging
         
2.1 DB and ECHO become inaccessible mid-way through ingesting a NEW .h5 file. If both become unavailable when some channels have already been updated and the user has already been authenticated then the first thing to fail will be ECHO and the channels will be removed from the record, next the DB insertion will fail and a 500 will be returned along with “Database error” 500 test_partial_s3_ and_db_upload_failure
2.2 The DB is but ECHO is not accessible mid-way through ingesting a NEW .h5 file. Same outcome as a 1.2 201 test_partial_ s3_upload_failure
2.3 The DB is not but ECHO is accessible mid-way through ingesting a NEW .h5 file. Files will be updated to ECHO but not the db, 500 will be returned along with “Database error”. THIS WILL MEAN DANGLING DATA IN ECHO. This is difficult to test, the only way of forcing an error mid-way through ingestion is through the unit tests, which clear up all tests file after. 500 test_partial_db_failure
         
3.1 DB and ECHO become inaccessible mid-way through ingesting a UPDATED .h5 file. Same outcome as a 2.1 500 Manual debuging
3.2 The DB is but ECHO is not accessible mid-way through ingesting a UPDATED .h5 file. Same outcome as a 1.2 201 Manual debuging
3.3 The DB is not but ECHO is accessible mid-way through ingesting a UPDATED .h5 file. Same outcome as a 2.3 500 Manual debuging

Copy link

codecov bot commented Nov 29, 2024

Codecov Report

Attention: Patch coverage is 93.90244% with 5 lines in your changes missing coverage. Please review.

Project coverage is 95.62%. Comparing base (ad8f996) to head (70bbd4e).
Report is 26 commits behind head on main.

Files with missing lines Patch % Lines
operationsgateway_api/src/routes/ingest_data.py 83.33% 2 Missing and 1 partial ⚠️
operationsgateway_api/src/records/record.py 60.00% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #148      +/-   ##
==========================================
- Coverage   95.67%   95.62%   -0.05%     
==========================================
  Files          73       74       +1     
  Lines        3766     3838      +72     
  Branches      659      692      +33     
==========================================
+ Hits         3603     3670      +67     
- Misses        113      116       +3     
- Partials       50       52       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@patrick-austin patrick-austin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EDIT: I almost forgot to mention, I've only looked at the changes from 6a190a3 onwards - I had a look at the git graph, and the ones before that also appear in the branch for #147 . So I figured they would be covered and reviewed by that PR. If you want I can review them as part of this one as well, just let me know.

Looks good on the whole, especially the in-depth testing with the extensive use of side effects on the patched functions.

I had a few queries in the in line comments, but most are minor - the only one I feel (somewhat) strongly about is that we should be calling log.error in mongo_error_handling.py like we do elsewhere.

In terms of coverage I mentioned one of the places, the other two seem like either defensive programming (the remove_channel else block) or otherwise would be difficult to reach (having multiple failure reasons for a single channel - setting up the mocks to get one failure seems like enough of an ask), so I think it's OK to leave these uncovered if you're happy to.

operationsgateway_api/src/mongo/interface.py Show resolved Hide resolved
operationsgateway_api/src/mongo/mongo_error_handling.py Outdated Show resolved Hide resolved
operationsgateway_api/src/mongo/mongo_error_handling.py Outdated Show resolved Hide resolved
operationsgateway_api/src/records/record.py Outdated Show resolved Hide resolved
test/mongo/test_mongo_error_handling.py Outdated Show resolved Hide resolved
@moonraker595
Copy link
Contributor Author

moonraker595 commented Feb 4, 2025

Putting this here for historical purposes:
Having run poetry update and committed the update lock file, the CI was failing due to perceived LDAP issues and a failure in running the ingestion script. However, this was working fine locally. The Issue was that a recent update to certify, probably used in the request package, was causing the issue. Downgrading the version confirmed this.

Having come back to the issue the next day and running another poetry update, another package has been updated; paramiko which fixed the issue. So basically... the thing fixed itself!

@moonraker595 moonraker595 merged commit bfbbab9 into main Feb 4, 2025
6 checks passed
@moonraker595 moonraker595 deleted the DSEGOG-271-Handling-Failed-Ingestion branch February 4, 2025 11:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants