Dsegog 271 handling failed ingestion #148

moonraker595 · 2024-11-28T08:50:09Z

This PR addresses the cases where “network blips” may occur, impacting the ingestion process. It mainly focuses on the following issues:

Failures that happen before ingestion, meaning that there are network issues where the API is still accessible, but the DB and ECHO are not BEFORE the ingestion process starts.
Failures that happen mid-way through ingestion of a NEW hdf, meaning that some data has already been ingested before network issues render either or both the DB and ECHO stores unavailable.
Failures that happen mid-way through ingestion of a UPDATED hdf, meaning that some data has already been ingested before network issues render either or both the DB and ECHO stores unavailable.
Each one of these can have 3 different failure states, where either/or the database and echo can’t be contacted:

• The DB and ECHO are not accessible
• The DB is but ECHO is not accessible
• The DB is not but ECHO is accessible

To help with the manual debugging of this task, an application was created to generate and send newly created hdf files. Here: https://github.com/ral-facilities/operationsgateway-hdf5-generator

From this, the state table below can be constructed.

No	State	Mitigation	Response	Tested?
1.1	The DB and ECHO are not accessible before ingestion starts.	The API is unable to authenticate the user. The API fails the request as part of its pre-flight checks, so the request is thrown out before any “code” in the endpoint is hit.	500	Manual debuging
1.2	The DB is but ECHO is not accessible before ingestion starts.	Any channels that require echo such as waveforms and images are added to the list of failed channels. The failed channels are not added to the DB. The upload to ECHO happens before the DB insertions so there is nothing to roll back. Other channels are processed and the metadata for the file is updated in the DB.	201	test_echo_ failure_on_start
1.3	The DB is not but ECHO is accessible before ingestion starts.	The API is unable to authenticate the user, same outcome as 1.1	500	Manual debuging

2.1	DB and ECHO become inaccessible mid-way through ingesting a NEW .h5 file.	If both become unavailable when some channels have already been updated and the user has already been authenticated then the first thing to fail will be ECHO and the channels will be removed from the record, next the DB insertion will fail and a 500 will be returned along with “Database error”	500	test_partial_s3_ and_db_upload_failure
2.2	The DB is but ECHO is not accessible mid-way through ingesting a NEW .h5 file.	Same outcome as a 1.2	201	test_partial_ s3_upload_failure
2.3	The DB is not but ECHO is accessible mid-way through ingesting a NEW .h5 file.	Files will be updated to ECHO but not the db, 500 will be returned along with “Database error”. THIS WILL MEAN DANGLING DATA IN ECHO. This is difficult to test, the only way of forcing an error mid-way through ingestion is through the unit tests, which clear up all tests file after.	500	test_partial_db_failure

3.1	DB and ECHO become inaccessible mid-way through ingesting a UPDATED .h5 file.	Same outcome as a 2.1	500	Manual debuging
3.2	The DB is but ECHO is not accessible mid-way through ingesting a UPDATED .h5 file.	Same outcome as a 1.2	201	Manual debuging
3.3	The DB is not but ECHO is accessible mid-way through ingesting a UPDATED .h5 file.	Same outcome as a 2.3	500	Manual debuging

codecov · 2024-11-29T15:12:11Z

Codecov Report

Attention: Patch coverage is 93.90244% with 5 lines in your changes missing coverage. Please review.

Project coverage is 95.62%. Comparing base (ad8f996) to head (70bbd4e).
Report is 26 commits behind head on main.

Files with missing lines	Patch %	Lines
operationsgateway_api/src/routes/ingest_data.py	83.33%	2 Missing and 1 partial ⚠️
operationsgateway_api/src/records/record.py	60.00%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #148      +/-   ##
==========================================
- Coverage   95.67%   95.62%   -0.05%     
==========================================
  Files          73       74       +1     
  Lines        3766     3838      +72     
  Branches      659      692      +33     
==========================================
+ Hits         3603     3670      +67     
- Misses        113      116       +3     
- Partials       50       52       +2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

…ssesary

…hannel fails to upload to echo

…ssesary

…e channels from the record if any of them fail.

patrick-austin

EDIT: I almost forgot to mention, I've only looked at the changes from 6a190a3 onwards - I had a look at the git graph, and the ones before that also appear in the branch for #147 . So I figured they would be covered and reviewed by that PR. If you want I can review them as part of this one as well, just let me know.

Looks good on the whole, especially the in-depth testing with the extensive use of side effects on the patched functions.

I had a few queries in the in line comments, but most are minor - the only one I feel (somewhat) strongly about is that we should be calling log.error in mongo_error_handling.py like we do elsewhere.

In terms of coverage I mentioned one of the places, the other two seem like either defensive programming (the remove_channel else block) or otherwise would be difficult to reach (having multiple failure reasons for a single channel - setting up the mocks to get one failure seems like enough of an ask), so I think it's OK to leave these uncovered if you're happy to.

operationsgateway_api/src/mongo/interface.py

operationsgateway_api/src/mongo/mongo_error_handling.py

operationsgateway_api/src/records/record.py

test/mongo/test_mongo_error_handling.py

moonraker595 · 2025-02-04T11:06:56Z

Putting this here for historical purposes:
Having run poetry update and committed the update lock file, the CI was failing due to perceived LDAP issues and a failure in running the ingestion script. However, this was working fine locally. The Issue was that a recent update to certify, probably used in the request package, was causing the issue. Downgrading the version confirmed this.

Having come back to the issue the next day and running another poetry update, another package has been updated; paramiko which fixed the issue. So basically... the thing fixed itself!

moonraker595 added 3 commits November 29, 2024 14:57

Added function decorator to handle db issues

6a190a3

Created function decorator to handle db issues

50273e9

Added test to check that a db error is caught and returned

f9f491a

kevinphippsstfc approved these changes Jan 16, 2025

View reviewed changes

moonraker595 added 12 commits January 20, 2025 13:37

Wrapped the upload in a try/catch and return the failed channel if ne…

c974569

…ssesary

Added a function to remove a channel from the record. Used when the c…

cbd62c5

…hannel fails to upload to echo

Wrapped the upload in a try/catch and return the failed channel if ne…

63d2366

…ssesary

Updated versions

b89d078

Black formatting

2270606

Added to list of aliases

ce92c48

Linting

eb73d4d

Added to list of aliases

18ad3b6

Created a way to monitor the result of the echo uploads and remove th…

2f6bd61

…e channels from the record if any of them fail.

added tests to check various db and echo related failures

1a4dd98

Linting

b89d73a

Update vulnerable dependencies

f7c53a4

moonraker595 mentioned this pull request Jan 21, 2025

Bump python-multipart from 0.0.7 to 0.0.18 #149

Closed

moonraker595 requested a review from patrick-austin January 29, 2025 10:11

patrick-austin requested changes Jan 29, 2025

View reviewed changes

moonraker595 added 2 commits January 30, 2025 13:30

Added error handling around the query_to_list function

2d623db

Removing redundant KeyError

5da3e01

patrick-austin mentioned this pull request Jan 31, 2025

Dsegog 366 add route to get all users #150

Merged

moonraker595 added 4 commits January 31, 2025 16:06

Added passthrough for DatabaseErrors

b65af8e

Updated to fix ingest boto issue

f048abc

Updated tests for coverage

6d07ad4

Updated test to cover for propagated db errors.

abfe6fd

patrick-austin approved these changes Feb 3, 2025

View reviewed changes

moonraker595 and others added 2 commits February 3, 2025 14:23

Merge branch 'main' into DSEGOG-271-Handling-Failed-Ingestion

6cce922

updated to working versions

fdc7309

moonraker595 added 2 commits February 4, 2025 09:48

testing downgrading of certify

1dea424

testing upgrade of certify

70bbd4e

moonraker595 merged commit bfbbab9 into main Feb 4, 2025
6 checks passed

moonraker595 deleted the DSEGOG-271-Handling-Failed-Ingestion branch February 4, 2025 11:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dsegog 271 handling failed ingestion #148

Dsegog 271 handling failed ingestion #148

moonraker595 commented Nov 28, 2024 •

edited

Loading

codecov bot commented Nov 29, 2024 •

edited

Loading

patrick-austin left a comment •

edited

Loading

moonraker595 commented Feb 4, 2025 •

edited

Loading

Dsegog 271 handling failed ingestion #148

Dsegog 271 handling failed ingestion #148

Conversation

moonraker595 commented Nov 28, 2024 • edited Loading

codecov bot commented Nov 29, 2024 • edited Loading

Codecov Report

patrick-austin left a comment • edited Loading

Choose a reason for hiding this comment

moonraker595 commented Feb 4, 2025 • edited Loading

moonraker595 commented Nov 28, 2024 •

edited

Loading

codecov bot commented Nov 29, 2024 •

edited

Loading

patrick-austin left a comment •

edited

Loading

moonraker595 commented Feb 4, 2025 •

edited

Loading