Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Record ID 99125158555106421 was not indexed - June 11th #2411

Closed
1 of 3 tasks
christinach opened this issue Jul 11, 2024 · 5 comments
Closed
1 of 3 tasks

Record ID 99125158555106421 was not indexed - June 11th #2411

christinach opened this issue Jul 11, 2024 · 5 comments
Assignees
Labels
bug 🐛 The application does not work as expected because of a defect investigate Tickets related to work that needs investigation

Comments

@christinach
Copy link
Member

christinach commented Jul 11, 2024

Expected behavior

Record with ID 99125158555106421 was changed in Alma on June 11th. It was part of this incremental_38205463170006421_20240611_180656[026]_new file. Today and with no additional updates the catalog record should reflect the changes from June 11th.

Actual behavior

The record did not get indexed.

Further Notes

Nancy B. from E-resources reported this issue in the catalog channel. @mzelesky investigated the exported files in Alma during this period and found that the record did get sent from Alma.

The timestamp from the JSON file indicates that the record was last indexed on 2024-05-22.

  "electronic_portfolio_s": [
    "{\"desc\":\" Available from 12/27/1890 until 12/31/1890.\",\"title\":\"CRL Open Access Newspapers\",\"url\":\"https://na05.alma.exlibrisgroup.com/view/uresolver/01PRI_INST/openurl?u.ignore_date_coverage=true&portfolio_pid=531019287200006421&Force_direct=true\",\"start\":\"1890\",\"end\":\"1890\",\"notes\":[]}",
    "{\"desc\":\" Available from 1884 until 1936.\",\"title\":\"NewspaperARCHIVE.com\",\"url\":\"https://na05.alma.exlibrisgroup.com/view/uresolver/01PRI_INST/openurl?u.ignore_date_coverage=true&portfolio_pid=53765186600006421&Force_direct=true\",\"start\":\"1884\",\"end\":\"1936\",\"notes\":[]}"
  ],
  "hashed_id_ssi": "2f904f131eec82c4",
  "_version_": 1799761628651061248,
  "timestamp": "2024-05-22T14:00:40.856Z"

Impact of this bug

The users cannot find all the available issues for this record in the catalog.

Suggestion

  • Fix this one record:
    1. create an xml file for the record
    2. scp the file to bibdata-qa-worker1,
    3. follow the [documentation on how to index an xml file] (https://github.com/pulibrary/bibdata/blob/main/docs/test_indexing.md#scenario-1-test-indexing-a-specific-xml-file) and index the file
    4. repeat the same steps on bibdata-worker-staging1.
    5. Review the record in catalog-staging.princeton.edu and catalog-qa.princeton.edu (make sure catalog-qa.princeton.edu is pointing to the Solr collection you used to index the file in bibdata-qa-worker1 -step 3)
    6. If the record does not reflect the changes review the logs on the VM, download the incremental file locally and troubleshoot in your dev environment.
    7. If everything looks ok index the record in bibdata-worker-prod1.
    8. Review the record in catalog.princeton.edu
    9. If the record reflects the portfolio changes follow up in the catalog channel with Nancy and close this ticket.
  • OR Run the alma updates since June 9th, 2024
  • OR Run a full reindex
@christinach christinach added bug 🐛 The application does not work as expected because of a defect investigate Tickets related to work that needs investigation labels Jul 11, 2024
@christinach christinach self-assigned this Jul 11, 2024
@christinach
Copy link
Member Author

christinach commented Jul 12, 2024

@christinach
Copy link
Member Author

christinach commented Jul 17, 2024

We checked with @mzelesky job_id: 38206309060006421 from alma. The specific job failed. We found the id in the webhook.
I looked into the database and even though the job failed the event was created with a success: true.

  1. We should not create an Event if the job failed in Alma. Currenlty we are only checking if the message body includes the alma job names from our alma.yml configuration.
  2. We should also check if the message_body["job_instance"]["status"]["value"] matches 'COMPLETED_FAILED' and skip it. For a failed job the json looks like the following:
"body:{\"id\":\"38206309060006421\",\"action\":\"JOB_END\",\"institution\":{\"value\":\"01PRI_INST\",\"desc\":\"Princeton University Library\"},\"time\":\"2024-06-11T19:40:52.051Z\",\"job_instance\":{\"id\":\"38206309060006421\",\"name\":\"Publishing Platform Job Incremental Publishing\",\"progress\":209.0,\"status\":{\"value\":\"COMPLETED_FAILED\",\"desc\":\"Completed with Errors\"},\"external_id\":\"38206612000006421\",\"submitted_by\":{\"value\":\"System\"},\"submit_time\":\"2024-06-11T18:00:12.461Z\",\"start_time\":\"2024-06-11T18:30:16.963Z\",\"end_time\":\"2024-06-11T19:40:52.051Z\",\"status_date\":\"2024-06-11Z\",\"alert\":[{\"value\":\"alert_general_error\",\"desc\":\"The job completed with errors. For more information view the report details (or contact Support using the process ID).\"}],\"counter\":[{\"type\":{\"value\":\"label.new.records\",\"desc\":\"New Records\"},\"value\":\"7\"},{\"type\":{\"value\":\"label.updated.records\",\"desc\":\"Updated Records\"},\"value\":\"432\"},{\"type\":{\"value\":\"label.deleted.records\",\"desc\":\"Deleted Records\"},\"value\":\"1\"},{\"type\":{\"value\":\"c.jobs.publishing.failed.publishing\",\"desc\":\"Unpublished failed records\"},\"value\":\"0\"},{\"type\":{\"value\":\"c.jobs.publishing.skipped\",\"desc\":\"Skipped records (update date changed but no data change)\"},\"value\":\"193\"},{\"type\":{\"value\":\"c.jobs.publishing.filtered_out\",\"desc\":\"Filtered records (not published due to filter)\"},\"value\":\"0\"},{\"type\":{\"value\":\"c.jobs.publishing.totalRecordsWrittenToFile\",\"desc\":\"Total records written to file\"},\"value\":\"0\"},{\"type\":{\"value\":\"FTP has failed.\",\"desc\":\"\"},\"value\":\"Download zip file for manual FTP.\"}],\"job_info\":{\"id\":\"S32986800410006421\",\"name\":\"Publishing Platform Job Incremental Publishing\",\"description\":\"Publishing Platform Job\",\"type\":{\"value\":\"SCHEDULED\",\"desc\":\"Scheduled\"},\"category\":{\"value\":\"PUBLISHING\",\"desc\":\"Publishing\"},\"link\":\"/almaws/v1/conf/jobs/S32986800410006421\"},\"link\":\"/almaws/v1/conf/jobs/S32986800410006421/instances/38206309060006421\"}}" 

This is what we save in the Event as message_body.

@christinach
Copy link
Member Author

christinach commented Jul 17, 2024

The incremental file that includes the record that was not indexed is: incremental_38205463170006421_20240611_180656[026]_new. The second part after incremental is the job process id from alma. This incremental file was in lib_sftp. When I check the AWS lambda log for this job id it has status 'COMPLETED_FAILED'.

"body:{\"id\":\"38205463170006421\",\"action\":\"JOB_END\",\"institution\":{\"value\":\"01PRI_INST\",\"desc\":\"Princeton University Library\"},\"time\":\"2024-06-11T18:27:10.961Z\",\"job_instance\":{\"id\":\"38205463170006421\",\"name\":\"Publishing Platform Job Incremental Publishing\",\"progress\":101.3,\"status\":{\"value\":\"COMPLETED_FAILED\",\"desc\":\"Completed with Errors\"},\"external_id\":\"38205728880006421\",\"submitted_by\":{\"value\":\"System\"},\"submit_time\":\"2024-06-11T17:00:11.594Z\",\"start_time\":\"2024-06-11T17:20:13.584Z\",\"end_time\":\"2024-06-11T18:27:10.961Z\",\"status_date\":\"2024-06-11Z\",\"alert\":[{\"value\":\"alert_general_error\",\"desc\":\"The job completed with errors. For more information view the report details (or contact Support using the process ID).\"}],\"counter\":[{\"type\":{\"value\":\"label.new.records\",\"desc\":\"New Records\"},\"value\":\"66\"},{\"type\":{\"value\":\"label.updated.records\",\"desc\":\"Updated Records\"},\"value\":\"3453\"},{\"type\":{\"value\":\"label.deleted.records\",\"desc\":\"Deleted Records\"},\"value\":\"0\"},{\"type\":{\"value\":\"c.jobs.publishing.failed.publishing\",\"desc\":\"Unpublished failed records\"},\"value\":\"0\"},{\"type\":{\"value\":\"c.jobs.publishing.skipped\",\"desc\":\"Skipped records (update date changed but no data change)\"},\"value\":\"224\"},{\"type\":{\"value\":\"c.jobs.publishing.filtered_out\",\"desc\":\"Filtered records (not published due to filter)\"},\"value\":\"0\"},{\"type\":{\"value\":\"c.jobs.publishing.totalRecordsWrittenToFile\",\"desc\":\"Total records written to file\"},\"value\":\"0\"},{\"type\":{\"value\":\"FTP has failed.\",\"desc\":\"\"},\"value\":\"Download zip file for manual FTP.\"}],\"job_info\":{\"id\":\"S32986800410006421\",\"name\":\"Publishing Platform Job Incremental Publishing\",\"description\":\"Publishing Platform Job\",\"type\":{\"value\":\"SCHEDULED\",\"desc\":\"Scheduled\"},\"category\":{\"value\":\"PUBLISHING\",\"desc\":\"Publishing\"},\"link\":\"/almaws/v1/conf/jobs/S32986800410006421\"},\"link\":\"/almaws/v1/conf/jobs/S32986800410006421/instances/38205463170006421\"}}"

@mzelesky since this job failed in Alma how did it generate the file with the failed Job Id?

@christinach
Copy link
Member Author

christinach commented Jul 17, 2024

The message_body with job_id:'38205463170006421' is in bibdata event 7338 with no attached dump file.

  • Next step check the timestamp of the incremental file in lib_sftp.

Update on 7/18/2024: I checked lib-sftp and the file has time stamp: "Jun 12 14:05 (UTC)" -rw-r--r-- 1 almasftp pul_g 1417919 Jun 12 14:05 'incremental_38205463170006421_20240611_180656[026]_new.tar.gz'

So far my understanding is that:

  • The alma job with process_id: 38205463170006421 failed on 2024-06-11T18:27 (UTC) (this is the time we see in bibdata as 'finish' and in alma as 'finished on')

bibdata events page - event 7338:

bibdata events page - event 7338

alma job report page - process id: 38205463170006421

alma job report - process id: 38205463170006421

  • The webhook received the job alma event ID (38205463170006421) ("time":"2024-06-11T18:27:10.961Z)
  • Bibdata created event#7338 based on the job id that received from the webhook.
  • There was no file yet in lib-sftp. (probably because there was no disk space. this is the issue we had during this week.)
  • For some reason the failed Alma job generated an incremental file: incremental_38205463170006421_20240611_180656[026]_new
  • The incremental file was transferred to lib-sftp on Jun 12 14:05 (UTC time). (probably when there was some free disk space)
  • Bibdata did not attach the file to the event that was created the previous day because: The bibdata event had in the message_body the process_id from alma and triggered a background job to fetch an incremental file with this job_id in the file name. This never happened because the job failed in alma, the file that the was generated was only transferred a day later when there was free space in lib-sftp job.
    Alma continued creating failed jobs every hour with the same effect on bibdata. Every time a successful job was created, Bibdata would have the alma process id in the new Event, looking for the transferred file in lib_sftp and moving forward. Bibdata will not look back to the event -that was created yesterday and finished and has a message_body with an alma process id A- and will not search lib-sftp for an incremental file A that came days later. AWS sqs poller retains the messages in the queue for 14 days. '

The main issues here are:

  1. Alma had a failed job that send a file a day later with filename of the failed process id. This file should have been included in the next successful Alma job with the successful job id in the filename. (I will discuss this with @mzelesky )
    • Update on 07/19/2024: @mzelesky confirmed that the failed job in Alma should not have generated a file with these records. The records should have moved in the next successful job and generated in as part of this job.
  2. lif-sftp ran out of space (@acozine is working on setting up a scheduled cleanup job)
  3. Bibdata creates events for failed jobs.
    • I discussed this with @mzelesky . It would be better to create the event with the Alma failed job in bibdata and track it down in the UI. We can add a new attribute 'Alma job status' in the events table with the ["status"]["value"] from the job_instance message body that we save from the webhook message. Next we can create a datadog alert to let us know if there is an Alma job with status 'COMPLETED_FAILED' and further investigate this. I will create a new ticket for this.

@christinach
Copy link
Member Author

I created #2415 and #2416 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🐛 The application does not work as expected because of a defect investigate Tickets related to work that needs investigation
Projects
None yet
Development

No branches or pull requests

2 participants