Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nuxeo API bug: harvesting fetch counts for a large Nuxeo collection fluctuates with harvesting runs - possibly an issue with deeply nested objects - revisit after the Nuxeo DB query work is completed #1166

Closed
christinklez opened this issue Jan 24, 2025 · 3 comments
Assignees

Comments

@christinklez
Copy link
Collaborator

Related: https://github.com/orgs/ucldc/projects/2/views/2?pane=issue&itemId=84872751&issue=ucldc%7Cnuxeo_merritt%7C12

==

Registry ID: 26713
https://calisphere-stage.cdlib.org/collections/26713/
This is a Nuxeo API bug.
Perhaps ask Nuxeo if they can provide guidance to query the database directly? Also press Nuxeo to address the API bug.

Expected counts for 26713, according to the doclist (as of 2024-09-04):
https://docs.google.com/spreadsheets/d/1_atOF_NRSNGFBktgecZoU3Hkiz-IsFQP/edit?gid=129213630#gid=129213630

  • main docs: 33,758
  • child objects: 883

==

Harvest attempt #1

Run ID: manual__2024-09-13T16:33:18+00:00 (this is the failed one, that never finished)
Fetch log: https://7a8067cb-3b99-477e-a883-7e311175a9b4.c3.us-west-2.airflow.amazonaws.com/dags/harvest_collection/grid?dag_run_id=manual__2024-09-13T16%3A33%3A18%2B00%3A00&tab=logs&task_id=fetching.fetch_collection
This fetch job took 53 minutes.

[2024-09-13, 17:16:48 UTC] {{logging_mixin.py:150}} INFO - 34025 parent items 2226 parent pages 731 child items 51 child pages
[2024-09-13, 17:16:48 UTC] {{logging_mixin.py:150}} INFO - 26713 : success,  2277 pages,  34756 items,  33733 solr items,   292 new items, solr count last updated: May 02, 2023 15:26:50.129314

Note: This harvest job did not complete. (A new job was started instead, when there were some content_harvest errors.)

Harvest attempt #2

Run ID: manual__2024-09-13T16:33:18+00:00 (this one was successful)
Fetch log: https://7a8067cb-3b99-477e-a883-7e311175a9b4.c3.us-west-2.airflow.amazonaws.com/dags/harvest_collection/grid?dag_run_id=manual__2024-09-16T15%3A56%3A49%2B00%3A00&task_id=fetching.fetch_collection&tab=logs
This fetch job took 43 minutes.

[2024-09-16, 16:41:09 UTC] {{logging_mixin.py:150}} INFO - 33196 parent items 2222 parent pages 907 child items 62 child pages
[2024-09-16, 16:41:09 UTC] {{logging_mixin.py:150}} INFO - 26713 : success,  2284 pages,  34103 items,  33733 solr items, -537 lost items, solr count last updated: May 02, 2023 15:26:50.129314

Note: This harvest job did complete.
-stage counts from this job: 29,433
-prod counts (currently published): 30,720

Because of the drop in counts, UCI alerted us of this discrepancy. We started a new harvest job.

Harvest attempt #3

Run ID: manual__2024-09-19T16:52:39+00:00
Fetch log: https://7a8067cb-3b99-477e-a883-7e311175a9b4.c3.us-west-2.airflow.amazonaws.com/dags/harvest_collection/grid?dag_run_id=manual__2024-09-19T16%3A52%3A39%2B00%3A00&task_id=fetching.fetch_collection&tab=logs
This fetch job took 42 minutes.

[2024-09-19, 17:35:52 UTC] {{logging_mixin.py:150}} INFO - 33541 parent items 2231 parent pages 785 child items 51 child pages
[2024-09-19, 17:35:52 UTC] {{logging_mixin.py:150}} INFO - 26713 : success,  2282 pages,  34326 items,  33733 solr items, -192 lost items, solr count last updated: May 02, 2023 15:26:50.129314

Note: This harvest job did complete.
-stage counts from this job: 25,763
-prod counts (currently published): 30,720

Harvest attempt #4

Before attempting this harvest, we decided to test if there was an issue with harvesting "Deeply Nested Objects." We've encountered this issue before with generating Nuxeo extent stats, in which Deeply Nested Objects weren't fully getting picked up.

We worked with Elvia/UCI to move the folders up one level. (They were previously separated into Do Not Publish & Publish folders. Instead the contents in the Publish folder were moved up one level.) Here are the harvesting results, below.

Run ID: manual__2024-09-24T18:32:22+00:00
Fetch log: https://7a8067cb-3b99-477e-a883-7e311175a9b4.c3.us-west-2.airflow.amazonaws.com/dags/harvest_collection/grid?dag_run_id=manual__2024-09-24T18%3A32%3A22%2B00%3A00&task_id=fetching.fetch_collection&tab=logs
This fetch job took 6 hours 1 minute.

[2024-09-25, 00:33:40 UTC] {{logging_mixin.py:150}} INFO - 33530 parent items 2227 parent pages 963 child items 60 child pages
[2024-09-25, 00:33:40 UTC] {{logging_mixin.py:150}} INFO - 26713 : success,  2287 pages,  34493 items,  33733 solr items, -203 lost items, solr count last updated: May 02, 2023 15:26:50.129314

Note: This harvest job did complete.
-stage counts from this job: 32,704
-prod counts (currently published): 30,720

Harvest attempt #5

Run ID: manual__2024-10-16T19:35:54+00:00
Fetch log: https://7a8067cb-3b99-477e-a883-7e311175a9b4.c3.us-west-2.airflow.amazonaws.com/dags/harvest_collection/grid?dag_run_id=manual__2024-10-16T19%3A35%3A54%2B00%3A00&task_id=fetching.fetch_collection&tab=logs

[2024-10-16, 20:41:24 UTC] {{logging_mixin.py:188}} INFO - 33519 parent items 2225 parent pages 609 child items 42 child pages
[2024-10-16, 20:41:24 UTC] {{logging_mixin.py:188}} INFO - 26713 : success,  2267 pages,  34128 items,  33733 solr items, -214 lost items, solr count last updated: May 02, 2023 15:26:50.129314

Harvest attempt #6

Run ID: manual__2024-10-21T06%3A36%3
Fetch log: https://7a8067cb-3b99-477e-a883-7e311175a9b4.c3.us-west-2.airflow.amazonaws.com/dags/harvest_collection/grid?base_date=2024-10-22T19%3A35%3A54Z&dag_run_id=manual__2024-10-21T06%3A36%3

Harvest attempt #7: 2025-01-16

Run ID: manual__2025-01-16T20:59:15+00:00
Fetch log: https://7a8067cb-3b99-477e-a883-7e311175a9b4.c3.us-west-2.airflow.amazonaws.com/dags/harvest_collection/grid?dag_run_id=manual__2025-01-16T20%3A59%3A15%2B00%3A00&task_id=fetching.fetch_collection&tab=logs

[2025-01-16, 21:38:44 UTC] {logging_mixin.py:188} INFO - 33606 parent items 2224 parent pages 1171 child items 78 child pages
[2025-01-16, 21:38:44 UTC] {logging_mixin.py:188} INFO - 26713 : success,  2302 pages,  34777 items,  33733 solr items, -127 lost items, solr count last updated: May 02, 2023 15:26:50.129314

Note: This harvest job did complete.
-stage counts from this job: 27,439

@christinklez
Copy link
Collaborator Author

Harvesting notes: 2025-01-23 (Fetching from the Nuxeo DB)

[2025-01-24, 00:14:00 UTC] {logging_mixin.py:188} INFO - 33758 parent items 2225 parent pages 883 child items 60 child pages
[2025-01-24, 00:14:00 UTC] {logging_mixin.py:188} INFO - 26713 : success,  2285 pages,  34641 items,  33733 solr items,    25 new items, solr count last updated: May 02, 2023 15:26:50.129314

https://7a8067cb-3b99-477e-a883-7e311175a9b4.c3.us-west-2.airflow.amazonaws.com/dags/harvest_collection/grid?dag_run_id=manual__2025-01-23T22%3A41%3A13%2B00%3A00&tab=logs&num_runs=365&task_id=fetching.fetch_collection

@barbarahui
Copy link
Collaborator

I think 33758 parents is good — the script I ran to get the count previously didn’t include the 3 records that were in the top level folder! Previous count I got was 33755.

@christinklez
Copy link
Collaborator Author

Harvest job just finished and we have 33,758 records on -stage: https://calisphere-stage.cdlib.org/collections/26713/

This also matches our Sept 2024 extent report counts!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants