Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CKAN duplicate dataset (same harvest object) #3944

Closed
jbrown-xentity opened this issue Sep 7, 2022 · 12 comments
Closed

CKAN duplicate dataset (same harvest object) #3944

jbrown-xentity opened this issue Sep 7, 2022 · 12 comments
Assignees
Labels
bug Software defect or bug

Comments

@jbrown-xentity
Copy link
Contributor

jbrown-xentity commented Sep 7, 2022

Perfect duplicate (other than name):

These even have the same harvest object id, but are harvested almost 4 hours apart from each other (but on the same job, at the time of this analysis only 1 harvest job has been run). There is no reason for this to occur, the code should be preventing it. The only reason I can imagine is if the harvest fails to mark this object as "complete", but it runs through everything up to that point (including actually creating the dataset), and it somehow re-enters the queue.

How to reproduce

  1. Unknown

Expected behavior

No duplicate datasets are harvested from a DCAT source (by design)

Actual behavior

Duplicates are created

Sketch

The goal of this ticket is to investigate and propose a resolution to this problem. This is a different outcome (the datasets are created in CKAN), but may be related to #3943.

This may involve digging into logs during this time period for this dataset, trying to resolve how this record was parsed twice from the queue.

Once the solution is in place, a follow up will need to be done/ticket be created to

  1. Discover how many items this affects
  2. Clean up any affected records

This may require creating custom processes/scripts in https://github.com/GSA/datagov-dedupe

@jbrown-xentity jbrown-xentity added the bug Software defect or bug label Sep 7, 2022
@hkdctol hkdctol moved this to Product Backlog in data.gov team board Sep 8, 2022
@hkdctol hkdctol moved this from Product Backlog to Sprint Backlog [7] in data.gov team board Sep 15, 2022
@jbrown-xentity
Copy link
Contributor Author

After examining logs using catalog-fetch observed-precipitation-temperature-and-snow-water-equivalent-for-the-rio-grande-headw-1905 for time period from Sep 1 6am - Sep 2 6am, we found that 3 objects were created. The first and final objects print 5 times each, with a notes corresponding with:

  • Starting package_create
  • Indexing
  • Package_create complete
  • Starting re-index
  • Re-index complete

The 2nd object only logged the Starting package_create log, nothing else. Need to investigate further on what may have happened during/around the 2nd statement at 2022-09-01T16:26:06.157Z on the 0 instance to see what errors or commands may have halted the import.

@btylerburton btylerburton moved this from Sprint Backlog [7] to In Progress [8] in data.gov team board Sep 22, 2022
@btylerburton btylerburton self-assigned this Sep 22, 2022
@btylerburton btylerburton moved this from In Progress [8] to Done in data.gov team board Sep 26, 2022
@btylerburton
Copy link
Contributor

Determined that a restart of the container occurred directly after 2022-09-01T16:26:06.157Z. With the recent fix to only restart fetch process when idle, we expect this issue to resolve itself.

GSA/catalog.data.gov#548

@hkdctol hkdctol closed this as completed Sep 29, 2022
@jbrown-xentity
Copy link
Contributor Author

We have seen this occur again. See list of DOI duplicates here: https://catalog.data.gov/api/action/package_search?fq=organization:doi-gov&facet.field=[%22identifier%22]&facet.limit=-1&facet.mincount=2&rows=0
We have confirmed that these datasets exist, with the same harvest object, are duplicated in the DB and SOLR, after the fix above has been in place.
Re-opening...

Repository owner moved this from ✔ Done to 📟 Sprint Backlog [7] in data.gov team board Oct 14, 2022
@btylerburton btylerburton removed their assignment Oct 14, 2022
@jbrown-xentity
Copy link
Contributor Author

We believe that with #4007 being complete, as well as the previous work on this ticket, that this may be resolved. Should examine DOI as an example, and the list of all duplicates from here against previous runs to check if any counts are growing. If the counts are static/down, this can be closed. If growing, we should investigate further.

@FuhuXia FuhuXia self-assigned this Oct 25, 2022
@FuhuXia FuhuXia moved this from 📟 Sprint Backlog [7] to 🏗 In Progress [8] in data.gov team board Oct 25, 2022
@jbrown-xentity
Copy link
Contributor Author

I have spot checked various current duplicates, and find a different case than this one. Closing.

Repository owner moved this from 🏗 In Progress [8] to ✔ Done in data.gov team board Oct 27, 2022
@FuhuXia
Copy link
Member

FuhuXia commented Nov 28, 2022

Found a newly created one, same harvest object id, 14 seconds apart on creation time.

https://catalog.data.gov/api/action/package_search?fq=(identifier:%22https://doi.org/10.23719/1526064%22)

dataset 1
dataset 2

@FuhuXia FuhuXia moved this from ✔ Done to 🏗 In Progress [8] in data.gov team board Nov 28, 2022
@FuhuXia FuhuXia reopened this Nov 28, 2022
Repository owner moved this from 🏗 In Progress [8] to 📟 Sprint Backlog [7] in data.gov team board Nov 28, 2022
@jbrown-xentity
Copy link
Contributor Author

The relevant logs should be here. I don't see anything that jumps out, whether instances restarting or suspicious failures. It's possible that it was inserted into the queue twice from the gather? Or maybe something else weird is going on... gather logs.

@hkdctol hkdctol moved this from 📟 Sprint Backlog [7] to 🏗 In Progress [8] in data.gov team board Dec 1, 2022
@FuhuXia
Copy link
Member

FuhuXia commented Dec 13, 2022

We can manually create a duplicate by inserting the same harvest_object_id into Redis DB twice. For datajson source, it will create a duplicate dataset, for XML sources, it will generate an error Validation Error: {'Name': 'That URL is already in use.'} that we saw in the other tickets, but it won't create a duplicate. So the issue is most likely due to glitches between multiple fetch processes and the Redis.

@FuhuXia
Copy link
Member

FuhuXia commented Dec 13, 2022

Commit GSA/ckanext-harvest@f2f97f4 should prevent duplicate dataset (same harvest object) from happening. Will move it further if this issue is happening more frequently. So far it only happened once, not giving us enough logs to analyze to find a general pattern. We can use this API call to do regular check.

https://catalog.data.gov/api/action/package_search?facet.field=[%22harvest_object_id%22]&facet.limit=-1&facet.mincount=2&rows=0&fq=(collection_package_id:*%20OR%20*:*)

Will manually run the dedupe process to remove the found one, then close this ticket.

@FuhuXia FuhuXia moved this from 🏗 In Progress [8] to ✔ Done in data.gov team board Dec 13, 2022
@FuhuXia
Copy link
Member

FuhuXia commented Jan 5, 2023

Moving back to in progress since we found 83 cases during the past two weeks.

@FuhuXia FuhuXia moved this from ✔ Done to 🏗 In Progress [8] in data.gov team board Jan 5, 2023
@hkdctol hkdctol moved this from 🏗 In Progress [8] to 📟 Sprint Backlog [7] in data.gov team board Jan 5, 2023
@hkdctol hkdctol moved this from 📟 Sprint Backlog [7] to 📔 Product Backlog in data.gov team board Jan 5, 2023
@FuhuXia
Copy link
Member

FuhuXia commented Mar 1, 2023

The 83 has resolved on its own, mostly likely due to the recent deployed db-solr-sync with harvest_object_id.

@FuhuXia FuhuXia closed this as completed Mar 1, 2023
@github-project-automation github-project-automation bot moved this from 📔 Product Backlog to ✔ Done in data.gov team board Mar 1, 2023
@jbrown-xentity
Copy link
Contributor Author

I found another one, not sure if it will resolve on it's own or not (but I believe it persisted from yesterday): https://catalog.data.gov/api/action/package_search?q=identifier:%22https://doi.org/10.23719/1526064%22&facet.field=[%22harvest_object_id%22,%20%22harvest_source_title%22]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Software defect or bug
Projects
Archived in project
Development

No branches or pull requests

4 participants