-
Notifications
You must be signed in to change notification settings - Fork 123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CKAN duplicate dataset (same harvest object) #3944
Comments
After examining logs using
The 2nd object only logged the Starting package_create log, nothing else. Need to investigate further on what may have happened during/around the 2nd statement at |
Determined that a restart of the container occurred directly after |
We have seen this occur again. See list of DOI duplicates here: https://catalog.data.gov/api/action/package_search?fq=organization:doi-gov&facet.field=[%22identifier%22]&facet.limit=-1&facet.mincount=2&rows=0 |
We believe that with #4007 being complete, as well as the previous work on this ticket, that this may be resolved. Should examine DOI as an example, and the list of all duplicates from here against previous runs to check if any counts are growing. If the counts are static/down, this can be closed. If growing, we should investigate further. |
I have spot checked various current duplicates, and find a different case than this one. Closing. |
Found a newly created one, same harvest object id, 14 seconds apart on creation time. |
The relevant logs should be here. I don't see anything that jumps out, whether instances restarting or suspicious failures. It's possible that it was inserted into the queue twice from the gather? Or maybe something else weird is going on... gather logs. |
We can manually create a duplicate by inserting the same |
Commit GSA/ckanext-harvest@f2f97f4 should prevent duplicate dataset (same harvest object) from happening. Will move it further if this issue is happening more frequently. So far it only happened once, not giving us enough logs to analyze to find a general pattern. We can use this API call to do regular check.
Will manually run the dedupe process to remove the found one, then close this ticket. |
Moving back to in progress since we found 83 cases during the past two weeks. |
The 83 has resolved on its own, mostly likely due to the recent deployed db-solr-sync with harvest_object_id. |
I found another one, not sure if it will resolve on it's own or not (but I believe it persisted from yesterday): https://catalog.data.gov/api/action/package_search?q=identifier:%22https://doi.org/10.23719/1526064%22&facet.field=[%22harvest_object_id%22,%20%22harvest_source_title%22] |
Perfect duplicate (other than name):
These even have the same harvest object id, but are harvested almost 4 hours apart from each other (but on the same job, at the time of this analysis only 1 harvest job has been run). There is no reason for this to occur, the code should be preventing it. The only reason I can imagine is if the harvest fails to mark this object as "complete", but it runs through everything up to that point (including actually creating the dataset), and it somehow re-enters the queue.
How to reproduce
Expected behavior
No duplicate datasets are harvested from a DCAT source (by design)
Actual behavior
Duplicates are created
Sketch
The goal of this ticket is to investigate and propose a resolution to this problem. This is a different outcome (the datasets are created in CKAN), but may be related to #3943.
This may involve digging into logs during this time period for this dataset, trying to resolve how this record was parsed twice from the queue.
Once the solution is in place, a follow up will need to be done/ticket be created to
This may require creating custom processes/scripts in https://github.com/GSA/datagov-dedupe
The text was updated successfully, but these errors were encountered: