CKAN duplicate dataset (same harvest object) #3944

jbrown-xentity · 2022-09-07T20:39:50Z

Perfect duplicate (other than name):

These even have the same harvest object id, but are harvested almost 4 hours apart from each other (but on the same job, at the time of this analysis only 1 harvest job has been run). There is no reason for this to occur, the code should be preventing it. The only reason I can imagine is if the harvest fails to mark this object as "complete", but it runs through everything up to that point (including actually creating the dataset), and it somehow re-enters the queue.

How to reproduce

Unknown

Expected behavior

No duplicate datasets are harvested from a DCAT source (by design)

Actual behavior

Duplicates are created

Sketch

The goal of this ticket is to investigate and propose a resolution to this problem. This is a different outcome (the datasets are created in CKAN), but may be related to #3943.

This may involve digging into logs during this time period for this dataset, trying to resolve how this record was parsed twice from the queue.

Once the solution is in place, a follow up will need to be done/ticket be created to

Discover how many items this affects
Clean up any affected records

This may require creating custom processes/scripts in https://github.com/GSA/datagov-dedupe

jbrown-xentity · 2022-09-22T19:58:16Z

After examining logs using catalog-fetch observed-precipitation-temperature-and-snow-water-equivalent-for-the-rio-grande-headw-1905 for time period from Sep 1 6am - Sep 2 6am, we found that 3 objects were created. The first and final objects print 5 times each, with a notes corresponding with:

Starting package_create
Indexing
Package_create complete
Starting re-index
Re-index complete

The 2nd object only logged the Starting package_create log, nothing else. Need to investigate further on what may have happened during/around the 2nd statement at 2022-09-01T16:26:06.157Z on the 0 instance to see what errors or commands may have halted the import.

btylerburton · 2022-09-26T21:01:03Z

Determined that a restart of the container occurred directly after 2022-09-01T16:26:06.157Z. With the recent fix to only restart fetch process when idle, we expect this issue to resolve itself.

GSA/catalog.data.gov#548

jbrown-xentity · 2022-10-14T15:18:38Z

We have seen this occur again. See list of DOI duplicates here: https://catalog.data.gov/api/action/package_search?fq=organization:doi-gov&facet.field=[%22identifier%22]&facet.limit=-1&facet.mincount=2&rows=0
We have confirmed that these datasets exist, with the same harvest object, are duplicated in the DB and SOLR, after the fix above has been in place.
Re-opening...

jbrown-xentity · 2022-10-20T20:35:59Z

We believe that with #4007 being complete, as well as the previous work on this ticket, that this may be resolved. Should examine DOI as an example, and the list of all duplicates from here against previous runs to check if any counts are growing. If the counts are static/down, this can be closed. If growing, we should investigate further.

jbrown-xentity · 2022-10-27T22:57:09Z

I have spot checked various current duplicates, and find a different case than this one. Closing.

FuhuXia · 2022-11-28T17:19:03Z

Found a newly created one, same harvest object id, 14 seconds apart on creation time.

https://catalog.data.gov/api/action/package_search?fq=(identifier:%22https://doi.org/10.23719/1526064%22)

dataset 1
dataset 2

jbrown-xentity · 2022-11-28T18:20:58Z

The relevant logs should be here. I don't see anything that jumps out, whether instances restarting or suspicious failures. It's possible that it was inserted into the queue twice from the gather? Or maybe something else weird is going on... gather logs.

FuhuXia · 2022-12-13T18:00:22Z

We can manually create a duplicate by inserting the same harvest_object_id into Redis DB twice. For datajson source, it will create a duplicate dataset, for XML sources, it will generate an error Validation Error: {'Name': 'That URL is already in use.'} that we saw in the other tickets, but it won't create a duplicate. So the issue is most likely due to glitches between multiple fetch processes and the Redis.

FuhuXia · 2022-12-13T21:32:27Z

Commit GSA/ckanext-harvest@f2f97f4 should prevent duplicate dataset (same harvest object) from happening. Will move it further if this issue is happening more frequently. So far it only happened once, not giving us enough logs to analyze to find a general pattern. We can use this API call to do regular check.

https://catalog.data.gov/api/action/package_search?facet.field=[%22harvest_object_id%22]&facet.limit=-1&facet.mincount=2&rows=0&fq=(collection_package_id:*%20OR%20*:*)

Will manually run the dedupe process to remove the found one, then close this ticket.

FuhuXia · 2023-01-05T18:01:38Z

Moving back to in progress since we found 83 cases during the past two weeks.

FuhuXia · 2023-03-01T17:38:46Z

The 83 has resolved on its own, mostly likely due to the recent deployed db-solr-sync with harvest_object_id.

jbrown-xentity · 2023-03-01T17:53:02Z

I found another one, not sure if it will resolve on it's own or not (but I believe it persisted from yesterday): https://catalog.data.gov/api/action/package_search?q=identifier:%22https://doi.org/10.23719/1526064%22&facet.field=[%22harvest_object_id%22,%20%22harvest_source_title%22]

jbrown-xentity added the bug Software defect or bug label Sep 7, 2022

jbrown-xentity added this to data.gov team board Sep 7, 2022

hkdctol moved this to Product Backlog in data.gov team board Sep 8, 2022

hkdctol moved this from Product Backlog to Sprint Backlog [7] in data.gov team board Sep 15, 2022

jbrown-xentity mentioned this issue Sep 22, 2022

CKAN duplicate dataset (different harvest object) #3968

Closed

btylerburton moved this from Sprint Backlog [7] to In Progress [8] in data.gov team board Sep 22, 2022

btylerburton self-assigned this Sep 22, 2022

btylerburton moved this from In Progress [8] to Done in data.gov team board Sep 26, 2022

hkdctol closed this as completed Sep 29, 2022

jbrown-xentity reopened this Oct 14, 2022

Repository owner moved this from ✔ Done to 📟 Sprint Backlog [7] in data.gov team board Oct 14, 2022

btylerburton removed their assignment Oct 14, 2022

FuhuXia self-assigned this Oct 25, 2022

FuhuXia moved this from 📟 Sprint Backlog [7] to 🏗 In Progress [8] in data.gov team board Oct 25, 2022

jbrown-xentity closed this as completed Oct 27, 2022

Repository owner moved this from 🏗 In Progress [8] to ✔ Done in data.gov team board Oct 27, 2022

FuhuXia moved this from ✔ Done to 🏗 In Progress [8] in data.gov team board Nov 28, 2022

FuhuXia reopened this Nov 28, 2022

Repository owner moved this from 🏗 In Progress [8] to 📟 Sprint Backlog [7] in data.gov team board Nov 28, 2022

hkdctol moved this from 📟 Sprint Backlog [7] to 🏗 In Progress [8] in data.gov team board Dec 1, 2022

FuhuXia moved this from 🏗 In Progress [8] to ✔ Done in data.gov team board Dec 13, 2022

FuhuXia moved this from ✔ Done to 🏗 In Progress [8] in data.gov team board Jan 5, 2023

hkdctol moved this from 🏗 In Progress [8] to 📟 Sprint Backlog [7] in data.gov team board Jan 5, 2023

hkdctol moved this from 📟 Sprint Backlog [7] to 📔 Product Backlog in data.gov team board Jan 5, 2023

FuhuXia closed this as completed Mar 1, 2023

github-project-automation bot moved this from 📔 Product Backlog to ✔ Done in data.gov team board Mar 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CKAN duplicate dataset (same harvest object) #3944

CKAN duplicate dataset (same harvest object) #3944

jbrown-xentity commented Sep 7, 2022 •

edited

Loading

jbrown-xentity commented Sep 22, 2022

btylerburton commented Sep 26, 2022

jbrown-xentity commented Oct 14, 2022

jbrown-xentity commented Oct 20, 2022

jbrown-xentity commented Oct 27, 2022

FuhuXia commented Nov 28, 2022

jbrown-xentity commented Nov 28, 2022

FuhuXia commented Dec 13, 2022 •

edited

Loading

FuhuXia commented Dec 13, 2022 •

edited

Loading

FuhuXia commented Jan 5, 2023

FuhuXia commented Mar 1, 2023

jbrown-xentity commented Mar 1, 2023

CKAN duplicate dataset (same harvest object) #3944

CKAN duplicate dataset (same harvest object) #3944

Comments

jbrown-xentity commented Sep 7, 2022 • edited Loading

How to reproduce

Expected behavior

Actual behavior

Sketch

jbrown-xentity commented Sep 22, 2022

btylerburton commented Sep 26, 2022

jbrown-xentity commented Oct 14, 2022

jbrown-xentity commented Oct 20, 2022

jbrown-xentity commented Oct 27, 2022

FuhuXia commented Nov 28, 2022

jbrown-xentity commented Nov 28, 2022

FuhuXia commented Dec 13, 2022 • edited Loading

FuhuXia commented Dec 13, 2022 • edited Loading

FuhuXia commented Jan 5, 2023

FuhuXia commented Mar 1, 2023

jbrown-xentity commented Mar 1, 2023

jbrown-xentity commented Sep 7, 2022 •

edited

Loading

FuhuXia commented Dec 13, 2022 •

edited

Loading

FuhuXia commented Dec 13, 2022 •

edited

Loading