You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In one of NCUA harvest jobs the harvester added all 36 datasets as new instead of updating existing ones. This resulted in duplicate datasets. With total of 60 datasets, 36 are newly harvested, 24 are duplicates. This is different from the other data.json duplicate issue #2981 in several ways:
In DB, one of the duplicate dataset's package_id has no linked harvest_object.
On the UI, the two duplicate datasets both have harvest_object info, pointing to the same harvest_object_id. Solr reindex does not help.
The following SQL script picks up the 24 duplicates from NCUA org, but it also shows it is a widely spread issue with other orgs too.
SELECT "group".name, COUNT(*) FROM package
JOIN "group" ON package.owner_org = "group".id
LEFT JOIN harvest_object ON package.id = harvest_object.package_id
WHERE package.state='active' AND package.type='dataset' AND harvest_object.package_id IS NULL
GROUP BY 1
ORDER BY 2 DESC
;
Ooooo this was a fun one! I remember @FuhuXia and @Jin-Sun-tts pairing heavily on this to manually cleanup the DB and then Jin got proficient enough to do it herself! 🥲
In one of NCUA harvest jobs the harvester added all 36 datasets as new instead of updating existing ones. This resulted in duplicate datasets. With total of 60 datasets, 36 are newly harvested, 24 are duplicates. This is different from the other data.json duplicate issue #2981 in several ways:
The following SQL script picks up the 24 duplicates from NCUA org, but it also shows it is a widely spread issue with other orgs too.
How to reproduce
Cannot replicate.
Sketch
One time fix: collect all ids and delete all duplicates via API.
Long term fix: improve de-dupe script to handle them.
The text was updated successfully, but these errors were encountered: