NUCA harvest job created duplicated datasets #3567

FuhuXia · 2021-12-01T14:25:36Z

In one of NCUA harvest jobs the harvester added all 36 datasets as new instead of updating existing ones. This resulted in duplicate datasets. With total of 60 datasets, 36 are newly harvested, 24 are duplicates. This is different from the other data.json duplicate issue #2981 in several ways:

In DB, one of the duplicate dataset's package_id has no linked harvest_object.
On the UI, the two duplicate datasets both have harvest_object info, pointing to the same harvest_object_id. Solr reindex does not help.
De-dupe script does not work on them.

The following SQL script picks up the 24 duplicates from NCUA org, but it also shows it is a widely spread issue with other orgs too.

SELECT "group".name, COUNT(*) FROM package
JOIN "group" ON package.owner_org = "group".id
LEFT JOIN harvest_object ON package.id = harvest_object.package_id
WHERE package.state='active' AND package.type='dataset' AND harvest_object.package_id IS NULL
GROUP BY 1
ORDER BY 2 DESC
;

                      name                      | count
------------------------------------------------+-------
 doc-gov                                        | 23423
 noaa-gov                                       |  9749
 ca-gov                                         |  2868
 usaid-gov                                      |   508
 hhs-gov                                        |   132
 state-of-oklahoma                              |   100
 city-of-baltimore                              |    81
 doe-gov                                        |    64
 usgs-gov                                       |    48
 usda-gov                                       |    45
 city-of-new-york                               |    39
 epa-gov                                        |    33
 federal-laboratory-consortium                  |    29
 national-credit-union-administration           |    24
 vcgi-org                                       |    17
 city-of-sioux-falls                            |    14
 doi-gov                                        |    14
 dot-gov                                        |    13
 city-of-austin                                 |    11
 king-county-washington                         |     5
 ed-gov                                         |     3
 fema-gov                                       |     3
 centers-for-disease-control-and-prevention     |     2
 va-gov                                         |     2
 national-institute-of-standards-and-technology |     2
 state-of-connecticut                           |     2
 state-of-maryland                              |     2
 city-of-baton-rouge                            |     1
 rrb-gov                                        |     1
 census-gov                                     |     1
 city-of-bloomington                            |     1
 fcc-gov                                        |     1
 doj-gov                                        |     1
(33 rows)

How to reproduce

Cannot replicate.

Sketch

One time fix: collect all ids and delete all duplicates via API.
Long term fix: improve de-dupe script to handle them.

The text was updated successfully, but these errors were encountered:

hkdctol · 2022-10-13T18:10:34Z

Could be resolved by #4007

Jin-Sun-tts · 2022-10-25T13:44:50Z

After cleaned up the bad data in #4007, above query returns 0 rows now.

nickumia-reisys · 2023-02-02T20:11:44Z

Ooooo this was a fun one! I remember @FuhuXia and @Jin-Sun-tts pairing heavily on this to manually cleanup the DB and then Jin got proficient enough to do it herself! 🥲

FuhuXia added the bug Software defect or bug label Dec 1, 2021

mogul added this to data.gov team board Mar 23, 2022

mogul moved this to Product Backlog in data.gov team board Mar 23, 2022

Jin-Sun-tts mentioned this issue Oct 18, 2022

Clear bad production data #4007

Closed

35 tasks

Jin-Sun-tts self-assigned this Oct 25, 2022

Jin-Sun-tts moved this from 📔 Product Backlog to 🏗 In Progress [8] in data.gov team board Oct 25, 2022

Jin-Sun-tts moved this from 🏗 In Progress [8] to 👀 Needs Review [2] in data.gov team board Oct 26, 2022

Jin-Sun-tts moved this from 👀 Needs Review [2] to ✔ Done in data.gov team board Oct 27, 2022

nickumia-reisys closed this as completed Feb 2, 2023

btylerburton added the harvest-duplicates Issues related to Duplicated Datasets label Dec 21, 2023

hkdctol moved this from ✔ Done to 🗄 Closed in data.gov team board Dec 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NUCA harvest job created duplicated datasets #3567

NUCA harvest job created duplicated datasets #3567

FuhuXia commented Dec 1, 2021 •

edited

Loading

hkdctol commented Oct 13, 2022

Jin-Sun-tts commented Oct 25, 2022

nickumia-reisys commented Feb 2, 2023

NUCA harvest job created duplicated datasets #3567

NUCA harvest job created duplicated datasets #3567

Comments

FuhuXia commented Dec 1, 2021 • edited Loading

How to reproduce

Sketch

hkdctol commented Oct 13, 2022

Jin-Sun-tts commented Oct 25, 2022

nickumia-reisys commented Feb 2, 2023

FuhuXia commented Dec 1, 2021 •

edited

Loading