Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NUCA harvest job created duplicated datasets #3567

Closed
FuhuXia opened this issue Dec 1, 2021 · 3 comments
Closed

NUCA harvest job created duplicated datasets #3567

FuhuXia opened this issue Dec 1, 2021 · 3 comments
Assignees
Labels
bug Software defect or bug harvest-duplicates Issues related to Duplicated Datasets

Comments

@FuhuXia
Copy link
Member

FuhuXia commented Dec 1, 2021

In one of NCUA harvest jobs the harvester added all 36 datasets as new instead of updating existing ones. This resulted in duplicate datasets. With total of 60 datasets, 36 are newly harvested, 24 are duplicates. This is different from the other data.json duplicate issue #2981 in several ways:

  1. In DB, one of the duplicate dataset's package_id has no linked harvest_object.
  2. On the UI, the two duplicate datasets both have harvest_object info, pointing to the same harvest_object_id. Solr reindex does not help.
  3. De-dupe script does not work on them.

The following SQL script picks up the 24 duplicates from NCUA org, but it also shows it is a widely spread issue with other orgs too.

SELECT "group".name, COUNT(*) FROM package
JOIN "group" ON package.owner_org = "group".id
LEFT JOIN harvest_object ON package.id = harvest_object.package_id
WHERE package.state='active' AND package.type='dataset' AND harvest_object.package_id IS NULL
GROUP BY 1
ORDER BY 2 DESC
;
                      name                      | count
------------------------------------------------+-------
 doc-gov                                        | 23423
 noaa-gov                                       |  9749
 ca-gov                                         |  2868
 usaid-gov                                      |   508
 hhs-gov                                        |   132
 state-of-oklahoma                              |   100
 city-of-baltimore                              |    81
 doe-gov                                        |    64
 usgs-gov                                       |    48
 usda-gov                                       |    45
 city-of-new-york                               |    39
 epa-gov                                        |    33
 federal-laboratory-consortium                  |    29
 national-credit-union-administration           |    24
 vcgi-org                                       |    17
 city-of-sioux-falls                            |    14
 doi-gov                                        |    14
 dot-gov                                        |    13
 city-of-austin                                 |    11
 king-county-washington                         |     5
 ed-gov                                         |     3
 fema-gov                                       |     3
 centers-for-disease-control-and-prevention     |     2
 va-gov                                         |     2
 national-institute-of-standards-and-technology |     2
 state-of-connecticut                           |     2
 state-of-maryland                              |     2
 city-of-baton-rouge                            |     1
 rrb-gov                                        |     1
 census-gov                                     |     1
 city-of-bloomington                            |     1
 fcc-gov                                        |     1
 doj-gov                                        |     1
(33 rows)

How to reproduce

Cannot replicate.

Sketch

One time fix: collect all ids and delete all duplicates via API.
Long term fix: improve de-dupe script to handle them.

@FuhuXia FuhuXia added the bug Software defect or bug label Dec 1, 2021
@mogul mogul moved this to Product Backlog in data.gov team board Mar 23, 2022
@hkdctol
Copy link
Contributor

hkdctol commented Oct 13, 2022

Could be resolved by #4007

@Jin-Sun-tts
Copy link
Contributor

After cleaned up the bad data in #4007, above query returns 0 rows now.

@Jin-Sun-tts Jin-Sun-tts self-assigned this Oct 25, 2022
@Jin-Sun-tts Jin-Sun-tts moved this from 📔 Product Backlog to 🏗 In Progress [8] in data.gov team board Oct 25, 2022
@Jin-Sun-tts Jin-Sun-tts moved this from 🏗 In Progress [8] to 👀 Needs Review [2] in data.gov team board Oct 26, 2022
@Jin-Sun-tts Jin-Sun-tts moved this from 👀 Needs Review [2] to ✔ Done in data.gov team board Oct 27, 2022
@nickumia-reisys
Copy link
Contributor

Ooooo this was a fun one! I remember @FuhuXia and @Jin-Sun-tts pairing heavily on this to manually cleanup the DB and then Jin got proficient enough to do it herself! 🥲

@btylerburton btylerburton added the harvest-duplicates Issues related to Duplicated Datasets label Dec 21, 2023
@hkdctol hkdctol moved this from ✔ Done to 🗄 Closed in data.gov team board Dec 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Software defect or bug harvest-duplicates Issues related to Duplicated Datasets
Projects
Archived in project
Development

No branches or pull requests

5 participants