Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SOLR phantom duplicate discovery #3943

Closed
jbrown-xentity opened this issue Sep 7, 2022 · 6 comments
Closed

SOLR phantom duplicate discovery #3943

jbrown-xentity opened this issue Sep 7, 2022 · 6 comments
Assignees
Labels
bug Software defect or bug harvest-duplicates Issues related to Duplicated Datasets

Comments

@jbrown-xentity
Copy link
Contributor

jbrown-xentity commented Sep 7, 2022

SOLR duplicate that doesn't exist in DB, but does in SOLR

Note that these have different names; it's not a bug in SOLR that is causing this duplicate. CKAN is for some reason creating it twice, but only on SOLR

How to reproduce

  1. Unknown

Expected behavior

If a dataset doesn't exist in the DB, it can't exist in SOLR

Actual behavior

Duplicate record only exists in SOLR

Sketch

Since this came from CKAN, we expect that it is related to a logic issue. It doesn't seem to be replicable (it didn't occur in dev, and it more duplicates aren't created when re-harvesting).
This will be mitigated by #2213, but it won't fix how this occurred initially.
It could be that a restart at the wrong time caused the system to fail at the wrong moment, but not sure. Could theoretically validate by examining logs.
The goal of this ticket is to solve the problem (code, infrastructure, restarts, whatever it is) and stop this from occurring.

There should be a follow up ticket to this to

  1. Discover how many items this affects
  2. Clean up any affected records
    This may be done by utilizing Test ckan-db-solr-sync job #2213.
@jbrown-xentity jbrown-xentity added the bug Software defect or bug label Sep 7, 2022
@hkdctol hkdctol moved this to Product Backlog in data.gov team board Sep 8, 2022
@hkdctol hkdctol moved this from Product Backlog to Sprint Backlog [7] in data.gov team board Sep 15, 2022
@FuhuXia FuhuXia moved this from Sprint Backlog [7] to In Progress [8] in data.gov team board Sep 20, 2022
@FuhuXia FuhuXia self-assigned this Sep 20, 2022
@FuhuXia
Copy link
Member

FuhuXia commented Sep 20, 2022

  • We can finish Test ckan-db-solr-sync job #2213 first to clean up the solr to be free of duplicate, then wait for it to happen again and exam the log.

  • One possible cause for the issue is that we force restart fetch process every 30 mins. Let us change the way of restarting it. it might help to resolve this one. A new ticket is created.

@FuhuXia
Copy link
Member

FuhuXia commented Sep 28, 2022

Moving back to backlog. Hopefully this is no issue any more after above two ticket have been addressed.

@FuhuXia FuhuXia moved this from In Progress [8] to Product Backlog in data.gov team board Sep 28, 2022
@hkdctol hkdctol moved this from 📔 Product Backlog to 📟 Sprint Backlog [7] in data.gov team board Oct 13, 2022
@FuhuXia FuhuXia removed their assignment Oct 13, 2022
@Jin-Sun-tts Jin-Sun-tts self-assigned this Oct 13, 2022
@Jin-Sun-tts Jin-Sun-tts moved this from 📟 Sprint Backlog [7] to 🏗 In Progress [8] in data.gov team board Oct 13, 2022
@Jin-Sun-tts
Copy link
Contributor

Monitoring the scheduled db-solr-sync job:

10/12
0 packages need to be removed from Solr
1 packages need to be updated/added to Solr

10/13
1 packages need to be removed from Solr
4 packages need to be updated/added to Solr

10/14
0 packages need to be removed from Solr
5 packages need to be updated/added to Solr

@Jin-Sun-tts
Copy link
Contributor

10/17
0 packages need to be removed from Solr
1 packages need to be updated/added to Solr

10/18
0 packages need to be removed from Solr
1 packages need to be updated/added to Solr

@hkdctol
Copy link
Contributor

hkdctol commented Oct 20, 2022

Marking as done

@hkdctol hkdctol moved this from 🏗 In Progress [8] to ✔ Done in data.gov team board Oct 20, 2022
@nickumia-reisys
Copy link
Contributor

This is awesome! This is the fruits of the daily db-solr-sync task that we run. Granted, this bug should not need to be fixed with a custom script in the way that it is written, but that's a different story haha...

This is indeed fixed right now, but if the db-solr-sync task were to break this would come back. Ticket that automated this fix:

@btylerburton btylerburton added the harvest-duplicates Issues related to Duplicated Datasets label Dec 21, 2023
@hkdctol hkdctol moved this from ✔ Done to 🗄 Closed in data.gov team board Dec 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Software defect or bug harvest-duplicates Issues related to Duplicated Datasets
Projects
Archived in project
Development

No branches or pull requests

6 participants