Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PXD-1187 ⁃ make sheepdog configurable to enforce indexd record exists before file node registration #164

Closed
philloooo opened this issue Jul 20, 2018 · 5 comments
Assignees

Comments

@philloooo
Copy link
Contributor

philloooo commented Jul 20, 2018

scope:

implement step 3 in the scenario described below:

  • add a "REQUIRE_INDEX_EXISTS_FOR_FILE"(or if you can think of a better name) configuration flag to sheepdog.
  • if such flag is presented and set to True, for either creating or updating the node, it should return an error if the file doesn't exist.
  • If it exists, store the object_id in the node

context:

current status described by users:

  1. As far as I know, there is no way to get a registered URL without typing it into api/index/index/UUID since you can't query "urls" on a node (IMO it should be queryable. If it were, you could write a script to query the url, then use aws cli to download it in your VM.)
  2. i have searched for data files myself and found that many of them don't even have the same file_name in Windmill vs. s3 storage. So, how are users supposed to find a large list of files? right now, i think they have to do it by hand since allowing this sort of mismatch means they can't do it programmatically.
  3. given that BloodPAC has tighter security (only allows users to DL to a machine in VPC), will they be able to implement "Files/Exploration" in Windmill? and/or "Workspace"?

For bloodpac's use case, we can support this data flow:
Scenario 1: user uploads data first for storage that users have upload access to
BPA uses this flow because they are granted access to buckets predates Gen3. Also DCP and EDC uses this flow, because data is not owned by us, we are given read access to those buckets in those commons, so we continuously index them to indexd and link them in our graph.

  1. user uploads data to buckets that they have direct access to
  2. there is a lambda hosted by Gen3 that listens to the buckets update, and checksum and index them to indexd ( https://github.com/occ-data/goes16-indexer - work need to be done to automate deployment and polish the prototype to integrate to gen3)
  3. user upload the metadata, if sheepdog can’t find a record in indexd that matches the checksum, it returns 400. If it can, it creates the data node with the file_id == indexd’s record id
@philloooo
Copy link
Contributor Author

➤ Pauline Ribeyre commented:

[~zflamig] [~[email protected]] Have steps 1 and 2 already been implemented? on which branch?

@philloooo
Copy link
Contributor Author

1 is not a service, it's just uesrs have direct access to S3 for some cases.
2 is the indexer service that I linked

@philloooo
Copy link
Contributor Author

➤ Pauline Ribeyre commented:

[~[email protected]] I meant steps 1 and 2 of the scope. Because it's written here to "implement step 3"

@philloooo
Copy link
Contributor Author

oops sorry, I rearranged the comment, it was refering to step3 in the scenario at the bottom of the comment... none of the thing in the scope is implemented

@paulineribeyre
Copy link
Contributor

Got it! thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants