PXD-1187 ⁃ make sheepdog configurable to enforce indexd record exists before file node registration #164

philloooo · 2018-07-20T01:19:06Z

scope:

implement step 3 in the scenario described below:

add a "REQUIRE_INDEX_EXISTS_FOR_FILE"(or if you can think of a better name) configuration flag to sheepdog.
if such flag is presented and set to True, for either creating or updating the node, it should return an error if the file doesn't exist.
If it exists, store the object_id in the node

context:

current status described by users:

As far as I know, there is no way to get a registered URL without typing it into api/index/index/UUID since you can't query "urls" on a node (IMO it should be queryable. If it were, you could write a script to query the url, then use aws cli to download it in your VM.)
i have searched for data files myself and found that many of them don't even have the same file_name in Windmill vs. s3 storage. So, how are users supposed to find a large list of files? right now, i think they have to do it by hand since allowing this sort of mismatch means they can't do it programmatically.
given that BloodPAC has tighter security (only allows users to DL to a machine in VPC), will they be able to implement "Files/Exploration" in Windmill? and/or "Workspace"?

For bloodpac's use case, we can support this data flow:
Scenario 1: user uploads data first for storage that users have upload access to
BPA uses this flow because they are granted access to buckets predates Gen3. Also DCP and EDC uses this flow, because data is not owned by us, we are given read access to those buckets in those commons, so we continuously index them to indexd and link them in our graph.

user uploads data to buckets that they have direct access to
there is a lambda hosted by Gen3 that listens to the buckets update, and checksum and index them to indexd ( https://github.com/occ-data/goes16-indexer - work need to be done to automate deployment and polish the prototype to integrate to gen3)
user upload the metadata, if sheepdog can’t find a record in indexd that matches the checksum, it returns 400. If it can, it creates the data node with the file_id == indexd’s record id

philloooo · 2018-07-23T16:29:30Z

➤ Pauline Ribeyre commented:

[~zflamig] [~[email protected]] Have steps 1 and 2 already been implemented? on which branch?

philloooo · 2018-07-23T17:38:26Z

1 is not a service, it's just uesrs have direct access to S3 for some cases.
2 is the indexer service that I linked

philloooo · 2018-07-23T18:12:59Z

➤ Pauline Ribeyre commented:

[~[email protected]] I meant steps 1 and 2 of the scope. Because it's written here to "implement step 3"

philloooo · 2018-07-23T18:20:01Z

oops sorry, I rearranged the comment, it was refering to step3 in the scenario at the bottom of the comment... none of the thing in the scope is implemented

paulineribeyre · 2018-07-23T18:24:14Z

Got it! thanks

philloooo assigned paulineribeyre Jul 20, 2018

philloooo closed this as completed Aug 2, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PXD-1187 ⁃ make sheepdog configurable to enforce indexd record exists before file node registration #164

PXD-1187 ⁃ make sheepdog configurable to enforce indexd record exists before file node registration #164

philloooo commented Jul 20, 2018 •

edited

Loading

philloooo commented Jul 23, 2018

philloooo commented Jul 23, 2018

philloooo commented Jul 23, 2018

philloooo commented Jul 23, 2018

paulineribeyre commented Jul 23, 2018

PXD-1187 ⁃ make sheepdog configurable to enforce indexd record exists before file node registration #164

PXD-1187 ⁃ make sheepdog configurable to enforce indexd record exists before file node registration #164

Comments

philloooo commented Jul 20, 2018 • edited Loading

scope:

context:

philloooo commented Jul 23, 2018

philloooo commented Jul 23, 2018

philloooo commented Jul 23, 2018

philloooo commented Jul 23, 2018

paulineribeyre commented Jul 23, 2018

philloooo commented Jul 20, 2018 •

edited

Loading