Expose dataset upload stage directly to users #40

mortenpi · 2023-10-26T02:25:36Z

Uploading a dataset to JuliaHub is actually a multi-step process where you "open an upload" to get S3 credentials, then talk directly to S3, and finally you "close the upload". We currently hide away that complexity in:

JuliaHub.jl/src/datasets.jl

Lines 482 to 606 in 27e5c72

    
           @_authuser function upload_dataset( 
        
               dsref::_DatasetRefTuple, 
        
               local_path::AbstractString; 
        
               # Operation type 
        
               create::Bool=true, 
        
               update::Bool=false, 
        
               replace::Bool=false, 
        
               # Dataset metadata 
        
               description::Union{AbstractString, Missing}=missing, 
        
               tags=missing, 
        
               visibility::Union{AbstractString, Missing}=missing, 
        
               license::Union{AbstractString, Tuple{Symbol, <:AbstractString}, Missing}=missing, 
        
               groups=missing, 
        
               # Authentication 
        
               auth::Authentication=__auth__(), 
        
           ) 
        
               username, dataset_name = dsref 
        
               _assert_current_user(username, auth; op="upload_new_dataset") 
        
               if !create && !update 
        
                   throw(ArgumentError("'create' and 'update' can not both be false")) 
        
               end 
        
               if update && replace 
        
                   throw(ArgumentError("'update' and 'replace' can not both be true")) 
        
               end 
        
               tags = _validate_iterable_argument(String, tags; argument="tags") 
        
               groups = _validate_iterable_argument(String, groups; argument="groups") 
        
               # We determine the dataset dtype from the local path. 
        
               # This may throw an ArgumentError. 
        
               dtype = _dataset_dtype(local_path) 
        
               # We need to declare `r` here, because we want to reuse the variable name 
        
               local r::_RESTResponse 
        
               # If `create`, then we first try to create the dataset. If the dataset name 
        
               # is already taken, then we should get a 409 back. 
        
               local newly_created_dataset::Bool = false 
        
               if create 
        
                   # Note: we do not set tags or description here (even though we could), but we 
        
                   # will do that in an update_dataset() call later. 
        
                   r = _new_dataset(dataset_name, dtype; auth) 
        
                   if r.status == 409 
        
                       # 409 Conflict indicates that a dataset with this name already exists. 
        
                       if !update && !replace 
        
                           # If neither update nor replace is set, and the dataset exists, then 
        
                           # we must throw an invalid request error. 
        
                           throw( 
        
                               InvalidRequestError( 
        
                                   "Dataset '$dataset_name' for user '$username' already exists, but update=false and replace=false.", 
        
                               ), 
        
                           ) 
        
                       elseif replace 
        
                           # In replace mode we will delete the existing dataset and 
        
                           # create a new one. 
        
                           delete_dataset((username, dataset_name); auth) 
        
                           r_recreated::_RESTResponse = _new_dataset(dataset_name, dtype; auth) 
        
                           if r_recreated.status == 200 
        
                               newly_created_dataset = true 
        
                           else 
        
                               _throw_invalidresponse(r_recreated) 
        
                           end 
        
                       end 
        
                       # There is one more case -- `update && !replace` -- but in this case 
        
                       # we just move on to uploading a new version. 
        
                   elseif r.status == 200 
        
                       # The only other valid response is 200, when we create the dataset 
        
                       newly_created_dataset = true 
        
                   else 
        
                       # For any non-200/409 responses we throw a backend error. 
        
                       _throw_invalidresponse(r) 
        
                   end 
        
               end 
        
               # If `!create`, the only option allowed is `update` (`replace` is excluded). 
        
               # 
        
               # Acquire an upload for the dataset. By this point, the dataset with this name 
        
               # should definitely exist, although race conditions are always a possibility. 
        
               r = _open_dataset_version(dataset_name; auth) 
        
               if (r.status == 404) && !create 
        
                   # A non-existent dataset if create=false indicates a user error. 
        
                   throw( 
        
                       InvalidRequestError( 
        
                           "Dataset '$dataset_name' for '$username' does not exist and create=false." 
        
                       ), 
        
                   ) 
        
               elseif r.status != 200 
        
                   # Any other 404 or other non-200 response indicates a backend failure 
        
                   _throw_invalidresponse(r) 
        
               end 
        
               upload_config, _ = _parse_response_json(r, Dict) 
        
               # Verify that the dtype of the remote dataset is what we expect it to be. 
        
               if upload_config["dataset_type"] != dtype 
        
                   if newly_created_dataset 
        
                       # If we just created the dataset, then there has been some strange error if dtypes 
        
                       # do not match. 
        
                       throw(JuliaHubError("Dataset types do not match.")) 
        
                   else 
        
                       # Otherwise, it's a user error (i.e. they are trying to update dataset with the wrong 
        
                       # dtype). 
        
                       throw( 
        
                           InvalidRequestError( 
        
                               "Local data type ($dtype) does not match existing dataset dtype $(upload_config["dataset_type"])", 
        
                           ), 
        
                       ) 
        
                   end 
        
               end 
        
               # Upload the actual data 
        
               try 
        
                   _upload_dataset(upload_config, local_path) 
        
               catch e 
        
                   throw(JuliaHubError("Data upload failed", e, catch_backtrace())) 
        
               end 
        
               # Finalize the upload 
        
               try 
        
                   # _close_dataset_version will also throw on non-200 responses 
        
                   _close_dataset_version(dataset_name, upload_config; local_path, auth) 
        
               catch e 
        
                   throw(JuliaHubError("Finalizing upload failed", e, catch_backtrace())) 
        
               end 
        
               # Finally, update the dataset metadata with the new metadata fields. 
        
               if !all(ismissing.((description, tags, visibility, license, groups))) 
        
                   update_dataset( 
        
                       (username, dataset_name); auth, 
        
                       description, tags, visibility, license, groups 
        
                   ) 
        
               end 
        
               # If everything was successful, we'll return an updated DataSet object. 
        
               return dataset((username, dataset_name); auth) 
        
           end

Sometimes the users may want to control the upload step themselves. So we should expose a slightly lower level API that returns an object to the user that contains the information and credentials for the active dataset upload. The users can then upload the data themselves, and finally they just need to close the upload.

The use cases I see for this:

Users wanting to upload things to S3 themselves using another tool, and they just want the credentials (e.g. they want to invoke rclone by hand for one reason or another).
Tools that want to have more control over how data gets written to the S3 bucket (e.g. to avoid writing all the files as temporary files).

The text was updated successfully, but these errors were encountered:

pfitzseb · 2023-10-26T09:35:39Z

I think this boils down to

auth = JuliaHub.__auth__()

dataset_name = "MyDataset"

r = JuliaHub._new_dataset(dataset_name, "BlobTree"; auth)
@assert r.status == 200
r = JuliaHub._open_dataset_version(dataset_name; auth)
@assert r.status == 200
upload_config, _ = JuliaHub._parse_response_json(r, Dict)

bucket = upload_config["location"]["bucket"]
prefix = upload_config["location"]["prefix"]
remote_path = "$bucket/$prefix"
# extract auth from `upload_config` and upload to `remote_path` manually

r = JuliaHub._close_dataset_version(dataset_name, upload_config; auth)
@assert r.status == 200

I like the idea of putting all of the context into a struct that users can manually inspect and eventually pass to close_dataset_version.

mortenpi added the enhancement New feature or request label Oct 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose dataset upload stage directly to users #40

Expose dataset upload stage directly to users #40

mortenpi commented Oct 26, 2023

pfitzseb commented Oct 26, 2023

Expose dataset upload stage directly to users #40

Expose dataset upload stage directly to users #40

Comments

mortenpi commented Oct 26, 2023

pfitzseb commented Oct 26, 2023