Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose dataset upload stage directly to users #40

Open
mortenpi opened this issue Oct 26, 2023 · 1 comment
Open

Expose dataset upload stage directly to users #40

mortenpi opened this issue Oct 26, 2023 · 1 comment
Labels
enhancement New feature or request

Comments

@mortenpi
Copy link
Member

Uploading a dataset to JuliaHub is actually a multi-step process where you "open an upload" to get S3 credentials, then talk directly to S3, and finally you "close the upload". We currently hide away that complexity in:

JuliaHub.jl/src/datasets.jl

Lines 482 to 606 in 27e5c72

@_authuser function upload_dataset(
dsref::_DatasetRefTuple,
local_path::AbstractString;
# Operation type
create::Bool=true,
update::Bool=false,
replace::Bool=false,
# Dataset metadata
description::Union{AbstractString, Missing}=missing,
tags=missing,
visibility::Union{AbstractString, Missing}=missing,
license::Union{AbstractString, Tuple{Symbol, <:AbstractString}, Missing}=missing,
groups=missing,
# Authentication
auth::Authentication=__auth__(),
)
username, dataset_name = dsref
_assert_current_user(username, auth; op="upload_new_dataset")
if !create && !update
throw(ArgumentError("'create' and 'update' can not both be false"))
end
if update && replace
throw(ArgumentError("'update' and 'replace' can not both be true"))
end
tags = _validate_iterable_argument(String, tags; argument="tags")
groups = _validate_iterable_argument(String, groups; argument="groups")
# We determine the dataset dtype from the local path.
# This may throw an ArgumentError.
dtype = _dataset_dtype(local_path)
# We need to declare `r` here, because we want to reuse the variable name
local r::_RESTResponse
# If `create`, then we first try to create the dataset. If the dataset name
# is already taken, then we should get a 409 back.
local newly_created_dataset::Bool = false
if create
# Note: we do not set tags or description here (even though we could), but we
# will do that in an update_dataset() call later.
r = _new_dataset(dataset_name, dtype; auth)
if r.status == 409
# 409 Conflict indicates that a dataset with this name already exists.
if !update && !replace
# If neither update nor replace is set, and the dataset exists, then
# we must throw an invalid request error.
throw(
InvalidRequestError(
"Dataset '$dataset_name' for user '$username' already exists, but update=false and replace=false.",
),
)
elseif replace
# In replace mode we will delete the existing dataset and
# create a new one.
delete_dataset((username, dataset_name); auth)
r_recreated::_RESTResponse = _new_dataset(dataset_name, dtype; auth)
if r_recreated.status == 200
newly_created_dataset = true
else
_throw_invalidresponse(r_recreated)
end
end
# There is one more case -- `update && !replace` -- but in this case
# we just move on to uploading a new version.
elseif r.status == 200
# The only other valid response is 200, when we create the dataset
newly_created_dataset = true
else
# For any non-200/409 responses we throw a backend error.
_throw_invalidresponse(r)
end
end
# If `!create`, the only option allowed is `update` (`replace` is excluded).
#
# Acquire an upload for the dataset. By this point, the dataset with this name
# should definitely exist, although race conditions are always a possibility.
r = _open_dataset_version(dataset_name; auth)
if (r.status == 404) && !create
# A non-existent dataset if create=false indicates a user error.
throw(
InvalidRequestError(
"Dataset '$dataset_name' for '$username' does not exist and create=false."
),
)
elseif r.status != 200
# Any other 404 or other non-200 response indicates a backend failure
_throw_invalidresponse(r)
end
upload_config, _ = _parse_response_json(r, Dict)
# Verify that the dtype of the remote dataset is what we expect it to be.
if upload_config["dataset_type"] != dtype
if newly_created_dataset
# If we just created the dataset, then there has been some strange error if dtypes
# do not match.
throw(JuliaHubError("Dataset types do not match."))
else
# Otherwise, it's a user error (i.e. they are trying to update dataset with the wrong
# dtype).
throw(
InvalidRequestError(
"Local data type ($dtype) does not match existing dataset dtype $(upload_config["dataset_type"])",
),
)
end
end
# Upload the actual data
try
_upload_dataset(upload_config, local_path)
catch e
throw(JuliaHubError("Data upload failed", e, catch_backtrace()))
end
# Finalize the upload
try
# _close_dataset_version will also throw on non-200 responses
_close_dataset_version(dataset_name, upload_config; local_path, auth)
catch e
throw(JuliaHubError("Finalizing upload failed", e, catch_backtrace()))
end
# Finally, update the dataset metadata with the new metadata fields.
if !all(ismissing.((description, tags, visibility, license, groups)))
update_dataset(
(username, dataset_name); auth,
description, tags, visibility, license, groups
)
end
# If everything was successful, we'll return an updated DataSet object.
return dataset((username, dataset_name); auth)
end

Sometimes the users may want to control the upload step themselves. So we should expose a slightly lower level API that returns an object to the user that contains the information and credentials for the active dataset upload. The users can then upload the data themselves, and finally they just need to close the upload.

The use cases I see for this:

  1. Users wanting to upload things to S3 themselves using another tool, and they just want the credentials (e.g. they want to invoke rclone by hand for one reason or another).
  2. Tools that want to have more control over how data gets written to the S3 bucket (e.g. to avoid writing all the files as temporary files).
@mortenpi mortenpi added the enhancement New feature or request label Oct 26, 2023
@pfitzseb
Copy link
Member

I think this boils down to

auth = JuliaHub.__auth__()

dataset_name = "MyDataset"

r = JuliaHub._new_dataset(dataset_name, "BlobTree"; auth)
@assert r.status == 200
r = JuliaHub._open_dataset_version(dataset_name; auth)
@assert r.status == 200
upload_config, _ = JuliaHub._parse_response_json(r, Dict)

bucket = upload_config["location"]["bucket"]
prefix = upload_config["location"]["prefix"]
remote_path = "$bucket/$prefix"
# extract auth from `upload_config` and upload to `remote_path` manually

r = JuliaHub._close_dataset_version(dataset_name, upload_config; auth)
@assert r.status == 200

I like the idea of putting all of the context into a struct that users can manually inspect and eventually pass to close_dataset_version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants