-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resolve against OCI / container registries as sha256-addressed content stores? #94
Comments
@cboettig nice! Thanks for sharing. I haven't looked at this yet, but it definitely sounds interesting. At a quick glance, the Open Container Initiative image spec https://github.com/opencontainers/image-spec/blob/main/spec.md sounds much like the architecture of Preston with a provenance / content layer, with the difference being that OCI image spec is more specific (less flexible) because Preston style package is using any text based format (including rdf/nquad), whereas OCI image spec is geared towards file systems and computer programs. I'd be open to experimenting with supporting the OCI image spec in context of Preston code base. |
fyi @mielliott |
Note that no open access appears to be allowed for github content registry -
|
Also, note that github doesn't seem to be able to locate content store SHAs via their api - is this expected? In which package registry did you publish hash://sha256/9412325831dab22aeebdd674b6eb53ba6b7bdd04bb99a4dbb21ddff646287e37 ? https://github.com/search?q=9412325831dab22aeebdd674b6eb53ba6b7bdd04bb99a4dbb21ddff646287e37 see attached screenshot |
Hi @jhpoelen open access is absolutely supported, please be sure to set the header as specified in my example:
No authentication is required, only setting the header. I published this example to the package called Note that this has nothing to do per se the GitHub API. To locate content on the container registry you need to search the container registry -- this could be any container registry -- e.g. the orca docs show using a zot registry on localhost. I haven't really played around with a local registry yet. So far though, I think you should be able to access the object by hash with the header as in my original example without any authentication
|
right, OCI was imagined for the purpose, but as you have so often said, bits are bits, it's just a key-value store indexed by sha256 hashes of those bits, and it really doesn't care whether they are rdf, or plain txt, or something else. As the oras docs say,
oras has a nice command line tool and client libraries, but as I understand it, it should be possible to interact with any Open Container Registry with standard tools (e.g. sha256 hashes and curl requests). this overview is a particularly nice summary which I think very much resonates with the preston design? Independent of the oras implementation, you've probably seen the opencontainers descriptor spec for any compliant OCI registry. A nice list of compliant registries includes many open source self-hosting options as well as registries of most commercial cloud providers, which suggests to me this strategy has at least as much capacity to scale as IPFS, but who knows. |
Thanks for elaborating. from preston cat\
--remote https://softwareheritage.org\
'line:hash://sha256/69f9224697d534daf8079fad21cd45cbbe888720014455dc9b15b600fc8cc063!/L52,L53' or or , I got pretty excited when I read -
Thank you! Hoping to play around for more with this sooner rather than later. It'd be fun to have the oras documentation be hosted via . . . a content registry. . . aside from all known biodiversity datasets that have been tracked by Preston since 2018 (and dataone, and zenodo etc.etc.). |
@jhpoelen very cool, love the deep link referencing in preston btw! I think ages ago we've talked about about hosting mirrors of preston, and I know we've also been on the lookout for more reliable / stable options for https://hash-archive.org, which seems to have gone forever dark now (even though my fork, https://hash-archive.carlboettiger.info/, is still up for now). My instinct is that these container registries are going to be around for a while (the open source docker registry idea / spec / software has already been around for about a decade, and clearly the concept is steadily growing). So I'm keen for alternatives that can function in a similar capacity to the hash-archive.org registry, (even better if it is also a content store), that has greater scalability and reliability. How do you feel about leveraging all these existing container registries, with their ability to both act as content-addressable-storage and host manifest as a DAG, as the new hash-archive.org or preston mirrors? Obviously the container registries need a bit of software wrapping around them to make it easy to use them in this way rather than in the way they were originally intended (i.e. using |
Leveraging, and experimenting with, existing infrastructure / technologies like container registries (via Open Container Initiative and associated projects like https://github.com/oras-project ) sounds like a good idea, especially now that @nfranz , @GregPost-ASU and friends at ASU (@n8upham) seem pretty excited about helping out to increase mobility of our diversity data through content-addressed repositories and associated registries. They've already helped keep a copy of GenBank's plant sequence flat file archives in a preston package via BioKiC / Globus. (see use case linking Jenn Yost's San Luis Obispo Herbarium OBI globalbioticinteractions/globalbioticinteractions#904 to their associated GenBank sequences). Is there a particular use case you have in mind? |
Just implemented rudimentary support for github content repository:
|
Is there a registry of oci registries? |
Here's a list - preston cat --remote https://softwareheritage.org 'line:hash://sha256/85044a71a67b1ad51e71e718eb773a5977f0f60c8bcf7771e61905ca9c160cfb!/L15-L24' or ## Registries supporting OCI Artifacts
- [CNCF Distribution](#cncf-distribution) - local/offline verification
- [Amazon Elastic Container Registry](#amazon-elastic-container-registry-ecr)
- [Azure Container Registry](#azure-container-registry-acr)
- [Google Artifact Registry](#google-artifact-registry-gar)
- [GitHub Packages container registry](#github-packages-container-registry-ghcr)
- [Bundle Bar](#bundle-bar)
- [Docker Hub](#docker-hub)
- [Zot Registry](#zot-registry) |
Nice @jhpoelen , this is awesome. Yup, I have a bunch of use-cases in mind! I've always had my eye out for robust / low-friction ways for researchers to work with and distribute their own data using content-addressed storage -- i.e. from an R perspective, replacing In contrast, these OCI repos seem far better set up for scale and ease-of-use in day-to-day operations. (Note that I believe complements rather than competes with the scientific data repositories, which were never intended for this task). So my basic use case is that user can push data to an OCI, and retrieve it with a content-based address later, with minimal friction and maximum performance. I'm curious as well about mirroring existing datasets in these repos or a preston archive into these OCI systems where we can access them with content-based addresses and benefit from the greater storage/bandwidth availability of these hosts. iiuc, with preston I can copy either just the registry metadata or the actual content over to my own machine -- maybe one could do that with, say, globalbioticinteractions? |
@cboettig this is all awesome. I'm just catching up with the thread, but seems so promising. Seems like making DataONE and other repository systems OCI compliant would be a great and easy approach to integrating traditional repositories with OCI. Given that Brew is using GHCR for their packages, I wonder if GHCR limits sizes of its public uploads? It seems it would be easy to upload copies of the DataONE corpus to GHCR as a caching/distribution layer. I'm giving a brief orienting talk on content-based identifiers at ESIP tomorrow (https://sched.co/1NodS), and will plan to include this in my remarks on authority-based versus content-based identification. Its very cool. If either of you wanted to join remotely, they have hybrid access set up for registered conference folks. |
hey @mbjones - looks like a great conference - I probably won't join 'cause I am attending another workshop at the Field Museum Thu/Fri. Can't help but plug our recent publication on application of content based identifiers in the form of signed citations of digital scientific data publication - Elliott, M.J., Poelen, J.H. & Fortes, J.A.B. Signing data citations enables data verification and citation persistence. Sci Data 10, 419 (2023). https://doi.org/10.1038/s41597-023-02230-y https://linker.bio/hash://sha256/f849c870565f608899f183ca261365dce9c9f1c5441b1c779e0db49df9c2a19d Please keep me posted on the responses you'll get, and insights gained following the ESIP sessions. PS - I am still looking for a list of OCI enabled endpoints (I have ghcr.io, and hoping to add other ones, especially those that have open access support). |
Just a note: I see that OCI has consolidated on SHA-256 and SHA-512 as the only two officially listed hash algorithms for digests. See: https://github.com/opencontainers/image-spec/blob/main/descriptor.md#registered-algorithms Oh, But I also see they say:
|
Hey, so do either of you know how to search an OCI registry like GHCR to resolve a hash without knowing the namespace that its in? For example, while this works, it requires info that is not in the hash:
But, is there a way to eliminate the need for the |
@mbjones good question, I wondered that too. There's an optional endpoint in the spec called From the contentid perspective, in seems like one might need to consider the namespace and domain name together as defining an endpoint? Definitely not ideal, though if we're really leaning into the distributed logic I guess that's not different than working across, say, GHCR and GitLab OCI, or one of the self-hosted OCI systems (zot, harbor, etc)... I'm not clear on size or bandwidth limits on GHCR, the docs for public storage only say "free".
I'm going to try playing a bit with the self-hosted registries, they look very simple to deploy... |
Ever https://github.com/opencontainers/distribution-spec/blob/main/spec.md#endpoints |
@jeroen pointed out to me that (docker or OCI) container registries like GitHub Package Registry are really just a bunch of sha256 addressed blobs. Moreover, that existing open-source tools like oras already make it pretty easy to push arbitrary content there.
Notably, the docker/OCI registry is already a widely adopted standard, easy to self-host and readily found on any major cloud provider. GitHub package registry is 'free for public packages' -- Jeroen notes that large projects like
brew
use it as their back-end storage and distribution medium.For instance, here's a command to grab my favorite example from the GitHub content registry, you can merely request it by it's sha256 hash:
and verify this is indeed the Vostok ice core file. To push to the content store I've used the
oras
client tool:Container registeries actually have a nice manifest system for associating different files with a single manifest, and adding metadata, including file names, tags (i.e. version tags), and (I think) MIME types, as well as generic extensible metadata in labels, which all can facilitate much more discovery than the approach above where I just use opaque blobs. These features could be particularly interesting but maybe beyond the scope/generality of
contentid
?Anyway, does seem like an intriguing way to deploy (public) data to a system that acts both as a content store and registry addressable by SHA256 hash, using a mechanism that is easily deployed locally (self-hosted container registry) and already in use by so many major providers. @jhpoelen @mbjones thoughts?
The text was updated successfully, but these errors were encountered: