Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support mirror servers: custom downloading upstream #136

Closed
johnnychen94 opened this issue Aug 20, 2021 · 12 comments · Fixed by #177
Closed

Support mirror servers: custom downloading upstream #136

johnnychen94 opened this issue Aug 20, 2021 · 12 comments · Fixed by #177
Labels
enhancement New feature or request
Milestone

Comments

@johnnychen94
Copy link
Member

johnnychen94 commented Aug 20, 2021

One of the main reasons I build jill.py is that I live in China, where AWS S3 and GitHub are quite unstable and usually out of service. We have a few mirror sites in China and it's generally much faster and stable to download binaries from mirror servers for users in China. And jill.py has some smart built-in mechanism to download binaries from nearest servers (speaking of RTT) and has GPG verification to ensure they are trusted downloads.

This can also be useful to boost Julia setup in CI environments (e.g., self-hosted gitlab), where you can point your runners to download from a Julia binary mirror that is in LAN network instead of from the Julia s3 storage.

Is there any plan to support this?

@davidanthoff
Copy link
Collaborator

I hadn't thought of this at all, but it sounds like a great feature and I'd love to see something like that here as well! If there is a way that we could port/learn from the stuff you did on that in jill.py it would of course be even better :)

Right now we have download URLs very hard coded in our "versions db" (which is just a JSON file), so we would need to redesign that a bit, but I don't think that would be prohibitive at all. I'm in the process of redesigning some of the logic for checking for new versions etc at the moment, so this issue came just in time, I'll try to design this from the get-go now so that we might be able to add mirrors down the road.

Are these mirrors "official" Julia language org mirrors? If not, would it make sense to make them that? Also CCing @staticfloat.

@johnnychen94 I actually had planned to reach out at some point to discuss the relationship between jill.py and juliaup, as they obviously have very similar (if not identical) goals. Not sure what your current thinking on all of this is, I think from my end this all started as a "get Julia in the Windows Store" project, and then spiraled out of control. At this point I do like that everything in Juliaup depends essentially on nothing (because it is written in Rust), but of course feature wise there is a fair bit of stuff that it is missing relative to jill.py. Not sure whether it would make sense to consider joining forces at some point?

@davidanthoff davidanthoff added the enhancement New feature or request label Aug 21, 2021
@davidanthoff davidanthoff added this to the Backlog milestone Aug 21, 2021
@johnnychen94
Copy link
Member Author

johnnychen94 commented Aug 21, 2021

In jill.py, to support mirror servers there are two separated non-trivial tasks:

  • create a virtual database at runtime by compositing the mirror registry (also a json file) and the versions.json (#131). The database can be made per version-sys-arch triplet. And to make it easy to maintain the mirror registry, I predefined some placeholders.
  • for any given triplet, find the "best" choice from the virtual database. And because there can be multiple sources to download from, this allows some delicate fallback solution (for instance, when downloading from BFSU mirror failed, it falls back to the Julia S3.)

By adding multiple sources, it's no longer valid that any predefined URLs will be available, so every time when it tries to download something, it needs to check if the URL is valid (e.g., send a HTTP HEAD request and see if it's 200). And because there will be multiple sources, the query needs to be done quite fast asynchronously with some timeout.

There will be some technical issues if we try to bring mirror support for nightly versions. Not all mirror sites are willing to provide nightly builds because only a few people want that. And even if they provided, it's not guaranteed that they're providing the latest version of nightly builds. (CRef #96 (comment), johnnychen94/jill.py#17)

This reminds me back to the day when I wrote jill.py, @staticfloat provided a cache service on the julia pkg server (https://github.com/JuliaPackaging/PkgServerS3Mirror). For instance:

- https://julialang-s3.julialang.org/bin/linux/x64/1.6/julia-1.6.2-linux-x86_64.tar.gz
+ https://mirror.us-east.pkg.julialang.org/julialang2/bin/linux/x64/1.6/julia-1.6.2-linux-x86_64.tar.gz

But it seems that it's not available anymore.


At this point I do like that everything in Juliaup depends essentially on nothing (because it is written in Rust)

This can be an advantage and also a disadvantage. The advantage is that it depends on nothing and can run on any target system as long as the compiled binary of juliaup is provided. The disadvantage is that you now need to provide another set of distribution systems to make juliaup available for users. (For instance #84 #97). How reliable would it be compared to the python/pip that almost everyone can reliably get in all sorts of ways? I believe this is also an initial reason why you want to ship the windows installer juliaup via the Windows store. For jill.py it's just pip install <something> that every programmer nowadays would easily know how to do for every platform that Julia supports. Julia S3 is by all meanings reliable, but again, for users in China where S3 is usually not reliable, it becomes another pain :(

I like the juliaup default and juliaup link subcommands and I plan to add corresponding features to jill.py. About joining force in the juliaup development, I'll watch and comments but due to my lack of rust knowledge, I'll probably not be able to contribute codes.

@StefanKarpinski
Copy link
Member

StefanKarpinski commented Nov 22, 2021

We could easily extend pkg.julialang.org servers to also serve Julia version tarballs. It seems silly to have a second set of infrastructure for distributing Julia versions separate from the infrastructure we already have for distributing artifacts and package versions. @staticfloat has already worked hard at making sure that these are highly available all around the world. It would likely be a pair of HTTP endpoints along the lines of:

  • https://pkg.julialang.org/julia/versions returning a mapping if versions to tree hashes, and
  • https://pkg.julialang.org/julia/$hash serving up a content-addressed tarball of each hash

Maybe also something for serving up channel info if that would be helpful. Note that if someone is already setting JULIA_PKG_SERVER to a static mirror (for Great Firewall reasons, for example), then as long as that mirror included the right set of static files for Julia versions, everything would just work.

@staticfloat
Copy link
Member

We in fact already have this! The whole julialang2 S3 bucket is transparently mirrored on all pkgservers. Just take the path within the S3 bucket (such as /bin/linux/x64/1.6/julia-1.6.4-linux-x86_64.tar.gz) and prepend https://pkg.julialang.org/mirror/julialang2:

https://pkg.julialang.org/mirror/julialang2/bin/linux/x64/1.6/julia-1.6.4-linux-x86_64.tar.gz

If you want to programmatically get a list of available versions, you just download the versions.json file, just like you would from the typical S3 bucket:

https://pkg.julialang.org/mirror/julialang2/bin/versions.json

@StefanKarpinski
Copy link
Member

It's nice that we can mirror from S3 but it feels a little haphazard (what if we decide to change the layout of that bucket or stop using S3?), so what about making it an official part of the Pkg server protocol so that we guarantee that there's a way to get this info and we'll keep it working? Of course, I'm long overdue on writing up the entire protocol as an official standard, but we largely stuck to the plan laid out in the original issue.

@davidanthoff
Copy link
Collaborator

Agreed, I think it would be nice if eventually this all worked via the package server protocol, in particular if it was simple to just redirect everything to some mirror via one central place.

@johnnychen94
Copy link
Member Author

johnnychen94 commented Nov 23, 2021

Actually, I think the triplet version julia-1.6.4-linux-x86_64.tar.gz is already a good "hash" value; there is no need to encrypt it.

@davidanthoff
Copy link
Collaborator

@johnnychen94 do you know whether ipfs works in China? I've played around with that a bit lately, and it looks quite fantastic and could be a really efficient and simple solution for this problem?

A very simple implementation could be that we add a configuration flag that makes juliaup download everything from a local ipfs node. The idea then would essentially be that if you are in a situation where the normal URLs are blocked, you can install ipfs desktop (or the daemon) or your system, and then change one setting in juliaup so that it grabs everything from that local ipfs node. We could also setup an ipfs cluster for the juliaup specific stuff, then anyone or any organization could run a node that automatically follows the set of pins that are needed to keep the juliaup story up and running. It would essentially be a very simple way to have a completely decentralized solution.

@johnnychen94
Copy link
Member Author

Unsure of it. My understanding is that we should try to follow the Pkg protocol and still use HTTP protocol.

The non-technical issue for ipfs in China is that an IPFS node can store anything, including the sensitive stuff that the government doesn't want. Thus I think the traditional mirror sites (e.g., TUNA) in universities won't serve an ipfs node. And using anonymous ipfs nodes might be unstable, I never used it so I'm really unsure of it.

@davidanthoff
Copy link
Collaborator

davidanthoff commented Nov 25, 2021

Ok, so I looked a bit more into this, and I think in the short term we can do the following: we just introduce a new env variable called JULIAUP_SERVER. If that is not set, we assume it is https://julialang-s3.julialang.org. And then I change the internal code so that instead of storing the entire URL for each download in the version DB, we only store the bin/linux/x64/1.6/julia-1.6.2-linux-x86_64.tar.gz part, and then at runtime we combine the value from JULIAUP_SERVER with that.

So that would mean that if there is a mirror that simply reflects the same file structure that we have on S3, it should work, it should also work with the mirror service from the existing package server, and it actually can also work with ipfs, the URL then would just be http://localhost:8080/SOMEIPNS for the JULIAUP_SERVER value.

There are a few, not very involved, things we need to do to make this work:

  • change the version DB to only record the second part of the URL
  • introduce support for the JULIAUP_SERVER env var
  • host the juliaup binaries on S3 as well, right now the juliaup selfupdate command downloads them from GitHub

And I think that is pretty much it? We can then still think about some fancier story down the road along the lines of what @johnnychen94 has done with a hosted server database etc.

@johnnychen94
Copy link
Member Author

Sounds like a good plan to me.
One small thing on this: because only a few people live on the edges, mirror sites usually don't host the static contents for the nightlies builds, so maybe JULIAUP_SERVER should only affect julialang2 the and not julialangnightlies bucket. Letting JULIAUP_SERVER also affecting julialangnightlies requires juliaup to check if the downloaded tarballs are up-to-date versions and thus becomes troublesome.

@StefanKarpinski
Copy link
Member

Actually, I think the triplet version julia-1.6.4-linux-x86_64.tar.gz is already a good "hash" value; there is no need to encrypt it.

That's a name, not a hash value. Specifically, when we identify content by hash, you can inherently check that you have the right version—just compute the tree hash of the content you got and it should be the same as the hash you requested. That obviously can't be done with that file name.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants