-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate hadronsampledata into hadron #295
base: master
Are you sure you want to change the base?
Conversation
This is a very nice solution! |
I've just implemented what you have linked in the blog posts, so it isn't my original idea 😄. Carsten suggested in the linked issue that we should rather use a web server, I guess we would then offer a bunch of ZIP archives with data to download. Perhaps we should pick something and then perhaps the whole PR here is obsolete. |
I've just implemented what you have linked in the blog posts, so it isn't my original idea 😄.
Carsten suggested in the linked issue that we should rather use a web server, I guess we would then offer a bunch of ZIP archives with data to download. Perhaps we should pick something and then perhaps the whole PR here is obsolete.
well, since you already did it...
Is the idea to have this on CRAN as well? Otherwise we cannot depend
on it.
|
CRAN has a 5 MB limit, the idea is to specifically not have it on CRAN and put a bunch of data here. And just because I put in the work does not mean that we have to keep it. If having a loose collection of ZIP files is easier in the long run, we should discard the work here. |
CRAN has a 5 MB limit, the idea is to specifically not have it on CRAN and put a bunch of data here.
But how do we implement the dependency here? Maybe I don't fully
understand yet.
|
> CRAN has a 5 MB limit, the idea is to specifically not have it on CRAN and put a bunch of data here.
But how do we implement the dependency here? Maybe I don't fully
understand yet.
CRAN policy is that packages can only depend on CRAN packages (also in
Suggests, as far as I understand)
|
We just have it in We could also just use the |
The dependency is a weak one, users need to manually install the example data package, and it will live in a different namespace. If we would limit ourselves to 5 MB, then we could make a data package, publish it on CRAN and make it a hard dependency. For the data that has been in mind, the 5 MB would not be enough, I thought. |
We just have it in `Suggests`, where we already have soft
dependencies on Bio Conductor. That itself does not seem to be a
problem. And the `Additional Repositories` entry makes sure that the
existence can be verified during *check*.
I had to remove the dependency on Bio Conductor for CRAN.
We could also just use the `.onLoad` and not list the package in the
`DESCRIPTION` if that makes it better. Perhaps we just wait for the
next round of feedback from CRAN when we want to publish an update?
CRAN offers data repositories. I didn't read the policy for those yet.
|
It seems that if one promises to update a package less often, one can exceed the 5 MB of data. What exactly do we want to offer? So far I have only seen the pion form factor data, which would be used to show the reading routines. The data would be more useful as a ZIP archive than installed into some location in the R library, I think. What other data would we want to provide to end users? |
I want to add gradient flow and loop files, as well as some stuff for |
As far as I have understood the literature that I posted, CRAN data packages should allow for this. |
If we just have it on our webserver (GitHub Pages), we would have the maximum flexibility. If the package is several hundred MB, then we will have a problem after a few releases as a package as GitHub only allows to have repositories of 1 GB in size. We would need to truncate the history of the I guess the most interesting question is whether we want the data to be available via the |
If I use the data in examples, it must be available in You are right that this will basically be impossible to do on github. Pulling it as a zip file might be an option, although I'm not sure if the examples will continue to work. |
I need to see how small I can make the example data for it to still be practically useful. |
There must be additional possibilities to host some files, Carsten could for instance do that in his home directory at the institute. Or we ask for webspace, that should not be that hard. Otherwise 1 GB on GitHub would bring us pretty far, we could also use multiple repositories to circumvent the limitations or just ask for an upgrade as part of our academic plan. In machine learning notebooks with Python there one usually downloads the data when it is needed from some website. I guess with a package like curl or so we can do the same thing in R. The disadvantage of a package would be that all data needs to be in one place. If you want to use the data in examples of hadron, we will need the data package to be on CRAN. Installing it from external sources only seems to work after the fact, and we can't have A data package on CRAN would be the most formal variant and easiest to use afterwards. The process is just the most tedious and the limitations the strictest. |
having so large data sets in regular examples is not a good idea anyhow, because they would run for too long presumably! |
Closes #292.
I have made the repository hadron_example_data into a real R package. It needs to be build using
R CMD build .
and deployed in a moment.Then I have created a new repository r-repo which will serve to hold the data for serving over HTTP. The example data package is then deployed via
drat::insertPackage('hadronexampledata_0.0.0.9000.tar.gz', repodir = '../r-repo/', action = 'prune')
. Go to ther-repo
, commit and push.At
r-repo
I have enabled GitHub Pages to serve the content at https://hiskp-lqcd.github.io/r-repo/ such that we have our own R repository (like CRAN).This pull request adds a little more meta data to hadron, such that one can do
install.packages('hadronexampledata')
after loading the hadron library. This seems to work just fine after installing and loading the version with the pull request:The data that was already present in the example data package was not in R format (Rdata/rds). As such I have moved it into the
inst
directory. One can now add more Rdata/rds files to that package and have it exported.The size limit on GitHub is 1 GB, we should be able to use it even for larger files in this way.