Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pkg: compute checksums for packages without checksums #9384

Merged
merged 1 commit into from
Dec 11, 2023

Conversation

gridbugs
Copy link
Collaborator

@gridbugs gridbugs commented Dec 5, 2023

fixes #9336

@gridbugs
Copy link
Collaborator Author

gridbugs commented Dec 5, 2023

It occurs to me that maybe we should add some indication that packages have had their hashes added by dune in this way. Hashes in opam have some air of legitimacy since they've at least gone through a manual code review process when the package was added to the repo, but if dune is just going to add its own hashes to packages that otherwise lack them people might start to trust our hashes to the same degree that they trust opam's.

@gridbugs gridbugs force-pushed the compute-checksums-when-missing branch 2 times, most recently from 8b65abe to ceb21d8 Compare December 5, 2023 06:53
@gridbugs gridbugs changed the title pkg: compute checksums for packages with checksums pkg: compute checksums for packages without checksums Dec 5, 2023
@rgrinberg
Copy link
Member

I suppose we could print the hashes that we've had to compute to let the user know we're doing this. FYI, opam installs such packages without checksums without any ceremony either. I'm not even sure if it saves the hash it ends up downloading.

@gridbugs gridbugs force-pushed the compute-checksums-when-missing branch 3 times, most recently from b1378f4 to f7b4e96 Compare December 6, 2023 02:46
@gridbugs
Copy link
Collaborator Author

gridbugs commented Dec 6, 2023

I added a message about how dune is generating its own checksum. It also prints the source archive url so users can vet suspicious-looking urls. Also I changed the test to use a text file rather than an archive since creating the archive (tar) was producing different results on different platforms.

let target = Path.relative temp_dir "archive" in
Fetch.fetch ~unpack:false ~checksum:None ~target (OpamUrl.of_string url)
>>| function
| Ok () -> Some (Dune_digest.file target |> Checksum.of_dune_digest)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, the internals of Fetch.fetch can already give you the newly calculated checksum. It should be enough to just tweak the API to get it somehow.

Copy link
Member

@rgrinberg rgrinberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for a start, but I see a couple things that are fishy:

  1. Why do we include checksums for sources computed from the local file system? Seems a bit pointless since they can't be shared anyway.

  2. The build step is going to refetch the sources again. That seems wasteful. We need to reuse the build rules somehow when fetching here to avoid this.

User_message.print
(User_message.make
[ Pp.textf
"Package %S has source archive which lacks a checksum."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Package %S has source archive which lacks a checksum."
"Package %S has a source archive which lacks a checksum."

"Package %S has source archive which lacks a checksum."
(Package_name.to_string package_name)
; Pp.textf "The source archive will be downloaded from: %s" url
; Pp.text "Dune will compute its own checksum for this source archive."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel that it would be good to print the computed checksum as well, just for reference for the user.

> sed -e "s#$PWD#<pwd>#"
> }

$ echo "Hello, World!" > foo.txt
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The alternative would be to create a tarball and run md5sum and then compare it to the one in the lock file, without ever printing it.

file://<pwd>/foo.txt
Dune will compute its own checksum for this source archive.
Warning: Failed to retrieve source archive from:
file://<pwd>/foo.txt
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also print what dune will do in that case (that is, not include a checksum).

@gridbugs
Copy link
Collaborator Author

gridbugs commented Dec 7, 2023

The build step is going to refetch the sources again. That seems wasteful. We need to reuse the build rules somehow when fetching here to avoid this.

I agree that we should avoid fetching the archive a second time. Conceptually it makes sense to have a rule which downloads the archive and a second rule which extracts it. How hard would this be to do in practice though? Currently we don't use rules when generating the lockdir or in Fetch.fetch. I don't have a clear sense of how much refactoring would be required to express the dependency between a source archive and the extracted contents with rules.

The alternative would be to download the archive to a well-known place in _build and have Fetch.fetch check for the archive before downloading it.

@rgrinberg
Copy link
Member

rgrinberg commented Dec 7, 2023

How hard would this be to do in practice though? Currently we don't use rules when generating the lockdir or in Fetch.fetch. I don't have a clear sense of how much refactoring would be required to express the dependency between a source archive and the extracted contents with rules.

It would be rather tricky. I'll try and think about it a little more.

The alternative would be to download the archive to a well-known place in _build and have Fetch.fetch check for the archive before downloading it.

If dune doesn't know something is a build target in _build it will consider it stale and just delete it. So it would have to live elsewhere. We could introduce this cache some other directory, but it seems a bit premature to do this.

For now what you have is fine. Let's just document the issues and solve them later.

@Leonidas-from-XIV
Copy link
Collaborator

We could introduce this cache some other directory, but it seems a bit premature to do this.

Wouldn't $XDG_CACHE_LOCATION/dune be the right place for it, potentially even reusing the same convention as opam? Another option would be to share OPAMs cache, but I am not sure whether that's a good move.

@rgrinberg
Copy link
Member

It's an option, but not particularly good. The dune cache deduplicates all files so that makes it more efficient. Also, we'd have to write yet another cache cleanup mechanism.

@rgrinberg
Copy link
Member

Btw, you can set dev = true for packages that are fetched this way

@Leonidas-from-XIV
Copy link
Collaborator

Deduplication would happen automatically if you use <hash-algo>/<first two chars of hash>/<rest-of-hash> for the filename as the OPAM cache does. Which works rather well with packages sharing a single-tarball (e.g. all multi-package projects using dune-release).

@rgrinberg
Copy link
Member

Deduplication would happen automatically if you use // for the filename as the OPAM cache does. Which works rather well with packages sharing a single-tarball (e.g. all multi-package projects using dune-release).

The dune cache is much better than that. It deduplicates on a per file basis, so tarballs of different versions of the same package will end up sharing the majority of the contents.

But even without that, creating a secondary cache system parallel to the one we already have is somewhat of a last resort.

@gridbugs gridbugs force-pushed the compute-checksums-when-missing branch from f7b4e96 to 53dafd7 Compare December 11, 2023 07:40
@gridbugs
Copy link
Collaborator Author

gridbugs commented Dec 11, 2023

It sounds like there's no obvious idiomatic place to put the file so it can be reused. Since re-using the archive is an optimization and doesn't affect functionality, in the interest of avoiding dealing with ongoing merge conflicts would it make sense to merge this now and work out where to put it in a new issue?

@Leonidas-from-XIV
Copy link
Collaborator

That sounds reasonable to me.

Out of curiosity, how many packages do not have checksums in opam-repository?

@rgrinberg
Copy link
Member

Out of curiosity, how many packages do not have checksums in opam-repository?

Not sure how many, but the +trunk OCaml packages use it so it's important to support.

@rgrinberg rgrinberg merged commit 611e5fc into ocaml:main Dec 11, 2023
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support packages without checksums
3 participants