Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does git-sizer count objects managed by Git LFS? #50

Open
mloskot opened this issue Nov 13, 2018 · 8 comments
Open

Does git-sizer count objects managed by Git LFS? #50

mloskot opened this issue Nov 13, 2018 · 8 comments

Comments

@mloskot
Copy link

mloskot commented Nov 13, 2018

I have a largish bare repo with Git LFS installed (SVN to Git migration):

proj.git (BARE:master) $ git-sizer
Processing blobs: 1107392
Processing trees: 178226
Processing commits: 29412
Matching commits to trees: 29412
Processing annotated tags: 0
Processing references: 24
| Name                         | Value     | Level of concern               |
| ---------------------------- | --------- | ------------------------------ |
| Overall repository size      |           |                                |
| * Blobs                      |           |                                |
|   * Total size               |  12.8 GiB | *                              |
|                              |           |                                |
| Biggest objects              |           |                                |
| * Trees                      |           |                                |
|   * Maximum entries      [1] |  1.96 k   | *                              |
| * Blobs                      |           |                                |
|   * Maximum size         [2] |   113 MiB | ***********                    |
|                              |           |                                |
| Biggest checkouts            |           |                                |
| * Number of directories  [3] |  13.3 k   | ******                         |
| * Maximum path depth     [4] |    18     | *                              |
| * Maximum path length    [5] |   232 B   | **                             |
| * Number of files        [6] |   910 k   | ******************             |
| * Total size of files    [7] |  3.37 GiB | ***                            |

I've written a little git lfs ls-file helper git_lfs_calculate_size_by_type.py which reports for proj.git repo this:

Git LFS objects summary:
.lib:   count: 1111     size: 8764.66 MB
.dll:   count: 749      size: 1427.98 MB
.pdb:   count: 612      size: 2814.09 MB
.exe:   count: 786      size: 2005.72 MB
.zip:   count: 24       size: 1153.65 MB
Total:  count: 3282     size: 16166.11 MB

Does the latter 16166.11 MB relate to the former 12.8 GiB in any way?
Or, is the grand total of the repo, Git and Git LFS objects, a sum of the two figure?

@ttaylorr
Copy link
Member

Git Sizer does not do this, but I think that it would be neat if it did. @mhagger: do you agree?

@mloskot
Copy link
Author

mloskot commented Nov 14, 2018

@ttaylorr Thanks for answering my question. It would be neat if it did, indeed.

@mhagger
Copy link
Member

mhagger commented Nov 20, 2018

I agree that this would be neat, with one proviso: either we should prove using benchmarks that this feature is not too costly, or we should make it possible to turn it on/off via command-line options. (Currently, git-sizer never has to open up any blob files, but if this feature were implemented, as I understand it, it would have to open and parse any blob files smaller than some limit, correct?)

@mloskot
Copy link
Author

mloskot commented Nov 20, 2018

I'm new to Git LFS, but AFAIU, it would have to open each pointer file and parse size key. An option sounds perfect.

@ttaylorr
Copy link
Member

(Currently, git-sizer never has to open up any blob files, but if this feature were implemented, as I understand it, it would have to open and parse any blob files smaller than some limit, correct?)

Right, we'll have to inflate blobs, but I don't think that we have to do so based on size, if I'm understanding correctly. Git LFS only watches files which match patterns given in any .gitattributes in a parent directory, so at worst we'll have to open up a blob, but at best we'll only match its path in the tree.

Some code that already exists to that end:

@adam-azarchs
Copy link

The entire point of git lfs is to be able to not care about the size of LFS files except at HEAD.

  • Given that you've already got the list of blobs and their sizes from previous steps, you can make a guess at which ones might be lfs without touching the filesystem any further, because their blob sizes will all be roughly in the range of 100-200 bytes.
  • You can pretty quickly determine which of those are actually stored in lfs using git check-attr filter on them (or as mentioned above just parse the .gitattributes files yourself in go, without the subprocess, though you probably want git check-attr --cached, which is likely faster in many cases since it uses the index).
  • git lfs ls-files -s -I is probably the most "correct" way to get the size for those objects, though if you're going by path then you have to go one invocation per path. Except, git lfs was written in go, so you can just use github.com/git-lfs/git-lfs/lfs.GitScanner with appropriate filters so it doesn't waste time searching paths you already know don't have anything interesting.

@mhagger
Copy link
Member

mhagger commented Dec 3, 2022

git-sizer usually doesn't know the path associated with a particular blob, and indeed that's a feature, not a bug. Why? In certain types of pathological repositories like git bombs, the same blob is repeated over and over at an astronomical number of different paths. git-sizer goes to great lengths to be immune to pathological Git repositories, always scaling like the number of objects in the Git object database rather than like the actual size of checked-out trees or whatever. So I'd be reluctant to add any features that require paths for blobs. If we start doing gitattribute checks, then I think we'd need those paths.

One could imagine skipping those gitattribute checks and instead deciding which items are LFS pointer files based only on their lengths and contents. This would probably be a nearly perfect approximation, and I think that it could be implemented in a way that isn't extravagantly expensive. (We'd still want it to be optional, though.)

If you want an exact count including gitattribute checks, then you're probably better off asking the LFS project for such a tool, if they don't already have one.

@adam-azarchs
Copy link

git lfs ls-files -s will tell you the sizes of all the lfs files. But I'm not even sure what git-sizer would do with it; storing large files in git is a bad idea, so git warns you about it, but storing them in LFS is precisely what LFS is for, so what would you warn about?

I feel like anything that actually inspects the content of files is going to be prohibitively slow. Probably the most useful thing it could do would actually be to just get the total number/size of objects in .git/lfs/objects. That would be relatively quick, I think, though the results would be highly dependent on which commits have ever been checked out. The second-most-useful thing I can think of would be to flag any very small objects in there, e.g. <100 bytes, at which point the stub file in the git index actually takes up more space than it would have taken to just put the file in git without the lfs, making the use of LFS counterproductive (I've definitely seen people using sloppy glob patterns to put even empty file into LFS).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants