Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option for ignore file content #160

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Arseney300
Copy link

Hello.
In cases, when files in directory are very large, fdupes can work a very long time.
But i think, be great, if fdupes will have special option for ignoring content of files and compare only by sizes.
I add new "-c --ignore-content" option, new compare function, and make small crunch in checkmatch for avoiding reading a whole file.

@adrianlopezroche
Copy link
Owner

adrianlopezroche commented Sep 4, 2021 via email

@moonshiner
Copy link

totally agree checksums are better.

But I've written something for myself that does this and keeps the hashes saved to make it easier to compare with other folders on other file systems.

Perhaps an option to save the checksum metadata between runs?

@mydogteddy
Copy link

I tried to use fdupes to find duplicate films on a 5T drive full of films with many duplicates and it was useless for that purpose unless you were in a very cold place and wanted some warmth from your CPU because fdupes just went on and on forever with 100% CPU

Now with -c fdupes worked superbly well and took less than a minute to find all the many duplicates and as far as I could determine it made no mistakes so for this use-case fdupes -c is excellent.

Prior to this I almost paid good money to buy some proprietary code: God for bid now I can spend it on beer instead.

update --avoid-content

change naming

update manual page
@moonshiner
Copy link

Would it be possible to log all the md5 sums generated during a run as an option?

@mydogteddy
Copy link

I have been using fdupes with the -c option for some time now. I find it particularly useful for quickly finding duplicate names in my very large film and music collections. To the best of my knowledge, there is no other free option available which does the same thing as quickly as fdupes -c can do it.

With the -c option I can search 5T of data in just a few seconds to find duplicate names which is perfect for finding duplicate film names etc where absolute content is not so important.

There is a paid-for version that can do the same however it is quite expensive.

I really do think the -c option ought to be included otherwise many other people who just wish to search their film/music collections for only duplicate names will have to pay for a paid-for version.

Including option -c will not detract from any of the other options available in fdupes so it has everything to gain with nothing lost so unless we are just being purists here for no good reason I see no reason not to have the -c option.

@lpcvoid
Copy link

lpcvoid commented Mar 11, 2022

I agree - I would love to see this added, as my use case is exactly what was described. I don't care much about the actual content - I care about fast comparisons, and also do not see the harm to give the user this option if he or she may want it.

@philipphutterer
Copy link

philipphutterer commented Feb 20, 2023

You could basically use find, sort, and awk for that:
find . -type f -printf '%s %p\n' | sort -V | awk '{if ($1 == s) {print l; c=1} else {if (c) {print l "\n"};c=0} s=$1; l=$0}'

This will print a list of files with equal sizes.

@jbruchon
Copy link

This is a fantastic way to lose a lot of data quickly. Don't do this unless you know your data quite intimately.

@philipphutterer
Copy link

How can you lose your data this way?

@mydogteddy
Copy link

I have been using this option for many months sorting my vast film and music collection and have lost nothing, it works really fast is easy to use and is as far as I am concerned reliable for what I use it for.

If you want to be a die-hard purist then go ahead and try :-
find . -type f -printf '%s %p\n' | sort -V | awk '{if ($1 == s) {print l; c=1} else {if (c) {print l "\n"};c=0} s=$1; l=$0}'

@jbruchon
Copy link

How can you lose data by assuming identical size equals identical contents and then taking potentially destructive actions based on that assumption? Are you seriously asking me this question?

@philipphutterer
Copy link

Okay are we even talking about the same thing? The command I posted is just listing file names with equal sizes, not more not less. No destructive actions, in fact, no actions at all. And as people mentioned above, there are use cases where you might want to have that list of files with equal file sizes. What you want to do with that information is a different story.

@jbruchon
Copy link

This tool is used primarily to delete duplicate files. -dN is the most common use case. Now imagine someone sees "faster" in the help text, uses the new option, and it deletes all "duplicate" files that are the same size only. Just because it's not YOUR use case or YOU wouldn't walk into that trap doesn't mean it's not a use case or a trap for someone else less experienced or careful.

https://en.wikipedia.org/wiki/Principle_of_least_astonishment

https://www.jjinux.com/2021/05/add-another-entry-to-unix-haters.html

Also, my response was primarily against the idea in general, not your code in particular.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants