Add option for ignore file content #160

Arseney300 · 2021-09-04T11:00:55Z

Hello.
In cases, when files in directory are very large, fdupes can work a very long time.
But i think, be great, if fdupes will have special option for ignoring content of files and compare only by sizes.
I add new "-c --ignore-content" option, new compare function, and make small crunch in checkmatch for avoiding reading a whole file.

adrianlopezroche · 2021-09-04T18:58:43Z

Comparing files by size is not at all a reliable way of detecting duplicates. Such a feature would fall outside of fdupes' intended purpose.

…

On Sat, Sep 4, 2021, 7:01 AM Arseney Mesheryakov ***@***.***> wrote: Hello. In cases, when files in directory are very large, fdupes can work a very long time. But i think, be great, if fdupes will have special option for ignoring content of files and compare only by sizes. I add new "-c --ignore-content" option, new compare function, and make small crunch in checkmatch for avoiding reading a whole file. ------------------------------ You can view, comment on, or merge this pull request online at: #160 Commit Summary - add -c option - update --avoid-content - change naming File Changes - *M* fdupes.c <https://github.com/adrianlopezroche/fdupes/pull/160/files#diff-a279a3be8c0ffbf671c08a3d17376b936ea857bdc5742f1e01e9b0a143b93836> (73) - *M* flags.h <https://github.com/adrianlopezroche/fdupes/pull/160/files#diff-8ea76a8a74222d114ae9560e7d8dcfda511a8f7c3da4c46d1be598b2ca0b3142> (2) Patch Links: - https://github.com/adrianlopezroche/fdupes/pull/160.patch - https://github.com/adrianlopezroche/fdupes/pull/160.diff — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#160>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABPQT7JRNHZDHKQCHDUFRALUAH37JANCNFSM5DNFTGLQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

moonshiner · 2021-09-04T19:15:35Z

totally agree checksums are better.

But I've written something for myself that does this and keeps the hashes saved to make it easier to compare with other folders on other file systems.

Perhaps an option to save the checksum metadata between runs?

mydogteddy · 2021-09-04T20:48:27Z

I tried to use fdupes to find duplicate films on a 5T drive full of films with many duplicates and it was useless for that purpose unless you were in a very cold place and wanted some warmth from your CPU because fdupes just went on and on forever with 100% CPU

Now with -c fdupes worked superbly well and took less than a minute to find all the many duplicates and as far as I could determine it made no mistakes so for this use-case fdupes -c is excellent.

Prior to this I almost paid good money to buy some proprietary code: God for bid now I can spend it on beer instead.

update --avoid-content change naming update manual page

moonshiner · 2021-09-05T05:23:46Z

Would it be possible to log all the md5 sums generated during a run as an option?

mydogteddy · 2021-11-22T13:46:26Z

I have been using fdupes with the -c option for some time now. I find it particularly useful for quickly finding duplicate names in my very large film and music collections. To the best of my knowledge, there is no other free option available which does the same thing as quickly as fdupes -c can do it.

With the -c option I can search 5T of data in just a few seconds to find duplicate names which is perfect for finding duplicate film names etc where absolute content is not so important.

There is a paid-for version that can do the same however it is quite expensive.

I really do think the -c option ought to be included otherwise many other people who just wish to search their film/music collections for only duplicate names will have to pay for a paid-for version.

Including option -c will not detract from any of the other options available in fdupes so it has everything to gain with nothing lost so unless we are just being purists here for no good reason I see no reason not to have the -c option.

lpcvoid · 2022-03-11T07:49:05Z

I agree - I would love to see this added, as my use case is exactly what was described. I don't care much about the actual content - I care about fast comparisons, and also do not see the harm to give the user this option if he or she may want it.

philipphutterer · 2023-02-20T18:20:50Z

You could basically use find, sort, and awk for that:
find . -type f -printf '%s %p\n' | sort -V | awk '{if ($1 == s) {print l; c=1} else {if (c) {print l "\n"};c=0} s=$1; l=$0}'

This will print a list of files with equal sizes.

jbruchon · 2023-02-20T18:23:57Z

This is a fantastic way to lose a lot of data quickly. Don't do this unless you know your data quite intimately.

philipphutterer · 2023-02-20T18:42:39Z

How can you lose your data this way?

mydogteddy · 2023-02-20T19:01:04Z

I have been using this option for many months sorting my vast film and music collection and have lost nothing, it works really fast is easy to use and is as far as I am concerned reliable for what I use it for.

If you want to be a die-hard purist then go ahead and try :-
find . -type f -printf '%s %p\n' | sort -V | awk '{if ($1 == s) {print l; c=1} else {if (c) {print l "\n"};c=0} s=$1; l=$0}'

jbruchon · 2023-02-20T19:01:09Z

How can you lose data by assuming identical size equals identical contents and then taking potentially destructive actions based on that assumption? Are you seriously asking me this question?

philipphutterer · 2023-02-20T19:08:29Z

Okay are we even talking about the same thing? The command I posted is just listing file names with equal sizes, not more not less. No destructive actions, in fact, no actions at all. And as people mentioned above, there are use cases where you might want to have that list of files with equal file sizes. What you want to do with that information is a different story.

jbruchon · 2023-02-20T20:05:19Z

This tool is used primarily to delete duplicate files. -dN is the most common use case. Now imagine someone sees "faster" in the help text, uses the new option, and it deletes all "duplicate" files that are the same size only. Just because it's not YOUR use case or YOU wouldn't walk into that trap doesn't mean it's not a use case or a trap for someone else less experienced or careful.

https://en.wikipedia.org/wiki/Principle_of_least_astonishment

https://www.jjinux.com/2021/05/add-another-entry-to-unix-haters.html

Also, my response was primarily against the idea in general, not your code in particular.

Arseney300 closed this Sep 4, 2021

Arseney300 reopened this Sep 4, 2021

Arseney300 force-pushed the large_files branch from 3b980bd to bbf5537 Compare September 4, 2021 15:45

add -c option

bbf5537

update --avoid-content change naming update manual page

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add option for ignore file content #160

Add option for ignore file content #160

Arseney300 commented Sep 4, 2021

adrianlopezroche commented Sep 4, 2021 via email

moonshiner commented Sep 4, 2021

mydogteddy commented Sep 4, 2021

moonshiner commented Sep 5, 2021

mydogteddy commented Nov 22, 2021

lpcvoid commented Mar 11, 2022

philipphutterer commented Feb 20, 2023 •

edited

Loading

jbruchon commented Feb 20, 2023

philipphutterer commented Feb 20, 2023

mydogteddy commented Feb 20, 2023

jbruchon commented Feb 20, 2023

philipphutterer commented Feb 20, 2023

jbruchon commented Feb 20, 2023

Add option for ignore file content #160

Are you sure you want to change the base?

Add option for ignore file content #160

Conversation

Arseney300 commented Sep 4, 2021

adrianlopezroche commented Sep 4, 2021 via email

moonshiner commented Sep 4, 2021

mydogteddy commented Sep 4, 2021

moonshiner commented Sep 5, 2021

mydogteddy commented Nov 22, 2021

lpcvoid commented Mar 11, 2022

philipphutterer commented Feb 20, 2023 • edited Loading

jbruchon commented Feb 20, 2023

philipphutterer commented Feb 20, 2023

mydogteddy commented Feb 20, 2023

jbruchon commented Feb 20, 2023

philipphutterer commented Feb 20, 2023

jbruchon commented Feb 20, 2023

philipphutterer commented Feb 20, 2023 •

edited

Loading