A simple utility for finding duplicated files within a specified path. It is intended to be a library but can also be used as a commandline tool, it doesn't delete the duplicate files found but returns a list of junk files so that you can choose the ones to delete.
- Import the dupliCat class and create an
object by passing the following arguments,
pathwhere the search will be made, defaults to current directory.recurseboolean, set to true if you want it to recurse down to all files in the path including sub-dirs defaults toFalse
-
the
generate_secure_hashmethod takes a file as first argument and generates a secure-hash for it. Hashing algorithm is blake2b, key is the size of the file, it returns the file with secure_hash attribute set. File must be of typedupliFile. -
read_chunkthis method reads a default 400 bytes of data from file. It takes the file as first positional argument and size as second argument which defaults to 400. File must be of typedupliFile -
human_sizethis is a static method that converts bytes into human-readable format.>>> human_size(nbytes=123456) >>> 120.56 KB -
hash_chunkstatic method, takes two positional arguments,text: strandkey: inthashes text with key usingblake2b. -
call the
search_duplicatemethod to begin the 🔍 search, search results will be stored in theduplicatesproperty of the class. This method is somewhat the main api of the class, it does everything for you, calling other methods instead of this might remove the functionality of using files fromsize_indexas input for generating a hash index. -
the
search_duplicatemethod used to initiate a search for duplicates. Does not take any additional arguments. Junk files set by this method contains all duplicates with one file left over for each. -
use the
analysemethod to analyse search result, this returns a named tuple of typeAnalysis. It contains the total number of duplicate files accessed throughanalysis.total_file_num, their total size on the disk accessed throughanalysis.total_sizeand the most occurred file, accessed throughanalysis.most_occurrence. -
the
generate_size_indexmethod is used to generate the size index from files. It sets the result or the generated size_index toself.size_indextakes the parameterfiles: files from which size index should be generated.
-
the
generate_hash_indexmethod is used to generate the hash index from files in the size index. It sets the result or the generated_hash_index toself.hash_indextakes the argumentfiles: files from which hash index should be generated.
size_indexYou can also access the size index using the property. it is a dictionary containing files grouped by their sizes.hash_indexYou can also access the hash index using this property. It is a dictionary containing files grouped by their secure hashes.fetched_filesaccess all fetched files from the search pathpathwhere the search will be made, defaults to current directory.recurseboolean, set to true if you want it to recurse down to all files in the path including sub-dirs defaults toFalsejunk_filesa list containing all duplicate files leaving an original copy each. Meaning you can go ahead and delete all files in this list
- fixed the total size value
- added
junk_filesproperty - new method
set_secure_hashfor setting the secure hash of a file if provided else generates one for the file. - updated
generate_secure_hashto only generate and return a secure hash for the file fetch_filesnow implements a recursive use ofos.scandirinstead ofos.walkfor faster file fetching.- increased overall speed.
You can now use dupliCat from the command line.
$ dupliCat --help
the above command will help you to use it.
- twitter: teddbug
- facebook: Tedd Bug
- mail: [email protected]
- mail: [email protected]
- facebook: Kwieeciol
😄 Happy Coding!