Article : checksum_on_directorys

The checksum needs to be of a deterministic and unambiguous representation of the files as a string. Deterministic means that if you put the same files at the same locations, you'll get the same result. Unambiguous means that two different sets of files have different representations.

Data and metadata

Making an archive containing the files is a good start. This is an unambiguous representation (obviously, since you can recover the files by extracting the archive). It may include file metadata such as dates and ownership. However, this isn't quite right yet: an archive is ambiguous, because its representation depends on the order in which the files are stored, and if applicable on the compression.

A solution is to sort the file names before archiving them. If your file names don't contain newlines, you can run find | sort to list them, and add them to the archive in this order. Take care to tell the archiver not to recurse into directories. Here are examples with POSIX pax, GNU tar and cpio:

find | LC_ALL=C sort | pax -w -d | md5sum

find | LC_ALL=C sort | tar -cf - -T - --no-recursion | md5sum

find | LC_ALL=C sort | cpio -o | md5sum

Names and contents only, the low-tech way

If you only want to take the file data into account and not metadata, you can make an archive that includes only the file contents, but there are no standard tools for that. Instead of including the file contents, you can include the hash of the files. If the file names contain no newlines, and there are only regular files and directories (no symbolic links or special files), this is fairly easy, but you do need to take care of a few things:

{ export LC_ALL=C;

  find -type f -exec wc -c {} \; | sort; echo;

  find -type f -exec md5sum {} + | sort; echo;

  find . -type d | sort; find . -type d | sort | md5sum;

} | md5sum

We include a directory listing in addition to the list of checksums, as otherwise empty directories would be invisible. The file list is sorted (in a specific, reproducible locale — thanks to Peter.O for reminding me of that). echo separates the two parts (without this, you could make some empty directories whose name look like md5sum output that could also pass for ordinary files). We also include a listing of file sizes, to avoid length-extension attacks.

By the way, MD5 is deprecated. If it's available, consider using SHA-2, or at least SHA-1.

Names and data, supporting newlines in names

Here is a variant of the code above that relies on GNU tools to separate the file names with null bytes. This allows file names to contain newlines. The GNU digest utilities quote special characters in their output, so there won't be ambiguous newlines.

{ export LC_ALL=C;

  du -0ab | sort -z; # file lengths, including directories (with length 0)

  echo | tr '\n' '\000'; # separator

  find -type f -exec sha256sum {} + | sort -z; # file hashes

  echo | tr '\n' '\000'; # separator

  echo "End of hashed data."; # End of input marker

} | sha256sum

A more robust approach

Here's a minimally tested Python script that builds a hash describing a hierarchy of files. It takes directories and file contents into accounts and ignores symbolic links and other files, and returns a fatal error if any file can't be read.

#! /usr/bin/env python

import hashlib, hmac, os, stat, sys

## Return the hash of the contents of the specified file, as a hex string

def file_hash(name):

    f = open(name)

    h = hashlib.sha256()

    while True:

        buf = f.read(16384)

        if len(buf) == 0: break

        h.update(buf)

    f.close()

    return h.hexdigest()

## Traverse the specified path and update the hash with a description of its

## name and contents

def traverse(h, path):

    rs = os.lstat(path)

    quoted_name = repr(path)

    if stat.S_ISDIR(rs.st_mode):

        h.update('dir ' + quoted_name + '\n')

        for entry in sorted(os.listdir(path)):

            traverse(h, os.path.join(path, entry))

    elif stat.S_ISREG(rs.st_mode):

        h.update('reg ' + quoted_name + ' ')

        h.update(str(rs.st_size) + ' ')

        h.update(file_hash(path) + '\n')

    else: pass # silently symlinks and other special files

h = hashlib.sha256()

for root in sys.argv[1:]: traverse(h, root)

h.update('end\n')

print h.hexdigest()

 

original post: https://unix.stackexchange.com/questions/35832/how-do-i-get-the-md5-sum-of-a-directorys-contents-as-one-sum