Get the md5sum of every 64MB block in a large file using bash.

sill...@yahoo.com

unread,

May 7, 2008, 3:18:27 PM5/7/08

to

Hello, can someone please help.

I have an 8GB file and need the md5sum of every 64MB block in the
file.

I'm looking for some ideas on how to write a script to do this using
bash - not interested in perl or other language solutions.

The size of my disk is 10GB with a smallish linux system and 256MB
free disk space.

Thanks for all constructive posts.

Hal

mo

unread,

May 7, 2008, 5:18:02 PM5/7/08

to

If time is not a problem you can try:

MD5=;x=0;while [ "$MD5" != d41d8cd98f00b204e9800998ecf8427e ];do
MD5=`dd status=noxfer if=file bs=64M skip=$x|md5sum|cut -d ' ' -f 1`
x=$[x+1]
echo $x $MD5 # for block 1, skipped 0
done >blocks

Don't tested!
See also command split.

The problem with the idea above is the restarting file read
after each block:
$ echo $[8000/64]
125

Luuk

unread,

May 7, 2008, 5:35:55 PM5/7/08

to

sill...@yahoo.com schreef:

something like:

[1] for ((x=0; x<128; x++)) ;
[2] do
[3] dd if=largefile ibs=64M obs=64M skip=$x skip=$x of=tmp;
[4] echo $x;
[5] md5sum tmp;
[6] done

reading in 128 steps, through your file, creating a (temp)file 'tmp' of
which the md5sum is created.

--
Luuk

Dave B

unread,

May 8, 2008, 4:18:48 AM5/8/08

to

On Wednesday 7 May 2008 23:35, Luuk wrote:

> something like:
>
> [1] for ((x=0; x<128; x++)) ;
> [2] do
> [3] dd if=largefile ibs=64M obs=64M skip=$x skip=$x of=tmp;
> [4] echo $x;
> [5] md5sum tmp;
> [6] done
>
>
> reading in 128 steps, through your file, creating a (temp)file 'tmp' of
> which the md5sum is created.

You don't need the temp file, since you can pipe the output of dd directly
to md5sum. Just omit the "of=" part in the dd command, and dd will write to
stdout.

--
D.

Stephane CHAZELAS

unread,

May 8, 2008, 6:55:01 AM5/8/08

to

2008-05-7, 12:18(-07), sill...@yahoo.com:

> Hello, can someone please help.
>
> I have an 8GB file and need the md5sum of every 64MB block in the
> file.
>
> I'm looking for some ideas on how to write a script to do this using
> bash - not interested in perl or other language solutions.
>
> The size of my disk is 10GB with a smallish linux system and 256MB
> free disk space.

[...]

You should need to use any disk space:
while
{
details=$(
{
LC_ALL=C dd bs="$((64*1024*1024))" count=1 2>&3 | md5sum >&4
} 3>&1
)
} 4>&1 &&
case $details in (*"1+0 records in"*) ;; (*) false;; esac
do :
done < your-big-file

--
Stéphane

mo

unread,

May 8, 2008, 10:38:45 PM5/8/08

to

Updating my previous code with CHAZELAS' smart idea to the feeder
and now using head instead dd:

md5(){ [ $1 ]||{ echo "md5 <block_size_in_bytes>" >&2;return 1;}
x=0;pMD5=
while MD5=`head -c$1|md5sum|cut -d ' ' -f 1`&&[ "$pMD5" != "$MD5" ];do
pMD5=$MD5
x=$[x+1]
echo "$MD5 $x"
done
}

###Using:
$ md5
md5 <block_size_in_bytes>
$

$ md5 512 <x.log
ae85a3ff6457e95c723c7d90232cf738 1
b09cdeb518910d3b8b8fb09bd8a57488 2
4fd51c79224201930e294d71feb47c7a 3
d41d8cd98f00b204e9800998ecf8427e 4
$

$ cat x.log|md5 1024
dd66ec03b7c59ed2828090dae4421818 1
4fd51c79224201930e294d71feb47c7a 2
d41d8cd98f00b204e9800998ecf8427e 3
$

With this new code the last line is always from a null string.

sill...@yahoo.com

unread,

May 10, 2008, 2:28:39 PM5/10/08

to

Thanks for the responses. Good work!

I also put together this attempt which seems to do the trick as well.
It's based on previous responses and some ideas found in the article
'Mincing Your Data into Arbitrary Chunks (in bash)' from the book
Linux Server Hacks. I found the article online.

If there are some glaring errors, please let me know.

Hal
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#!/bin/sh

#IN="/dev/hda1"
IN="bigfile"

if [ "${IN:0:5}" == "/dev/" ] ; then
echo "filename starts '/dev/...' so assume its a disk or
partition."
INX=`echo $IN | sed 's:/:\\\\/:g'`
SIZE=`fdisk -l $IN | awk '/'$INX'\:/ {print $5}'`
else
echo "filename does not start '/dev/..' so assume its a normal
file."
SIZE=`ls -l $IN | awk '{print $5}'`
fi

echo "SIZE="$SIZE

OUT="out"
B="$((64*1024*1024))"

total=0
while [ $total -lt $SIZE ]; do
dd bs="$B" count=1 2> /dev/null | md5sum
total=$((total + B))
done < $IN

Stephane CHAZELAS

unread,

May 10, 2008, 3:11:31 PM5/10/08

to

2008-05-10, 11:28(-07), sill...@yahoo.com:
[...]

> #!/bin/sh
>
> #IN="/dev/hda1"
> IN="bigfile"
>
> if [ "${IN:0:5}" == "/dev/" ] ; then

That's not standard sh syntax, the ${IN:0:5} and "==" is ksh93
syntax also recognised by bash (and not by GNU "[", BTW).

case $IN in
(/dev/*) ...

> echo "filename starts '/dev/...' so assume its a disk or
> partition."

But to check whether it's a block device, you can simply do:

[ -b "$IN" ]

> INX=`echo $IN | sed 's:/:\\\\/:g'`
> SIZE=`fdisk -l $IN | awk '/'$INX'\:/ {print $5}'`

which works for disks but not for other block devices.

And instead of escaping IN and use awk /.../, you could have
done:

awk -v disk="$IN" 'index($0, disk ":") {print $5}'

Which is a substring search instead of a pattern search.