I have an 8GB file and need the md5sum of every 64MB block in the
file.
I'm looking for some ideas on how to write a script to do this using
bash - not interested in perl or other language solutions.
The size of my disk is 10GB with a smallish linux system and 256MB
free disk space.
Thanks for all constructive posts.
Hal
If time is not a problem you can try:
MD5=;x=0;while [ "$MD5" != d41d8cd98f00b204e9800998ecf8427e ];do
MD5=`dd status=noxfer if=file bs=64M skip=$x|md5sum|cut -d ' ' -f 1`
x=$[x+1]
echo $x $MD5 # for block 1, skipped 0
done >blocks
Don't tested!
See also command split.
The problem with the idea above is the restarting file read
after each block:
$ echo $[8000/64]
125
something like:
[1] for ((x=0; x<128; x++)) ;
[2] do
[3] dd if=largefile ibs=64M obs=64M skip=$x skip=$x of=tmp;
[4] echo $x;
[5] md5sum tmp;
[6] done
reading in 128 steps, through your file, creating a (temp)file 'tmp' of
which the md5sum is created.
--
Luuk
> something like:
>
> [1] for ((x=0; x<128; x++)) ;
> [2] do
> [3] dd if=largefile ibs=64M obs=64M skip=$x skip=$x of=tmp;
> [4] echo $x;
> [5] md5sum tmp;
> [6] done
>
>
> reading in 128 steps, through your file, creating a (temp)file 'tmp' of
> which the md5sum is created.
You don't need the temp file, since you can pipe the output of dd directly
to md5sum. Just omit the "of=" part in the dd command, and dd will write to
stdout.
--
D.
You should need to use any disk space:
while
{
details=$(
{
LC_ALL=C dd bs="$((64*1024*1024))" count=1 2>&3 | md5sum >&4
} 3>&1
)
} 4>&1 &&
case $details in (*"1+0 records in"*) ;; (*) false;; esac
do :
done < your-big-file
--
Stéphane
Updating my previous code with CHAZELAS' smart idea to the feeder
and now using head instead dd:
md5(){ [ $1 ]||{ echo "md5 <block_size_in_bytes>" >&2;return 1;}
x=0;pMD5=
while MD5=`head -c$1|md5sum|cut -d ' ' -f 1`&&[ "$pMD5" != "$MD5" ];do
pMD5=$MD5
x=$[x+1]
echo "$MD5 $x"
done
}
###Using:
$ md5
md5 <block_size_in_bytes>
$
$ md5 512 <x.log
ae85a3ff6457e95c723c7d90232cf738 1
b09cdeb518910d3b8b8fb09bd8a57488 2
4fd51c79224201930e294d71feb47c7a 3
d41d8cd98f00b204e9800998ecf8427e 4
$
$ cat x.log|md5 1024
dd66ec03b7c59ed2828090dae4421818 1
4fd51c79224201930e294d71feb47c7a 2
d41d8cd98f00b204e9800998ecf8427e 3
$
With this new code the last line is always from a null string.
I also put together this attempt which seems to do the trick as well.
It's based on previous responses and some ideas found in the article
'Mincing Your Data into Arbitrary Chunks (in bash)' from the book
Linux Server Hacks. I found the article online.
If there are some glaring errors, please let me know.
Hal
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#!/bin/sh
#IN="/dev/hda1"
IN="bigfile"
if [ "${IN:0:5}" == "/dev/" ] ; then
echo "filename starts '/dev/...' so assume its a disk or
partition."
INX=`echo $IN | sed 's:/:\\\\/:g'`
SIZE=`fdisk -l $IN | awk '/'$INX'\:/ {print $5}'`
else
echo "filename does not start '/dev/..' so assume its a normal
file."
SIZE=`ls -l $IN | awk '{print $5}'`
fi
echo "SIZE="$SIZE
OUT="out"
B="$((64*1024*1024))"
total=0
while [ $total -lt $SIZE ]; do
dd bs="$B" count=1 2> /dev/null | md5sum
total=$((total + B))
done < $IN
That's not standard sh syntax, the ${IN:0:5} and "==" is ksh93
syntax also recognised by bash (and not by GNU "[", BTW).
case $IN in
(/dev/*) ...
> echo "filename starts '/dev/...' so assume its a disk or
> partition."
But to check whether it's a block device, you can simply do:
[ -b "$IN" ]
> INX=`echo $IN | sed 's:/:\\\\/:g'`
> SIZE=`fdisk -l $IN | awk '/'$INX'\:/ {print $5}'`
which works for disks but not for other block devices.
And instead of escaping IN and use awk /.../, you could have
done:
awk -v disk="$IN" 'index($0, disk ":") {print $5}'
Which is a substring search instead of a pattern search.
See also
fdisk -l -- "$IN" | sed -n "s#.* $IN:.* \([0-9]*\) bytes#\1#p"
The util-linux tools have fdisk but also the blockdev command:
SIZE=$(blockdev --getsize64 "$IN")
> else
> echo "filename does not start '/dev/..' so assume its a normal
> file."
> SIZE=`ls -l $IN | awk '{print $5}'`
SIZE=$(wc -c < "$IN")
You may want to check that it's a regular file as well:
[ -f "$IN" ]
With ls, you'd need the "-L" option because for symlinks, you
want the size of the file pointed to, not of the symlink which
is irrelevant.
> fi
>
> echo "SIZE="$SIZE
Funny that you put quotes where it was not necessary and not
where it was. In shells, double quotes are to be put around
variables:
echo "SIZE=$SIZE"
Without quotes, variable expansions are subject to word
splitting and filename generation.
>
> OUT="out"
> B="$((64*1024*1024))"
>
> total=0
> while [ $total -lt $SIZE ]; do
> dd bs="$B" count=1 2> /dev/null | md5sum
> total=$((total + B))
total=$(($total + $B)) is prefered (not necessary in most recent
implementations of sh though)
> done < $IN
The solution that I had given that parses the stderr output of
dd allows you not to have to find out the size beforehand. It
stops when dd can't read a whole input block.
--
Stéphane