I have a string $hex which has lets assume "0012345689abcd"
How can I split them into to an array so that
arr[0]=00 ,arr[1] =12..etc
it works with split command like this to some extent
foreach (split(//, $hex){
$arr[$i]=$_;
$i++;
}
Unfortunately when i read big files of 4MB size it takes
like 10mins before it completes execution. No good.
(i couldnt split it like 00,12 but only like 0,0,1,2)
Then I thought unpack wud be a better idea.
@arr = unpack("H2",$data); or
@arr = unpack("H2*",$data);
But only first element got transferred. ie 00.
$arr[0]=00 and arr[1] undefined.
Any one can help me on this?
thanks,
jis
while ( $hex =~ /[[:xdigit:]]{2}/g ) { push @arr, $1 }
The "g" flag on the regex tells Perl to do its search from where the
previous search left off, so this will just walk through your string two
characters at a time and relieve you from having to keep track of where
you are in the string and in your array.
The "o" flag may also be useful; check "Regexp Quote-Like Operators" in
perlop for more info.
No need for a loop:
my @arr = $hex =~ /[[:xdigit:]]{2}/g;
Also, you don't use capturing parentheses in your regular expression so
$1 will always be empty.
> The "g" flag on the regex tells Perl to do its search from where the
> previous search left off, so this will just walk through your string two
> characters at a time and relieve you from having to keep track of where
> you are in the string and in your array.
>
> The "o" flag may also be useful; check "Regexp Quote-Like Operators" in
> perlop for more info.
The /o option would not be useful in this case as there are no variables
in the regular expression to interpolate and in any case modern versions
of perl would not re-interpolate a variable that doesn't change.
perldoc -q /o
John
--
The programmer is fighting against the two most
destructive forces in the universe: entropy and
human stupidity. -- Damian Conway
my @arr = unpack '(a2)*', $hex;
The /o flag is rarely useful. Any meaningful use of it is better
replaced with qr//, and if compilation-speed is really a concern it's
best to build the regex as a string and pass it through qr// only once.
Ben
> No need for a loop:
>
> my @arr = $hex =~ /[[:xdigit:]]{2}/g;
>
> Also, you don't use capturing parentheses in your regular expression so
> $1 will always be empty.
So much for my proofreading :-P You're right, of course.
>Guys,
>
>I have a string $hex which has lets assume "0012345689abcd"
>[snip]
>Unfortunately when i read big files of 4MB size it takes
>like 10mins before it completes execution. No good.
>(i couldnt split it like 00,12 but only like 0,0,1,2)
>
>Then I thought unpack wud be a better idea.
> @arr = unpack("H2",$data); or
>@arr = unpack("H2*",$data);
>
Perl distributions for win32 have a problem with
native realloc(). On these, the larger the dynamic list
generated by the function, the longer it takes.
Linux doesen't have this problem.
In general, if you expect to be splitting up very
large data segments, its better to control the list
external to the function, where push() is better.
Of the 3 types of basic methods: substr/unpack/regexp,
the one thats the fastest seems to be substr().
Additionally, on win32 platforms, any method using a
push is far better.
My platform is Windows in generating the below data.
If you have Linux, your results will be different.
Post your numbers if you can.
-sln
Output:
--------------------
Size of bigstring = 560
Substr/push took: 0.00030303 wallclock secs ( 0.00 usr + 0.00 sys = 0.00 CPU)
Unpack/list took: 0.000344038 wallclock secs ( 0.00 usr + 0.00 sys = 0.00 CPU)
Unpack/push took: 0.000586033 wallclock secs ( 0.00 usr + 0.00 sys = 0.00 CPU)
Regexp/list took: 0.000608206 wallclock secs ( 0.00 usr + 0.00 sys = 0.00 CPU)
Regexp/push took: 0.000404835 wallclock secs ( 0.00 usr + 0.00 sys = 0.00 CPU)
--------------------
Size of bigstring = 5600
Substr/push took: 0.002841 wallclock secs ( 0.00 usr + 0.00 sys = 0.00 CPU)
Unpack/list took: 0.00334311 wallclock secs ( 0.00 usr + 0.00 sys = 0.00 CPU)
Unpack/push took: 0.00657105 wallclock secs ( 0.00 usr + 0.00 sys = 0.00 CPU)
Regexp/list took: 0.00673795 wallclock secs ( 0.00 usr + 0.00 sys = 0.00 CPU)
Regexp/push took: 0.004076 wallclock secs ( 0.00 usr + 0.00 sys = 0.00 CPU)
--------------------
Size of bigstring = 56000
Substr/push took: 0.0301139 wallclock secs ( 0.03 usr + 0.00 sys = 0.03 CPU)
Unpack/list took: 0.0458951 wallclock secs ( 0.05 usr + 0.00 sys = 0.05 CPU)
Unpack/push took: 0.0644789 wallclock secs ( 0.06 usr + 0.00 sys = 0.06 CPU)
Regexp/list took: 0.07149 wallclock secs ( 0.06 usr + 0.00 sys = 0.06 CPU)
Regexp/push took: 0.03965 wallclock secs ( 0.03 usr + 0.00 sys = 0.03 CPU)
--------------------
Size of bigstring = 560000
Substr/push took: 0.309315 wallclock secs ( 0.30 usr + 0.02 sys = 0.31 CPU)
Unpack/list took: 0.723145 wallclock secs ( 0.61 usr + 0.11 sys = 0.72 CPU)
Unpack/push took: 0.640141 wallclock secs ( 0.64 usr + 0.00 sys = 0.64 CPU)
Regexp/list took: 0.927701 wallclock secs ( 0.92 usr + 0.00 sys = 0.92 CPU)
Regexp/push took: 0.516143 wallclock secs ( 0.52 usr + 0.00 sys = 0.52 CPU)
--------------------
Size of bigstring = 5600000
Substr/push took: 3.79988 wallclock secs ( 3.75 usr + 0.06 sys = 3.81 CPU)
Unpack/list took: 40.0264 wallclock secs (34.97 usr + 5.06 sys = 40.03 CPU)
Unpack/push took: 6.71793 wallclock secs ( 6.70 usr + 0.01 sys = 6.72 CPU)
Regexp/list took: 34.6208 wallclock secs (34.56 usr + 0.06 sys = 34.63 CPU)
Regexp/push took: 7.93654 wallclock secs ( 7.89 usr + 0.05 sys = 7.94 CPU)
=======
for my $multiplier (40, 400, 4_000, 40_000, 400_000)
{
my $bigstring = '0012345689abcd' x $multiplier;
print "\n",'-'x20,"\nSize of bigstring = ",length($bigstring),"\n\n";
##
{
my ($val, $offs, @pairs) = ('',0);
my $t0 = new Benchmark;
while ($val=substr( $bigstring, $offs, 2))
{
push @pairs, $val;
$offs+=2;
}
my $t1 = new Benchmark;
print "Substr/push took: ",timestr(timediff($t1, $t0)),"\n";
}
##
{
my $t0 = new Benchmark;
my @pairs = unpack '(a2)*', $bigstring;
my $t1 = new Benchmark;
print "Unpack/list took: ",timestr(timediff($t1, $t0)),"\n";
}
##
{
my ($val, $offs, @pairs) = ('',0);
my $t0 = new Benchmark;
while ($val=unpack("x$offs a2", $bigstring) )
{
push @pairs, $val;
$offs+=2;
}
my $t1 = new Benchmark;
print "Unpack/push took: ",timestr(timediff($t1, $t0)),"\n";
}
##
{
my $t0 = new Benchmark;
my @pairs = $bigstring =~ /[0-9a-f]{2}/g;
my $t1 = new Benchmark;
print "Regexp/list took: ",timestr(timediff($t1, $t0)),"\n";
}
##
{
my @pairs;
my $t0 = new Benchmark;
while ( $bigstring =~ /([0-9a-f]{2})/g ) {
push @pairs, $1;
}
my $t1 = new Benchmark;
print "Regexp/push took: ",timestr(timediff($t1, $t0)),"\n";
}
}
__END__
Thanks for the replies.
As said regex and unpack took longer time than substr.
I use Windows. The following are the time taken.
1. Regex : @arr = $hex =~ /[[:xdigit:]]{2}/g; - To read 4Mb file
into an array it took 1min 7 seconds.
2. Unpack : @arr = unpack("(C2)*",$hex); - To read 4Mb file into
an array it took 3min 26seconds.
3. Substr: while ($val=substr( $hex, $offs, 2))
{
push @arr, $val;
$offs+=2;
} - To read 4Mb file into an array it took 11 seconds.
thanks,
jis
j> As said regex and unpack took longer time than substr.
j> I use Windows. The following are the time taken.
j> 1. Regex : @arr = $hex =~ /[[:xdigit:]]{2}/g; - To read 4Mb file
j> into an array it took 1min 7 seconds.
j> 2. Unpack : @arr = unpack("(C2)*",$hex); - To read 4Mb file into
j> an array it took 3min 26seconds.
j> 3. Substr: while ($val=substr( $hex, $offs, 2))
j> {
j> push @arr, $val;
j> $offs+=2;
j> } - To read 4Mb file into an array it took 11 seconds.
i am sorry, i can't believe it took on the order of minutes to read in a
file and convert from hex to binary. this is not possible on anything
but an abacus. given you haven't shown the complete script for each
version i have to assume your code is broken in some way. also there is
no way a substr loop would be faster than unpack or a regex. both of
those would spend all their time in perl's guts while the substr version
spends most of its time doing slow perl ops in a loop. i say this from
plenty of experience benchmarking perl code. you can easily write an
incorrect test of this so i must ask you to post complete working
programs that exhibit the slowness you claim. i will wager large amounts
of quatloos i can fix them so the substr will be outed as the slowest
one.
uri
--
Uri Guttman ------ u...@stemsystems.com -------- http://www.sysarch.com --
----- Perl Code Review , Architecture, Development, Training, Support ------
--------- Gourmet Hot Cocoa Mix ---- http://bestfriendscocoa.com ---------
Even I want to beleive it should take very less time.
I post the scripts I used for testing.
1. #!/usr/bin/perl
use strict;
use warnings;
my $binary_file="28247101.bin";
open FILE, $binary_file or die "Can't open $binary_file $!\n";
# binmode FILE to supress conversion of line endings
binmode FILE;
undef $/;
my $data = <FILE>;
close FILE;
# convert data to hex form
my $hex = unpack 'H*', $data;
my ($val, $offs, @arr) = ('',0);
#@arr = $hex =~ /[[:xdigit:]]{2}/g;
@arr = unpack("(C2)*",$hex);
print "bye";
print $arr[2]; ( this took 3minuts 25 sec)
if i uncommment regex protion and comment unpack it would take
1minute 25 sec
#!/usr/bin/perl
use strict;
use warnings;
my $binary_file="28247101.bin";
open FILE, $binary_file or die "Can't open $binary_file $!\n";
# binmode FILE to supress conversion of line endings
binmode FILE;
undef $/;
my $data = <FILE>;
close FILE;
# convert data to hex form
my $hex = unpack 'H*', $data;
my $i=0;
my ($val, $offs, @arr) = ('',0);
while ($val=substr( $hex, $offs, 2)){
push @arr, $val;
$offs+=2;
}
print "bye";
print $arr[2]; This would take only 9 seconds.
I have used a stopwatch to calculate time.
Appreciate your help in finding how it can be improved.
thanks,
jis
On Mar 10, 12:51 pm, "Uri Guttman" <u...@StemSystems.com> wrote:
j> Even I want to beleive it should take very less time.
j> I post the scripts I used for testing.
j> 1. #!/usr/bin/perl
j> # convert data to hex form
j> my $hex = unpack 'H*', $data;
j> my ($val, $offs, @arr) = ('',0);
j> #@arr = $hex =~ /[[:xdigit:]]{2}/g;
j> @arr = unpack("(C2)*",$hex);
j> my $data = <FILE>;
j> close FILE;
j> # convert data to hex form
j> my $hex = unpack 'H*', $data;
j> my $i=0;
j> my ($val, $offs, @arr) = ('',0);
j> while ($val=substr( $hex, $offs, 2)){
j> push @arr, $val;
j> $offs+=2;
j> }
j> print "bye";
j> print $arr[2]; This would take only 9 seconds.
j> I have used a stopwatch to calculate time.
a stopwatch? you need to learn how to use the Benchmark.pm module.
j> Appreciate your help in finding how it can be improved.
easy. let me do a proper benchmark.
and you should learn how to properly bottom post and not leave my entire
post in the message.
j> if i uncommment regex protion and comment unpack it would take
j> 1minute 25 sec
j> print "bye";
j> print $arr[2]; This would take only 9 seconds.
j> I have used a stopwatch to calculate time.
as i said, that is a silly way to time programs. and there is no way it
would take minutes to do this unless you are on a severely slow cpu or
you are low on ram and are disk thrashing. here is my benchmarked
version which shows that unpacking (fixed to use A and not C) is the
fastest and regex (also fixed to do the simplest but correct thing which
is grab 2 chars) ties your code.
uncomment out those commented lines to see that this does the same and
correct thing in all cases.
here is the timing result run for 10 seconds each:
s/iter regex substring unpacking
regex 2.11 -- -0% -25%
substring 2.11 0% -- -25%
unpacking 1.58 33% 33% --
uri
use strict;
use warnings;
use File::Slurp ;
use Benchmark qw(:all) ;
my $duration = shift || -2 ;
my $file_name = '/boot/vmlinuz-2.6.28-15-generic' ;
my $data = read_file( $file_name, binary => 1 ) ;
#$data = "\x00\x10" ;
my $hex = unpack 'H*', $data;
# unpacking() ;
# regex() ;
# substring() ;
# exit ;
cmpthese( $duration, {
unpacking => \&unpacking,
regex => \®ex,
substring => \&substring,
} ) ;
sub unpacking {
my @arr = unpack( '(A2)*' , $hex) ;
# print "@arr\n"
}
sub regex {
my @arr = $hex =~ /(..{2})/g ;
# print "@arr\n"
}
sub substring {
my ($val, $offs, @arr) = ('',0);
while ($val=substr( $hex, $offs, 2)){
push @arr, $val;
$offs+=2;
}
# print "@arr\n"
Uri,
I have used the script you have posted with only change in input file
i get the following results.
(warning: too few iterations for a reliable count)
(warning: too few iterations for a reliable count)
(warning: too few iterations for a reliable count)
s/iter unpacking regex substring
unpacking 9.06 -- -27% -34%
regex 6.59 37% -- -9%
substring 6.01 51% 10% --
Unpacking still remains the longest to finish.
I use Windows XP professional with a 2Gb RAM. I also have got a 45GB
free space in my C drive.
DO you see something else different?
thanks,
jis
Shouldn't that be:
my @arr = $hex =~ /../g ;
Or:
my @arr = $hex =~ /.{2}/g ;
You are capturing *three* characters instead of two.
You have Windows!
Try this test below. It uses timethis() for $count itterations.
You don't want a partial itteration result given a small time interval.
After you run the code as written, run it by plugging in your file
information and change the $count to 3 itterations.
Go for a cofee break. Post back.
My results:
Unpacking: 12.7929 wallclock secs ( 9.94 usr + 2.84 sys = 12.78 CPU) @ 0.08/s (n=1)
Regex: 29.6103 wallclock secs (29.53 usr + 0.08 sys = 29.61 CPU) @ 0.03/s (n=1)
Substring: 2.85185 wallclock secs ( 2.81 usr + 0.03 sys = 2.84 CPU) @ 0.35/s (n=1)
-sln
-----------------
use strict;
use warnings;
use Benchmark qw(:all :hireswallclock) ;
#---- Uncomment, plug in filename ---------
# use File::Slurp ;
# my $file_name = '/boot/vmlinuz-2.6.28-15-generic' ;
# my $data = read_file( $file_name, binary => 1 ) ;
# #$data = "\x00\x10" ;
# my $hex = unpack 'H*', $data;
#------------------------------------------
my $count = 1; # increase count to 3 after first testing 1
#---- Comment out $hex -------------------
my $hex = 'a0b0c1d2e3f411aabbcc' x 200_000; # about 4MB's
#-----------------------------------------
timethis ($count, \&unpacking, "Unpacking");
timethis ($count, \®ex, "Regex");
timethis ($count, \&substring, "Substring");
sub unpacking {
my @arr = unpack( '(A2)*' , $hex) ;
# print "@arr\n"
}
sub regex {
my @arr = $hex =~ /.{2}/g ; # regex modified
# print "@arr\n"
}
sub substring {
my ($val, $offs, @arr) = ('',0);
while ($val=substr( $hex, $offs, 2)) {
push @arr, $val;
$offs+=2;
}
# print "@arr\n"
}
__END__
j> Uri,
j> I have used the script you have posted with only change in input file
j> i get the following results.
j> (warning: too few iterations for a reliable count)
j> (warning: too few iterations for a reliable count)
j> (warning: too few iterations for a reliable count)
j> s/iter unpacking regex substring
j> unpacking 9.06 -- -27% -34%
j> regex 6.59 37% -- -9%
j> substring 6.01 51% 10% --
j> Unpacking still remains the longest to finish.
j> I use Windows XP professional with a 2Gb RAM. I also have got a 45GB
j> free space in my C drive.
j> DO you see something else different?
i don't have 45GB files nor do i intend to do that. you are disk
thrashing which is the cause of your slowdowns. you are not properly
testing the perl code as your OS I/O is the limiting factor here. learn
how to understand benchmarks better. your test is not legitimate in
comparing the algorithms as the disk I/O dominates.
try it with smaller files that will fit in your ram. not more than .5 gb
given your systems. and with files that large, i would do the conversion
in large chunks in a look to mitigate the i/o and then see which does
better.
uri
JWK> Uri Guttman wrote:
>>
>> sub regex {
>> my @arr = $hex =~ /(..{2})/g ;
>> # print "@arr\n"
>> }
JWK> Shouldn't that be:
JWK> my @arr = $hex =~ /../g ;
JWK> Or:
JWK> my @arr = $hex =~ /.{2}/g ;
JWK> You are capturing *three* characters instead of two.
true. i did my output test and must have optimized this without running
the tests again. anyhow, this whole thing is moot. the OP never said he
had a 25GB file on a 2gb system. slurping in the whole file and then
processing it is disk bound and the 2 char algorithm is irrelevant. i am
out of this thread. the OP doesn't seem to get the concept of
benchmarking or optimizing. let him stick to his substr and stopwatch.
uri
Right. He never said that. So where did you get that information?
He said he had a 4 MB file and 45 GB of free space (the latter is rather
irrelevant, of course).
hp
PJH> On 2010-03-11 18:30, Uri Guttman <u...@StemSystems.com> wrote:
>>>>>>> "JWK" == John W Krahn <som...@example.com> writes:
>> anyhow, this whole thing is moot. the OP never said he had a 25GB file
>> on a 2gb system.
PJH> Right. He never said that. So where did you get that information?
PJH> He said he had a 4 MB file and 45 GB of free space (the latter is rather
PJH> irrelevant, of course).
i misread the 45Gb free disk as the file size. he still never mentioned
the file size. as i showed, the unpack is fastest with the data in
ram. i still would want to know his setup (file size included) to see
why his substr would be fastest. it has to be some very odd thing he is
doing and not telling us. there is no way a substr loop could be faster
than a single call to unpack.
The odd thing he is doing seems to be "using perl on Windows". Sln has
repeatedly pointed out that growing strings or arrays on Windows is
extremely slow (yes, sln sometimes makes strange claims, but be not only
provided benchmark results but also a link to a perlmonks thread - so he
isn't the only one who noticed this). I don't have access to a Windows
machine where I could test this myself, though.
hp
Win32's malloc is well known to be appalling, but you can't build perl
with perl's malloc if you want USE_IMPLICIT_SYS. Since IMP_SYS is required
for fork emulation, most perls on Win32 are built with Win32's malloc,
and are thus much slower than they might be :(.
Ben