use strict; use warnings;
use Parallel::ForkManager;
my $pm = Parallel::ForkManager->new(10);
# assume @files contains 100 files that will be processed,
# and processing time could range from subseconds to hours
my $out = 'results.txt';
for my $file (@files) {
$pm->start and next;
# some code to process file
# blah blah blah
open( my $fh_out, '>', $out ) or die "can't open $out: $!\n";
print $fh_out "$file\n";
close $fh_out;
$pm->finish;
}
$pm->wait_all_children;
Apologies... ^^^ should be '>>'
As long as you don't care about the order in which the output from
respective file is appended to $out, I can't see what the problem would be.
--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl
order is unimportant at this juncture. i was concerned if one process
would interrupt another that was writing, so that either both failed,
or a single entry became a garbled hybrid of two entries...something
along those lines. these, or other cases that may interfere with
writing one entry per line to $out, are what cause me apprehension.
Probably somebody more knowledgable than me should comment on your concerns.
Awaiting that, have you read
perldoc -q append.+text
I can add that I'm using the above method in a widely used Perl program
(run on various platforms), without any problems having been reported.
However, I do use flock() to set an exclusive lock before printing to
the file. Can't tell if locking is necessary.
On Linux this is safe, provided the string $file is not more than a few
hundred bytes. On Windows, it is not safe (although probably safe
enough, provided this is just for progress monitoring)
Xho
--
-------------------- http://NewsReader.Com/ --------------------
Usenet Newsgroup Service $9.95/Month 30GB
I meant to say "On Linux this is fairly safe." I have never ran into a
problem with it, but that is no gaurantee of absolute safety.
It would be better to have the file opened for writing in
the parent process, and let all the forked processes inherit
that file handle.
-Joe
i thought about that, but i tried something similar with an FTP
connection, and discovered it was not fork safe, so i was hesitant to
share anything among the forked processes.
i'm running into some issues, which may be related to this. unsure
right now. i'm also using a similar process to actually print out
apache web log records (unlike th above example, which is just a short
file name), some of which, may be large due to cookies and query
strings, and additional authentication information we put in there. i
know in at least one case, the user agent string appears once fully,
then later in the processed record--BUT the second time only a jagged
latter half of it gets inserted somewhere before the end of the record.
from your response here, it seems like you may be aware of some problem
is the string is large?
should i use flock()? i don't really know much about it...
this never occurred when i wrote to the same file using many system()
calls. again, i am unsure if this is actually a forking issue. i am
currently investigating. just wondered if this raised any alarms with
anyone who may know what is going on.
> Gunnar Hjalmarsson wrote:
>> I can add that I'm using the above method in a widely used Perl
>> program (run on various platforms), without any problems having been
>> reported. However, I do use flock() to set an exclusive lock before
>> printing to the file. Can't tell if locking is necessary.
...
> hello Gunnar, could you please post a snippet of code illustrating how
> you would use flock() in my above example?
So far,I do not see any code posted by you.
Maybe, you should read the documentation, and come up with a short
script illustrating your problem, and we can comment on it:
perldoc -q lock
perldoc -f flock
Sinan
--
A. Sinan Unur <1u...@llenroc.ude.invalid>
(reverse each component and remove .invalid for email address)
comp.lang.perl.misc guidelines on the WWW:
http://mail.augustmail.com/~tadmc/clpmisc/clpmisc_guidelines.html
apologies, i was referring to the following example:
>
> A. Sinan Unur wrote:
>> simon...@gmail.com wrote in
>> news:1129994532.1...@g47g2000cwa.googlegroups.com:
>>
>> > Gunnar Hjalmarsson wrote:
>>
>> >> I can add that I'm using the above method in a widely used Perl
>> >> program (run on various platforms), without any problems having
>> >> been reported. However, I do use flock() to set an exclusive lock
>> >> before printing to the file. Can't tell if locking is necessary.
>>
>> ...
>>
>> > hello Gunnar, could you please post a snippet of code illustrating
>> > how you would use flock() in my above example?
>>
>> So far,I do not see any code posted by you.
>>
>> Maybe, you should read the documentation, and come up with a short
>> script illustrating your problem, and we can comment on it:
>>
>> perldoc -q lock
>>
>> perldoc -f flock
>>
>
> apologies, i was referring to the following example:
I see now ... You like to keep your readers guessing by not sticking
with a single posting address. Not very polite. Just configure whichever
client you are using with the same name and email address, please, so I
can figure out with whom I am corresponding.
> use strict; use warnings;
> use Parallel::ForkManager;
>
>
> my $pm = Parallel::ForkManager->new(10);
>
>
> # assume @files contains 100 files that will be processed,
> # and processing time could range from subseconds to hours
>
>
> my $out = 'results.txt';
> for my $file (@files) {
> $pm->start and next;
>
>
> # some code to process file
> # blah blah blah
>
This is a good place to put an exclusive flock on a sentinel file (not
the output file). You code will block until it gets the exclusive lock.
> open( my $fh_out, '>>', $out ) or die "can't open $out: $!\n";
> print $fh_out "$file\n";
> close $fh_out;
And, this would be the place to release that lock.
So long as the sentinel file is not on a network mounted volume, this
will give you the assurance that at most one process will write to the
file at any given time.
> $pm->finish;
> }
> $pm->wait_all_children;
sorry about that. i am unable to access non-work email at work. i will
try to set up my home account to use the same name and address.
>
> > use strict; use warnings;
> > use Parallel::ForkManager;
> >
> >
> > my $pm = Parallel::ForkManager->new(10);
> >
> >
> > # assume @files contains 100 files that will be processed,
> > # and processing time could range from subseconds to hours
> >
> >
> > my $out = 'results.txt';
> > for my $file (@files) {
> > $pm->start and next;
> >
> >
> > # some code to process file
> > # blah blah blah
> >
>
> This is a good place to put an exclusive flock on a sentinel file (not
> the output file). You code will block until it gets the exclusive lock.
i don't understand. in the Cookbook 2nd ed., pg 279 Locking a File they
write:
open(FH, "+<" $path)
flock(FH, LOCK_EX)
#update file, then...
close(FH)
i'm not importing the Fcntl module, so i must use the numeric values. 2
corresponds to LOCK_EX.
i don't understand what you mean by placing a LOCK_EX on a sentinal
file. why wouldn't i lock my output file? i thought i would :
open the fh
lock it
write to the file
close the fh (which unlocks it).
is this not correct? the file may get up to 3 GB, so i don't know if
locking a REGION of the file may be more apt, but i am also unsure of
whether or not that only applies to file updates, because how can you
lock a region of a new file, when the newest part doesn't exist yet.
Why not the output file?
>> open( my $fh_out, '>>', $out ) or die "can't open $out: $!\n";
Why not just:
flock $fh_out, 2 or die $!;
>> print $fh_out "$file\n";
>> close $fh_out;
>
> And, this would be the place to release that lock.
Unless you use a lexical ref to the filehandle as above, in which case
you don't need to release it explicitly.
My related comment was posted at
http://groups.google.com/group/comp.lang.perl.misc/msg/2ca8d6e6894030b3
You should do
use Fcntl qw(:flock);
because there is no guarantee that LOCK_EX will always be 2. Using the
LOCK_EX will make sure that the correct value for the system is going to
be used.
> i don't understand what you mean by placing a LOCK_EX on a sentinal
> file. why wouldn't i lock my output file? i thought i would :
>
> open the fh
> lock it
> write to the file
> close the fh (which unlocks it).
>
> is this not correct?
It is not wrong, although I can see how my post can be interpreted as
stating otherwise.
I have gotten into the habit of locking a sentinel file as a semaphore,
because, I frequently find myself needing to serialize access to
multiple resources, in addition to, occasionally, serializing access to
files on a network mounted drive.
> My related comment was posted at
> http://groups.google.com/group/comp.lang.perl.misc/msg/2ca8d6e6894030b3
Gunnar, I had seen your response, but I thought Simon Chao and 'balls'
were different people.
<snip>
>> My related comment was posted at
>> http://groups.google.com/group/comp.lang.perl.misc/msg/2ca8d6e6894030b3
>
> Gunnar, I had seen your response, but I thought Simon Chao and 'balls'
> were different people.
Yeah, I know, and you also answered my question above in your reply to
"it_says_BALLS_on_your forehead".
Thanks.
well, it's been a long day at the office. thanks for all your help
guys! i apologize for my lack of clarity in the beginning--i wasn't
attempting to be disingenuous, i just was stressed and felt i didn't
have the time to explain everything in minute detail (even though the
slightest thing could radically change the advice i get). anyway, i
think things are going well with the flock. slowed things down a bit,
but what can you do? :-). thanks again!
I have to ask: Did flock() actually make a difference (besides the
execution time)?
previously there would be web log records that had bits of another web
log record spliced in the middle of them.
after loading into a table, there were many (much more than normal)
that were thrown into an error table. after the flocking, we seem to
have reduced the number of error records to normal.
it's difficult to look at the processed flat file that was produced to
check for these splices--they appear sporadically, and may be
infrequent relative to the size of the data set (50 million), but once
it was loaded into a table, we could query against it and see if things
turned out ok.
we re-ran, and examined the error table real-time up until 25 million
records were loaded, and the number of errors seemed normal. it's
possible that the last 25 million records could have many many errors,
but unlikely.
also, when i used the non-flocking method, we got high error counts for
particular servers (some high volume ones), and since i'm parsing these
files based on reverse sorted size, i know that the these will be in
the 1st half of the total number of records. after the flocking, the
biggest files were still processed first, so when we looked at the 1st
25 million records, most of the big servers were done, so we know that
they, at least, had many less errors. so we think this solved the
problem...
still, i have the code written in such a way that i'm opening, locking,
writing to, and closing a file handle for every record. i wish there
were a way that lock and release without having to open/close every
time. i don't know if that would save much time or not. maybe the extra
time is not from the extra opening/closing, but from the locking and
having to wait to write. i will do more research and share if i find a
solution.
Thanks for sharing. Then the lock is probably not redundant, as I
previously was suspecting, even if I'm doing it as well.
yeah, i don't think it's an issue when you're appending small strings
to a file, but they should make file locking default, don't you think?!
:-)
when *wouldn't* you want to lock a file when writing?
The flocking slowed things down a noticable amount? Since you seem to be
only running flock per process (i.e. every flock is accompanied by at least
one fork, whatever processing your are doing, and one open). That is very
strange. Are you holding the lock longer than necessary?
> >
> > And, this would be the place to release that lock.
>
> Unless you use a lexical ref to the filehandle as above, in which case
> you don't need to release it explicitly.
If your lock is contentious, then you should release it explicitly (or
close the filehandle explicitly) rather than just letting it go out of
scope. Even if the natural place to release it happens to be the end of
the scope, you never know when the code will be changed to add more things
to the scope. Plus, garbage collection can sometimes be time consuming, and
there is no reason to make other processes wait upon it.
From the code you posted before, you are forking a process for every
record. Given that, it is hard to believe that opening and closing a file
is significant.
> i wish there
> were a way that lock and release without having to open/close every
> time.
There is. Have you read the perldoc -q append entry someone pointed out
earlier?
> i don't know if that would save much time or not. maybe the extra
> time is not from the extra opening/closing, but from the locking and
> having to wait to write. i will do more research and share if i find a
> solution.
I'm still incredulous. If any of this has a meaningful impact on
performance, you are doing something wrong at a more fundamental level.
<snip appending without locking>
> > On Linux this is safe, provided the string $file is not more than a few
> > hundred bytes. On Windows, it is not safe (although probably safe
> > enough, provided this is just for progress monitoring)
> >
>
> i'm running into some issues, which may be related to this. unsure
> right now. i'm also using a similar process to actually print out
> apache web log records (unlike th above example, which is just a short
> file name), some of which, may be large due to cookies and query
> strings, and additional authentication information we put in there.
On my single CPU Linux system, ~4096 bytes seems to be magic string length
where you go from reliably not screwing up to occasionally screwing up.
It is possible that multi CPU systems might screw up on less that 4096,
but I doubt it.
Consider that, when you are not using autoflush, you never really know
when it is that you are actually writing.
Locking is used to resolve thorny concurrency issues. Being thorny issues,
any overly simplistic solution is not going to work. When do two different
print statements need to be in the same "transaction", and when don't they?
Most programmers who should know better can't even figure this out, so why
would you expect Perl to be able to?
sorry, i was in a rush, so had to use that overly simplistic example.
in my real code ( i may post later when i have time ) i am running
flock per RECORD.
i open a lexical anonymous filehandle, flock it, write to it, and close
it, each and every record.
I see now that my comment did not express what I wanted to say. It
should rather have been:
"Unless you close the filehandle, in which case you don't need to
release the lock explicitly."
;-)
Thanks for the correction.
i'm not sure i really follow, xho...are you advocating file locking as
default or not?
hey Xho, i tried to trim down the code as much as possible while
maintaining the exact same functionality of the code. in this
simplified version, there are 2 scripts: betaProcess.pl and
betaParse.pl
betaProcess.pl
#-------------
use strict; use warnings;
use Net::FTP;
use File::Copy;
use Cwd;
use Parallel::ForkManager;
my ($area, $processDate) = @ARGV;
my %fetch;
my $files_are_missing;
my $count = 1;
do {
&get_fetch(\%fetch, $area, $processDate);
$files_are_missing = 0;
if ( ($count % 30) == 0 ) {
doMail("$0 - slept $count times...each for a minute. maybe some
files are missing.", "", $toList);
}
print LOGS localtime() . " - about to start the cycle $count.\n";
my $redo = &startCycle( $area, 16, \%fetch );
print LOGS localtime() . " - about to end the cycle $count.\n";
%fetch = ();
if ( $redo ) {
print LOGS localtime() . " - will sleep for a minute because at
least one file was missing.\n";
$fetch{redo}++;
sleep 60;
$files_are_missing = 1;
}
$count++;
}
while ($files_are_missing);
print LOGS "Ending $0 at -" . UnixDate('now','%m/%d/%Y %H:%M:%S') .
"-\n";
close LOGS;
my $time_end = localtime();
print "START: $time_start\n";
print " END: $time_end\n";
doMail("$time_start: $0 - Start => " . localtime() . ": $0 - Done", "",
$toList);
#--- subs ---
sub get_fetch {
my ($fetch_ref, $area, $processDate) = @_;
my ($command, @result);
if (! keys( %{ $fetch_ref } ) ) {
print "in get_fetch there were no keys. so NO MISSING FILES\n";
$command = "ssh -2 -l <user> <server>
/export/home/<user>/getFileSize.pl $area $processDate";
}
else {
# process missing files
print "in get_fetch there were keys. so files were missing!\n";
delete $fetch_ref->{redo};
$command = "ssh -2 -l <user> <server>
/export/home/<user>/getFileSize.pl $area.missing $processDate";
}
@result = `$command`;
for (@result) {
chomp;
my ($file, $size, $time) = split('=>');
$fetch_ref->{$file} = {
time => $time,
size => $size,
};
}
}
sub startCycle {
my ($area, $num_groups, $data_ref) = @_;
my %data = %{$data_ref};
my $server = 'server';
my ($userName, $password) = split /\|/, $mValues{$server};
my %missing;
my $pm = Parallel::ForkManager->new($num_groups);
for my $file ( sort { $data{$b}->{size} <=> $data{$a}->{size} } keys
%data ) {
$pm->start and next;
if ( $data{$file}->{size} ) {
print "size is TRUE: $file has size -$data{$file}->{size}-\n";
# fetch and parse
my ($server, $dir, $file_type, $slice, $controlNo, $isCritical) =
split(/\|/, $file );
my $remoteFile = "$dir/$processDate.gz";
my $localFile =
"$cycleType{$area}->{rawLogs}/$file_type.$processDate$slice";
#
# Establish FTP Connection:
#
my $ftp;
unless ($ftp = Net::FTP->new($server)) {
doMail("Problem with $0", "Can't connect to $server with ftp,
$@\n");
die;
}
unless ($ftp->login($userName, $password)) {
doMail("Problem with $0", "Can't login to $server with ftp
using -$userName- and -$password- $@\n");
die;
}
$ftp->binary();
if ($ftp->get($remoteFile, $localFile)) {
# print "got $remoteFile to $localFile\n";
my $doneFile = $localFile;
$doneFile =~ s/^.*\///g;
$doneFile =~ s/\.gz$//;
$doneFile = "$cycleType{$area}->{work}/$doneFile";
# Kick Off Parsing for this file:
my $command = "betaParse.pl $processDate $localFile $doneFile
$hash{$area}->{parseMethod}";
system($command);
}
else {
print localtime() . " - FTP MESSAGE: $ftp->message: $@\n";
open( my $fh_missing, '>>', "$area.missing.fetch" ) or die
"can't open $area.missing.fetch: $!\n";
my $missingFile = $file;
$missingFile =~ s/\|/\ /g;
my $controlNo = 1;
my $isCritical = 'Y';
print $fh_missing "$missingFile $controlNo $isCritical\n";
close $fh_missing;
}
$ftp->quit();
}
else {
# Capture missing logs to deal with later
print "size is FALSE: $file has size -$data{$file}->{size}-\n";
$missing{$file} = {
time => scalar(localtime()),
size => $data{$file}->{size},
};
print "$file: $missing{$file}->{time} and
-$missing{$file}->{size}-\n";
open( my $fh_missing, '>>', "$area.missing.fetch" ) or die "can't
open $area.missing.fetch: $!\n";
while( my ($missingFile, $attr) = each %missing ) {
# my ($server, $path, $frontName, $backName) = split(/\|/,
$missingFile);
$missingFile =~ s/\|/\ /g;
my $controlNo = 1;
my $isCritical = 'Y';
print $fh_missing "$missingFile $controlNo $isCritical\n";
}
close $fh_missing;
}
$pm->finish;
}
$pm->wait_all_children;
my $redo = 0;
if ( -e "$area.missing.fetch" ) {
my $command = "scp -oProtocol=2 $area.missing.fetch
<user>\@<server>:/export/home/<user>/data";
my $rc = system($command);
unlink "$area.missing.fetch";
$redo = 1;
}
return $redo;
}
##
# betaParse.pl
##
use strict;
use Cwd;
use Date::Manip;
use File::Basename;
use File::Copy;
my $rawCounts = 0;
my $numOutputFiles = 10;
open(my $fh_in, "gzip -dc $inputFile|") || dieWithMail("Can't open
$inputFile $!\n");
$status = &reformatLogs($fh_in);
close $fh_in;
if ($status == 1) {
system("touch $inputFile.DONE");
}
#--- subs ---
sub reformatLogs {
my ($fh_in) = @_;
while( <$fh_in> ) {
$rawCounts++;
chomp;
# process $_
# evenly distribute data to output files
my $section = $rawCounts % $numOutputFiles;
open( my $fh, ">>log$section" ) || die "can't open log$section:
$!\n";
flock( $fh, 2 );
print $fh "$_\n";
close( $fh );
}
return 1;
}
....is there a problem because of the mix of forked processes and
system calls? perhaps i should change the system call to a function
call (after making the necessary code changes)
previously, the open filehandles were at the top of betaParse.pl and
there was no locking. this appeared to cause the record splicing,
although the processing was about twice as fast. can you shed some
light on this phenommenon? you said before that the string length at
which writing would start going crazy was around maybe 4096. the
records in these weblogs are perhaps a maximum of 10 lines (this is a
conservative estimate) on a 1024 x 768 res screen maximized.
i have looked at it. i don't see how it's helpful at all? am i missing
something? it says Perl is not a text editor, in general there's no
direct way for Perl to seek a particular line of a file, insert to, or
delete from a file. in special cases you can write to the end of a
file. i know that. i'm using Perl v. 5.6.1...is there more in the
perldocs for higher versions?
i don't see how this is helpful. i know that open my $fh, '>>', $file
appends. how does this perldoc help me? also, would syswrite with
O_APPEND be faster? would that not be as safe though?
i don't understand why in Programming Perl 3rd ed. page 421, it states:
To get an exclusive lock, typically used for writing, you have
to be more careful. You cannot use a regular open for this; if you
use an open mode of <, it will fail on files that don't exist yet,
and if you use >, it will clobber any files that do. Instead, use
sysopen on the file so it can be locked before getting
overwritten.
what about open my $fh, '>>' $file ... ? that's what i do, and it seems
to work.
but i've also read this:
If you really want to get faster I/O in Perl, you might experiment with
the sysopen(), sysread(), sysseek(), and syswrite() functions. But
beware, they interact quirkily with normal Perl I/O functions
and:
sysopen(), sysread(), sysseek(), and syswrite()
These are low level calls corresponding to C's open(), read() and
write() functions. Due to lack of buffering, they interact strangely
with calls to Perl's buffered file I/O. But if you really want to speed
up Perl's I/O, this might (or might not) be a way to do it. This is
beyond the scope of this document.
which can be found at:
http://www.troubleshooters.com/codecorn/littperl/perlfile.htm
and instead of for every log record (50 million times across 16
processes), doing:
while( <$fh_in> ) {
chomp;
# process line
open( my $fh, '>>', $file ) || die "can't open $file: $!\n;
flock( $fh, 2 );
print $fh "$_\n";
close( $fh );
}
would it be better for each process to have a global filehandle at the
top, and then in the loop for each record lock then unlock it? would
this be faster? would this be SAFE?
ex:
open( my $fh_out, '>>', $file ) || die "can't open $file: $!\n";
# ...some code here
while( <$fh_in> ) {
chomp;
# ...process line
flock( $fh_out, LOCK_EX );
print $fh_out "$_\n";
flock( $fh_out, LOCK_UN );
}
# ...some code here
close( $fh_out );
does anyone know if this second method is faster and 100% safe across
multiple forked processes?
You seem to have read the old version of the second entry in perlfaq5
(which indeed has been changed since v5.6.1). However, assuming that Xho
was thinking of the FAQ entry I suggested:
perldoc -q append.+text
what you read was not that one.
This is the current version of the entry I suggested:
http://faq.perl.org/perlfaq5.html#All_I_want_to_do_is_
Apparently so. See Gunnar's post. Sorry for the confusion.
I think you could have trimmed it down a lot more in most places, and
a little less in others. :)
For example, all the stuff in betaProcess which are in paths other than
those invoking betaParse are almost certainly irrelevant.
Also, I think you need to rear back and think of the big picture some
more. You are forking of 16 processes, and you are creating 10 log files,
and you want all 16 of those processes to write to all 10 of those log
files, so you have 160 filehandles all fighting with each other. Is that
really necessary? Do the 10 log files serve 10 different purposes
(unlikely, it seems) or they are just to keep the size of any given log
file down, or what?
More on the big picutre. A couple weeks ago when you started asking here
about parsing (or was it FTPing?) lots of files, I thought
Parallel::ForkManager was the right solution. But you keep incrementally
adding new wrinkles, and I fear that by the time you are done adding them
perhaps the right solution will no longer be Parallel::ForkManager but
rather threads or event-loops or some RDMS-interfaced program or even some
other language entirely. Parallelization in inherently challenging, and it
probably needs some deep big-picture thinking, not incremental tinkering.
> ##
> # betaParse.pl
> ##
> use strict;
> use Cwd;
> use Date::Manip;
> use File::Basename;
> use File::Copy;
>
> my $rawCounts = 0;
>
> my $numOutputFiles = 10;
>
> open(my $fh_in, "gzip -dc $inputFile|") || dieWithMail("Can't open
> $inputFile $!\n");
There is no $inputFile! Like I said, too much trimming in places.
>
> $status = &reformatLogs($fh_in);
>
> close $fh_in;
>
> if ($status == 1) {
>
> system("touch $inputFile.DONE");
> }
>
> #--- subs ---
>
> sub reformatLogs {
> my ($fh_in) = @_;
>
> while( <$fh_in> ) {
> $rawCounts++;
> chomp;
>
> # process $_
>
> # evenly distribute data to output files
> my $section = $rawCounts % $numOutputFiles;
I don't understand what you want to accomplish with this. Do your lines
need to be parceled out to output files in this inherently inefficient way?
That seems unlikely, as log files generally have one record per line and
the relative position of lines to each other is meaningless. I doubt you
have a good reason for doing this, but for the rest of the post I'll assume
you do.
In my experience, having all the children write to a common file for
monitoring, debugging, errors, etc, is fine. But for ordinary output, I've
almost never found it beneficial to have multiple children multiplex into
shared unstructured output files. It is usually far easier to produce 10,
000 output files and then combine them at a later stage if that is
necessary.
> open( my $fh, ">>log$section" )
> || die "can't open log$section: $!\n";
## don't reopen the handle each time.
## rather, keep a hash of open handles
unless ( $section_handles{$section}) {
open (my $fh, ">>", "log$section)
or die "can't open log$section: $!\n";
$section_handles{$section}=$fh;
};
my $fh = $section_handles{$section};
> flock( $fh, 2 );
flock( $fh, LOCK_EX ) or die $!;
> print $fh "$_\n";
> close( $fh );
## and now you don't close the handle, but you still need it
unlocked. flock( $fh, LOCK_UN) or die $!;
> }
> return 1;
> }
>
> ....is there a problem because of the mix of forked processes and
> system calls? perhaps i should change the system call to a function
> call (after making the necessary code changes)
While I don't think this causes this particular problem, I would make that
change anyway.
> previously, the open filehandles were at the top of betaParse.pl and
> there was no locking. this appeared to cause the record splicing,
> although the processing was about twice as fast. can you shed some
> light on this phenommenon?
I would guess that reopening a file for every line is going to be slow.
Locking the file for every line is probably also going to be kind of
slow, but probably not nearly slow as re-opening it is. If it turns out to
be a bottleneck, you could batch up, say, 100 lines and write them in one
chunk, with only one lock around it. If you are going to do chunking, you
should probably whip up a module for it rather than putting it directly
into your code.
> you said before that the string length at
> which writing would start going crazy was around maybe 4096. the
> records in these weblogs are perhaps a maximum of 10 lines (this is a
> conservative estimate) on a 1024 x 768 res screen maximized.
I don't know how long 10 lines on a 1024x768 screen are. But I do know
that Perl has a length function :)
perl -lne 'print length' file
thanks, i'll work on improving this to help you help me more :-)
>
> Also, I think you need to rear back and think of the big picture some
> more. You are forking of 16 processes, and you are creating 10 log files,
> and you want all 16 of those processes to write to all 10 of those log
> files, so you have 160 filehandles all fighting with each other. Is that
> really necessary? Do the 10 log files serve 10 different purposes
> (unlikely, it seems) or they are just to keep the size of any given log
> file down, or what?
they were initially there because we have an ETL (Extract, Transform,
Load) tool that picks them up, and it was determined that this tool
could optimally use 10 threads to gather these processed records. it
would be best if these 10 records were the same size (or near enough).
i'm working on a 16 CPU machine, so that's why i created 16 forked
processes. however, i have learned that the ETL tool is being used
differently now to use a round robin technique--several readers (fast
readers) pass the data streams along to processing threads. So it
appears that we are doing work to balance the data twice--inefficient.
Now it will be much easier for me to just have every process write to
it's own file. well, not every process since there aren't really only
16 processes, but rather a max of 16 at one time. So I was thinking of
creating a counter, and incrementing it in the loop that spawns the
processes, and mod this counter by 16. the mod'd counter will be a
value that is passed to the betaParse script/function and the parsing
will use this value to choose which filehandle it writes to.
>
> More on the big picutre. A couple weeks ago when you started asking here
> about parsing (or was it FTPing?) lots of files, I thought
> Parallel::ForkManager was the right solution. But you keep incrementally
> adding new wrinkles, and I fear that by the time you are done adding them
> perhaps the right solution will no longer be Parallel::ForkManager but
> rather threads or event-loops or some RDMS-interfaced program or even some
> other language entirely. Parallelization in inherently challenging, and it
> probably needs some deep big-picture thinking, not incremental tinkering.
>
yeah, i never thought it would be easy :-)
>
> > ##
> > # betaParse.pl
> > ##
> > use strict;
> > use Cwd;
> > use Date::Manip;
> > use File::Basename;
> > use File::Copy;
> >
> > my $rawCounts = 0;
> >
> > my $numOutputFiles = 10;
> >
> > open(my $fh_in, "gzip -dc $inputFile|") || dieWithMail("Can't open
> > $inputFile $!\n");
>
> There is no $inputFile! Like I said, too much trimming in places.
sorry!
>
> >
> > $status = &reformatLogs($fh_in);
> >
> > close $fh_in;
> >
> > if ($status == 1) {
> >
> > system("touch $inputFile.DONE");
> > }
> >
> > #--- subs ---
> >
> > sub reformatLogs {
> > my ($fh_in) = @_;
> >
> > while( <$fh_in> ) {
> > $rawCounts++;
> > chomp;
> >
> > # process $_
> >
> > # evenly distribute data to output files
> > my $section = $rawCounts % $numOutputFiles;
>
> I don't understand what you want to accomplish with this. Do your lines
> need to be parceled out to output files in this inherently inefficient way?
> That seems unlikely, as log files generally have one record per line and
> the relative position of lines to each other is meaningless. I doubt you
> have a good reason for doing this, but for the rest of the post I'll assume
> you do.
>
don't think i need this anymore, thank God!
>
> In my experience, having all the children write to a common file for
> monitoring, debugging, errors, etc, is fine. But for ordinary output, I've
> almost never found it beneficial to have multiple children multiplex into
> shared unstructured output files. It is usually far easier to produce 10,
> 000 output files and then combine them at a later stage if that is
> necessary.
>
>
>
>
> > open( my $fh, ">>log$section" )
> > || die "can't open log$section: $!\n";
>
> ## don't reopen the handle each time.
> ## rather, keep a hash of open handles
>
> unless ( $section_handles{$section}) {
> open (my $fh, ">>", "log$section)
> or die "can't open log$section: $!\n";
> $section_handles{$section}=$fh;
> };
> my $fh = $section_handles{$section};
>
this is similar to what was done before, only an array was used since
$section is simply a number 0 thru max processes - 1
> > flock( $fh, 2 );
>
> flock( $fh, LOCK_EX ) or die $!;
>
>
> > print $fh "$_\n";
> > close( $fh );
>
> ## and now you don't close the handle, but you still need it
> unlocked. flock( $fh, LOCK_UN) or die $!;
>
yeah, i thought this may be better...
> > }
> > return 1;
> > }
> >
> > ....is there a problem because of the mix of forked processes and
> > system calls? perhaps i should change the system call to a function
> > call (after making the necessary code changes)
>
> While I don't think this causes this particular problem, I would make that
> change anyway.
>
yeah, seems cleaner. less complexity.
>
> > previously, the open filehandles were at the top of betaParse.pl and
> > there was no locking. this appeared to cause the record splicing,
> > although the processing was about twice as fast. can you shed some
> > light on this phenommenon?
>
> I would guess that reopening a file for every line is going to be slow.
>
> Locking the file for every line is probably also going to be kind of
> slow, but probably not nearly slow as re-opening it is. If it turns out to
> be a bottleneck, you could batch up, say, 100 lines and write them in one
> chunk, with only one lock around it. If you are going to do chunking, you
> should probably whip up a module for it rather than putting it directly
> into your code.
interesting concept. i will investigate...
>
> > you said before that the string length at
> > which writing would start going crazy was around maybe 4096. the
> > records in these weblogs are perhaps a maximum of 10 lines (this is a
> > conservative estimate) on a 1024 x 768 res screen maximized.
>
> I don't know how long 10 lines on a 1024x768 screen are. But I do know
> that Perl has a length function :)
>
> perl -lne 'print length' file
>
i deserved that :-). thanks again Xho!
> > I think you could have trimmed it down a lot more in most places, and
> > a little less in others. :)
> thanks, i'll work on improving this to help you help me more :-)
And even better, trimmed down code is good for benchmarking and other
experimentation.
> >
> > Also, I think you need to rear back and think of the big picture some
> > more. You are forking of 16 processes, and you are creating 10 log
> > files, and you want all 16 of those processes to write to all 10 of
> > those log files, so you have 160 filehandles all fighting with each
> > other. Is that really necessary? Do the 10 log files serve 10
> > different purposes (unlikely, it seems) or they are just to keep the
> > size of any given log file down, or what?
>
> they were initially there because we have an ETL (Extract, Transform,
> Load) tool that picks them up, and it was determined that this tool
> could optimally use 10 threads to gather these processed records. it
> would be best if these 10 records were the same size (or near enough).
> i'm working on a 16 CPU machine, so that's why i created 16 forked
> processes.
Since your processes have a mixed work load, needing to do both Net::FTP
(probably I/O bound) and per line processing (probably CPU bound), it might
make sense to use more than 16 of them.
...
> So I was thinking of
> creating a counter, and incrementing it in the loop that spawns the
> processes, and mod this counter by 16. the mod'd counter will be a
> value that is passed to the betaParse script/function and the parsing
> will use this value to choose which filehandle it writes to.
Don't do that. Lets say you start children 0 through 15, writing to
files 0 through 15. Child 8 finishes first. So ForkManager starts child
16, which tries to write to file 16%16 i.e. 0. But of course file 0 is
still being used by child 0. If you wish to avoid doing a flock for every
row, you need to mandate that no two children can be using the same file at
the same time.
I see two good ways to accomplish that. The first is simply to have each
child, as one of the first things it does, loop through a list of
filenames. For each, it opens it and attempts a nonblocking flock. Once it
finds a file it can successfully flock, it keeps that lock for the rest if
its life, and uses that filehandle for output.
The other way is to have ForkManager (in the parent) manage the files that
the children will write to. This has the advantage that, as long as there
is only one parent process running at once, you don't actually need to do
any flocking in the children, as the parent ensures they don't intefere
(but I do the locking anyway if I'm on a system that supports it. Better
safe than sorry):
my $pm=new Parallel::ForkManager(16);
#tokens for the output files.
my @outputID="file01".."file20"; # needs to be >= 16, of course;
#put the token back into the queue once the child is done.
$pm->run_on_finish( sub { push @outputID, $_[2] } ) ;
#...
foreach my $whatever (@whatever) {
#get the next available token for output
my $oid=shift @outputID or die;
$pm->start($oid) and next;
open my $fh, ">>", "/tmp/$oid" or die $!;
flock $fh, LOCK_EX|LOCK_NB or die "Hey, someone is using my file!";
#hold the lock for life
#...
while (<$in>) {
#...
print $fh $stuff_to_print;
};
close $fh or die $!;
$pm->finish();
};
> >
> > More on the big picutre. A couple weeks ago when you started asking
> > here about parsing (or was it FTPing?) lots of files, I thought
> > Parallel::ForkManager was the right solution. But you keep
> > incrementally adding new wrinkles, and I fear that by the time you are
> > done adding them perhaps the right solution will no longer be
> > Parallel::ForkManager but rather threads or event-loops or some
> > RDMS-interfaced program or even some other language entirely.
> > Parallelization in inherently challenging, and it probably needs some
> > deep big-picture thinking, not incremental tinkering.
> >
>
> yeah, i never thought it would be easy :-)
Theme song from grad school days: "No one said it would be easy, but no one
said it would be this hard."
i'm not sure i see how you derive your conclusion here? i don't know
too much about resource utilization/consumption. i just heard that it
was optimal to use as many processes as there are CPUs. i can see why
more processes would be better for more concurrent FTP gets, but
wouldn't the subsequent processing of > 16 files make the machine
sluggish since there are only 16 CPUs? i would say overall the entire
'thing' (hesitant to use the word process) is more CPU bound since the
file transfer comprises a significantly smaller percentage of the time
relative to the actual processing.
is it actually better to use more if i have a mixed work load? why
exactly? how does the machine allocate its resources?
Doing an FTP is generally limited by IO (or, at least it reasonable to
assume so until specific data on it is gathered). It would be nice if your
CPUs had something to do while waiting for this IO, other than just sit and
wait for IO.
> i don't know
> too much about resource utilization/consumption. i just heard that it
> was optimal to use as many processes as there are CPUs.
This is true if your processes are CPU bound.
> i can see why
> more processes would be better for more concurrent FTP gets, but
> wouldn't the subsequent processing of > 16 files
But you are not running synchronously--each child proceeds to the per-line
processing as soon as its own data is loaded, regardless of what the other
children are doing. If you run 20 children and the FTP part takes 20% of
the time, then at any given moment you would expect about 4 processes to
the FTPing, using little CPU, and 16 processes to be grinding on the CPUs.
(If you use 16 children and FTPing is 20% of the time, then you would
expect 3.2 to be FTPing and 12.8 to be crunching at any one time, leaving
almost 3.2 CPUs idle)
Of course, one thing you have to watch out for is the distribution of the
processing. Just because programs *can* be arranged in a way that uses
both IO and CPU most efficiently doesn't mean they automatically will
arrange themselves that way. But with each task being of a more-or-less
random length, as you have, they shouldn't do too badly arranged just by
chance (other than right after the parent starts, when they will all be
doing FTP at the same time.)
> make the machine
> sluggish since there are only 16 CPUs?
"sluggish" is a term usually used for interactive programs, while yours
seems to be batch, and so sluggishness is probably of minimal concern.
> i would say overall the entire
> 'thing' (hesitant to use the word process) is more CPU bound since the
> file transfer comprises a significantly smaller percentage of the time
> relative to the actual processing.
Yeah, in that case there is no compelling reason to go above 16. Which is
nice, as it is one less thing to need to consider.
> is it actually better to use more if i have a mixed work load? why
> exactly? how does the machine allocate its resources?
The FTP asks the remote computer to send it a bunches of data. While it is
waiting for those bunches of data to appear, the CPU is able to do other
things, provided there are other things to do. So, it is often worthwhile
to ensure that there *are* other things to do. But obviously if the FTP
time is a small percentage of overall time, then this makes little
difference in your case.
Also, it is possible, if your remote machines, ethernet, etc. are fast
enough compared to your CPUs, that the FTP itself is CPU rather than IO
bound, or nearly so. It's also possible that the local hard-drive that the
FTPed files are being written to are the limiting factor, or that the
remote machine(s) you are fetching from are the limiting factors.
But it occurs to me that I might be advocating premature optimization here.
Get it working first and see if it is "fast enough". If you can process an
hours worth of logs in 5 minutes, there is probably no point it making it
faster. If it takes 65 minutes to process an hours worth of logs, that
is probably the time to worry about 16 versus 20 versus 24 parallel
children.
Axel
actually, i had several processes appending to the same log file and
came across records that had spliced entries. this was the very problem
of which i spoke, and why i ended up needing to explicitly flock.
Which operating system were you using? I never ran across this problem
using Solaris.
Axel
SunOS 5.8