question about forked processes writing to the same file

it_says_BALLS_on_your forehead

unread,

Oct 16, 2005, 8:46:05 PM10/16/05

to

is this dangerous? for instance, is there ever a danger of race
conditions/locking/etc if i have:

use strict; use warnings;
use Parallel::ForkManager;

my $pm = Parallel::ForkManager->new(10);

# assume @files contains 100 files that will be processed,
# and processing time could range from subseconds to hours

my $out = 'results.txt';
for my $file (@files) {
$pm->start and next;

# some code to process file
# blah blah blah

open( my $fh_out, '>', $out ) or die "can't open $out: $!\n";
print $fh_out "$file\n";
close $fh_out;

$pm->finish;
}
$pm->wait_all_children;

it_says_BALLS_on_your forehead

unread,

Oct 16, 2005, 8:47:44 PM10/16/05

to

it_says_BALLS_on_your forehead wrote:
> is this dangerous? for instance, is there ever a danger of race
> conditions/locking/etc if i have:
>
> use strict; use warnings;
> use Parallel::ForkManager;
>
> my $pm = Parallel::ForkManager->new(10);
>
> # assume @files contains 100 files that will be processed,
> # and processing time could range from subseconds to hours
>
> my $out = 'results.txt';
> for my $file (@files) {
> $pm->start and next;
>
> # some code to process file
> # blah blah blah
>
> open( my $fh_out, '>', $out ) or die "can't open $out: $!\n";

Apologies... ^^^ should be '>>'

Gunnar Hjalmarsson

unread,

Oct 16, 2005, 9:01:20 PM10/16/05

to

As long as you don't care about the order in which the output from
respective file is appended to $out, I can't see what the problem would be.

--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl

it_says_BALLS_on_your forehead

unread,

Oct 16, 2005, 9:08:15 PM10/16/05

to

order is unimportant at this juncture. i was concerned if one process
would interrupt another that was writing, so that either both failed,
or a single entry became a garbled hybrid of two entries...something
along those lines. these, or other cases that may interfere with
writing one entry per line to $out, are what cause me apprehension.

Gunnar Hjalmarsson

unread,

Oct 16, 2005, 10:58:02 PM10/16/05

to

Probably somebody more knowledgable than me should comment on your concerns.

Awaiting that, have you read

perldoc -q append.+text

I can add that I'm using the above method in a widely used Perl program
(run on various platforms), without any problems having been reported.
However, I do use flock() to set an exclusive lock before printing to
the file. Can't tell if locking is necessary.

xho...@gmail.com

unread,

Oct 16, 2005, 11:37:21 PM10/16/05

to

On Linux this is safe, provided the string $file is not more than a few
hundred bytes. On Windows, it is not safe (although probably safe
enough, provided this is just for progress monitoring)

Xho

--
-------------------- http://NewsReader.Com/ --------------------
Usenet Newsgroup Service $9.95/Month 30GB

xho...@gmail.com

unread,

Oct 16, 2005, 11:45:56 PM10/16/05

to

xho...@gmail.com wrote:
>
> On Linux this is safe, provided the string $file is not more than a few
> hundred bytes. On Windows, it is not safe (although probably safe
> enough, provided this is just for progress monitoring)

I meant to say "On Linux this is fairly safe." I have never ran into a
problem with it, but that is no gaurantee of absolute safety.

it_says_BALLS_on_your forehead

unread,

Oct 17, 2005, 7:28:01 AM10/17/05

to

xho...@gmail.com wrote:
> xho...@gmail.com wrote:
> >
> > On Linux this is safe, provided the string $file is not more than a few
> > hundred bytes. On Windows, it is not safe (although probably safe
> > enough, provided this is just for progress monitoring)
>
> I meant to say "On Linux this is fairly safe." I have never ran into a
> problem with it, but that is no gaurantee of absolute safety.
>

yeah, i tried it and ran into no problems, but i didn't know if anyone
was aware of existing issues with this--it's difficult to test. thanks
for relating your experience :-).

Joe Smith

unread,

Oct 18, 2005, 8:21:37 AM10/18/05

to

it_says_BALLS_on_your forehead wrote:
> is this dangerous?

It would be better to have the file opened for writing in
the parent process, and let all the forked processes inherit
that file handle.
-Joe

it_says_BALLS_on_your forehead

unread,

Oct 18, 2005, 9:31:03 AM10/18/05

to

i thought about that, but i tried something similar with an FTP
connection, and discovered it was not fork safe, so i was hesitant to
share anything among the forked processes.

simon...@gmail.com

unread,

Oct 22, 2005, 11:19:19 AM10/22/05

to

xho...@gmail.com wrote:
> "it_says_BALLS_on_your forehead" <simon...@fmr.com> wrote:
> > it_says_BALLS_on_your forehead wrote:
> > > is this dangerous? for instance, is there ever a danger of race
> > > conditions/locking/etc if i have:
> > >
> > > use strict; use warnings;
> > > use Parallel::ForkManager;
> > >
> > > my $pm = Parallel::ForkManager->new(10);
> > >
> > > # assume @files contains 100 files that will be processed,
> > > # and processing time could range from subseconds to hours
> > >
> > > my $out = 'results.txt';
> > > for my $file (@files) {
> > > $pm->start and next;
> > >
> > > # some code to process file
> > > # blah blah blah
> > >
> > > open( my $fh_out, '>', $out ) or die "can't open $out: $!\n";
> >
> > Apologies... ^^^ should be '>>'
> >
> > > print $fh_out "$file\n";
> > > close $fh_out;
> > >
> > > $pm->finish;
> > > }
> > > $pm->wait_all_children;
>
> On Linux this is safe, provided the string $file is not more than a few
> hundred bytes. On Windows, it is not safe (although probably safe
> enough, provided this is just for progress monitoring)
>

i'm running into some issues, which may be related to this. unsure
right now. i'm also using a similar process to actually print out
apache web log records (unlike th above example, which is just a short
file name), some of which, may be large due to cookies and query
strings, and additional authentication information we put in there. i
know in at least one case, the user agent string appears once fully,
then later in the processed record--BUT the second time only a jagged
latter half of it gets inserted somewhere before the end of the record.

from your response here, it seems like you may be aware of some problem
is the string is large?

should i use flock()? i don't really know much about it...

this never occurred when i wrote to the same file using many system()
calls. again, i am unsure if this is actually a forking issue. i am
currently investigating. just wondered if this raised any alarms with
anyone who may know what is going on.

simon...@gmail.com

unread,

Oct 22, 2005, 11:22:12 AM10/22/05

to

hello Gunnar, could you please post a snippet of code illustrating how
you would use flock() in my above example? it would be much
appreciated. if it doesn't fix my problem, at least it is one less
thing to worry about.

A. Sinan Unur

unread,

Oct 22, 2005, 1:13:20 PM10/22/05

to

simon...@gmail.com wrote in
news:1129994532.1...@g47g2000cwa.googlegroups.com:

> Gunnar Hjalmarsson wrote:

>> I can add that I'm using the above method in a widely used Perl
>> program (run on various platforms), without any problems having been
>> reported. However, I do use flock() to set an exclusive lock before
>> printing to the file. Can't tell if locking is necessary.

...

> hello Gunnar, could you please post a snippet of code illustrating how
> you would use flock() in my above example?

So far,I do not see any code posted by you.

Maybe, you should read the documentation, and come up with a short
script illustrating your problem, and we can comment on it:

perldoc -q lock

perldoc -f flock

Sinan
--
A. Sinan Unur <1u...@llenroc.ude.invalid>
(reverse each component and remove .invalid for email address)

comp.lang.perl.misc guidelines on the WWW:
http://mail.augustmail.com/~tadmc/clpmisc/clpmisc_guidelines.html

it_says_BALLS_on_your forehead

unread,

Oct 22, 2005, 2:16:16 PM10/22/05

to

A. Sinan Unur wrote:
> simon...@gmail.com wrote in
> news:1129994532.1...@g47g2000cwa.googlegroups.com:
>
> > Gunnar Hjalmarsson wrote:
>
> >> I can add that I'm using the above method in a widely used Perl
> >> program (run on various platforms), without any problems having been
> >> reported. However, I do use flock() to set an exclusive lock before
> >> printing to the file. Can't tell if locking is necessary.
>
> ...
>
> > hello Gunnar, could you please post a snippet of code illustrating how
> > you would use flock() in my above example?
>
> So far,I do not see any code posted by you.
>
> Maybe, you should read the documentation, and come up with a short
> script illustrating your problem, and we can comment on it:
>
> perldoc -q lock
>
> perldoc -f flock
>

apologies, i was referring to the following example:

A. Sinan Unur

unread,

Oct 22, 2005, 2:33:49 PM10/22/05

to

"it_says_BALLS_on_your forehead" <simon...@fmr.com> wrote in
news:1130004976.4...@g47g2000cwa.googlegroups.com:

>
> A. Sinan Unur wrote:
>> simon...@gmail.com wrote in
>> news:1129994532.1...@g47g2000cwa.googlegroups.com:
>>
>> > Gunnar Hjalmarsson wrote:
>>
>> >> I can add that I'm using the above method in a widely used Perl
>> >> program (run on various platforms), without any problems having
>> >> been reported. However, I do use flock() to set an exclusive lock
>> >> before printing to the file. Can't tell if locking is necessary.
>>
>> ...
>>
>> > hello Gunnar, could you please post a snippet of code illustrating
>> > how you would use flock() in my above example?
>>
>> So far,I do not see any code posted by you.
>>
>> Maybe, you should read the documentation, and come up with a short
>> script illustrating your problem, and we can comment on it:
>>
>> perldoc -q lock
>>
>> perldoc -f flock
>>
>
> apologies, i was referring to the following example:

I see now ... You like to keep your readers guessing by not sticking
with a single posting address. Not very polite. Just configure whichever
client you are using with the same name and email address, please, so I
can figure out with whom I am corresponding.

> use strict; use warnings;
> use Parallel::ForkManager;
>
>
> my $pm = Parallel::ForkManager->new(10);
>
>
> # assume @files contains 100 files that will be processed,
> # and processing time could range from subseconds to hours
>
>
> my $out = 'results.txt';
> for my $file (@files) {
> $pm->start and next;
>
>
> # some code to process file
> # blah blah blah
>

This is a good place to put an exclusive flock on a sentinel file (not
the output file). You code will block until it gets the exclusive lock.

> open( my $fh_out, '>>', $out ) or die "can't open $out: $!\n";
> print $fh_out "$file\n";
> close $fh_out;

And, this would be the place to release that lock.

So long as the sentinel file is not on a network mounted volume, this
will give you the assurance that at most one process will write to the
file at any given time.

> $pm->finish;
> }
> $pm->wait_all_children;

it_says_BALLS_on_your forehead

unread,

Oct 22, 2005, 2:50:54 PM10/22/05

to

>
> I see now ... You like to keep your readers guessing by not sticking
> with a single posting address. Not very polite. Just configure whichever
> client you are using with the same name and email address, please, so I
> can figure out with whom I am corresponding.

sorry about that. i am unable to access non-work email at work. i will
try to set up my home account to use the same name and address.

>
> > use strict; use warnings;
> > use Parallel::ForkManager;
> >
> >
> > my $pm = Parallel::ForkManager->new(10);
> >
> >
> > # assume @files contains 100 files that will be processed,
> > # and processing time could range from subseconds to hours
> >
> >
> > my $out = 'results.txt';
> > for my $file (@files) {
> > $pm->start and next;
> >
> >
> > # some code to process file
> > # blah blah blah
> >
>
> This is a good place to put an exclusive flock on a sentinel file (not
> the output file). You code will block until it gets the exclusive lock.

i don't understand. in the Cookbook 2nd ed., pg 279 Locking a File they
write:

open(FH, "+<" $path)
flock(FH, LOCK_EX)
#update file, then...
close(FH)

i'm not importing the Fcntl module, so i must use the numeric values. 2
corresponds to LOCK_EX.

i don't understand what you mean by placing a LOCK_EX on a sentinal
file. why wouldn't i lock my output file? i thought i would :

open the fh
lock it
write to the file
close the fh (which unlocks it).

is this not correct? the file may get up to 3 GB, so i don't know if
locking a REGION of the file may be more apt, but i am also unsure of
whether or not that only applies to file updates, because how can you
lock a region of a new file, when the newest part doesn't exist yet.

Gunnar Hjalmarsson

unread,

Oct 22, 2005, 3:13:11 PM10/22/05

to

A. Sinan Unur wrote:

> it_says_BALLS_on_your forehead wrote:
>>
>>use strict; use warnings;
>>use Parallel::ForkManager;
>>
>>my $pm = Parallel::ForkManager->new(10);
>>
>># assume @files contains 100 files that will be processed,
>># and processing time could range from subseconds to hours
>>
>>my $out = 'results.txt';
>>for my $file (@files) {
>> $pm->start and next;
>>
>> # some code to process file
>> # blah blah blah
>
> This is a good place to put an exclusive flock on a sentinel file (not
> the output file). You code will block until it gets the exclusive lock.

Why not the output file?

>> open( my $fh_out, '>>', $out ) or die "can't open $out: $!\n";

Why not just:

flock $fh_out, 2 or die $!;

>> print $fh_out "$file\n";
>> close $fh_out;
>
> And, this would be the place to release that lock.

Unless you use a lexical ref to the filehandle as above, in which case
you don't need to release it explicitly.

My related comment was posted at
http://groups.google.com/group/comp.lang.perl.misc/msg/2ca8d6e6894030b3

A. Sinan Unur

unread,

Oct 22, 2005, 5:25:44 PM10/22/05

to

"it_says_BALLS_on_your forehead" <simon...@fmr.com> wrote in

news:1130007054.3...@g43g2000cwa.googlegroups.com:

You should do

use Fcntl qw(:flock);

because there is no guarantee that LOCK_EX will always be 2. Using the
LOCK_EX will make sure that the correct value for the system is going to
be used.

> i don't understand what you mean by placing a LOCK_EX on a sentinal
> file. why wouldn't i lock my output file? i thought i would :
>
> open the fh
> lock it
> write to the file
> close the fh (which unlocks it).
>
> is this not correct?

It is not wrong, although I can see how my post can be interpreted as
stating otherwise.

I have gotten into the habit of locking a sentinel file as a semaphore,
because, I frequently find myself needing to serialize access to
multiple resources, in addition to, occasionally, serializing access to
files on a network mounted drive.

A. Sinan Unur

unread,

Oct 22, 2005, 5:53:53 PM10/22/05

to

Gunnar Hjalmarsson <nor...@gunnar.cc> wrote in news:3rvh6fFlmbneU1
@individual.net:

> My related comment was posted at
> http://groups.google.com/group/comp.lang.perl.misc/msg/2ca8d6e6894030b3

Gunnar, I had seen your response, but I thought Simon Chao and 'balls'
were different people.

Gunnar Hjalmarsson

unread,

Oct 22, 2005, 7:10:04 PM10/22/05

to

A. Sinan Unur wrote:
> Gunnar Hjalmarsson wrote:

>> A. Sinan Unur wrote:
>>> This is a good place to put an exclusive flock on a sentinel file (not
>>> the output file).
>>

>> Why not the output file?

<snip>

>> My related comment was posted at
>> http://groups.google.com/group/comp.lang.perl.misc/msg/2ca8d6e6894030b3
>
> Gunnar, I had seen your response, but I thought Simon Chao and 'balls'
> were different people.

Yeah, I know, and you also answered my question above in your reply to
"it_says_BALLS_on_your forehead".

Thanks.

it_says_BALLS_on_your forehead

unread,

Oct 22, 2005, 7:17:13 PM10/22/05

to

Gunnar Hjalmarsson wrote:
> A. Sinan Unur wrote:
> > Gunnar Hjalmarsson wrote:
> >> A. Sinan Unur wrote:
> >>> This is a good place to put an exclusive flock on a sentinel file (not
> >>> the output file).
> >>
> >> Why not the output file?
>
> <snip>
>
> >> My related comment was posted at
> >> http://groups.google.com/group/comp.lang.perl.misc/msg/2ca8d6e6894030b3
> >
> > Gunnar, I had seen your response, but I thought Simon Chao and 'balls'
> > were different people.
>
> Yeah, I know, and you also answered my question above in your reply to
> "it_says_BALLS_on_your forehead".
>
> Thanks.

well, it's been a long day at the office. thanks for all your help
guys! i apologize for my lack of clarity in the beginning--i wasn't
attempting to be disingenuous, i just was stressed and felt i didn't
have the time to explain everything in minute detail (even though the
slightest thing could radically change the advice i get). anyway, i
think things are going well with the flock. slowed things down a bit,
but what can you do? :-). thanks again!

Gunnar Hjalmarsson

unread,

Oct 22, 2005, 7:37:33 PM10/22/05

to

it_says_BALLS_on_your forehead wrote:
> i think things are going well with the flock. slowed things down a
> bit, but what can you do? :-).

I have to ask: Did flock() actually make a difference (besides the
execution time)?

it_says_BALLS_on_your forehead

unread,

Oct 22, 2005, 7:48:03 PM10/22/05

to

previously there would be web log records that had bits of another web
log record spliced in the middle of them.

after loading into a table, there were many (much more than normal)
that were thrown into an error table. after the flocking, we seem to
have reduced the number of error records to normal.

it's difficult to look at the processed flat file that was produced to
check for these splices--they appear sporadically, and may be
infrequent relative to the size of the data set (50 million), but once
it was loaded into a table, we could query against it and see if things
turned out ok.

we re-ran, and examined the error table real-time up until 25 million
records were loaded, and the number of errors seemed normal. it's
possible that the last 25 million records could have many many errors,
but unlikely.

also, when i used the non-flocking method, we got high error counts for
particular servers (some high volume ones), and since i'm parsing these
files based on reverse sorted size, i know that the these will be in
the 1st half of the total number of records. after the flocking, the
biggest files were still processed first, so when we looked at the 1st
25 million records, most of the big servers were done, so we know that
they, at least, had many less errors. so we think this solved the
problem...

still, i have the code written in such a way that i'm opening, locking,
writing to, and closing a file handle for every record. i wish there
were a way that lock and release without having to open/close every
time. i don't know if that would save much time or not. maybe the extra
time is not from the extra opening/closing, but from the locking and
having to wait to write. i will do more research and share if i find a
solution.

Gunnar Hjalmarsson

unread,

Oct 22, 2005, 8:00:02 PM10/22/05

to

it_says_BALLS_on_your forehead wrote:
> Gunnar Hjalmarsson wrote:
>>it_says_BALLS_on_your forehead wrote:
>>>i think things are going well with the flock. slowed things down a
>>>bit, but what can you do? :-).
>>
>>I have to ask: Did flock() actually make a difference (besides the
>>execution time)?
>

Thanks for sharing. Then the lock is probably not redundant, as I
previously was suspecting, even if I'm doing it as well.

it_says_BALLS_on_your forehead

unread,

Oct 22, 2005, 8:04:10 PM10/22/05

to

yeah, i don't think it's an issue when you're appending small strings
to a file, but they should make file locking default, don't you think?!
:-)

when *wouldn't* you want to lock a file when writing?

xho...@gmail.com

unread,

Oct 22, 2005, 8:11:58 PM10/22/05

to

"it_says_BALLS_on_your forehead" <simon...@fmr.com> wrote:
>
> well, it's been a long day at the office. thanks for all your help
> guys! i apologize for my lack of clarity in the beginning--i wasn't
> attempting to be disingenuous, i just was stressed and felt i didn't
> have the time to explain everything in minute detail (even though the
> slightest thing could radically change the advice i get). anyway, i
> think things are going well with the flock. slowed things down a bit,
> but what can you do? :-). thanks again!

The flocking slowed things down a noticable amount? Since you seem to be
only running flock per process (i.e. every flock is accompanied by at least
one fork, whatever processing your are doing, and one open). That is very
strange. Are you holding the lock longer than necessary?

xho...@gmail.com

unread,

Oct 22, 2005, 8:20:01 PM10/22/05

to

Gunnar Hjalmarsson <nor...@gunnar.cc> wrote:

> >
> > And, this would be the place to release that lock.
>
> Unless you use a lexical ref to the filehandle as above, in which case
> you don't need to release it explicitly.

If your lock is contentious, then you should release it explicitly (or
close the filehandle explicitly) rather than just letting it go out of
scope. Even if the natural place to release it happens to be the end of
the scope, you never know when the code will be changed to add more things
to the scope. Plus, garbage collection can sometimes be time consuming, and
there is no reason to make other processes wait upon it.

xho...@gmail.com

unread,

Oct 22, 2005, 8:30:15 PM10/22/05

to

"it_says_BALLS_on_your forehead" <simon...@fmr.com> wrote:
>
> still, i have the code written in such a way that i'm opening, locking,
> writing to, and closing a file handle for every record.

From the code you posted before, you are forking a process for every
record. Given that, it is hard to believe that opening and closing a file
is significant.

> i wish there
> were a way that lock and release without having to open/close every
> time.

There is. Have you read the perldoc -q append entry someone pointed out
earlier?

> i don't know if that would save much time or not. maybe the extra
> time is not from the extra opening/closing, but from the locking and
> having to wait to write. i will do more research and share if i find a
> solution.

I'm still incredulous. If any of this has a meaningful impact on
performance, you are doing something wrong at a more fundamental level.

xho...@gmail.com

unread,

Oct 22, 2005, 8:40:27 PM10/22/05

to

simon...@gmail.com wrote:
> xho...@gmail.com wrote:

> > On Linux this is safe, provided the string $file is not more than a few
> > hundred bytes. On Windows, it is not safe (although probably safe
> > enough, provided this is just for progress monitoring)
> >
>
> i'm running into some issues, which may be related to this. unsure
> right now. i'm also using a similar process to actually print out
> apache web log records (unlike th above example, which is just a short
> file name), some of which, may be large due to cookies and query
> strings, and additional authentication information we put in there.

On my single CPU Linux system, ~4096 bytes seems to be magic string length
where you go from reliably not screwing up to occasionally screwing up.

It is possible that multi CPU systems might screw up on less that 4096,
but I doubt it.

xho...@gmail.com

unread,

Oct 22, 2005, 8:47:08 PM10/22/05

to

"it_says_BALLS_on_your forehead" <simon...@fmr.com> wrote:
> >
> > Thanks for sharing. Then the lock is probably not redundant, as I
> > previously was suspecting, even if I'm doing it as well.
>
> yeah, i don't think it's an issue when you're appending small strings
> to a file, but they should make file locking default, don't you think?!
> :-)
>
> when *wouldn't* you want to lock a file when writing?

Consider that, when you are not using autoflush, you never really know
when it is that you are actually writing.

Locking is used to resolve thorny concurrency issues. Being thorny issues,
any overly simplistic solution is not going to work. When do two different
print statements need to be in the same "transaction", and when don't they?
Most programmers who should know better can't even figure this out, so why
would you expect Perl to be able to?

it_says_BALLS_on_your forehead

unread,

Oct 22, 2005, 9:09:10 PM10/22/05

to

xho...@gmail.com wrote:
> "it_says_BALLS_on_your forehead" <simon...@fmr.com> wrote:
> >
> > well, it's been a long day at the office. thanks for all your help
> > guys! i apologize for my lack of clarity in the beginning--i wasn't
> > attempting to be disingenuous, i just was stressed and felt i didn't
> > have the time to explain everything in minute detail (even though the
> > slightest thing could radically change the advice i get). anyway, i
> > think things are going well with the flock. slowed things down a bit,
> > but what can you do? :-). thanks again!
>
> The flocking slowed things down a noticable amount? Since you seem to be
> only running flock per process (i.e. every flock is accompanied by at least
> one fork, whatever processing your are doing, and one open). That is very
> strange. Are you holding the lock longer than necessary?
>
> Xho

sorry, i was in a rush, so had to use that overly simplistic example.
in my real code ( i may post later when i have time ) i am running
flock per RECORD.
i open a lexical anonymous filehandle, flock it, write to it, and close
it, each and every record.

Gunnar Hjalmarsson

unread,

Oct 22, 2005, 9:16:51 PM10/22/05

to

xho...@gmail.com wrote:

> Gunnar Hjalmarsson wrote:
>>
>>>And, this would be the place to release that lock.
>>
>>Unless you use a lexical ref to the filehandle as above, in which case
>>you don't need to release it explicitly.
>
> If your lock is contentious, then you should release it explicitly (or
> close the filehandle explicitly) rather than just letting it go out of
> scope. Even if the natural place to release it happens to be the end of
> the scope, you never know when the code will be changed to add more things
> to the scope. Plus, garbage collection can sometimes be time consuming, and
> there is no reason to make other processes wait upon it.

I see now that my comment did not express what I wanted to say. It
should rather have been:

"Unless you close the filehandle, in which case you don't need to
release the lock explicitly."
;-)

Thanks for the correction.

it_says_BALLS_on_your forehead

unread,

Oct 22, 2005, 9:29:52 PM10/22/05

to

xho...@gmail.com wrote:
> "it_says_BALLS_on_your forehead" <simon...@fmr.com> wrote:
> > >
> > > Thanks for sharing. Then the lock is probably not redundant, as I
> > > previously was suspecting, even if I'm doing it as well.
> >
> > yeah, i don't think it's an issue when you're appending small strings
> > to a file, but they should make file locking default, don't you think?!
> > :-)
> >
> > when *wouldn't* you want to lock a file when writing?
>
> Consider that, when you are not using autoflush, you never really know
> when it is that you are actually writing.
>
> Locking is used to resolve thorny concurrency issues. Being thorny issues,
> any overly simplistic solution is not going to work. When do two different
> print statements need to be in the same "transaction", and when don't they?
> Most programmers who should know better can't even figure this out, so why
> would you expect Perl to be able to?

i'm not sure i really follow, xho...are you advocating file locking as
default or not?

it_says_BALLS_on_your forehead

unread,

Oct 23, 2005, 3:43:43 PM10/23/05

to

xho...@gmail.com wrote:
> "it_says_BALLS_on_your forehead" <simon...@fmr.com> wrote:
> >
> > still, i have the code written in such a way that i'm opening, locking,
> > writing to, and closing a file handle for every record.
>
> From the code you posted before, you are forking a process for every
> record. Given that, it is hard to believe that opening and closing a file
> is significant.
>
> > i wish there
> > were a way that lock and release without having to open/close every
> > time.
>
> There is. Have you read the perldoc -q append entry someone pointed out
> earlier?
>
> > i don't know if that would save much time or not. maybe the extra
> > time is not from the extra opening/closing, but from the locking and
> > having to wait to write. i will do more research and share if i find a
> > solution.
>
> I'm still incredulous. If any of this has a meaningful impact on
> performance, you are doing something wrong at a more fundamental level.

hey Xho, i tried to trim down the code as much as possible while
maintaining the exact same functionality of the code. in this
simplified version, there are 2 scripts: betaProcess.pl and
betaParse.pl

betaProcess.pl
#-------------

use strict; use warnings;
use Net::FTP;
use File::Copy;
use Cwd;
use Parallel::ForkManager;

my ($area, $processDate) = @ARGV;

my %fetch;
my $files_are_missing;
my $count = 1;
do {
&get_fetch(\%fetch, $area, $processDate);

$files_are_missing = 0;

if ( ($count % 30) == 0 ) {
doMail("$0 - slept $count times...each for a minute. maybe some
files are missing.", "", $toList);
}

print LOGS localtime() . " - about to start the cycle $count.\n";
my $redo = &startCycle( $area, 16, \%fetch );
print LOGS localtime() . " - about to end the cycle $count.\n";

%fetch = ();

if ( $redo ) {
print LOGS localtime() . " - will sleep for a minute because at
least one file was missing.\n";
$fetch{redo}++;
sleep 60;
$files_are_missing = 1;
}
$count++;
}
while ($files_are_missing);

print LOGS "Ending $0 at -" . UnixDate('now','%m/%d/%Y %H:%M:%S') .
"-\n";
close LOGS;

my $time_end = localtime();
print "START: $time_start\n";
print " END: $time_end\n";
doMail("$time_start: $0 - Start => " . localtime() . ": $0 - Done", "",
$toList);

#--- subs ---

sub get_fetch {
my ($fetch_ref, $area, $processDate) = @_;
my ($command, @result);

if (! keys( %{ $fetch_ref } ) ) {
print "in get_fetch there were no keys. so NO MISSING FILES\n";
$command = "ssh -2 -l <user> <server>
/export/home/<user>/getFileSize.pl $area $processDate";
}
else {
# process missing files

print "in get_fetch there were keys. so files were missing!\n";
delete $fetch_ref->{redo};
$command = "ssh -2 -l <user> <server>
/export/home/<user>/getFileSize.pl $area.missing $processDate";
}

@result = `$command`;
for (@result) {
chomp;
my ($file, $size, $time) = split('=>');
$fetch_ref->{$file} = {
time => $time,
size => $size,
};
}
}

sub startCycle {
my ($area, $num_groups, $data_ref) = @_;
my %data = %{$data_ref};

my $server = 'server';
my ($userName, $password) = split /\|/, $mValues{$server};

my %missing;

my $pm = Parallel::ForkManager->new($num_groups);

for my $file ( sort { $data{$b}->{size} <=> $data{$a}->{size} } keys
%data ) {

$pm->start and next;

if ( $data{$file}->{size} ) {
print "size is TRUE: $file has size -$data{$file}->{size}-\n";
# fetch and parse

my ($server, $dir, $file_type, $slice, $controlNo, $isCritical) =
split(/\|/, $file );

my $remoteFile = "$dir/$processDate.gz";
my $localFile =
"$cycleType{$area}->{rawLogs}/$file_type.$processDate$slice";
#

# Establish FTP Connection:

#

my $ftp;
unless ($ftp = Net::FTP->new($server)) {
doMail("Problem with $0", "Can't connect to $server with ftp,
$@\n");
die;
}
unless ($ftp->login($userName, $password)) {
doMail("Problem with $0", "Can't login to $server with ftp
using -$userName- and -$password- $@\n");
die;
}
$ftp->binary();

if ($ftp->get($remoteFile, $localFile)) {
# print "got $remoteFile to $localFile\n";

my $doneFile = $localFile;
$doneFile =~ s/^.*\///g;
$doneFile =~ s/\.gz$//;
$doneFile = "$cycleType{$area}->{work}/$doneFile";
# Kick Off Parsing for this file:

my $command = "betaParse.pl $processDate $localFile $doneFile
$hash{$area}->{parseMethod}";
system($command);
}
else {
print localtime() . " - FTP MESSAGE: $ftp->message: $@\n";
open( my $fh_missing, '>>', "$area.missing.fetch" ) or die
"can't open $area.missing.fetch: $!\n";
my $missingFile = $file;
$missingFile =~ s/\|/\ /g;
my $controlNo = 1;
my $isCritical = 'Y';
print $fh_missing "$missingFile $controlNo $isCritical\n";
close $fh_missing;
}
$ftp->quit();
}
else {
# Capture missing logs to deal with later

print "size is FALSE: $file has size -$data{$file}->{size}-\n";
$missing{$file} = {
time => scalar(localtime()),
size => $data{$file}->{size},
};
print "$file: $missing{$file}->{time} and
-$missing{$file}->{size}-\n";
open( my $fh_missing, '>>', "$area.missing.fetch" ) or die "can't
open $area.missing.fetch: $!\n";
while( my ($missingFile, $attr) = each %missing ) {
# my ($server, $path, $frontName, $backName) = split(/\|/,
$missingFile);
$missingFile =~ s/\|/\ /g;
my $controlNo = 1;
my $isCritical = 'Y';
print $fh_missing "$missingFile $controlNo $isCritical\n";
}
close $fh_missing;

}
$pm->finish;
}
$pm->wait_all_children;

my $redo = 0;
if ( -e "$area.missing.fetch" ) {
my $command = "scp -oProtocol=2 $area.missing.fetch
<user>\@<server>:/export/home/<user>/data";
my $rc = system($command);

unlink "$area.missing.fetch";
$redo = 1;
}

return $redo;
}

##
# betaParse.pl
##
use strict;
use Cwd;
use Date::Manip;
use File::Basename;
use File::Copy;

my $rawCounts = 0;

my $numOutputFiles = 10;

open(my $fh_in, "gzip -dc $inputFile|") || dieWithMail("Can't open
$inputFile $!\n");

$status = &reformatLogs($fh_in);

close $fh_in;

if ($status == 1) {

system("touch $inputFile.DONE");
}

#--- subs ---

sub reformatLogs {
my ($fh_in) = @_;

while( <$fh_in> ) {
$rawCounts++;
chomp;

# process $_

# evenly distribute data to output files
my $section = $rawCounts % $numOutputFiles;

open( my $fh, ">>log$section" ) || die "can't open log$section:
$!\n";
flock( $fh, 2 );
print $fh "$_\n";
close( $fh );
}
return 1;
}

....is there a problem because of the mix of forked processes and
system calls? perhaps i should change the system call to a function
call (after making the necessary code changes)

previously, the open filehandles were at the top of betaParse.pl and
there was no locking. this appeared to cause the record splicing,
although the processing was about twice as fast. can you shed some
light on this phenommenon? you said before that the string length at
which writing would start going crazy was around maybe 4096. the
records in these weblogs are perhaps a maximum of 10 lines (this is a
conservative estimate) on a 1024 x 768 res screen maximized.

it_says_BALLS_on_your_forehead

unread,

Oct 23, 2005, 7:12:16 PM10/23/05

to

xho...@gmail.com wrote:
> "it_says_BALLS_on_your forehead" <simon...@fmr.com> wrote:
> >
> > still, i have the code written in such a way that i'm opening, locking,
> > writing to, and closing a file handle for every record.
>
> From the code you posted before, you are forking a process for every
> record. Given that, it is hard to believe that opening and closing a file
> is significant.
>
> > i wish there
> > were a way that lock and release without having to open/close every
> > time.
>
> There is. Have you read the perldoc -q append entry someone pointed out
> earlier?

i have looked at it. i don't see how it's helpful at all? am i missing
something? it says Perl is not a text editor, in general there's no
direct way for Perl to seek a particular line of a file, insert to, or
delete from a file. in special cases you can write to the end of a
file. i know that. i'm using Perl v. 5.6.1...is there more in the
perldocs for higher versions?
i don't see how this is helpful. i know that open my $fh, '>>', $file
appends. how does this perldoc help me? also, would syswrite with
O_APPEND be faster? would that not be as safe though?

it_says_BALLS_on_your forehead

unread,

Oct 23, 2005, 8:05:38 PM10/23/05

to

> also, would syswrite with
> O_APPEND be faster? would that not be as safe though?

i don't understand why in Programming Perl 3rd ed. page 421, it states:

To get an exclusive lock, typically used for writing, you have
to be more careful. You cannot use a regular open for this; if you
use an open mode of <, it will fail on files that don't exist yet,
and if you use >, it will clobber any files that do. Instead, use
sysopen on the file so it can be locked before getting
overwritten.

what about open my $fh, '>>' $file ... ? that's what i do, and it seems
to work.

but i've also read this:
If you really want to get faster I/O in Perl, you might experiment with
the sysopen(), sysread(), sysseek(), and syswrite() functions. But
beware, they interact quirkily with normal Perl I/O functions

and:
sysopen(), sysread(), sysseek(), and syswrite()
These are low level calls corresponding to C's open(), read() and
write() functions. Due to lack of buffering, they interact strangely
with calls to Perl's buffered file I/O. But if you really want to speed
up Perl's I/O, this might (or might not) be a way to do it. This is
beyond the scope of this document.
which can be found at:
http://www.troubleshooters.com/codecorn/littperl/perlfile.htm

and instead of for every log record (50 million times across 16
processes), doing:
while( <$fh_in> ) {
chomp;
# process line
open( my $fh, '>>', $file ) || die "can't open $file: $!\n;

flock( $fh, 2 );
print $fh "$_\n";
close( $fh );
}

would it be better for each process to have a global filehandle at the
top, and then in the loop for each record lock then unlock it? would
this be faster? would this be SAFE?

ex:

open( my $fh_out, '>>', $file ) || die "can't open $file: $!\n";

# ...some code here
while( <$fh_in> ) {
chomp;
# ...process line
flock( $fh_out, LOCK_EX );
print $fh_out "$_\n";
flock( $fh_out, LOCK_UN );
}

# ...some code here
close( $fh_out );

does anyone know if this second method is faster and 100% safe across
multiple forked processes?

Gunnar Hjalmarsson

unread,

Oct 23, 2005, 10:52:07 PM10/23/05

to

it_says_BALLS_on_your_forehead wrote:

> xho...@gmail.com wrote:
>>Have you read the perldoc -q append entry someone pointed out
>>earlier?
>
> i have looked at it. i don't see how it's helpful at all? am i missing
> something? it says Perl is not a text editor, in general there's no
> direct way for Perl to seek a particular line of a file, insert to, or
> delete from a file. in special cases you can write to the end of a
> file. i know that. i'm using Perl v. 5.6.1...is there more in the
> perldocs for higher versions?
> i don't see how this is helpful. i know that open my $fh, '>>', $file
> appends. how does this perldoc help me? also, would syswrite with
> O_APPEND be faster? would that not be as safe though?

You seem to have read the old version of the second entry in perlfaq5
(which indeed has been changed since v5.6.1). However, assuming that Xho
was thinking of the FAQ entry I suggested:

perldoc -q append.+text

what you read was not that one.

This is the current version of the entry I suggested:

http://faq.perl.org/perlfaq5.html#All_I_want_to_do_is_

xho...@gmail.com

unread,

Oct 24, 2005, 1:12:25 PM10/24/05

to

"it_says_BALLS_on_your_forehead" <simon...@gmail.com> wrote:
> i'm using Perl v. 5.6.1...is there more in the
> perldocs for higher versions?

Apparently so. See Gunnar's post. Sorry for the confusion.

xho...@gmail.com

unread,

Oct 24, 2005, 2:23:04 PM10/24/05

to

"it_says_BALLS_on_your forehead" <simon...@fmr.com> wrote:
>
> hey Xho, i tried to trim down the code as much as possible while
> maintaining the exact same functionality of the code. in this
> simplified version, there are 2 scripts: betaProcess.pl and
> betaParse.pl

I think you could have trimmed it down a lot more in most places, and
a little less in others. :)

For example, all the stuff in betaProcess which are in paths other than
those invoking betaParse are almost certainly irrelevant.

Also, I think you need to rear back and think of the big picture some
more. You are forking of 16 processes, and you are creating 10 log files,
and you want all 16 of those processes to write to all 10 of those log
files, so you have 160 filehandles all fighting with each other. Is that
really necessary? Do the 10 log files serve 10 different purposes
(unlikely, it seems) or they are just to keep the size of any given log
file down, or what?

More on the big picutre. A couple weeks ago when you started asking here
about parsing (or was it FTPing?) lots of files, I thought
Parallel::ForkManager was the right solution. But you keep incrementally
adding new wrinkles, and I fear that by the time you are done adding them
perhaps the right solution will no longer be Parallel::ForkManager but
rather threads or event-loops or some RDMS-interfaced program or even some
other language entirely. Parallelization in inherently challenging, and it
probably needs some deep big-picture thinking, not incremental tinkering.

> ##
> # betaParse.pl
> ##
> use strict;
> use Cwd;
> use Date::Manip;
> use File::Basename;
> use File::Copy;
>
> my $rawCounts = 0;
>
> my $numOutputFiles = 10;
>
> open(my $fh_in, "gzip -dc $inputFile|") || dieWithMail("Can't open
> $inputFile $!\n");

There is no $inputFile! Like I said, too much trimming in places.

>
> $status = &reformatLogs($fh_in);
>
> close $fh_in;
>
> if ($status == 1) {
>
> system("touch $inputFile.DONE");
> }
>
> #--- subs ---
>
> sub reformatLogs {
> my ($fh_in) = @_;
>
> while( <$fh_in> ) {
> $rawCounts++;
> chomp;
>
> # process $_
>
> # evenly distribute data to output files
> my $section = $rawCounts % $numOutputFiles;

I don't understand what you want to accomplish with this. Do your lines
need to be parceled out to output files in this inherently inefficient way?
That seems unlikely, as log files generally have one record per line and
the relative position of lines to each other is meaningless. I doubt you
have a good reason for doing this, but for the rest of the post I'll assume
you do.

In my experience, having all the children write to a common file for
monitoring, debugging, errors, etc, is fine. But for ordinary output, I've
almost never found it beneficial to have multiple children multiplex into
shared unstructured output files. It is usually far easier to produce 10,
000 output files and then combine them at a later stage if that is
necessary.

> open( my $fh, ">>log$section" )
> || die "can't open log$section: $!\n";

## don't reopen the handle each time.
## rather, keep a hash of open handles

unless ( $section_handles{$section}) {
open (my $fh, ">>", "log$section)
or die "can't open log$section: $!\n";
$section_handles{$section}=$fh;
};
my $fh = $section_handles{$section};

> flock( $fh, 2 );

flock( $fh, LOCK_EX ) or die $!;

> print $fh "$_\n";
> close( $fh );

## and now you don't close the handle, but you still need it
unlocked. flock( $fh, LOCK_UN) or die $!;

> }
> return 1;
> }
>
> ....is there a problem because of the mix of forked processes and
> system calls? perhaps i should change the system call to a function
> call (after making the necessary code changes)

While I don't think this causes this particular problem, I would make that
change anyway.

> previously, the open filehandles were at the top of betaParse.pl and
> there was no locking. this appeared to cause the record splicing,
> although the processing was about twice as fast. can you shed some
> light on this phenommenon?

I would guess that reopening a file for every line is going to be slow.

Locking the file for every line is probably also going to be kind of
slow, but probably not nearly slow as re-opening it is. If it turns out to
be a bottleneck, you could batch up, say, 100 lines and write them in one
chunk, with only one lock around it. If you are going to do chunking, you
should probably whip up a module for it rather than putting it directly
into your code.

> you said before that the string length at
> which writing would start going crazy was around maybe 4096. the
> records in these weblogs are perhaps a maximum of 10 lines (this is a
> conservative estimate) on a 1024 x 768 res screen maximized.

I don't know how long 10 lines on a 1024x768 screen are. But I do know
that Perl has a length function :)

perl -lne 'print length' file

it_says_BALLS_on_your forehead

unread,

Oct 24, 2005, 4:39:52 PM10/24/05

to

xho...@gmail.com wrote:
> "it_says_BALLS_on_your forehead" <simon...@fmr.com> wrote:
> >
> > hey Xho, i tried to trim down the code as much as possible while
> > maintaining the exact same functionality of the code. in this
> > simplified version, there are 2 scripts: betaProcess.pl and
> > betaParse.pl
>
> I think you could have trimmed it down a lot more in most places, and
> a little less in others. :)
>
> For example, all the stuff in betaProcess which are in paths other than
> those invoking betaParse are almost certainly irrelevant.

thanks, i'll work on improving this to help you help me more :-)

>
> Also, I think you need to rear back and think of the big picture some
> more. You are forking of 16 processes, and you are creating 10 log files,
> and you want all 16 of those processes to write to all 10 of those log
> files, so you have 160 filehandles all fighting with each other. Is that
> really necessary? Do the 10 log files serve 10 different purposes
> (unlikely, it seems) or they are just to keep the size of any given log
> file down, or what?

they were initially there because we have an ETL (Extract, Transform,
Load) tool that picks them up, and it was determined that this tool
could optimally use 10 threads to gather these processed records. it
would be best if these 10 records were the same size (or near enough).
i'm working on a 16 CPU machine, so that's why i created 16 forked
processes. however, i have learned that the ETL tool is being used
differently now to use a round robin technique--several readers (fast
readers) pass the data streams along to processing threads. So it
appears that we are doing work to balance the data twice--inefficient.
Now it will be much easier for me to just have every process write to
it's own file. well, not every process since there aren't really only
16 processes, but rather a max of 16 at one time. So I was thinking of
creating a counter, and incrementing it in the loop that spawns the
processes, and mod this counter by 16. the mod'd counter will be a
value that is passed to the betaParse script/function and the parsing
will use this value to choose which filehandle it writes to.

>
> More on the big picutre. A couple weeks ago when you started asking here
> about parsing (or was it FTPing?) lots of files, I thought
> Parallel::ForkManager was the right solution. But you keep incrementally
> adding new wrinkles, and I fear that by the time you are done adding them
> perhaps the right solution will no longer be Parallel::ForkManager but
> rather threads or event-loops or some RDMS-interfaced program or even some
> other language entirely. Parallelization in inherently challenging, and it
> probably needs some deep big-picture thinking, not incremental tinkering.
>

yeah, i never thought it would be easy :-)

>
> > ##
> > # betaParse.pl
> > ##
> > use strict;
> > use Cwd;
> > use Date::Manip;
> > use File::Basename;
> > use File::Copy;
> >
> > my $rawCounts = 0;
> >
> > my $numOutputFiles = 10;
> >
> > open(my $fh_in, "gzip -dc $inputFile|") || dieWithMail("Can't open
> > $inputFile $!\n");
>
> There is no $inputFile! Like I said, too much trimming in places.

sorry!

>
> >
> > $status = &reformatLogs($fh_in);
> >
> > close $fh_in;
> >
> > if ($status == 1) {
> >
> > system("touch $inputFile.DONE");
> > }
> >
> > #--- subs ---
> >
> > sub reformatLogs {
> > my ($fh_in) = @_;
> >
> > while( <$fh_in> ) {
> > $rawCounts++;
> > chomp;
> >
> > # process $_
> >
> > # evenly distribute data to output files
> > my $section = $rawCounts % $numOutputFiles;
>
> I don't understand what you want to accomplish with this. Do your lines
> need to be parceled out to output files in this inherently inefficient way?
> That seems unlikely, as log files generally have one record per line and
> the relative position of lines to each other is meaningless. I doubt you
> have a good reason for doing this, but for the rest of the post I'll assume
> you do.
>

don't think i need this anymore, thank God!

>
> In my experience, having all the children write to a common file for
> monitoring, debugging, errors, etc, is fine. But for ordinary output, I've
> almost never found it beneficial to have multiple children multiplex into
> shared unstructured output files. It is usually far easier to produce 10,
> 000 output files and then combine them at a later stage if that is
> necessary.
>
>
>
>
> > open( my $fh, ">>log$section" )
> > || die "can't open log$section: $!\n";
>
> ## don't reopen the handle each time.
> ## rather, keep a hash of open handles
>
> unless ( $section_handles{$section}) {
> open (my $fh, ">>", "log$section)
> or die "can't open log$section: $!\n";
> $section_handles{$section}=$fh;
> };
> my $fh = $section_handles{$section};
>

this is similar to what was done before, only an array was used since
$section is simply a number 0 thru max processes - 1

> > flock( $fh, 2 );
>
> flock( $fh, LOCK_EX ) or die $!;
>
>
> > print $fh "$_\n";
> > close( $fh );
>
> ## and now you don't close the handle, but you still need it
> unlocked. flock( $fh, LOCK_UN) or die $!;
>

yeah, i thought this may be better...

> > }
> > return 1;
> > }
> >
> > ....is there a problem because of the mix of forked processes and
> > system calls? perhaps i should change the system call to a function
> > call (after making the necessary code changes)
>
> While I don't think this causes this particular problem, I would make that
> change anyway.
>

yeah, seems cleaner. less complexity.

>
> > previously, the open filehandles were at the top of betaParse.pl and
> > there was no locking. this appeared to cause the record splicing,
> > although the processing was about twice as fast. can you shed some
> > light on this phenommenon?
>
> I would guess that reopening a file for every line is going to be slow.
>
> Locking the file for every line is probably also going to be kind of
> slow, but probably not nearly slow as re-opening it is. If it turns out to
> be a bottleneck, you could batch up, say, 100 lines and write them in one
> chunk, with only one lock around it. If you are going to do chunking, you
> should probably whip up a module for it rather than putting it directly
> into your code.

interesting concept. i will investigate...

>
> > you said before that the string length at
> > which writing would start going crazy was around maybe 4096. the
> > records in these weblogs are perhaps a maximum of 10 lines (this is a
> > conservative estimate) on a 1024 x 768 res screen maximized.
>
> I don't know how long 10 lines on a 1024x768 screen are. But I do know
> that Perl has a length function :)
>
> perl -lne 'print length' file
>

i deserved that :-). thanks again Xho!

xho...@gmail.com

unread,

Oct 24, 2005, 7:56:12 PM10/24/05

to

"it_says_BALLS_on_your forehead" <simon...@fmr.com> wrote:

> > I think you could have trimmed it down a lot more in most places, and
> > a little less in others. :)

> thanks, i'll work on improving this to help you help me more :-)

And even better, trimmed down code is good for benchmarking and other
experimentation.

> >
> > Also, I think you need to rear back and think of the big picture some
> > more. You are forking of 16 processes, and you are creating 10 log
> > files, and you want all 16 of those processes to write to all 10 of
> > those log files, so you have 160 filehandles all fighting with each
> > other. Is that really necessary? Do the 10 log files serve 10
> > different purposes (unlikely, it seems) or they are just to keep the
> > size of any given log file down, or what?
>
> they were initially there because we have an ETL (Extract, Transform,
> Load) tool that picks them up, and it was determined that this tool
> could optimally use 10 threads to gather these processed records. it
> would be best if these 10 records were the same size (or near enough).
> i'm working on a 16 CPU machine, so that's why i created 16 forked
> processes.

Since your processes have a mixed work load, needing to do both Net::FTP
(probably I/O bound) and per line processing (probably CPU bound), it might
make sense to use more than 16 of them.

...

> So I was thinking of
> creating a counter, and incrementing it in the loop that spawns the
> processes, and mod this counter by 16. the mod'd counter will be a
> value that is passed to the betaParse script/function and the parsing
> will use this value to choose which filehandle it writes to.

Don't do that. Lets say you start children 0 through 15, writing to
files 0 through 15. Child 8 finishes first. So ForkManager starts child
16, which tries to write to file 16%16 i.e. 0. But of course file 0 is
still being used by child 0. If you wish to avoid doing a flock for every
row, you need to mandate that no two children can be using the same file at
the same time.

I see two good ways to accomplish that. The first is simply to have each
child, as one of the first things it does, loop through a list of
filenames. For each, it opens it and attempts a nonblocking flock. Once it
finds a file it can successfully flock, it keeps that lock for the rest if
its life, and uses that filehandle for output.

The other way is to have ForkManager (in the parent) manage the files that
the children will write to. This has the advantage that, as long as there
is only one parent process running at once, you don't actually need to do
any flocking in the children, as the parent ensures they don't intefere
(but I do the locking anyway if I'm on a system that supports it. Better
safe than sorry):

my $pm=new Parallel::ForkManager(16);

#tokens for the output files.
my @outputID="file01".."file20"; # needs to be >= 16, of course;

#put the token back into the queue once the child is done.
$pm->run_on_finish( sub { push @outputID, $_[2] } ) ;

#...
foreach my $whatever (@whatever) {
#get the next available token for output
my $oid=shift @outputID or die;
$pm->start($oid) and next;
open my $fh, ">>", "/tmp/$oid" or die $!;
flock $fh, LOCK_EX|LOCK_NB or die "Hey, someone is using my file!";
#hold the lock for life
#...
while (<$in>) {
#...
print $fh $stuff_to_print;
};
close $fh or die $!;
$pm->finish();
};

> >
> > More on the big picutre. A couple weeks ago when you started asking
> > here about parsing (or was it FTPing?) lots of files, I thought
> > Parallel::ForkManager was the right solution. But you keep
> > incrementally adding new wrinkles, and I fear that by the time you are
> > done adding them perhaps the right solution will no longer be
> > Parallel::ForkManager but rather threads or event-loops or some
> > RDMS-interfaced program or even some other language entirely.
> > Parallelization in inherently challenging, and it probably needs some
> > deep big-picture thinking, not incremental tinkering.
> >
>
> yeah, i never thought it would be easy :-)

Theme song from grad school days: "No one said it would be easy, but no one
said it would be this hard."

it_says_BALLS_on_your_forehead

unread,

Oct 24, 2005, 9:57:10 PM10/24/05

to

xho...@gmail.com wrote:

ahh, similar to the callback example on the CPAN site. brilliant! :-D

it_says_BALLS_on_your forehead

unread,

Oct 25, 2005, 6:01:34 AM10/25/05

to

xho...@gmail.com wrote:
>
> Since your processes have a mixed work load, needing to do both Net::FTP
> (probably I/O bound) and per line processing (probably CPU bound), it might
> make sense to use more than 16 of them.
>
> ...

i'm not sure i see how you derive your conclusion here? i don't know
too much about resource utilization/consumption. i just heard that it
was optimal to use as many processes as there are CPUs. i can see why
more processes would be better for more concurrent FTP gets, but
wouldn't the subsequent processing of > 16 files make the machine
sluggish since there are only 16 CPUs? i would say overall the entire
'thing' (hesitant to use the word process) is more CPU bound since the
file transfer comprises a significantly smaller percentage of the time
relative to the actual processing.

is it actually better to use more if i have a mixed work load? why
exactly? how does the machine allocate its resources?

xho...@gmail.com

unread,

Oct 25, 2005, 12:52:43 PM10/25/05

to

"it_says_BALLS_on_your forehead" <simon...@fmr.com> wrote:
> xho...@gmail.com wrote:
> >
> > Since your processes have a mixed work load, needing to do both
> > Net::FTP (probably I/O bound) and per line processing (probably CPU
> > bound), it might make sense to use more than 16 of them.
> >
> > ...
>
> i'm not sure i see how you derive your conclusion here?

Doing an FTP is generally limited by IO (or, at least it reasonable to
assume so until specific data on it is gathered). It would be nice if your
CPUs had something to do while waiting for this IO, other than just sit and
wait for IO.

> i don't know
> too much about resource utilization/consumption. i just heard that it
> was optimal to use as many processes as there are CPUs.

This is true if your processes are CPU bound.

> i can see why
> more processes would be better for more concurrent FTP gets, but
> wouldn't the subsequent processing of > 16 files

But you are not running synchronously--each child proceeds to the per-line
processing as soon as its own data is loaded, regardless of what the other
children are doing. If you run 20 children and the FTP part takes 20% of
the time, then at any given moment you would expect about 4 processes to
the FTPing, using little CPU, and 16 processes to be grinding on the CPUs.
(If you use 16 children and FTPing is 20% of the time, then you would
expect 3.2 to be FTPing and 12.8 to be crunching at any one time, leaving
almost 3.2 CPUs idle)

Of course, one thing you have to watch out for is the distribution of the
processing. Just because programs *can* be arranged in a way that uses
both IO and CPU most efficiently doesn't mean they automatically will
arrange themselves that way. But with each task being of a more-or-less
random length, as you have, they shouldn't do too badly arranged just by
chance (other than right after the parent starts, when they will all be
doing FTP at the same time.)

> make the machine
> sluggish since there are only 16 CPUs?

"sluggish" is a term usually used for interactive programs, while yours
seems to be batch, and so sluggishness is probably of minimal concern.

> i would say overall the entire
> 'thing' (hesitant to use the word process) is more CPU bound since the
> file transfer comprises a significantly smaller percentage of the time
> relative to the actual processing.

Yeah, in that case there is no compelling reason to go above 16. Which is
nice, as it is one less thing to need to consider.

> is it actually better to use more if i have a mixed work load? why
> exactly? how does the machine allocate its resources?

The FTP asks the remote computer to send it a bunches of data. While it is
waiting for those bunches of data to appear, the CPU is able to do other
things, provided there are other things to do. So, it is often worthwhile
to ensure that there *are* other things to do. But obviously if the FTP
time is a small percentage of overall time, then this makes little
difference in your case.

Also, it is possible, if your remote machines, ethernet, etc. are fast
enough compared to your CPUs, that the FTP itself is CPU rather than IO
bound, or nearly so. It's also possible that the local hard-drive that the
FTPed files are being written to are the limiting factor, or that the
remote machine(s) you are fetching from are the limiting factors.

But it occurs to me that I might be advocating premature optimization here.
Get it working first and see if it is "fast enough". If you can process an
hours worth of logs in 5 minutes, there is probably no point it making it
faster. If it takes 65 minutes to process an hours worth of logs, that
is probably the time to worry about 16 versus 20 versus 24 parallel
children.

ax...@white-eagle.invalid.uk

unread,

Oct 26, 2005, 7:00:59 PM10/26/05

to

it_says_BALLS_on_your forehead <simon...@fmr.com> wrote:

>> Thanks for sharing. Then the lock is probably not redundant, as I
>> previously was suspecting, even if I'm doing it as well.

> yeah, i don't think it's an issue when you're appending small strings
> to a file, but they should make file locking default, don't you think?!
> :-)

> when *wouldn't* you want to lock a file when writing?

Simple... when you have several Apache processes dealing with
web requests more or less at the same time and wanting to just
append to a log file. I was a bit worried about this at one
time using Solaris but I investigated and found it not to
be a problem. I suspect that BSD and Linux work in a similar
manner... I cannot answer for any other operating systems.

Axel

it_says_BALLS_on_your_forehead

unread,

Oct 26, 2005, 9:57:41 PM10/26/05

to

actually, i had several processes appending to the same log file and
came across records that had spliced entries. this was the very problem
of which i spoke, and why i ended up needing to explicitly flock.

ax...@white-eagle.invalid.uk

unread,

Oct 29, 2005, 9:46:13 AM10/29/05

to

it_says_BALLS_on_your_forehead <simon...@gmail.com> wrote:

Which operating system were you using? I never ran across this problem
using Solaris.

Axel

it_says_BALLS_on_your forehead

unread,

Oct 29, 2005, 10:38:40 AM10/29/05

to

SunOS 5.8