It manifested when sorting a large (4000+) list of URLs. The perl
process just crashed: nothing output to STDERR, no core file
left over, it was just gone. Often it would work for days at a
time; then there would be long periods when it would not work at
all. I have also seen crashes _near_ this segment of code: again,
the perl process just exits silently. They have not seemed
related to the sort but it's hard to tell.
The sort uses a custom routine to sort the URLs into an order
that's a little more natural than alphanumeric, which I call
"URLDomainOrder". For debugging, I added a filehandle to the
comparison function; it looks like this now:
[...]
my @ers_keys = keys %$cur_ers_ref;
open(UDOFH, ">/tmp/udofh") or die "can't write udofh, $!";
select(UDOFH); $|=1; select(STDOUT);
my @sorted_ers_keys =
sort { &URLDomainOrder($a, $b, \*UDOFH) }
@ers_keys;
close UDOFH;
[...]
The comparison function starts off like this:
sub URLDomainOrder {
my($a, $b, $fh) = @_;
$fh = '' if !$fh;
if ($fh) {
print $fh "UDO1 a='$a' b='$b'\n";
}
my($a_www, $b_www,
$a_host, $b_host,
$a_port_given, $b_port_given,
$a_host_num_str, $b_host_num_str,
$a_domain, $b_domain,
$a_path, $b_path) = ( ) x 12;
my($a_host_num_val, $b_host_num_val) = (0, 0);
my($a_port, $b_port) = (80, 80);
if ($fh) {
print $fh "UDO a='$a' b='$b'\n";
}
if ($a and $b) {
if ($fh) {
print $fh "UDO a='$a' b='$b'\n";
}
($a_www, $a_host, $a_port, $a_path) = $a =~ m!^(?:http://)?(www\.)?([^/]+?)(\:\d+)?(/.*)!;
if ($fh) {
print $fh "UDO a_www='$a_www' a_host='$a_host' a_port='$a_port' a_path='$a_path'\n";
}
($b_www, $b_host, $b_port, $b_path) = $b =~ m!^(?:http://)?(www\.)?([^/]+?)(\:\d+)?(/.*)!;
if ($fh) {
print $fh "UDO b_www='$b_www' b_host='$b_host' b_port='$b_port' b_path='$b_path'\n";
}
[...snip...]
The last lines left in my debugging log file before the crash are
these (and this is at least somewhat repeatable, it's crashed
exactly like this twice in a row):
UDO1 a='http://www.projo.com/report/pjb/stories/02573501.htm' b='http://www.projo.com/report/pjb/stories/02568772.htm'
UDO a='http://www.projo.com/report/pjb/stories/02573501.htm' b='http://www.projo.com/report/pjb/stories/02568772.htm'
UDO a='http://www.projo.com/report/pjb/stories/02573501.htm' b='http://www.projo.com/report/pjb/stories/02568772.htm'
UDO a_www='www.' a_host='projo.com' a_port='' a_path='/report/pjb/stories/02573501.htm'
It thus appears to have crashed on the last regex above ("$b_www").
Of course this regex works fine on the data normally, and in fact
the debugging log file is full of a few hundred other references to
this same URL in which the regex was successfully executed.
This behavior is with perl5.005_61, and though I haven't run the
debugging log with 5.005_03, essentially identical behavior was
occurring with 5.005_03 as well.
My guess is that perl has a bug causing a stray pointer, but I
don't know enough to chase this down.
I can do workarounds, but I don't like this. Any advice will be
appreciated.
--
Jamie McCarthy
ja...@mccarthy.org
Are you sure you look for core in a correct directory? What is your
core-size limit? What is the exit code the parent process gets?
Ilya
> It manifested when sorting a large (4000+) list of URLs. The perl
> process just crashed: nothing output to STDERR, no core file
> left over, it was just gone.
Could core files be disabled, or could there be too little (non-reserved)
space remaining for the large core? Other than that, I can't see why there
would be no core.
Your sorting problem might benefit from the Schwartzian Transform, so that
you won't have to process each URL multiple times.
> My guess is that perl has a bug causing a stray pointer, but I
> don't know enough to chase this down.
I'd suspect a memory leak. But I'd expect a core, too. It sounds as if
your perl (or your libraries) has a bug of misconfiguration.
If you're using perl's malloc, try your system's, and vice versa.
Good luck!
--
Tom Phoenix Perl Training and Hacking Esperanto
Randal Schwartz Case: http://www.rahul.net/jeffrey/ovs/
> Are you sure you look for core in a correct directory? What is your
> core-size limit? What is the exit code the parent process gets?
There are no core files on any filesystem on this machine.
My core size limit is 1000000 blocks, which it seems is 512 MB
(this perl process takes up 5-10 MB, 20 at most). I don't know
the exit code because its parent process has long since exited.
Tom Phoenix wrote:
> Could core files be disabled, or could there be too little (non-reserved)
> space remaining for the large core? Other than that, I can't see why there
> would be no core.
There's plenty of space free on every filesystem (gigabytes).
> Your sorting problem might benefit from the Schwartzian Transform, so that
> you won't have to process each URL multiple times.
That's true. I've been meaning to get around to that :-)
> > My guess is that perl has a bug causing a stray pointer, but I
> > don't know enough to chase this down.
>
> I'd suspect a memory leak. But I'd expect a core, too. It sounds as if
> your perl (or your libraries) has a bug of misconfiguration.
I confirmed that it does happen both on the 5.00503 shipped with
Red Hat 6.0, and on 5.00561 as installed with "Configure -des".
> If you're using perl's malloc, try your system's, and vice versa.
That's an excellent idea. I'll try perl's malloc.
(Un)Fortunately, everything has worked fine for the last 48 hours
with no changes in the code; a few dozen of these large sorts
have been done in that time. So I won't know whether any changes
have made it better or worse. Phase of the moon, for all I know...
--
Jamie McCarthy
In <37E121D1...@mccarthy.org> Jamie McCarthy wrote:
> (Un)Fortunately, everything has worked fine for the last 48 hours
> with no changes in the code; a few dozen of these large sorts
> have been done in that time. So I won't know whether any changes
> have made it better or worse. Phase of the moon, for all I know...
Another guess: There are known problems with perl signal handling in that
you can crash a non-threaded perl by sending it a signal at an inopportune
time. Perhaps your process is getting a signal of some sort (no pun
intended) and this is most likely to occur during the time the program
spends in the large sort which causes the process to go belly up silently?
From what I understand, it is a hit-or-miss kind of thing - sometimes the
signal is handled gracefully, sometimes (usually very rarely) it isn't. But
with large sorts, perhaps you would be more likely to run into the problem.
Cheers,
Mark
I think without my voodoo patch your "very rarely" is 1/30. With the
voodoo patch it was down to less than 1/100000. (I had seen a failure
only once on many *very* long tests - with a signal each 30ms tick. I
could not get a better granularity on OS/2.)
Ilya
P.S. I do not remember what the voodoo patch was doing. It *should
not have* made any difference... See archives for details.