Ideas for 5.10

Nick Ing-Simmons

unread,

Aug 10, 2002, 5:28:32 PM8/10/02

to art...@contiller.se, Nick Ing-Simmons, h...@crypt.org, perl5-...@perl.org, raphel.gar...@hexaflux.com

Arthur Bergman <art...@contiller.se> writes:
>>
>> It will of course require new versions of XS modulues that use magic.
>> Tk for sure.
>
>I had high hopes for a compatibility layer that made them work, with the
>goal of supporting the big ones (Tk, DBI ...). otherwise I don't think
>this is possible.

Tk _uses_ MAGIC quite heavily. '~' magic is used to associate core-Tk
data structures with perl objects, and to attach some Tcl_Obj vtables
to SVs - so sometimes the vtable is not empty, nor is it one of standard
perl ones.

Multiple 'U' magics are possible for "watching" variables to change
displayed text, or to change variables when GUI event occurs
(radio button).

Current Tk also uses Tie::Handle hooks to do 'fileevent'
(though this is likely to change for Tk804 which needs perl5.8
for Unicode so will switch to PerlIO layers).

Unless the wrapper layer presents a linked list of MAGIC * with
same field names Tk is going need to change.

>
>It would be a good way for a future path to parrot land.

If you can persuade Tcl/Tk folk of the merits of Parrot I may not need
to do a Tk port at all ;-)

If not then it is going to need re-porting - some way to make
a Tcl_Obj "be" a PMC (or vice versa) is the obvious thing to do.

--
Nick Ing-Simmons
http://www.ni-s.u-net.com/

Nick Ing-Simmons

unread,

Aug 10, 2002, 6:09:01 PM8/10/02

to gol...@earthlink.net, h...@crypt.org, perl5-...@perl.org, Arthur Bergman

Benjamin Goldberg <gol...@earthlink.net> writes:
>Arthur Bergman wrote:
>>
>> Hi,
>>
>> Here is a list of things that I would like to see in 5.10
>
>Another think I would like to see would be to allow $/ to be a qr//
>regex.

The main snag with that is getting to grips with the regex engine
and avoiding issues with re-entrancy if <$fh> calls into a layer/tie
which also uses regex-es.

Nick Ing-Simmons

unread,

Aug 10, 2002, 6:06:08 PM8/10/02

to gol...@earthlink.net, h...@crypt.org, perl5-...@perl.org, Arthur Bergman

Benjamin Goldberg <gol...@earthlink.net> writes:
>Arthur Bergman wrote:
>>
>> Hi,
>>
>> Here is a list of things that I would like to see in 5.10
>

>1) Allow filehandles to have their own input record seperators. If a
>filehandle doesn't have it's input record seperator set at the time that
>readline/<> is called, then the global $/ variable would be used.

This has often been suggested. The snag is knowing when legacy code
expects the global $/ to have an effect. So we need a _new_ mechanism
that sets "the per-handle record separator and causes $/ to be ignored".

>
>2) A mechanism like IO::Select which is able to see inside of the perlio
>buffers, and which will, if necessary, perform C-level read()s for you
>on filehandles which do have data pending, but which don't yet have the
>record seperator.

Perl/Tk has had various approximations to that for years.
One that does a better job using layers will be along "soon".
The main pain is that C's (POSIX's really) select() only tells you
that reading one byte is possible. Doing non-blocking read is far
more tricky to do portably.

>
>In otherwords, make a select() which works with buffered io.
>
> my $select = PerlIO::Select->new;
> $select->add($foo); $foo->input_record_seperator( $CRLF );
> $select->add($bar); $bar->input_record_seperator( \8 );
> my @ready = $select->can_read();

->can_read seems to be really ->has_whole_record

> # ->can_read will call the C-select function, and will call read()
> # if necessary.
> foreach my $h ( @ready ) {
> my $line = <$h>; # doesn't block, doesn't confuse ->can_read.
> }
>
>This could of course be implemented without (1) also being done, but
>then you would want to have:
> $select->add_with_input_record_sep( $foo, $CRLF );
>
>3) Allow the user more control of buffering: Normally, one can only
>set autoflush on or off. I'd like to be able to say: flush every X
>bytes,

would be easy to give a parameter to say :perlio(X) to set buffer size.

>or flush whenever the string Y is encountered,

That is messier but still doable with a simple layer with custom write
method - before or after the string? ;-)

>or flush after
>every print,

Auto-flush is the best way to do that. Lower level stuff has no idea
where the 'print' boundaries are.

>or never flush at all until asked to (keep growing the
>output buffer as needed).

A :scalar -like layer with a flush method that looks for layer below
indeed it may make sense for :scalar to behave like that.

>
>That last option would be useful with my idea (2) -- it could print
>bytes for you and remove them from the output buffer as necessary.

Benjamin Goldberg

unread,

Aug 10, 2002, 9:15:13 PM8/10/02

to Nick Ing-Simmons, h...@crypt.org, perl5-...@perl.org, Arthur Bergman

Nick Ing-Simmons wrote:
>
> Benjamin Goldberg <gol...@earthlink.net> writes:
> >Arthur Bergman wrote:
> >>
> >> Hi,
> >>
> >> Here is a list of things that I would like to see in 5.10
> >

[snip]

> >2) A mechanism like IO::Select which is able to see inside of the
> >perlio buffers, and which will, if necessary, perform C-level read()s
> >for you on filehandles which do have data pending, but which don't
> >yet have the record seperator.
>
> Perl/Tk has had various approximations to that for years.
> One that does a better job using layers will be along "soon".
> The main pain is that C's (POSIX's really) select() only tells you
> that reading one byte is possible. Doing non-blocking read is far
> more tricky to do portably.

I was thinking something like this:

sub PerlIO::Select::can_readline {
my $self = shift;
my $handles = $self->[HANDLES];
my $input_rec_seps = $self->[IRS];
my @ready;

# Most important difference from IO::Select ... it
# checks if there's a record already buffered.
foreach my $handle (@$handles) {
my $bufref = GetPerlIOBuffer( $handle );
my $fd = fileno $handle;
my $irs = $input_rec_seps->{$fd};
$irs = $/ unless exists $input_rec_seps->{$fd};
if(!defined($irs) ? GetPerlIOEOFWasRead($handle) :
ref($irs) eq "Regex" ? $$bufref =~ $irs :
ref($irs) ? length($$bufref) >= $$irs :
$irs eq "" ? $$bufref =~ /\n\n./ :
-1 != index( $$bufref, $irs ) )
{
push @ready, $handle;
}
}
return @ready if @ready;

my $timeout = shift;
my $rvec = $self->[READ_VEC];
SELECTLOOP: {
((my $nready), $timeout) =
CORE::select( (my $ready=$rvec), '', '', $timeout );
return if $nready <= 0;
while( $ready =~ /[^\0]+/g ) {
for my $fd ( $-[0] * 8 .. $+[0] * 8 ) {
next unless vec( $ready, $fd, 1 );
my $handle = $handles->[$fd];
my $bufref = GetPerlIOBuffer( $handle );
my $irs = $input_rec_seps->{$fd};
$irs = $/ unless exists $input_rec_seps->{$fd};
my $m = sysread $handle, $$bufref, LOTS, length $$bufref;
if(!$m or defined($irs) and
ref($irs) eq "Regex" ? $$bufref =~ $irs :
ref($irs) ? length($$bufref) >= $$irs :
$irs eq "" ? $$bufref =~ /\n\n./ :
-1 != index( $$bufref, $irs ) )
{
push @ready, $handle;
}
} } # end for $fd, end while [^\0]
redo SELECTLOOP unless @ready;
} # end select block
@ready;
}

Which uses C/POSIX select.

The important (difficult) thing is that you need a way to be able to
read from *underneath* the PerlIO layer, and store it into the input
buffer for the PerlIO object, which a later read() or readline() will
get data from.

> >In otherwords, make a select() which works with buffered io.
> >
> > my $select = PerlIO::Select->new;
> > $select->add($foo); $foo->input_record_seperator( $CRLF );
> > $select->add($bar); $bar->input_record_seperator( \8 );
> > my @ready = $select->can_read();
>
> ->can_read seems to be really ->has_whole_record

Ok... or perhaps, can_readline. The reason I wanted can_read was to
make PerlIO::Select be a drop-in replacement for IO::Select, except that
one can safely call readline/<> after the method returns, whereas one
normally may only safely call sysread after the method returns.

> > # ->can_read will call the C-select function, and will call read()
> > # if necessary.
> > foreach my $h ( @ready ) {
> > my $line = <$h>; # doesn't block, doesn't confuse ->can_read.
> > }
> >
> >This could of course be implemented without (1) also being done, but
> >then you would want to have:
> > $select->add_with_input_record_sep( $foo, $CRLF );
> >
> >3) Allow the user more control of buffering: Normally, one can only
> >set autoflush on or off. I'd like to be able to say: flush every X
> >bytes,
>
> would be easy to give a parameter to say :perlio(X) to set buffer
> size.
>
> >or flush whenever the string Y is encountered,
>
> That is messier but still doable with a simple layer with custom write
> method - before or after the string? ;-)

After, of course... it's not as if Y is going to get removed from the
buffer. The most common usage would be something like:

binmode SOCKET, qq/:flush(after => "\015\012")/;

So that if a print statement contains the internet record terminator,
the line gets flushed. If one record is produced with multiple print
statements, it's only sent after the last print (which has the CRLF).

Note that this is less often than autoflush, which would send bytes
after every print.

> >or flush after every print,
>
> Auto-flush is the best way to do that.

Yes, but I was kinda thinking of having a unified interface:

binmode SOCKET, qq/:flush(after => "\015\012")/;
binmode SOCKET, qq/:flush(after => \\72)/;
binmode SOCKET, qq/:flush("autoflush")/;
binmode SOCKET, qq/:flush("manual")/;

> Lower level stuff has no idea where the 'print' boundaries are.

It doesn't?

I thought we were integrating via and tiehandle ... I know that
tiehandle knows where 'print' boundaries are (it recently came up on
clp.misc, that mod_perl's tiehandle for stdout forgot to add $/ to the
strings print()ed to it, which was a problem for someone).

> >or never flush at all until asked to (keep growing the
> >output buffer as needed).
>
> A :scalar -like layer with a flush method that looks for layer below
> indeed it may make sense for :scalar to behave like that.

I was thinking maybe:
binmode SOCKET, qq/:flush("manual")/;

> >That last option would be useful with my idea (2) -- it could print
> >bytes for you and remove them from the output buffer as necessary.

--
tr/`4/ /d, print "@{[map --$| ? ucfirst lc : lc, split]},\n" for
pack 'u', pack 'H*', 'ab5cf4021bafd28972030972b00a218eb9720000';

Nick Ing-Simmons

unread,

Aug 11, 2002, 5:43:37 AM8/11/02

to yves....@mciworldcom.de, Nicholas Clark, perl5-...@perl.org

Yves Orton <yves....@mciworldcom.de> writes:
>
>Forgive my ignorance in commenting on this, but for this purpose do we
>really need the full power of the perl (irregular) regexes? It seems to me
>that for this type of task a simple DFA based (regular)regex engine would be
>admirable suited, and would be much faster too.

A lex-like DFA which safely finds longest matches without backtracking
would also be useful for lex-like duties.

>
>Ive often thought there is room in perl for a DFA based regex engine. For
>instance perls regex engine doesn't handle option list very well as it has
>to do a lot of backtracking whereas a DFA based regex engine wouldn't
>backtrack at all. Perhaps such a beast could be put in a module and changes
>to the way $/ is handled could provide a way for that module to hook in
>somehow.

Nick Ing-Simmons

unread,

Aug 11, 2002, 5:41:08 AM8/11/02

to ni...@ccl4.org, h...@crypt.org, perl5-...@perl.org, Arthur Bergman, Benjamin Goldberg

Nicholas Clark <ni...@ccl4.org> writes:

>On Fri, Aug 09, 2002 at 01:15:15AM -0400, Benjamin Goldberg wrote:
>> Arthur Bergman wrote:
>> Another think I would like to see would be to allow $/ to be a qr//
>> regex.
>

>I'd like to see this, but without a rewrite of the regexp engine to allow
>the engine to accept incomplete strings with an associated "get more"
>function I cannot see how it could be implemented for the general case
>without internally unsetting $/, slurping the file, and then finding the
>regexp.

I was naively assuming one would apply the regexp to the current
buffer, and if it did not match 'cos it needed more it would return
false.

>
>However, this idea is not simple if the regexp engine in all its
>backtracking is keeping absolute pointers to match points in the string
>(and expecting a string contiguous in memory) as there's no way a "get more"
>function could reallocate a buffer larger, or hang the extra into another
>buffer (making a discontinuous string)

Again I assumed the regexp would be passed the current buffer afresh
at each attempt.

>
>Nicholas Clark

Nick Ing-Simmons

unread,

Aug 11, 2002, 5:45:37 PM8/11/02

to gol...@earthlink.net, h...@crypt.org, Nick Ing-Simmons, perl5-...@perl.org, Arthur Bergman

Benjamin Goldberg <gol...@earthlink.net> writes:

>Nick Ing-Simmons wrote:
>>
>> >2) A mechanism like IO::Select which is able to see inside of the
>> >perlio buffers, and which will, if necessary, perform C-level read()s
>> >for you on filehandles which do have data pending, but which don't
>> >yet have the record seperator.
>>
>> Perl/Tk has had various approximations to that for years.
>> One that does a better job using layers will be along "soon".
>> The main pain is that C's (POSIX's really) select() only tells you
>> that reading one byte is possible. Doing non-blocking read is far
>> more tricky to do portably.
>
>I was thinking something like this:
>
>sub PerlIO::Select::can_readline {

....

> SELECTLOOP: {
> ((my $nready), $timeout) =
> CORE::select( (my $ready=$rvec), '', '', $timeout );
> return if $nready <= 0;
> while( $ready =~ /[^\0]+/g ) {
> for my $fd ( $-[0] * 8 .. $+[0] * 8 ) {
> next unless vec( $ready, $fd, 1 );
> my $handle = $handles->[$fd];
> my $bufref = GetPerlIOBuffer( $handle );
> my $irs = $input_rec_seps->{$fd};
> $irs = $/ unless exists $input_rec_seps->{$fd};
> my $m = sysread $handle, $$bufref, LOTS, length $$bufref;

And that sysread hangs. The select() just says a sysread() of one byte
will not hang - nothing more.
To avoid the hang you either have to sysread one byte (not even one char
in a Unicode world!) at a time, or put handle into non-blocking mode.
1-byte at a time is horribly slow (as in _most_ cases there is a lot
more than that). Non-blocking IO is messy - as I said.

>
>The important (difficult) thing is that you need a way to be able to
>read from *underneath* the PerlIO layer, and store it into the input
>buffer for the PerlIO object, which a later read() or readline() will
>get data from.

A layer naturaly does a read from layer below and stores
it into its own buffer. So this just means adding a layer
with this functionality on top of what we had. readline() (aka sv_gets() will
call the new top layer and get its buffer which has been so manipulated.

>> ->can_read seems to be really ->has_whole_record
>
>Ok... or perhaps, can_readline. The reason I wanted can_read was to
>make PerlIO::Select be a drop-in replacement for IO::Select, except that
>one can safely call readline/<> after the method returns, whereas one
>normally may only safely call sysread after the method returns.
>

>> >or flush whenever the string Y is encountered,
>>
>> That is messier but still doable with a simple layer with custom write
>> method - before or after the string? ;-)
>
>After, of course... it's not as if Y is going to get removed from the
>buffer. The most common usage would be something like:
>
> binmode SOCKET, qq/:flush(after => "\015\012")/;
>
>So that if a print statement contains the internet record terminator,
>the line gets flushed. If one record is produced with multiple print
>statements, it's only sent after the last print (which has the CRLF).
>
>Note that this is less often than autoflush, which would send bytes
>after every print.

Agreed.

>
>> >or flush after every print,
>>
>> Auto-flush is the best way to do that.
>
>Yes, but I was kinda thinking of having a unified interface:
>
> binmode SOCKET, qq/:flush(after => "\015\012")/;
> binmode SOCKET, qq/:flush(after => \\72)/;
> binmode SOCKET, qq/:flush("autoflush")/;
> binmode SOCKET, qq/:flush("manual")/;
>
>> Lower level stuff has no idea where the 'print' boundaries are.
>
>It doesn't?

No.

In absense of tie

print $a,$b,$c;

Calls PerlIO_write() or similar at least three times - once for each string.
(If each of those is 10Mbytes there is no point in building one 30Mbyte
string somewhere is there?)
Then if autoflush is on it calls PerlIO_flush().

TIEHANDLE may indeed build the one big string - but that is the class's
choice not perl's.

>
>I thought we were integrating via and tiehandle ...

I don't know what "we" are doing but I was considering
deprecating tiehandle, or re-implementing it in terms of layers or some
other scheme to make it stop getting in the way.

>I know that
>tiehandle knows where 'print' boundaries are (it recently came up on
>clp.misc, that mod_perl's tiehandle for stdout forgot to add $/ to the
>strings print()ed to it, which was a problem for someone).
>
>> >or never flush at all until asked to (keep growing the
>> >output buffer as needed).
>>
>> A :scalar -like layer with a flush method that looks for layer below
>> indeed it may make sense for :scalar to behave like that.
>
>I was thinking maybe:
> binmode SOCKET, qq/:flush("manual")/;

Please try and think of :foo() as a layer (a filtering object if you must)
and not as some magical attribute of the file handle. I see no real barier to
writing PerlIO::flush which does what you suggest - but it is NOT really
altering perl's flush behaviour - rather the "flush object" accumulates
data until it sees fit (depending on how it is configured) to send
it down to the layer below.

The main snag with :flush(manual) is perl has NO native function
which does PerlIO_flush() - one turns on autoflush and does a print :-(
So what do you put in your code to "manualy" flush the thing
binmode SOCKET, ":flush(now)" ?

Nicholas Clark

unread,

Aug 11, 2002, 5:52:41 PM8/11/02

to Nick Ing-Simmons, gol...@earthlink.net, h...@crypt.org, perl5-...@perl.org, Arthur Bergman

On Sun, Aug 11, 2002 at 10:45:37PM +0100, Nick Ing-Simmons wrote:

> Please try and think of :foo() as a layer (a filtering object if you must)

> and not as some magical attribute of the file handle. I see no real barrier to

> writing PerlIO::flush which does what you suggest - but it is NOT really
> altering perl's flush behaviour - rather the "flush object" accumulates
> data until it sees fit (depending on how it is configured) to send
> it down to the layer below.
>
> The main snag with :flush(manual) is perl has NO native function
> which does PerlIO_flush() - one turns on autoflush and does a print :-(
> So what do you put in your code to "manualy" flush the thing
> binmode SOCKET, ":flush(now)" ?

It's not clear to me how the PerlIO vtable allows a layer to distinguish to
the layer below whether it is merely asking for it to empty its buffer
(in a the middle of a regular write) or hurry the data along as fast as
possible. (I suspect "expedite" is the marketdroid term)

For example, PerlIOBuf_write does this for the non-linebuf case:

while (count > 0) {
SSize_t avail = b->bufsiz - (b->ptr - b->buf);
if ((SSize_t) count < avail)
avail = count;
PerlIOBase(f)->flags |= PERLIO_F_WRBUF;
if (PerlIOBase(f)->flags & PERLIO_F_LINEBUF) {
....
}
else {
if (avail) {
Copy(buf, b->ptr, avail, STDCHAR);
count -= avail;
buf += avail;
written += avail;
b->ptr += avail;
}
}
if (b->ptr >= (b->buf + b->bufsiz))
PerlIO_flush(f);
}

It's calling flush on itself to empty the buffer. Yes, it's true that it is
"flushing" in one sense. But it's not really flushing in the fflush() sense.
(ie please sync this data to disk; don't return until you've done it/
please send this socket data at once; set the push flag, don't buffer it up
so that you can send the full window at once/
please pass this data out the other side of the compressor immediately, even
if that gives sub-optimal compression; don't hang onto it for the best bulk
transfer)

Am I missing a way to make this distinction?

Nicholas Clark
--
Even better than the real thing: http://nms-cgi.sourceforge.net/

Benjamin Goldberg

unread,

Aug 12, 2002, 2:58:37 AM8/12/02

to Nick Ing-Simmons, h...@crypt.org, perl5-...@perl.org, Arthur Bergman

Nick Ing-Simmons wrote:
>
> Benjamin Goldberg <gol...@earthlink.net> writes:
> >Nick Ing-Simmons wrote:
> >>
> >> >2) A mechanism like IO::Select which is able to see inside of the
> >> >perlio buffers, and which will, if necessary, perform C-level
> >> >read()s for you on filehandles which do have data pending, but
> >> >which don't yet have the record seperator.
> >>
> >> Perl/Tk has had various approximations to that for years.
> >> One that does a better job using layers will be along "soon".
> >> The main pain is that C's (POSIX's really) select() only tells you
> >> that reading one byte is possible. Doing non-blocking read is far
> >> more tricky to do portably.
> >
> >I was thinking something like this:
> >
> >sub PerlIO::Select::can_readline {
> ...

> > SELECTLOOP: {
> > ((my $nready), $timeout) =
> > CORE::select( (my $ready=$rvec), '', '', $timeout );
> > return if $nready <= 0;
> > while( $ready =~ /[^\0]+/g ) {
> > for my $fd ( $-[0] * 8 .. $+[0] * 8 ) {
> > next unless vec( $ready, $fd, 1 );
> > my $handle = $handles->[$fd];
> > my $bufref = GetPerlIOBuffer( $handle );
> > my $irs = $input_rec_seps->{$fd};
> > $irs = $/ unless exists $input_rec_seps->{$fd};
> > my $m = sysread $handle, $$bufref, LOTS, length $$bufref;
>
> And that sysread hangs. The select() just says a sysread() of one byte
> will not hang - nothing more.

Are you sure about that?

While I agree that that select() merely indicates that there is at least
one byte ready for reading, I think you're mistaken about what sysread
does. AFAIK, sysread() will read at least one byte (blocking if
necessary to do so), but if you request more bytes than are available,
it will read however many it has available, and return the number that
were read (and *not* block to get the full count requested).

It's the C-level fread(), and the perl-level read(), that block until
the total amount requested has been gotten.

> To avoid the hang you either have to sysread one byte (not even one
> char in a Unicode world!) at a time, or put handle into non-blocking
> mode. 1-byte at a time is horribly slow (as in _most_ cases there is a
> lot more than that). Non-blocking IO is messy - as I said.

I seriously doubt that you really have to sysread one byte at a time. I
will look for docs stating that things are one way or another, but I
think that I'm right.

Nick Ing-Simmons

unread,

Aug 12, 2002, 6:02:52 AM8/12/02

to ni...@unfortu.net, h...@crypt.org, Nick Ing-Simmons, perl5-...@perl.org, Arthur Bergman, gol...@earthlink.net

Nicholas Clark <ni...@unfortu.net> writes:
>
>It's not clear to me how the PerlIO vtable allows a layer to distinguish to
>the layer below whether it is merely asking for it to empty its buffer
>(in a the middle of a regular write) or hurry the data along as fast as
>possible. (I suspect "expedite" is the marketdroid term)

There is no mechanism - drat (you mentioned this before and I forgot).

>
>For example, PerlIOBuf_write does this for the non-linebuf case:
>

> if (b->ptr >= (b->buf + b->bufsiz))
> PerlIO_flush(f);
> }
>
>
>It's calling flush on itself to empty the buffer. Yes, it's true that it is
>"flushing" in one sense. But it's not really flushing in the fflush() sense.
>(ie please sync this data to disk; don't return until you've done it/
> please send this socket data at once; set the push flag, don't buffer it up
> so that you can send the full window at once/
> please pass this data out the other side of the compressor immediately, even
> if that gives sub-optimal compression; don't hang onto it for the best bulk
> transfer)

It is far from clear that fflush() has all those connotations or effects ;-)

>
>Am I missing a way to make this distinction?
>
>Nicholas Clark
--

Nick Ing-Simmons
http://www.ni-s.u-net.com/

Nicholas Clark

unread,

Aug 12, 2002, 6:19:05 AM8/12/02

to Nick Ing-Simmons, ni...@unfortu.net, h...@crypt.org, Nick Ing-Simmons, perl5-...@perl.org, Arthur Bergman, gol...@earthlink.net

On Mon, Aug 12, 2002 at 11:02:52AM +0100, Nick Ing-Simmons wrote:
> Nicholas Clark <ni...@unfortu.net> writes:
> >
> >It's not clear to me how the PerlIO vtable allows a layer to distinguish to
> >the layer below whether it is merely asking for it to empty its buffer
> >(in a the middle of a regular write) or hurry the data along as fast as
> >possible. (I suspect "expedite" is the marketdroid term)
>
> There is no mechanism - drat (you mentioned this before and I forgot).

But we have an API version in the release 5.8.0, don't we? So it would
be possible to update the flush vtable entry to take a flags argument,
and compensate for any existing binary-compiled PerlIO layers that
declare themselves to be using the current API?

> >For example, PerlIOBuf_write does this for the non-linebuf case:
> >
> > if (b->ptr >= (b->buf + b->bufsiz))
> > PerlIO_flush(f);
> > }
> >
> >
> >It's calling flush on itself to empty the buffer. Yes, it's true that it is
> >"flushing" in one sense. But it's not really flushing in the fflush() sense.
> >(ie please sync this data to disk; don't return until you've done it/
> > please send this socket data at once; set the push flag, don't buffer it up
> > so that you can send the full window at once/
> > please pass this data out the other side of the compressor immediately, even
> > if that gives sub-optimal compression; don't hang onto it for the best bulk
> > transfer)
>
> It is far from clear that fflush() has all those connotations or effects ;-)

Well, it's certainly not going to know how to play ball with a compression
library such as zlib. There's a niggly bit of my brain saying that I read
somewhere that there's no part of the sockets API that allows one to set the
PSH flag, and a man page for fflush says "Note that fflush only flushes
the user space buffers provided by the C library. To ensure that the data
is physically stored on disk the kernel buffers must be flushed too, eg
with sync(2) or fsync(2)"

So I think that fflush doesn't do any of them. But we're perl - we can do
better than C. :-)

Nicholas Clark

Nick Ing-Simmons

unread,

Aug 12, 2002, 9:02:31 AM8/12/02

to gol...@earthlink.net, h...@crypt.org, Nick Ing-Simmons, perl5-...@perl.org, Arthur Bergman

Benjamin Goldberg <gol...@earthlink.net> writes:
>> > my $m = sysread $handle, $$bufref, LOTS, length $$bufref;
>>
>> And that sysread hangs. The select() just says a sysread() of one byte
>> will not hang - nothing more.
>
>Are you sure about that?

No.

Across the whole slew of platforms perl runs on I am sure at least
one of them behaves that way for some kind of thing that fd may be
connected to. Or I would not have had all the hassle with Tk's
fileevent over the years. I would be delighted to be proved wrong.

>
>While I agree that that select() merely indicates that there is at least
>one byte ready for reading, I think you're mistaken about what sysread
>does. AFAIK, sysread() will read at least one byte (blocking if
>necessary to do so), but if you request more bytes than are available,
>it will read however many it has available, and return the number that
>were read (and *not* block to get the full count requested).

"man 2 read" on this Linux says it _may_ do that (for pipes and
terminals) but is rather vague. If I recall correctly my problems
have been with pipes and TCP (stream) sockets (particularly on Win32).

>
>It's the C-level fread(), and the perl-level read(), that block until
>the total amount requested has been gotten.

Agreed.

>
>I seriously doubt that you really have to sysread one byte at a time. I
>will look for docs stating that things are one way or another, but I
>think that I'm right.

The problem is almost certainly not going to be on a UNIX-oid with
reasonably solid BSD socket heritage. It is going to be Win32, MacOS, VMS
VOS, OS/2 or some other platfrom where POSIX-y calls are not native
and perhaps not really understood by folk that implemented the emulation.

(A classic example - a DOS C runtime where read's fd was treated as
index into FILE * table which then did ANSI C's fread() which was
implemented as native INT 21h calls.)

Nicholas Clark

unread,

Aug 12, 2002, 9:11:35 AM8/12/02

to Nick Ing-Simmons, gol...@earthlink.net, h...@crypt.org, Nick Ing-Simmons, perl5-...@perl.org, Arthur Bergman

On Mon, Aug 12, 2002 at 02:02:31PM +0100, Nick Ing-Simmons wrote:
> Benjamin Goldberg <gol...@earthlink.net> writes:

> >I seriously doubt that you really have to sysread one byte at a time. I
> >will look for docs stating that things are one way or another, but I
> >think that I'm right.
>
> The problem is almost certainly not going to be on a UNIX-oid with
> reasonably solid BSD socket heritage. It is going to be Win32, MacOS, VMS
> VOS, OS/2 or some other platfrom where POSIX-y calls are not native
> and perhaps not really understood by folk that implemented the emulation.

Or "worse", when the implementors know the POSIX spec, but fail to
"understand" that they should implement things in the "traditional" way
because many programs implicitly rely on too much.

Although the (in)ability of vendors to implement bits of the TCP stack
(eg shutdown over Unix domain sockets) would depress me if I let it.

Nicholas Clark

Benjamin Goldberg

unread,

Aug 13, 2002, 12:27:54 AM8/13/02

to p5p

Nick Ing-Simmons wrote:
>
> Benjamin Goldberg <gol...@earthlink.net> writes:
> >> > my $m = sysread $handle, $$bufref, LOTS, length $$bufref;
> >>
> >> And that sysread hangs. The select() just says a sysread() of one
> >> byte will not hang - nothing more.
> >
> >Are you sure about that?
>
> No.
>
> Across the whole slew of platforms perl runs on I am sure at least
> one of them behaves that way for some kind of thing that fd may be
> connected to. Or I would not have had all the hassle with Tk's
> fileevent over the years. I would be delighted to be proved wrong.

What kind of hassle did you have with Tk's fileevent?

> >While I agree that that select() merely indicates that there is at
> >least one byte ready for reading, I think you're mistaken about what
> >sysread does. AFAIK, sysread() will read at least one byte (blocking
> >if necessary to do so), but if you request more bytes than are
> >available, it will read however many it has available, and return the
> >number that were read (and *not* block to get the full count
> >requested).
>
> "man 2 read" on this Linux says it _may_ do that (for pipes and
> terminals) but is rather vague. If I recall correctly my problems
> have been with pipes and TCP (stream) sockets (particularly on Win32).

I've been told that (on *nix) select() will *always* indicate that an fd
for a disk file is readable -- I would assume that if you try to sysread
a huge number of bytes from such an fd, it will block until that many
bytes are read.

And I've been told that on Windows, select will only work properly with
TCP sockets, and not work right with pipes or disk files.

I'll concede that the fact that Windows's select doesn't work right with
pipes is a major annoyance, but *most* of the times that one would be
using select is with sockets, in which cases it should do the right
thing.

> >It's the C-level fread(), and the perl-level read(), that block until
> >the total amount requested has been gotten.
>
> Agreed.
>
> >I seriously doubt that you really have to sysread one byte at a time.
> >I will look for docs stating that things are one way or another, but
> >I think that I'm right.
>
> The problem is almost certainly not going to be on a UNIX-oid with
> reasonably solid BSD socket heritage. It is going to be Win32, MacOS,
> VMS VOS, OS/2 or some other platfrom where POSIX-y calls are not
> native and perhaps not really understood by folk that implemented the
> emulation.

On systems where it's supported, ioctl FIONREAD might allow us to know
how many bytes are available for reading, so that more than one byte
could be read, without blocking.

> (A classic example - a DOS C runtime where read's fd was treated as
> index into FILE * table which then did ANSI C's fread() which was
> implemented as native INT 21h calls.)

Ok, now that example is just bizarre :)

If it becomes necessary, we *could* have some platform specific code
which somehow reads data in a sensible (UNIX-oid) manner.

Richard Clamp

unread,

Aug 13, 2002, 12:03:30 AM8/13/02

to Nick Ing-Simmons, gol...@earthlink.net, h...@crypt.org, perl5-...@perl.org, Arthur Bergman

On Sun, Aug 11, 2002 at 10:45:37PM +0100, Nick Ing-Simmons wrote:
> Benjamin Goldberg <gol...@earthlink.net> writes:

> >I thought we were integrating via and tiehandle ...
>
> I don't know what "we" are doing but I was considering
> deprecating tiehandle, or re-implementing it in terms of layers or some
> other scheme to make it stop getting in the way.

You're not the only person to have had thoughts along these lines. I
was writing some tie->perlio mapping code, to bring perlio::via-like
semantics to older perls, then Arthur stepped in :)

Arthur convinced me that it would be equally useful to go the other
way, and to emulate TIEHANDLE in terms of perlio. This is my current
source of the dumb questions about ::via and binmode.

Right now I'm trying to minimally modify t/tiehandle.t and have it
work with my wrapper code[0]. I'm currently tussling with WRITE not
mapping cleanly back to PRINT/PRINTF/WRITE. Similarly READ won't
easily map into READ/READLINE/GETC calls. I suspect this can be
partly solved by doing the same kind of op tree examination that
Want.pm does but I've not pushed quite that far yet.

Any input would be most welcome.

[0] Available at: http://mirth.unixbeard.net/cgi-bin/viewcvs.cgi/PerlTIEIO/

--
Richard Clamp <rich...@unixbeard.net>

Nick Ing-Simmons

unread,

Aug 13, 2002, 5:03:51 AM8/13/02

to gol...@earthlink.net, p5p

Benjamin Goldberg <gol...@earthlink.net> writes:
>Nick Ing-Simmons wrote:
>>
>> Benjamin Goldberg <gol...@earthlink.net> writes:
>> >> > my $m = sysread $handle, $$bufref, LOTS, length $$bufref;
>> >>
>> >> And that sysread hangs. The select() just says a sysread() of one
>> >> byte will not hang - nothing more.
>> >
>> >Are you sure about that?
>>
>> No.
>>
>> Across the whole slew of platforms perl runs on I am sure at least
>> one of them behaves that way for some kind of thing that fd may be
>> connected to. Or I would not have had all the hassle with Tk's
>> fileevent over the years. I would be delighted to be proved wrong.
>
>What kind of hassle did you have with Tk's fileevent?

That it hung IO ops freezing the GUI.
byte-at-a-time read and non-blocking IO both solved it with other
snags.

>
>I've been told that (on *nix) select() will *always* indicate that an fd
>for a disk file is readable

True - it is.

>-- I would assume that if you try to sysread
>a huge number of bytes from such an fd, it will block until that many
>bytes are read.

But not for long.

>
>And I've been told that on Windows, select will only work properly with
>TCP sockets,

It is more complicated than that. There are are least two implementations
of sockets on Win32 (two different DLLs), and even with same DLL name
their behaviour differs between Win9X and WinNT families - and WinCE
is probably different again :-(.

Sockets can be in different modes and select only works "right" in
blocking mode.

I suspect though that on such a socket read() will wait for whole requested
length.

Win32 select() may also not work "right" for listen mode sockets.

>and not work right with pipes or disk files.

True.

Hence the "plan" for Tk on Win32 is to switch to native "Event" scheme
rather than use select(). But that means sockets must NOT be in blocking
mode and so read() will not work at all.

>
>I'll concede that the fact that Windows's select doesn't work right with
>pipes is a major annoyance, but *most* of the times that one would be
>using select is with sockets, in which cases it should do the right
>thing.

It _should_ but it doesn't.

>> The problem is almost certainly not going to be on a UNIX-oid with
>> reasonably solid BSD socket heritage. It is going to be Win32, MacOS,
>> VMS VOS, OS/2 or some other platfrom where POSIX-y calls are not
>> native and perhaps not really understood by folk that implemented the
>> emulation.
>
>On systems where it's supported, ioctl FIONREAD might allow us to know
>how many bytes are available for reading, so that more than one byte
>could be read, without blocking.

ioctl() is the most UNIXy of the fd family. Which ioctl's are honoured
depends on the device driver.

FIONREAD is variously implemented when I tried it as a solution I could
not even get it to be usable between Linux and Solaris :-(

>
>If it becomes necessary, we *could* have some platform specific code
>which somehow reads data in a sensible (UNIX-oid) manner.
--

Nick Ing-Simmons
http://www.ni-s.u-net.com/

Mark-Jason Dominus

unread,

Aug 13, 2002, 1:23:28 PM8/13/02

to Nick Ing-Simmons, h...@crypt.org, perl5-...@perl.org, Arthur Bergman

Perhaps I am late to this discussion, and am repeating something that
has already been said, but:

Nick Ing-Simmons wrote:
> Benjamin Goldberg <gol...@earthlink.net> writes:

> > SELECTLOOP: {
> > ((my $nready), $timeout) =
> > CORE::select( (my $ready=$rvec), '', '', $timeout );
> > return if $nready <= 0;
> > while( $ready =~ /[^\0]+/g ) {
> > for my $fd ( $-[0] * 8 .. $+[0] * 8 ) {
> > next unless vec( $ready, $fd, 1 );
> > my $handle = $handles->[$fd];
> > my $bufref = GetPerlIOBuffer( $handle );
> > my $irs = $input_rec_seps->{$fd};
> > $irs = $/ unless exists $input_rec_seps->{$fd};
> > my $m = sysread $handle, $$bufref, LOTS, length $$bufref;
>
> And that sysread hangs. The select() just says a sysread() of one byte
> will not hang - nothing more.
>
> To avoid the hang you either have to sysread one byte (not even one char
> in a Unicode world!) at a time, or put handle into non-blocking mode.

I am confident that you are mistaken. I believe that POSIX semantics
are that when select() indicates a descriptor is ready for reading,
the immediately next read() will *never* block.

That is why read() is allowed to read fewer than the number of bytes
requested, and returns the number of bytes actually read. If you ask
for a million bytes from a terminal, socket, or pipe, and only one is
available, it will not block; instead, it reads the one available byte
and returns 1. read() on a terminal, socket, or pipe will block only
when there is nothing at all available. On files, of course, it never
blocks at all.

It may be that my experience is too limited but I believe that is the
way it has been on every Unix system I have ever used. I just
double-checked it with terminals, pipes, and sockets under Linux 2.4.2
and SunOS 5.8, the two systems I have handy right now.

Unix being Unix, there may be some broken version somewhere on which
it is impossible to read reliably, but this behavior is so old and so
well-established that there might not be. In any case I think more
research should be done before we write off this possibility.

Nick Ing-Simmons

unread,

Aug 13, 2002, 4:40:58 PM8/13/02

to m...@plover.com, h...@crypt.org, Nick Ing-Simmons, perl5-...@perl.org, Arthur Bergman

Mark-Jason Dominus <m...@plover.com> writes:
>Perhaps I am late to this discussion, and am repeating something that
>has already been said, but:
>
>>

>> And that sysread hangs. The select() just says a sysread() of one byte
>> will not hang - nothing more.
>>
>> To avoid the hang you either have to sysread one byte (not even one char
>> in a Unicode world!) at a time, or put handle into non-blocking mode.
>
>I am confident that you are mistaken. I believe that POSIX semantics
>are that when select() indicates a descriptor is ready for reading,
>the immediately next read() will *never* block.

Good. That is two of you that agree with each other that POSIX
says it should not block.

>
>It may be that my experience is too limited but I believe that is the
>way it has been on every Unix system I have ever used.

Same here for _Unix_.

So it remains for me (eventually) or someone else to try this on
non UNIX system(s). Seeing as I don't normally do workrounds
for non-problems I _assume_ the Tk hackery was/is for Win32.
MS may even have fixed it by now...

Mark-Jason Dominus

unread,

Aug 13, 2002, 7:11:05 PM8/13/02

to Mark Mielke, Mark-Jason Dominus, Nick Ing-Simmons, h...@crypt.org, perl5-...@perl.org, Arthur Bergman

> I've heard of the existent of a race condition between select() and read(),
> but I have never seen it happen in real life.

I think you may be thinking of the following: Two processes are
reading from the same source. Both call select(). Select returns
'ready' to both processes. Process A reads the data. Process B tries
to read, but is blocked, even though it got a 'ready' condition.

When only one process has the socket/pipe/terminal open, I believe
that there is no race condition.

> Event-loop managed IO should use non-blocking file handles... they are
> far more predictable...

Perhaps this would be a good time for someone to do some real
research, rather than continuing to spread FUD. Does someone have the
Stevens book or the POSIX standard? If not I will try to obtain one
or the other.

> Unless the file is mounted via MVFS... but that's another story, and select()
> wouldn't help either...

And when reading from a custom device driver, the read() call may
cause the kernel to overwrite the process's text segment so that it is
forced into an infinite loop printing "Polly Wolly Doodle", regardless
of the results of select().

Arthur Bergman

unread,

Aug 14, 2002, 2:46:31 AM8/14/02

to Mark-Jason Dominus, Mark Mielke, Nick Ing-Simmons, h...@crypt.org, perl5-...@perl.org

On onsdag, augusti 14, 2002, at 01:11 , Mark-Jason Dominus wrote:

>
> Perhaps this would be a good time for someone to do some real
> research, rather than continuing to spread FUD. Does someone have the
> Stevens book or the POSIX standard? If not I will try to obtain one
> or the other.
>

I have both, what do you want to know?

Arthur

Nick Ing-Simmons

unread,

Aug 14, 2002, 1:02:58 PM8/14/02

to art...@contiller.se, h...@crypt.org, Nick Ing-Simmons, Mark Mielke, perl5-...@perl.org, Mark-Jason Dominus

If select() indicates that fd is readable can

read(fd,buffer,8192); // i.e. some large number of bytes

ever block?

>
>Arthur

Graham Barr

unread,

Aug 14, 2002, 1:05:27 PM8/14/02

to Nick Ing-Simmons, art...@contiller.se, h...@crypt.org, Mark Mielke, perl5-...@perl.org, Mark-Jason Dominus

I cannot remember reading a spec which explicitly stated it cannot.
Thats not to say I remember everything, but I only remember reading that they
state that read() *MAY* return with less than the requested number
of bytes. Which IMO, means that whether it blocks or not is
implementation dependant.

Graham.

Nicholas Clark

unread,

Aug 16, 2002, 4:29:42 PM8/16/02

to Benjamin Goldberg, p5p

On Tue, Aug 13, 2002 at 12:27:54AM -0400, Benjamin Goldberg wrote:
> And I've been told that on Windows, select will only work properly with
> TCP sockets, and not work right with pipes or disk files.

Now that 5.8 can fake socketpair() on Windows, will 5.10 have pipes
done as socketpairs, and so have select working on them?

Nicholas Clark

unread,

Aug 16, 2002, 4:47:34 PM8/16/02

to Benjamin Goldberg, Nicholas Clark, Arthur Bergman, perl5-...@perl.org, h...@crypt.org

On Fri, Aug 09, 2002 at 10:57:14PM -0400, Benjamin Goldberg wrote:
> Nicholas Clark wrote:

> > It's probably not unreasonable to limit $/ to non-greedy regexps,
> > because most of the time they're more likely to do what you want.
> > (fast).
>
> It's a bit difficult to identify a regex as being greedy or non-greedy;
> especially, consider when parts of it are greedy, but are limited by
> something, eg: qr/<[^>]*>/ is not really greedy, even though it might
> look like it is. Or: qr/BEGIN(?:(?!END).)*END/, which also isn't
> greedy.
>
> It would be easier to simply tell users, "don't use greedy regexen."
> Or, "If you *must* do that, stick a \u at the beginning."

I think we may be able to get away with working out which regexps can
be matched correctly without having to change the regexp engine or read
the entire file into memory with what I hope is the majority of regexps.

[explanation paragraph, for those not following the thread closely]

I believe the only danger comes from regexps that can match both a long
string within the file and also match a shorter substring. Because they
are greedy, by the rules of perl regexp matching the longer match should
be found first and returned. The problem with $/ comes if only the first
part of the long string that should match is in the PerlIO buffer passed
to the regexp engine, so the match on the long string fails. Normally with
this $/ as regexp idea this wouldn't be a problem, as a failed match would
simply cause more file to be read into the buffer and the match re-attempted.
However, if the greedy regexp can also match the shorter string which *is*
fully in the buffer then the regexp engine will return that shorter match,
which isn't correct for this IO case.

I believe that most regexps that $/ is set to won't use any characters that
could match "\n", except possibly at the end of the regexp. The main class
of regexp that won't are negated character ranges such as your [^>]*> above.
(For these we may want a regexp flag saying "non-greedy - honest" as a hint)
For regexps such as /^__[A-Z]+__$/ that we know can't match a newline in the
middle, we simply look to see if there is a newline in the PerlIO buffer
between the file pointer and the end of buffer. If there is, we call the
regexp engine, saying that the end of where it can match to is the last
newline we know of in the PerlIO buffer.
If there isn't a newline in the PerlIO buffer (or the regexp engine fails to
find a match) we read more data from disk (or whatever) until we find a
newline. (I'm envisaging reading a disk block, then scanning for the last
newline, rather than some sort of fgets())

For regexps that we can't figure out whether they are greedy and could
backtrack we fall back to reading until EOF and presenting the lot to the
regexp engine. And a flag the knowledgeable could add to their regexp saying
"I promise this regexp can't backtrack in a greedy way" or whatever is most
useful to the PerlIO/regexp system would be useful.

Benjamin Goldberg

unread,

Aug 16, 2002, 7:18:21 PM8/16/02

to Nicholas Clark, Nicholas Clark, Arthur Bergman, p5p, h...@crypt.org

Nicholas Clark wrote:
[snip]

> For regexps such as /^__[A-Z]+__$/ that we know can't match a newline
> in the middle, we simply look to see if there is a newline in the
> PerlIO buffer between the file pointer and the end of buffer. If there
> is, we call the regexp engine, saying that the end of where it can
> match to is the last newline we know of in the PerlIO buffer.
>
> If there isn't a newline in the PerlIO buffer (or the regexp engine
> fails to find a match) we read more data from disk (or whatever) until
> we find a newline. (I'm envisaging reading a disk block, then scanning
> for the last newline, rather than some sort of fgets())

This seems more like an optomization for the regex engine to do, not the
PerlIO using it.

> For regexps that we can't figure out whether they are greedy and could
> backtrack we fall back to reading until EOF and presenting the lot to
> the regexp engine. And a flag the knowledgeable could add to their
> regexp saying "I promise this regexp can't backtrack in a greedy way"
> or whatever is most useful to the PerlIO/regexp system would be
> useful.

Turn the 'is it nongreedy' question around ... what happens if we find
out that the qr regex that was placed in $/ *is* greedy (but can
backtrack to match a lesser part)? Do we carp, croak, ignore the
problem (and possibly match wrong), or do it "right" by deffering
running the regex until EOF is seen?

Nicholas Clark

unread,

Aug 16, 2002, 7:24:03 PM8/16/02

to Benjamin Goldberg, Nicholas Clark, Nicholas Clark, Arthur Bergman, p5p, h...@crypt.org

On Fri, Aug 16, 2002 at 07:18:21PM -0400, Benjamin Goldberg wrote:
> Nicholas Clark wrote:
> [snip]
> > For regexps such as /^__[A-Z]+__$/ that we know can't match a newline
> > in the middle, we simply look to see if there is a newline in the
> > PerlIO buffer between the file pointer and the end of buffer. If there
> > is, we call the regexp engine, saying that the end of where it can
> > match to is the last newline we know of in the PerlIO buffer.
> >
> > If there isn't a newline in the PerlIO buffer (or the regexp engine
> > fails to find a match) we read more data from disk (or whatever) until
> > we find a newline. (I'm envisaging reading a disk block, then scanning
> > for the last newline, rather than some sort of fgets())
>
> This seems more like an optomization for the regex engine to do, not the
> PerlIO using it.

But that would mean that the regexp engine would have to know how to
"get more" (in this case by doing IO)
If we do the decision making outside the regexp engine then the regexp
engine needs no change from the present, and there's no recursion danger
(PerlIO calling out to layers written in perl that in turn use regexps)

> > For regexps that we can't figure out whether they are greedy and could
> > backtrack we fall back to reading until EOF and presenting the lot to
> > the regexp engine. And a flag the knowledgeable could add to their
> > regexp saying "I promise this regexp can't backtrack in a greedy way"
> > or whatever is most useful to the PerlIO/regexp system would be
> > useful.
>
> Turn the 'is it nongreedy' question around ... what happens if we find
> out that the qr regex that was placed in $/ *is* greedy (but can
> backtrack to match a lesser part)? Do we carp, croak, ignore the
> problem (and possibly match wrong), or do it "right" by deffering
> running the regex until EOF is seen?

Don't know. I'd like to have the "do EOF" option at least available as this
would be most useful for processing disk files that have historically
been processed with undef $/ and a regexp on $_
Now we could readline() them.

But having the croak or carp available to anyone doing qr// $/ for a terminal
or socket or pipe would also be handy. Else they are going to sit forever
and wonder what's going on.

And in that case, most minimally we only need to carp/croak/redo if we
find that we did match after backtracking.

I don't think I've answered your question properly. Sorry.

Nicholas Clark

Benjamin Goldberg

unread,

Aug 16, 2002, 8:49:15 PM8/16/02

to Nicholas Clark, Nicholas Clark, Arthur Bergman, p5p, h...@crypt.org

Nicholas Clark wrote:
>
> On Fri, Aug 16, 2002 at 07:18:21PM -0400, Benjamin Goldberg wrote:
> > Nicholas Clark wrote:
> > [snip]
> > > For regexps such as /^__[A-Z]+__$/ that we know can't match a
> > > newline in the middle, we simply look to see if there is a newline
> > > in the PerlIO buffer between the file pointer and the end of
> > > buffer. If there is, we call the regexp engine, saying that the
> > > end of where it can match to is the last newline we know of in the
> > > PerlIO buffer.
> > >
> > > If there isn't a newline in the PerlIO buffer (or the regexp
> > > engine fails to find a match) we read more data from disk (or
> > > whatever) until we find a newline. (I'm envisaging reading a disk
> > > block, then scanning for the last newline, rather than some sort
> > > of fgets())
> >
> > This seems more like an optomization for the regex engine to do, not
> > the PerlIO using it.
>
> But that would mean that the regexp engine would have to know how to
> "get more" (in this case by doing IO)

I think you're misunderstanding me.

Suppose that $/ has qr/^__[A-Z]+__\n/m in it. PerlIO would simply read
a block, then (without looking at the contents of the buffer) call to
the regex engine, and if it fails, read another block. PerlIO shouldn't
bother checking if there's a newline in the buffer and whether or not
the regex pattern requires a newline -- the regex engine would handle
that kind of optomization. All(*) perlio should care about is whether
the pattern matched or didn't match.

If we want to, we can make the regex engine be able to say to itself,
"the pattern requires finding a newline, and the last newline in the
data is at position X; therefor, I shouldn't start any searches later
than at position X-minlen." ... this might help deal with PerlIO
better ... but it would be an optomization *independent* of whether
PerlIO is using regexen.

PerlIO shouldn't care about what kind of requirements a regex pattern
might have(*), and the regex engine should not care that it's being
called from PerlIO -- they should be entirely seperate things.

(*) PerlIO does of course care about that greedyness thing, which I
believe should be more properly described as being "prefix-free."

> If we do the decision making outside the regexp engine then the regexp
> engine needs no change from the present, and there's no recursion
> danger (PerlIO calling out to layers written in perl that in turn use
> regexps)

I'm not sure what you're saying here, but I'm fairly sure that I'm not
suggesting to make the regex engine especially different from how it is
at present.

> > > For regexps that we can't figure out whether they are greedy and
> > > could backtrack we fall back to reading until EOF and presenting
> > > the lot to the regexp engine. And a flag the knowledgeable could
> > > add to their regexp saying "I promise this regexp can't backtrack
> > > in a greedy way" or whatever is most useful to the PerlIO/regexp
> > > system would be useful.
> >
> > Turn the 'is it nongreedy' question around ... what happens if we
> > find out that the qr regex that was placed in $/ *is* greedy (but
> > can backtrack to match a lesser part)? Do we carp, croak, ignore
> > the problem (and possibly match wrong), or do it "right" by
> > deffering running the regex until EOF is seen?
>
> Don't know. I'd like to have the "do EOF" option at least available as
> this would be most useful for processing disk files that have
> historically been processed with undef $/ and a regexp on $_
> Now we could readline() them.
>
> But having the croak or carp available to anyone doing qr// $/ for a
> terminal or socket or pipe would also be handy. Else they are going to
> sit forever and wonder what's going on.
>
> And in that case, most minimally we only need to carp/croak/redo if we
> find that we did match after backtracking.

Finding out whether or not we backtracked would *definitely* require
changing how the regex engine works. Plus, that could only catch this
problem at readline() time.

Finding out whether or not the regex matches a prefix-free language
could be checked at the time we assign it to $/.

> I don't think I've answered your question properly. Sorry.

No, I think you have... I suppose that the best answer is: Offer all
four options, somehow.

Mark-Jason Dominus

unread,

Aug 17, 2002, 8:08:37 AM8/17/02

to Benjamin Goldberg, perl5-...@perl.org

Benjamin Goldberg:

> If we want to, we can make the regex engine be able to say to itself,
> "the pattern requires finding a newline, and the last newline in the
> data is at position X; therefor, I shouldn't start any searches later
> than at position X-minlen." ...

I believe it already does do that, or something close to it. Try some
-Dr searches, and look at the 'floating-anchored' discussion in the
output.

That Ilya is a very smart fellow.

Larry Wall

unread,

Aug 17, 2002, 4:21:54 PM8/17/02

to Mark-Jason Dominus, Benjamin Goldberg, perl5-...@perl.org

It would surely be nice if Perl 5 could take a cue from Perl 6 here
and attach regexes to "extensible" strings rather than continuing
to overload the long-known-to-be-lousy $/ interface. Even attaching
the regex to an I/O layer would be better than that.

Larry

Benjamin Goldberg

unread,

Aug 17, 2002, 8:25:59 PM8/17/02

to Larry Wall, Mark-Jason Dominus, perl5-...@perl.org

Attaching the regex to the I/O layer (instead of putting it in $/) is
essentially a means of having a per-filehandle input record seperator.

The problem with this is that legacy code may do 'local $/ = undef' and
expect it to have an effect.

Larry Wall

unread,

Aug 17, 2002, 9:13:52 PM8/17/02

to Benjamin Goldberg, Mark-Jason Dominus, perl5-...@perl.org

On Sat, 17 Aug 2002, Benjamin Goldberg wrote:
: Larry Wall wrote:
: >
: > It would surely be nice if Perl 5 could take a cue from Perl 6 here
: > and attach regexes to "extensible" strings rather than continuing
: > to overload the long-known-to-be-lousy $/ interface. Even attaching
: > the regex to an I/O layer would be better than that.
: >
: > Larry
:
: Attaching the regex to the I/O layer (instead of putting it in $/) is
: essentially a means of having a per-filehandle input record seperator.

Which is where it belongs in the long run, of course.

: The problem with this is that legacy code may do 'local $/ = undef' and

: expect it to have an effect.

For perfect compatibility, we *could* go as far as to tweak all
currently open filehandles when you modify $/. Or maybe we could just
emit a warning if you read from a new filehandle while $/ is not in
the default state, since legacy code won't generally be setting up
the irs on the handle. But I guarantee you that by the time we get
to Perl 6, we won't have that problem anymore, because $/ will be
totally gone. It would be good if Perl 5 could at least get to the
point of deprecating $/. Adding more functionality to $/ is not the
way to proceed, however. The less guesswork the p5-to-p6 translator
has to do, the better. Right now, deducing the intended scope of $/
modifications is much too close to the halting problem.

Larry

Nick Ing-Simmons

unread,

Aug 19, 2002, 4:43:00 AM8/19/02

to ni...@unfortu.net, p5p, Benjamin Goldberg

Nicholas Clark <ni...@unfortu.net> writes:
>On Tue, Aug 13, 2002 at 12:27:54AM -0400, Benjamin Goldberg wrote:
>> And I've been told that on Windows, select will only work properly with
>> TCP sockets, and not work right with pipes or disk files.
>
>Now that 5.8 can fake socketpair() on Windows, will 5.10 have pipes
>done as socketpairs, and so have select working on them?

That was the plan.

>
>Nicholas Clark