Back in December I asked a question about utf8 I/O. Leo responded
pointing me at the encoding filters. I then published a possible
implementation of PIO_utf8_read with a request for comments.
Since that time I have been thinking about the testing and
implementation of I/O filters. Actually I started thinking about how to
create a suitable test set for the fragment I had written which fed back
various shortcomings of the implementation and led to wider thinking of
the entire process. I am primarily thinking about file I/O but I can see
no reason why this scheme cannot apply to any other form of I/O.
1) The immediate result returned by the lowest level of a read operation
is an undifferentiated type which I am going to call "bytestream". This
type makes no assumptions about the internal encoding of its contents.
2) Trans-encoding from a bytestream string to a named charset and
encoding involves:
2.a) confirming that the bytestream converts to an integral number of
characters in the target encoding. The trans-encoding function should
return any trailing character fragments as a bytestream string. 
2.b) labelling the (possibly) truncated string with the target charset
and encoding.
3) I feel that it would be preferable if the read opcode specified N
characters to be read rather than N bytes. However to implement this the
PIO_*_read call would have to pass down the maximum byte size of a
character as well as the character count to the fundamental operation.
If N stays as bytes then the implementation will return a trans-encoding
dependent integral number of characters derived from no more than N
bytes of source data. It may also be desirable to limit the returned
string to also retuning no more than N bytes.
4) PIO_*_peek needs to include a parameter to specify the maximum byte
length of one character in the target charset / encoding so that the
fundamental operation can guarantee returning enough bytes to return a
character after trans-encoding.
5) Seeking through an encoding filter could be highly problematic.
Filters such as "utf8" that have a non-deterministic byte per character
ratio should politely refuse seeks.
6) Use of escape codes also adds a non-deterministic level to character
counts. The generation and normalisation of escape codes during
trans-encoding is very DWIM but the documents need to explicitly set a
policy on this behaviour. In general the use of HTML style entity codes
is preferable to using C style \nnn code as they can be normalised to
any encoding that supports them rather than requiring the programmer to
have to guess the original encoding.
7) The line buffered read function should be removed from the
fundamental operations and made into a filter layer similar to the "buf"
layer. There is no guarantee that the underlying data source is going to
conform to the line end notions of the current system and this should be
able to be compensated for.
8) There would be advantages to having a PIO_*_get_encoding function in
the I/O interface to allow enquiries about the returned encoding from
lower levels.
Okay some examples ...
$P0 = open "foo"
push $P0, 'ascii'
push $P0, 'by_line'
This would be a standard line oriented read/write.
$P0 = open "foo"
push $P0, 'utf16'
push $P0, 'by_line'
push $P0, 'utf8'
This could be used to read a Windows unicode file while all internal
processing is done using utf8 encodings. 'by_line' would need
initialisation with a non default line end marker.
$P0 = open "foo"
push $P0, 'ebcdic'
push $P0, 'ascii'
For mainframes.
$P0 = open "foo"
push $P0, 'encrypt_blowfish'
push $P0, 'adaptive_huffman'
push $P0, 'escaped_ascii'
push $P0, 'utf8'
You can figure it out .... 
Cheers,
Steve Gunnell
Yep.
> 2) Trans-encoding from a bytestream string to a named charset and
> encoding involves:
> 2.a) confirming that the bytestream converts to an integral number of
> characters in the target encoding. The trans-encoding function should
> return any trailing character fragments as a bytestream string. 
Or warn or throw an exception.
> 3) I feel that it would be preferable if the read opcode specified N
> characters to be read rather than N bytes.
Yep. If the user pushed an UTF8 input filter, it's pretty clear that he 
wants to deal with chars and not bytes.
> ... However to implement this the
> PIO_*_read call would have to pass down the maximum byte size of a
> character as well as the character count to the fundamental operation.
A utf8 input filter would read bytes by one from the underlaying 'buf' 
layer and convert N chars on the fly. A fixed-width encoding filter can 
just multiply N by bytes_per_char. I don't see a problem here.
> 4) PIO_*_peek needs to include a parameter to specify the maximum byte
> length of one character in the target charset / encoding so that the
> fundamental operation can guarantee returning enough bytes to return a
> character after trans-encoding.
Or PIO_peek is disabled for e.g. utf8 filters and returns an error.
> 5) Seeking through an encoding filter could be highly problematic.
> Filters such as "utf8" that have a non-deterministic byte per character
> ratio should politely refuse seeks.
Yep - same.
> 
> 6) Use of escape codes also adds a non-deterministic level to character
> counts. 
That's an entirely different problem and hasn't much in common with eg 
an utf8 input filter.
> 7) The line buffered read function should be removed from the
> fundamental operations and made into a filter layer similar to the "buf"
> layer. 
There is no line buffered read function in the layer_api. io_buf does 
exactly, what you are proposing.
> 8) There would be advantages to having a PIO_*_get_encoding function in
> the I/O interface to allow enquiries about the returned encoding from
> lower levels.
I'm not sure about that.
> Cheers,
> 
> Steve Gunnell
leo
> 5) Seeking through an encoding filter could be highly problematic.
> Filters such as "utf8" that have a non-deterministic byte per character
> ratio should politely refuse seeks.
In theory it ought to be possible to seek back to any location you were
previously at (as returned by C<tell>)
For the specific case of UTF8, you can even tell if a random location in
the stream was a valid point to seek to, which could be done with only a
one character look ahead read (Bad plan on anything not-a-file, mind you,
unless you like blocking) or by deferring the error until the next read.
I don't know other variable width encodings well enough to know if any other
have equivalent abilities to self-synchronise the stream.
Clearly as you say, fixed width encodings are fine, when dealing with an
entire file. But if you push a UCS32 filter onto a stream after reading an
odd number of bytes, valid seek positions aren't going to be multiples of 4.
I guess a seek validator can be coded to know this, but it starts getting
fiddly. The other alternative would be that seek/tell locations are always
in bytes in the underlying stream, and purposefully ignore any many-to-1
filters atop them.
Nicholas Clark
In the case of mis-alignment I think it would entirely reasonable to
give the user exactly what they asked for (if possible) or the filter
can throw an exception.  It also sound like we want to be able to
seek/tell both by character and by byte.   
Cheers,
-J
--
It seems that "seek" is used in two ways:
    * returning to some previously identified point (including the start or
      end of the file)
    * moving a given number of characters you want to move relative to a
      known location
Clearly you can always do the first, just by using the underlying byte
offset without regard for the encoding. If you have a fixed number of bytes
per character then you can trivially do the second as well.
But if you have a variable-length encoding then you have to read through the
byte stream to get to the position you want; this might or might not be
desirable depending on the characteristics of the underlying stream.
Furthermore it makes (some) sense always to be able to seek *forwards* --
even on a tty device -- but not backwards.
So my suggestion is that we change the interface to "seek", and have
separate parameters for the "previously known position" and the
"character offset". The latter is obviously just an integer, but the
first is a black-box token -- maybe a PMC, but more likely a mangled
integer -- to ensure that the two args are distinguishable.
(Please excuse me as I discuss this in terms of a HLL rather than
Parrot...)
In other words, change this:
 $fpos = $io.tell();
 $io.seek(SEEK_SET, $fpos)
to this:
 ...
 $io.seek($fpos, 0)
or for brevity, just this:
 ...
 $io.seek($fpos)
Now SEEK_SET, SEEK_CUR and SEEK_END just become special cases of
"previously known positions". And I'm tempted to say that they should be
spelt "0", "undef" and "-1" respectively.
Thence it's fairly straightforward for the units of "seek" to be
whatever you find convenient: counting whole records, or lines of text,
or whatever.
Clearly this needs to be discussed in p6-lang, but having separated the
two parameter types, the filter can decide which it can implement, and
how.
-Martin
Obviously there is another camp that feels that everything needs to be 
implemented in low level C and special PMCs for each HLL and funky APIs 
everywhere, and those are probably the same guys that think Parrot needs to 
worry about the issue you are discussing. I'm not in that camp, which is 
the reason I just watch nowadays. To me, the elegance of the pure VM has 
been lost on Parrot.
-Melvin
"Its 2006, do you know where your opcodes are?"
> Obviously there is another camp that feels that everything needs to be 
> implemented in low level C and special PMCs for each HLL and funky 
> APIs everywhere
- did I miss the high level C release?
- "special PMCs for each HLL": - just no: one general class that 
provides lo-level basics, and:
   *if* HLL differs, well, of course it'll need some wrapper
- "APIs": - yes, usable by HLLs for interoperbility, that's it
leo's 2 (EUR) c
and: "the elegance of the pure VM" hasn't been lost, it wasn't there 
yet, but we'll improve on that.