lsls(inout STR, in INT)
lsrs(inout STR, in INT)
and, of course, their appropriate permutations.
For those who haven't looked at bit.ops, lsl and lsr are logical shift
left and logical shift right. Doing this operation on strings (as bands,
bors and bxors do) would allow the full range of bit-manipulation to be
done quickly on strings-as-bitfields (though, of course, it's already
possible even without these operations).
I don't see shls and shrs being useful (or terribly meaningful), but
correct me if I'm wrong there.
Of course, there's the small matter that shifting left might grow your
string, but this should not be a major concern for Parrot. I don't think
shifting right should shrink the string.
--
Aaron Sherman <a...@ajs.com>
Senior Systems Engineer and Toolsmith
"It's the sound of a satellite saying, 'get me down!'" -Shriekback
Cool!
>Of course, there's the small matter that shifting left might grow your
>string, but this should not be a major concern for Parrot. I don't think
>shifting right should shrink the string.
I think left and right shift of strings should work the same way that
shifts on ints works--that is, it doesn't grow, bits just fall off
the end. You can decide whether to sign-extend or 0-extend, either
one's OK.
--
Dan
--------------------------------------"it's like this"-------------------
Dan Sugalski even samurai
d...@sidhe.org have teddy bears and even
teddy bears get drunk
Have we[1] finished working out what a string is yet?
[1] And by "we", I mean "you"[2].
[2] And by "you", I mean "you" plural.
--
Bryan C. Warnock
bwarnock@(gtemail.net|raba.com)
Yep. Or at least we've worked out what Parrot's going to do--whether
everyone (or anyone) will be happy with it is a different matter
entirely.
Bitstring operations ought only be valid on binary data, though,
unless someone can give me a good reason why we ought to allow
bitshifting on Unicode. (And then give me a reasoned argument *how*,
too)
-----Original Message-----
From: Bryan C. Warnock [mailto:bryan....@raba.com]
Sent: Friday, April 30, 2004 9:08 AM
To: Dan Sugalski
Cc: Perl6 Internals List
Subject: Re: Bit ops on strings
On Thu, 2004-04-29 at 13:04, Dan Sugalski wrote:
> I think left and right shift of strings should work the same way that
> shifts on ints works--that is, it doesn't grow, bits just fall off
> the end. You can decide whether to sign-extend or 0-extend, either
> one's OK.
Have we[1] finished working out what a string is yet?
[1] And by "we", I mean "you"[2].
[2] And by "you", I mean "you" plural.
-------------------------------------------------------------------------
I have been following the discussion of strings on this list over the last few
weeks. It seems that there is somewhat of a disconnect in various definitions
of what is a "string". It seems as though there needs to be a hierarchy to
this with a little more clear definition. May I humbly propose the following:
1. String - low-level, abstract, base class (or in Perl6 terms role --
I think) which represents a "logically" contiguous series of Parrot Int
2. BinaryString - inherits from String, represents a "logically"
contiguous series of "bytes/bits"
3. TextString - inherits from String, represents a series of
characters (where character is an abstract thingy which in the concrete of a
specific "font" is a particular "glyph") -- I'm hand-waving at the concept of
"font" and "glyph" here, but, I think you get the idea.
4. RichTextString - inherits from TextString, represents a "String"
which various "properties" assigned to various substrings.
* Now, "String" should not have any methods/properties which do not
apply to both the definition of a "BinaryString" and a "TextString".
* Things like "LSR/LSL" should only apply to "BinaryString".
* Properties like "language", "encoding", etc. only apply to
"TextStrings". Same for methods like "substr" (though this could be a "String"
method like "slice" which gives a slice of either a binary or text string I
suppose).
Anyway, the important point in all this is we should not mix the concept of a
"TextString" and "BinaryString". These are two distinct things in abstract
terms. They should not be treated the same (IMHO).
The information contained in this e-mail message is privileged and/or
confidential and is intended only for the use of the individual or entity
named above. If the reader of this message is not the intended recipient,
or the employee or agent responsible to deliver it to the intended
recipient, you are hereby notified that any dissemination, distribution or
copying of this communication is strictly prohibited. If you have received
this communication in error, please immediately notify us by telephone
(330-668-5000), and destroy the original message. Thank you.
> Bitstring operations ought only be valid on binary data, though,
> unless someone can give me a good reason why we ought to allow
> bitshifting on Unicode. (And then give me a reasoned argument *how*,
> too)
100% agree. If you want to play games with any other encoding, you may
proceed to write your own damn code ;-)
Let me start by saying that I have not drunk the Unicode cool-aid. I'm
not at all certain that the overhead required to do all of what Parrot
wants to do is warranted, BUT that's beside the point.
Parrot is doing things the way it's doing them, and the time for debate
was a few months or at latest weeks ago, as far as I can tell.
> I have been following the discussion of strings on this list over the last few
> weeks. It seems that there is somewhat of a disconnect in various definitions
> of what is a "string". It seems as though there needs to be a hierarchy to
> this with a little more clear definition. May I humbly propose the following:
>
> 1. String - low-level, abstract, base class (or in Perl6 terms role --
> I think) which represents a "logically" contiguous series of Parrot Int
You say that you think there should be a hierarchy, but you're just
throwing out broad concepts and applying them equally to terminology,
representation and implementation. As such, there is no good way to
respond to what you suggest, nor any way to determine how much work you
are proposing be performed in order to bend existing code to your
suggested paradigm.
A string is what Dan described in his various postings on strings. Nuff
said.
###################################
Aside from the rest of your message, and bearing no logical impact on
the rest of it, I'd like to call out:
> The information contained in this e-mail message is privileged and/or
> confidential and is intended only for the use of the individual or entity
> named above. If the reader of this message is not the intended recipient,
> or the employee or agent responsible to deliver it to the intended
> recipient, you are hereby notified that any dissemination, distribution or
> copying of this communication is strictly prohibited. If you have received
> this communication in error, please immediately notify us by telephone
> (330-668-5000), and destroy the original message. Thank you.
Need I point out http://www.goldmark.org/jeff/stupid-disclaimers/
> Now, we
> have people talking about doing "LSL/LSR" on "Strings". That is 100%
> inconsistent with that definition of a "String".
Not at all, and keep in mind that I didn't propose this out of the blue.
"bands", "bxors" and "bors" are existing string ops and have been for a
long time. I was just proposing rounding out the bit operator set. Go
check out the ops/bit.ops in CVS. It's even well documented.
I don't think Dan was being at all contradictory or inconsistent in his
string postings, given that those ops were already there. I may have
problems with the extent to which Parrot embraces abstraction, but
inconsistency is not one of them.
> 1. String - low-level, abstract, base class (or in Perl6 terms role --
> I think) which represents a "logically" contiguous series of Parrot Int
>
> 2. BinaryString - inherits from String, represents a "logically"
> contiguous series of "bytes/bits"
>
> 3. TextString - inherits from String, represents a series of
> characters (where character is an abstract thingy which in the
> concrete of a
> specific "font" is a particular "glyph") -- I'm hand-waving at the
> concept of
> "font" and "glyph" here, but, I think you get the idea.
Please, let's not ever mix the word "binary" and "string" in the same
sentence again! One language already made this mistake: the Haskell
standard library has many binary file reading/writing functions
operating with "strings" (which, in Haskell, is a list of characters or
[Char], where character = thingys which Jeff defined quite nicely),
rather than returning a "stream of bytes" ([Word8], in Haskell
terminology), which is binary data--what they should've returned.
Of course Parrot should have a function to reinterpret something of a
string type as raw binary data and vice versa, but don't mix binary
data with strings: they are completely different types, and raw binary
data should never be able to be put into a string register. Maybe some
blurring of binary data/strings should happen at the Perl layer, but
Parrot should keep them as distinct as possible, IMHO.
--
% Andre Pang : trust.in.love.to.save
I'm trying to make sure that keeping them separate is possible, but
it's important for everyone to remember that we're limited in what we
can do.
Parrot *can't* dictate semantics. That's not what we get to do. We're
constrained in what we can make happen by the semantics of the
languages we've declared as primary support languages, regardless of
whether we think those semantics are a good idea or not. (Heck,
regardless of whether the original designers of the languages think
the semantics were a good idea or not, since an installed base beats
hindsight every time)
So. Strings and byte buffers as semantically different? Good idea,
can't do it. Unicode everywhere universally? Arguably a good idea,
can't do it.
All we can do is make the best of what we can with the semantics we
have to provide, and layer on as much safety as is possible to let
folks transition over, debug better, and work cleaner without
breaking what has been declared to be The Answer.
Assuming I've an idea what's going on here (it's a bit difficult to
tease this apart with the quoting--I do so hate outlook) you're both
right, so if there's any shouting it ought be at me.
Parrot, at the very low levels, makes no distinction between strings
and buffers--as far as it's concerned they're the same thing, and
either can hang off an S register. (Ultimately, when *I* talk of
strings I mean "A thing I can hang off an S register", though I'm in
danger of turning into Humpty Dumpty here) That's part of the
problem. There are already bitwise operations on S-register things in
the core, which is OK.
The bitshift operations on S-register contents are valid, so long as
the thing hanging off the register support it. Binary data ought
allow this. Most 8-bit string encodings will have to support it
whether it's a good idea or not, since you can do it now. If Jarkko
tells me you can do bitwise operations with unicode text now in Perl
5, well... we'll support it there, too, though we shan't like it at
all.
I *think* most of the variable-width encodings, and the character
sets that sit on top of them, can reasonably forbid this.
Or at least my unclarity is becoming clearer. :)
> >
>> The bitshift operations on S-register contents are valid, so long as
>> the thing hanging off the register support it. Binary data ought
>> allow this. Most 8-bit string encodings will have to support it
>> whether it's a good idea or not, since you can do it now. If Jarkko
>> tells me you can do bitwise operations with unicode text now in Perl
>> 5, well... we'll support it there, too, though we shan't like it at
>> all.
>>
>> I *think* most of the variable-width encodings, and the character
>> sets that sit on top of them, can reasonably forbid this.
>
><mode="dave barry">
> Since "text" "strings" are a proper "subset" of a "binary" "buffer",
>which is really what the "string" "registers" really "are", what we've
>"logically" got is this:
></mode>
>
>LAYER 1 2 3
> +-- Text Ops --- (Hosted Language)
> SREG --+
> +-- Bin Ops --- (Hosted Language)
>
>or maybe this:
>
> SREG --- Bin Ops ----------- (Hosted Language)
> +-- Text Ops --- (Hosted Language)
>
>where semantics are found in Layers 2 and 3. (Layer 3 could also be
>merged.)
Yep, that's basically it.
>Now I think that's more less what Parrot has, right? Except that the
>Layer 2 semantics are tracked (and locked in?) at Layer 1? (To prevent
>the aforementioned bit-shifting of WTF strings.)
If you want, you could think of the S-register strings as mini-PMCs.
The encoding and charset stuff (we'll ignore language semantics for
the moment) are essentially small vtables that hang off the string,
and whatever we do with it mostly goes through those vtable functions.
Which sort of argues for putting the bitstring stuff in there
somewhere as well. (And may well argue for MMD on string operations,
but I think that makes my head hurt so I'm not going there righ tnow)
Yeah, there's that bulk thing. I'd thought about making strings just
PMCs way back at the beginning, but there's that whole regex thing--I
knew that for speed, the code'd ultimately have to say something like
"Hey, you, string! Make yourself UTF-32, get normalized, and gimme
your buffer pointer dammit!" Without that low-level access we aren't
going to be able to get the speed we need, which kinda limits the
abstraction there.
> If you want, you could think of the S-register strings as mini-PMCs.
> The encoding and charset stuff (we'll ignore language semantics for
> the moment) are essentially small vtables that hang off the string,
I think its the cleanest way of implementing all that string mess. Want
some {byte, code point, grapheme} string length: call
string->str_vtable->length. Want to SHR an Unicode string, the vtable
throws an exception. The question is: which of all these vtables depend
on the string and which are "environment" things. If there are a
reaonable amount of the former, a vtable is the way to go IMHO.
And there are or course some binary ops like concat but not much.
leo
> At 2:57 AM +1000 5/1/04, Andre Pang wrote:
>> Of course Parrot should have a function to reinterpret something of a
>> string type as raw binary data and vice versa, but don't mix binary
>> data with strings: they are completely different types, and raw
>> binary data should never be able to be put into a string register.
>> Maybe some blurring of binary data/strings should happen at the Perl
>> layer, but Parrot should keep them as distinct as possible, IMHO.
>
> I'm trying to make sure that keeping them separate is possible, but
> it's important for everyone to remember that we're limited in what we
> can do.
>
> Parrot *can't* dictate semantics. That's not what we get to do.
But your plan seems to be very much dictating semantics--treating a
whole class of reasonable string operations as "in that case, punt and
throw an exception". And it's not clear that the semantics it is
dictating in fact match any of the target languages (or in fact, any
existing language at all). The at-runtime association of character
set/encoding/language, and the semantics it implies, is what I'm
referring to here.
JEff
That's why it's overridable. I fully expect most languages will do so
by default, but the option to leave the exceptions on as a debugging
aid.
> And it's not clear that the semantics it is dictating in fact match
>any of the target languages (or in fact, any existing language at
>all). The at-runtime association of character set/encoding/language,
>and the semantics it implies, is what I'm referring to here.
Yep, but with the exceptions disabled things'll act the way they should.
We can and I don't like it at all :-) What they basically operate on
are the internal UTF-8 bit patterns, in other words utter crapola from
the viewpoint of traditional "bit strings". Especially "fun" was
getting the semantics of ~ to make any sense whatsoever. None of it
anything I want to propagate anywhere.
Please correct me if I'm wrong here, but I'm going to lay out my
understanding as a set of assertions:
* Parrot will be able to convert any encoding to any other
encoding
* though, some conversions will result in an exception, that's
still a defined behavior
* We've agreed that only raw binary 8-bit strings make sense for
bit vector operations
So it seems to me that the "obvious" way to go is to have all bit-s
operations first convert to raw bytes (possibly throwing an exception)
and then proceed to do their work.
This means that UTF-8 strings will be handled just fine, and (as I
understand it) some subset of Unicode-at-large will be handled as well.
In other-words, the burden goes on the conversion functions, not on the
bit ops.
It's not that it's going to be meaningful in the general case, but if
you have code like:
sub foo() { return "\x01"+|"\x02" }
I would expect the get the bit-string, "\x03" back even though strings
may default to Unicode in Perl 6.
You could put this on the shoulders of the client language (by saying
that the operands must be pre-converted, but that seems to be contrary
to Parrot's usual MO.
Let me know. I'm happy to do it either way, and I'll look at modifying
the other bit-string operators if they don't conform to the decision.
If these conversions croak if there are code points beyond \x{ff}, I'm
fine with it. But trying to mix \x{100} or higher just leads into silly
discontinuities (basically we would need to decide on a word width, and
I think that would be a silly move).
> This means that UTF-8 strings will be handled just fine, and (as I
Please don't mix encodings and code points. That strings might be
serialized or stored as UTF-8 should have no consequence with bitops.
> understand it) some subset of Unicode-at-large will be handled as well.
> In other-words, the burden goes on the conversion functions, not on the
> bit ops.
>
> It's not that it's going to be meaningful in the general case, but if
I'd rather have meaningful results.
> you have code like:
>
> sub foo() { return "\x01"+|"\x02" }
Please consider what happens when the operands have code points beyond 0xff.
> I would expect the get the bit-string, "\x03" back even though strings
> may default to Unicode in Perl 6.
Of course. But I would expect a horrible flaming death for
"\x{100}"|+"\x02".
> You could put this on the shoulders of the client language (by saying
> that the operands must be pre-converted, but that seems to be contrary
> to Parrot's usual MO.
>
> Let me know. I'm happy to do it either way, and I'll look at modifying
> the other bit-string operators if they don't conform to the decision.
>
--
Jarkko Hietaniemi <j...@iki.fi> http://www.iki.fi/jhi/ "There is this special
biologist word we use for 'stable'. It is 'dead'." -- Jack Cohen
As for codepoints outside of \x00-\xff, I vote exception. I don't think
there's any other logical choice, but I think it's just an encoding
conversion exception, not a special bit-op exception (that's arm-waving,
I have not looked at Parrot's exception model yet... miles to go...)
> > This means that UTF-8 strings will be handled just fine, and (as I
>
> Please don't mix encodings and code points. That strings might be
> serialized or stored as UTF-8 should have no consequence with bitops.
What I meant was that UTF-8 IS going to be represented in a way that
will guarantee you won't get an exception when trying to do bit-ops. All
bets are off for many other encodings. While you're right that you might
get lucky, that wasn't really the point I was making. Many languages
(Perl included, I think) are going to encode strings as UTF-8 by
default, and this means that in the general case, we should not expect
exceptions to be thrown around any time we do a bit-op and 'A'|'B' will
still be 'C' :-)
> Of course. But I would expect a horrible flaming death for
> "\x{100}"|+"\x02".
Well, if you consider a string conversion exception to be horrible
flaming death, then I hate to see what you do with a divide-by-zero ;-)
None of your response sounds overly scary to me, so I'll start looking
at what Parrot does NOW for bit-string-ops and see if it needs to mutate
to fit this model. Then I'll add in the rest. Then I get to see what
evil Dan and Leo perform upon my patch ;-)
>> So it seems to me that the "obvious" way to go is to have all bit-s
>> operations first convert to raw bytes (possibly throwing an exception)
>> and then proceed to do their work.
>
> If these conversions croak if there are code points beyond \x{ff}, I'm
> fine with it. But trying to mix \x{100} or higher just leads into
> silly
> discontinuities (basically we would need to decide on a word width, and
> I think that would be a silly move).
Just FYI, the way I implemented bitwise-not so far, was to bitwise-not
code points 0x{00}-0x{FF} as uint8-sized things, 0x{100}-0x{FFFF} as
uint16-sized things, and > 0x{FFFF} as uint32-sized things (but then
bit-masking them with 0xFFFFF to make sure that they fell into a valid
code point range). That's pretty arbitrary, but if you bitwise-not as
though everything were 32-bits wide, you'll end up with a "string"
containing no assigned code points at all (they'll all be > 0x10FFFFF).
But from a text point of view, bitwise-not on a string isn't a sensible
operation no matter how you slice it (that is, even for 0x{00}-0x{FF}),
so one flavor of arbitrary is just about as good as any other. We could
also make anything > 0x{FF} map to either 0x{00} or 0x{FF}, or mask if
with 0xFF to push it into that range. It's all pretty meaningless, as
text transformations go, and I can't imagine anyone using it for
anything, except maybe weak encryption.
>> This means that UTF-8 strings will be handled just fine, and (as I
>
> Please don't mix encodings and code points. That strings might be
> serialized or stored as UTF-8 should have no consequence with bitops.
Exactly. And also realize that if you bitwise-not (or shift or
something similar) the bytes of a UTF-8 serialization of something, the
result isn't going to be valid UTF-8, so you'd be hard-pressed to lay
text semantics down on top of it.
>> understand it) some subset of Unicode-at-large will be handled as
>> well.
>> In other-words, the burden goes on the conversion functions, not on
>> the
>> bit ops.
>>
>> It's not that it's going to be meaningful in the general case, but if
>
> I'd rather have meaningful results.
Exactly--and, meaningful operations to begin with.
I'm beginning to wonder if we're going to be square-rooting strings,
and taking the array-th root of a hash.... :)
JEff
> Just FYI, the way I implemented bitwise-not so far, was to bitwise-not
> code points 0x{00}-0x{FF} as uint8-sized things, 0x{100}-0x{FFFF} as
> uint16-sized things, and > 0x{FFFF} as uint32-sized things (but then
> bit-masking them with 0xFFFFF to make sure that they fell into a valid
> code point range). That's pretty arbitrary, but if you bitwise-not as
> though everything were 32-bits wide, you'll end up with a "string"
> containing no assigned code points at all (they'll all be > 0x10FFFFF).
> But from a text point of view, bitwise-not on a string isn't a sensible
> operation no matter how you slice it (that is, even for 0x{00}-0x{FF}),
> so one flavor of arbitrary is just about as good as any other. We could
> also make anything > 0x{FF} map to either 0x{00} or 0x{FF}, or mask if
> with 0xFF to push it into that range. It's all pretty meaningless, as
> text transformations go, and I can't imagine anyone using it for
> anything, except maybe weak encryption.
I think Dan and I were both thinking in terms of bit-vector operations
on byte-streams for any purpose that would require such a beast. In
Perl, you have the vec function to make this slightly easier.
This is one of those places where thinking about strings as text is
highly misleading. They're used for an awful lot more.
> Exactly. And also realize that if you bitwise-not (or shift or
> something similar) the bytes of a UTF-8 serialization of something, the
> result isn't going to be valid UTF-8, so you'd be hard-pressed to lay
> text semantics down on top of it.
How are you defining "valid UTF-8"? Is there a codepoint in UTF-8
between \x00 and \xff that isn't valid? Is there a reason to ever do
bitwise operations on anything other than 8-bit codepoints?
> I'm beginning to wonder if we're going to be square-rooting strings,
> and taking the array-th root of a hash.... :)
Strings are not numbers, but there's a heck of a lot of code out there
that treats existing strings as bit-vectors (note: bit vectors are not
numbers either), and that code needs to be supported, no?
Now, shift operations aren't usually part of the package, but I figured
that as long as we were going to have the rest of the bit-manipulators,
finishing off the set would be of value.
More to the point, I said all of this at the beginning of this thread.
You should not, at this point, be confused about the scope of what I
want to do, as it was very narrowly and clearly defined up-front.
Like, half of them? \x80 .. \xff are all invalid as UTF-8.
> bitwise operations on anything other than 8-bit codepoints?
I am very confused. THIS IS WHAT WE ALL SEEM TO BE SAYING. BITOPS ONLY
ON EIGHT-BIT DATA. AM I WRONG?
>
> On Sat, 2004-05-01 at 14:18, Jeff Clites wrote:
>
>> Exactly. And also realize that if you bitwise-not (or shift or
>> something similar) the bytes of a UTF-8 serialization of something,
>> the
>> result isn't going to be valid UTF-8, so you'd be hard-pressed to lay
>> text semantics down on top of it.
>
> How are you defining "valid UTF-8"? Is there a codepoint in UTF-8
> between \x00 and \xff that isn't valid? Is there a reason to ever do
> bitwise operations on anything other than 8-bit codepoints?
If you're dealing in terms of code points, then the UTF-8 encoding (or
any other) has nothing to do with it.
If you are dealing in terms of bytes, then there are bytes sequences
which don't encode any code point in the UTF-8 encoding. By "valid
UTF-8", I'm referring to the definition of that encoding (and I should
have said, "well-formed")--see section 3.9, item D36 of the Unicode
Standard. In particular, bytes 0xC0, 0xC1, and 0xF5-0xFF cannot occur
in UTF-8.
But if you're speaking in terms of code points, that's not relevant,
but then neither is the encoding.
> More to the point, I said all of this at the beginning of this thread.
> You should not, at this point, be confused about the scope of what I
> want to do, as it was very narrowly and clearly defined up-front.
And yet, I am confused. You said near the beginning of the thread:
> On Fri, 2004-04-30 at 10:42, Dan Sugalski wrote:
>
>> Bitstring operations ought only be valid on binary data, though,
>> unless someone can give me a good reason why we ought to allow
>> bitshifting on Unicode. (And then give me a reasoned argument *how*,
>> too)
>
> 100% agree. If you want to play games with any other encoding, you may
> proceed to write your own damn code ;-)
Given that, I'm not sure how UTF-8 is coming into the picture.
JEff
>>> Dan Sugalski <d...@sidhe.org> 04/30/04 10:25 PM >>>
Heh, damn Ken Thompson and his placemat!
I am too new to UCS and UTF-8, and had thought it was always 8-bit. I
stand corrected, having read up on the UTF-8 and Unicode FAQ.
Jeff, yeah I have to take back my statement. If Perl defaults to UTF-8,
then it's not a valid assumption that a UTF-8 input string won't throw
an exception. I still think that's ok, and better than
representation-expanding to the larger representation and doing the
bit-op in that, since that means that bit-vectors would have to be
valid in enum_stringrep_one, _two and _four as sort of alternate
datastructures. I don't think we want to go there.
For everything else, as Jeff correctly points out, this has nothing to
do with encoding. Only in the sense that default encoding in a language
like (only one example) Perl 6 dictates what representation you will
have to expect to be the common case.
> > bitwise operations on anything other than 8-bit codepoints?
>
> I am very confused. THIS IS WHAT WE ALL SEEM TO BE SAYING. BITOPS ONLY
> ON EIGHT-BIT DATA. AM I WRONG?
No, it's not, and could you please not get emotional about this? It's
what you, Dan and I have been saying, but I was responding to Jeff who
said:
"Just FYI, the way I implemented bitwise-not so far, was to
bitwise-not code points 0x{00}-0x{FF} as uint8-sized things,
0x{100}-0x{FFFF} as uint16-sized things, and > 0x{FFFF} as
uint32-sized things (but then bit-masking them with 0xFFFFF to
make sure that they fell into a valid code point range)."
It was kind of important that I deal with the fact that I was proposing
a very different behavior for bit-shifting than exists currently for
boolean operations, I thought.
The question becomes should I CHANGE the existing bit-ops so that they
don't work on representations in two or four bytes for symmetry?
If this continues to be so contentious, I'm tempted to agree with the
nay-sayers and say that Parrot shouldn't do bit-vectors on strings, and
we should just implement a bit-vector class later on. Perl will just
have to suffer the overhead of translation. This just IS NOT important
enough to waste this many brain cells on.
I apologize for using UPPERCASE. My only excuse is that it was not
personally aimed at you: I have been griping about these things for
quite some time now, and I tend to pull out the clue-by-four rather
quickly these days, out of sheer frustration.
On May 1, 2004, at 4:54 PM, Aaron Sherman wrote:
> If Perl defaults to UTF-8
People need to realize also that although UTF-8 is a pretty good
interchange format, it's a really bad in-memory representation. This is
for at least 2 related reasons: (1) To get to the N-th logical
character, you need to start all the way at the beginning and scan
forward, so your access time is O(N). For instance, the 1000th
character of a string might start anywhere between byte 1000 and byte
4000. (2) Even once you've located the right byte position, you need to
do computational work to unwind the bytes into the value they
represent. A third reason is that Japanese text will take up three
bytes per character.
> I still think that's ok, and better than
> representation-expanding to the larger representation and doing the
> bit-op in that, since that means that bit-vectors would have to be
> valid in enum_stringrep_one, _two and _four as sort of alternate
> datastructures. I don't think we want to go there.
I'm not sure it's relevant, since I think Dan's completely changing
everything, but my original intention was that rep_one v. two v. four
were just different ways of storing integers in optimally compact ways.
There wasn't supposed to be any externally-visible behavior difference
between them. For instance, you might end up with something in rep_four
which could have been represented in rep_one--is so, you'd be wasting a
bit of memory, but you should never be able to tell, in terms of the
API. With rep_one, you _know_ all of the numbers in your list have to
be < 256; with rep_four, they might be, but you'd have to check (and if
you check, you should downsize to rep_one probably). Those three
representation choices were just a space-based optimization--they
weren't supposed to lead to different behaviors.
> If this continues to be so contentious, I'm tempted to agree with the
> nay-sayers and say that Parrot shouldn't do bit-vectors on strings, and
> we should just implement a bit-vector class later on.
Yes, that does have the benefit of
clarity/simplicity/non-contentiousness.
> Perl will just have to suffer the overhead of translation.
Yep, though actually there's no reason why there couldn't be two
distinct PMCs, which just happen to look the same from a Perl5 point of
view. There wouldn't have to be a translation overhead, necessarily (at
least, in cases where you don't do both binary-ish and text-ish
operations on the same scalar).
It tends to be easier to have a distinction, and pretend it's not there
in certain circumstances, than to lack a distinction, and try to make
everything work out sensibly (text operations on binary data, binary
operations on textual data).
JEff
> I'm not sure it's relevant, since I think Dan's completely changing
> everything, but my original intention was that rep_one v. two v. four
> were just different ways of storing integers in optimally compact ways.
> There wasn't supposed to be any externally-visible behavior difference
> between them. For instance, you might end up with something in rep_four
> which could have been represented in rep_one--is so, you'd be wasting a
> bit of memory, but you should never be able to tell, in terms of the
> API. With rep_one, you _know_ all of the numbers in your list have to
> be < 256; with rep_four, they might be, but you'd have to check (and if
> you check, you should downsize to rep_one probably). Those three
> representation choices were just a space-based optimization--they
> weren't supposed to lead to different behaviors.
I may be misremembering what I've read here but I thought that Dan said
that for variable length encodings (such as shift-JIS) parrot would store
the byte(s) in memory in constant size 16 or 32 bit integers, rather than
the (external) variable length byte sequence, as this gives O(1) random
access, and avoids much coding pain.
However, he made no explicit comment about UTF8 (just another variable
length encoding), which would imply that parrot will be storing UTF8 in
this way. However, I was assuming that internally the UTF8 will immediately
get converted to either UTF-32BE or UTF32LE (or 16 bit or 8 bit if possible)
as this avoids worrying about code points > 0x4000000 cropping up in UTF8
(which if I have it right are 5 bytes long, so would bust Dan's scheme),
on the assumption that UTF8 input will be validity checked at input time,
(at least to do the code point splitting) and writing the values out as
32 bit quantities there and then will take virtually no more CPU, but
save lots later.
Or is this now all gone, because there will be Unicode everywhere and
strings will get converted at input IO time?
Nicholas Clark
Yup. UTF8 is Just another variable-width encoding. Do anything with
it and we convert it to a fixed-width encoding, in this case UTF32.
> Yup. UTF8 is Just another variable-width encoding. Do anything with
> it and we convert it to a fixed-width encoding, in this case UTF32.
Does this mean that we won't be verifying the validity of UTF8 on input?
And instead pitching exceptions at conversion time?
Nicholas Clark
No. We verify validity of strings whenever they're created or shift
formats, so in this case when we go from binary->utf8 on input we'll
do the validation and pitch a fit if there's a problem.
> At 12:30 PM +0100 5/25/04, Nicholas Clark wrote:
>>
>> I may be misremembering what I've read here but I thought that Dan
>> said
>> that for variable length encodings (such as shift-JIS) parrot would
>> store
>> the byte(s) in memory in constant size 16 or 32 bit integers, rather
>> than
>> the (external) variable length byte sequence, as this gives O(1)
>> random
>> access, and avoids much coding pain.
>>
>> However, he made no explicit comment about UTF8 (just another variable
>> length encoding), which would imply that parrot will be storing UTF8
>> in
>> this way.
>
> Yup. UTF8 is Just another variable-width encoding. Do anything with it
> and we convert it to a fixed-width encoding, in this case UTF32.
This has the unfortunate side-effect of wasting 50-75% of the storage
space in the common cases, of course.
JEff
True. But variable length encodings suck performance wise. Jarkko wrote a
caching layer for perl 5.8.1 to store pairs of UTF8/byte offsets, and
even though there is now much more complexity and usually only one pair
cached the feeling was that it accelerated some operations by a factor of 10.
It seems that the O(n) for random access hurts much more than the memory
usage. But you can't win.
Jarkko's view was that if he were to implement Unicode in perl5 again, he'd
go internally for fixed width, UCS 32 (IIRC).
The only thing that might be useful to cache on a UTF8 string is the highest
code point seen, so that we know whether to unpack to 8, 16 or 32 bit without
a scan. Presumably we can find this when we input validate on the
"conversion" from binary to UTF8.
Nicholas Clark
> On Tue, May 25, 2004 at 07:48:32PM -0700, Jeff Clites wrote:
>> On May 25, 2004, at 12:26 PM, Dan Sugalski wrote:
>
>>> Yup. UTF8 is Just another variable-width encoding. Do anything with
>>> it
>>> and we convert it to a fixed-width encoding, in this case UTF32.
>>
>> This has the unfortunate side-effect of wasting 50-75% of the storage
>> space in the common cases, of course.
>
> True. But variable length encodings suck performance wise.
Yes--that was the point I made previously in this thread. But my
proposed scheme was neither variable length nor egregiously wasteful of
space.
> The only thing that might be useful to cache on a UTF8 string is the
> highest
> code point seen, so that we know whether to unpack to 8, 16 or 32 bit
> without
> a scan. Presumably we can find this when we input validate on the
> "conversion" from binary to UTF8.
This is basically what I implemented.
JEff