On Apr 9, 2004, at 7:19 AM, Leopold Toetsch wrote:
> Jeff Clites <jcli...@mac.com> wrote:Yes, I know it got quite large--sorry about that, I know it makes
>> I've sent my patch in through RT--it's [perl #28405]!
> Phew, that's huge. I'd really like to have smaller patches that do it
things more difficult. Mostly it was all-or-nothing, though--changes to
string.c, and then dealing with the consequence of those changes. (It
would have been smaller if I had not added the "interpreter" argument
to the string API which lacked it--updating the places they are used is
probably half of the patch or so.)
In addition to may responses below, I hope to have a better-organized
> But anyway, the patch must have been a lot of work soThe key idea is to move to a model in which a string is not represented
> lets see and make the best out of it.
> Some questions:
as a bag of bytes + associated encoding, but rather conceptually as an
array of (abstract) characters, which boils down to an array of
numbers, those numbers being the Unicode code point of the character.
The easiest way to do this would be to represent a string using an
array of 32-bit ints, but this is a large waste of space in the common
case of an all-ASCII string. So what I'm doing is sticking to the idea
of modeling a string as conceptually an array of 32-bit numbers, but
optimizing by using just a uint8_t if all of the characters happen to
have numerical value < 2^8, uint16_t if some are outside of that
range but still < 2^16, and finally uint32_t in the rare case it
contains characters above that range.
So internally, strings don't have an associated encoding (or chartype
In my view, the fundamental question you can ask a string is "what's
In this model, an "encoding" is a serialization algorithm for strings,
So I get what the split is trying to capture, but I think it's
So in this model (to recap), if I have two files which contain the same
There are a couple of key strengths of this approach, from a
1) Indexing into a string is O(1). (With the existing model, if you
2) There's no memory allocation (and hence no GC) needed during string
3) The Boyer-Moore algorithm can be used for all string searches. (But
Incidentally, this closely matches the approach used by ObjC and Java.
> - Where is string->language?I removed it from the string struct because I think that's the wrong
place for it (and it wasn't actually being used anywhere yet,
fortunately). The problem is that although many operations can be
language-dependent (sorting, for example), the operation doesn't depend
on the language of the strings involved, but rather on the locale of
the reader. The classic example is a list containing a mixture of
English and Swedish names; for an English reader the while list should
be sorted in English order, and a Swedish reader would want to see them
in Swedish alphabetical order. (That example is from Richard Gillam's
book on Unicode.)
I'm assuming that was the intention of "language". If it was meant to
Also, we may want to re-create an API which allows us to obtain an
> With this string type how do we deal with anything beyond codepoints?Hmm, what do you mean? Even prior to this, all of our operations ended
up relying on being able to either transcode two arbitrary strings to
the same encoding, or ask arbitrary strings what their N-th character
is. Ultimately, this carried the requirement that a string be
representable as a series of code points. I don't think that there is
anything beyond that which a string has to offer. But I'm curious as to
what you had in mind.
> And some misc remarks/questions:Yep, exactly.
> - string_compare seems just to compare byte,short, or int_32
(There are compare-strings-using-a-particular-normalization-form
> - What happenend to external constant strings?They should still work (or could). But the only cases in which we can
optimize, and actually use "in-place" a buffer handed to string_make,
is for a handful of encodings. But dealing with the "flags" argument
passed into string_make may still be incomplete.
> - What's the plan towards all the transcode opcodes? (And leaving theseBasically there's no need for a transcode op on a string--it no longer
> as a noop would have been simpler)
makes sense, there's nothing to transcode. What we need to add is an
"encode" which creates a bag of bytes from a string + an encoding, and
a "decode" which creates a string based on a bag of bytes + an
encoding. But we currently lack (as far as I can tell) a data type to
hold a naked bag of bytes. It's in my plans to create a PMC for that,
and once that exists to add the corresponding ops.
> - hash_string seems not to deal with mixed encodings anymore.Yep, since we're hashing based on characters rather than bytes, there's
no such thing as mixed encodings. That means that, to use my example
from above, a string representing the Japanese word for "sushi" will
hash to the same thing no matter what encoding may have been used to
represent it on disk.
> - Why does PIO_putps convert to UTF-8?Mostly for testing. We currently lack a way to associate an encoding
with an IO handle (or with an IO read or write), and if you want to
write out a string you have to pick an encoding--that's the recipe
needed to convert a string into something write-able. Until we've
developed that, I'm writing everything out in UTF-8, just as a sensible
thing to use to allow basically any string to be written out. Before
this, we were writing out the raw bytes used to internally represent
the string, which never really made sense.
> - Why does read/PIO_reads generate an UTF-8 first?Similar to the above, but right now that API is only being called by
the freeze/thaw stuff, and it's current odd state reflects what needed
to be done it keep it working (basically, to match the write by
PIO_putps). But ultimately, for the freeze-using-opcodes case we should
be accumulating our bytes into a raw byte buffer rather than sneaking
them into the body of a string, but since we don't yet have a data type
representing a raw byte buffer (as mentioned above), this was a
Fundamentally, I think we need two sorts of IO API (which may be
Of course, pass along any other questions/concerns you have.
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.