A question about encoding

Luke Palmer

unread,

Apr 26, 2003, 11:26:24 PM4/26/03

to perl6-i...@perl.org

I'm working on the next iteration of string_str_search (because it's
still slower than Perl 5, and it needs to be faster!), and I have a
few questions about encodings.

What exactly does string_transcode do? I assume it changes a string
encoded one way into a different encoding. Are there any caveats to
this?

If two strings are in the same encoding and they are identical, will
their bit patterns be identical? Since I'm looking for an identical
substring, would it be impolite to just ignore the whole character
idea and look for an identical sequence of bytes?

Thanks,
Luke

Benjamin Goldberg

unread,

Apr 27, 2003, 2:52:24 AM4/27/03

to perl6-i...@perl.org

If you *know* that in that encoding, no character's byte pattern can
occur as a substring of another character's byte pattern, then yes, you
can do that. UTF8 has this property. IIRC, this is called the "infix-
free" property.

But if the encoding isn't like that, you can't. For example, imagine an
encoding in which each character is represented as an ascii "X",
followed by ascii decimal digits representing the char's ordinal value.
So the character "\n" is encoded as the bytes "X10", and the character
"d" is encoded as the bytes "X100". Clearly, if we try and look purely
at the bytes, we would be told that "\n" is a substring of "d". Not
Good.

More reasonably, consider UTF16 or UCS2 -- suppose you've got a string
with the chars "\x{1234}\x{5678}", and you want to know if "\x{3456}" is
a substring of that -- a simple byte comparison would say yes, when the
correct answer is no.

Assuming that you're using string_ord for getting characters, then a
better way to do the comparison would be to use the encoding vtable of
the strings, and the skip_forward and decode methods of those vtables.

And chartype_lookup_transcoder function that string_transcode uses, to
make sure that you've got characters which can be sensibly compared.

Look at the guts of string_transcode and string_compare for inspiration.

At a guess, something like this might work:

Buffer* failure = NULL;
Parrot_UInt len, len2, s_idx, t_idx, c;
const void *s, *t, *i;

CHARTYPE_TRANSCODER transcoder1 = (CHARTYPE_TRANSCODER)NULLfunc;
CHARTYPE_TRANSCODER transcoder2 = (CHARTYPE_TRANSCODER)NULLfunc;

len = string_length(str); /* searching for */
len2 = string_length(str2); /* searching within */

if( len > len2 ) return -1;

if( str->encoding == str2->encoding )
if( str->type == str2->type ) {
if( str1->encoding->index == enum_encoding_singlebyte ) {
/* memmem() is non-ansi! */
void * offset = memmem( str2->strstart + start,
str->strstart, len - start, len2 );
if( offset == NULL ) return -1;
return str1->strstart + start - offset;
}
}
}
s = str->encoding->skip_forward(str->strstart, start);
c = str->encoding->decode(s);
s = str->encoding->skip_forward(s, 1);
s_idx = 1;
if( str->encoding->index == enum_encoding_singlebyte ) {
s = memchr( s, c, len-1 );
if( s == NULL ) {
s = str->strend;
s_idx = len;
} else
s_idx = s - str->strstart;
} else {
while( s_idx < len ) {
if( c == str->encoding->decode(s) )
break;
s = str->encoding->skip_forward(s, 1);
++s_idx;
}
}
if( s_idx < len ) {
Parrot_block_DOD(interp);
failure = new_buffer_header(interp);
Parrot_allocate_zeroed(
interp, failure, sizeof(Parrot_UInt) * (len + 1));
Parrot_unblock_DOD(interp);
}
/* create failure dfa */
t = str->strstart; t_idx = 0;
if( str->encoding->index == enum_encoding_singlebyte ) {
while( s_idx < len ) {
while( t_idx > 0 && *s != *t ) {
t_idx = ((Parrot_UInt*)failure->bufstart)[t_idx];
t = str->strstart + t_idx;
}
if( *(s++) == *t )
++t;
((Parrot_UInt*)failure->bufstart)[++s_idx] = ++t_idx;
} else
((Parrot_UInt*)failure->bufstart)[++s_idx] = 0;
}
} else {
while( s_idx < len ) {
while( t_idx > 0 &&
str->encoding->decode(s) != str->encoding->decode(t) ) {
t_idx = ((Parrot_UInt*)failure->bufstart)[t_idx];
t = str->encoding->skip_forward(str->strstart, t_idx);
}
if( str->encoding->decode(s) == str->encoding->decode(t) ) {
t = str->encoding->skip_forward(t, 1);
((Parrot_UInt*)failure->bufstart)[++s_idx] = ++t_idx;
} else
((Parrot_UInt*)failure->bufstart)[++s_idx] = 0;
s = str->encoding->skip_forward(s, 1);
}
}

/* need to convert character type to some common format. */
transcoder1 = chartype_lookup_transcoder(str->type, str2->type);
if( !transcoder1 ) {
transcoder1 = chartype_lookup_transcoder(
str2->type, string_unicode_type);
transcoder2 = chartype_lookup_transcoder(
str->type, string_unicode_type);
}

/* Perform match */
s = str->strstart; s_idx = 0;
/* c = str->encoding->decode(s); This is what c already is. */
if( transcoder1 ) c = transcoder1(c);
t = str2->encoding->skip_forward(str2->strstart, start);
t_idx = start;
while( t_idx < len2 ) {
Parrot_UInt c2;
if( s_idx == len ) break;
c2 = str2->encoding->decode(t);
if( transcoder2 ) c2 = transcoder2(c2);
while( s_idx > 0 && c != c2 )
if( failure ) {
s_idx = ((UINTVAL*)failure->bufstart)[s_idx];
s = str->encoding->skip_forward(str->strstart, s_idx);
} else {
s_idx = 0;
s = str->strstart;
}
c = str->encoding->decode(s);
if( transcoder1 ) c = transcoder1(c);
}
if( c == c2 ) {
++s_idx;
s = str->encoding->skip_forward(s, 1);
}
t = str2->encoding->skip_forward(t, 1);
++t_idx;
}
if( s_idx == len )
return t_idx - s_idx;
else
return -1;
}

Ugh... really this needs to be unrolled into multiple seperate
functions:
First, two functions for creating the failure dfa (one if str is a
single byte encoding, another if it's a multi-byte encoding), then four
functions for doing the actual search: single/single, single/multibyte,
multibyte/single, and multibyte/multibyte.

--
$a=24;split//,240513;s/\B/ => /for@@=qw(ac ab bc ba cb ca
);{push(@b,$a),($a-=6)^=1 for 2..$a/6x--$|;print "$@[$a%6
]\n";((6<=($a-=6))?$a+=$_[$a%6]-$a%6:($a=pop @b))&&redo;}

Luke Palmer

unread,

Apr 27, 2003, 2:54:11 AM4/27/03

to gol...@earthlink.net, perl6-i...@perl.org

> > If two strings are in the same encoding and they are identical, will
> > their bit patterns be identical? Since I'm looking for an identical
> > substring, would it be impolite to just ignore the whole character
> > idea and look for an identical sequence of bytes?
>
> If you *know* that in that encoding, no character's byte pattern can
> occur as a substring of another character's byte pattern, then yes, you
> can do that. UTF8 has this property. IIRC, this is called the "infix-
> free" property.

You know, it doesn't matter. It's too slow anyway, because once I
find the matching string, I have to count backwards in linear time to
find out what character number it is.

> But if the encoding isn't like that, you can't. For example, imagine an
> encoding in which each character is represented as an ascii "X",
> followed by ascii decimal digits representing the char's ordinal value.
> So the character "\n" is encoded as the bytes "X10", and the character
> "d" is encoded as the bytes "X100". Clearly, if we try and look purely
> at the bytes, we would be told that "\n" is a substring of "d". Not
> Good.

Definitely.

> More reasonably, consider UTF16 or UCS2 -- suppose you've got a string
> with the chars "\x{1234}\x{5678}", and you want to know if "\x{3456}" is
> a substring of that -- a simple byte comparison would say yes, when the
> correct answer is no.
>
> Assuming that you're using string_ord for getting characters, then a
> better way to do the comparison would be to use the encoding vtable of
> the strings, and the skip_forward and decode methods of those vtables.

I would never use string_ord to get the characters. Not after I
looked at how it works. For nontrivial encodings (like UTF-8), it's a
linear time function. For fetching a character. That's wrong (but
the only way it can sensibly be done).

> And chartype_lookup_transcoder function that string_transcode uses, to
> make sure that you've got characters which can be sensibly compared.
>
> Look at the guts of string_transcode and string_compare for inspiration.
>
> At a guess, something like this might work:

> [oodles of code]

Wow. You did a lot of work there. Perhaps we could use a part of
that function for multibyte/multibyte.

After some benchmarking, I found that the "failure function" method I
implemented a couple weeks ago didn't gain much over the naive mathod.
So, in the patch that I'm about to send (yay! it just passed the
tests!), any multibyte search uses that naive method again, because it
uses less memory.

On the other hand, I wrote a singlebyte/singlebyte search that
screams. The benchmarks consistently show it >= 2.5 times faster than
Perl 5 (when compiled under -O2).

So, it's test writing time....

Luke