On 08/22/2016 07:05 AM,
pa...@cpan.org wrote:
> On Sunday 21 August 2016 08:49:08 Karl Williamson wrote:
>> On 08/21/2016 02:34 AM,
pa...@cpan.org wrote:
>>> On Sunday 21 August 2016 03:10:40 Karl Williamson wrote:
>>>> Top posting.
>>>>
>>>> Attached is my alternative patch. It effectively uses a different
>>>> algorithm to avoid decoding the input into code points, and to copy
>>>> all spans of valid input at once, instead of character at a time.
>>>>
>>>> And it uses only currently available functions.
>>>
>>> And that's the problem. As already wrote in previous email, calling
>>> function from shared library cannot be heavy optimized as inlined
>>> function and cause slow down. You are calling is_utf8_string_loc for
>>> non-strict mode which is not inlined and so encode/decode of non-strict
>>> mode will be slower...
>>>
>>> And also in is_strict_utf8_string_loc you are calling isUTF8_CHAR which
>>> is calling _is_utf8_char_slow and which is calling utf8n_to_uvchr which
>>> cannot be inlined too...
>>>
>>> Therefore I think this is not good approach...
>>>
>>
>> Then you should run your benchmarks to find out the performance.
>
> You are right, benchmarks are needed to show final results.
>
>> On valid input, is_utf8_string_loc() is called once per string. The
>> function call overhead and non-inlining should be not noticeable.
>
> Ah right, I misread it as it is called per one valid sequence, not for
> whole string. You are right.
It is called once per valid sequence. See below.
>
>> On valid input, is_utf8_char_slow() is never called. The used-parts can be
>> inlined.
>
> Yes, but this function is there to be called primary on unknown input
> which does not have to be valid. If I know that input is valid then
> utf8::encode/decode is enough :-)
What process_utf8() does is to copy the alleged UTF-8 input to the
output, verifying along the way that it actually is legal UTF-8 (with 2
levels of strictness, depending on the input parameter), and taking some
actions (exactly what depends on other input parameters) if and when it
finds invalid UTF-8.
The way it works after my patch is like an instruction pipeline. You
start it up, and it stays in the pipeline as long as the next character
in the input is legal or it reaches the end. When it finds illegal
input, it drops out of the pipeline, handles that, and starts up the
pipeline to process any remaining input. If the entire input string is
valid, a single instance of the pipeline is all that gets invoked.
The use-case I envision is that the input is supposed to be valid UTF-8,
and the purpose of process_utf8() is to verify that that is in fact
true, and to take specified actions when it isn't. Under that use-case,
taking longer to deal with invalid input is not a problem. If that is
not your use-case, please explain what yours is.
And I think you misunderstand when is_utf8_char_slow() is called. It is
called only when the next byte in the input indicates that the only
legal UTF-8 that might follow would be for a code point that is at least
U+200000, almost twice as high as the highest legal Unicode code point.
It is a Perl extension to handle such code points, unlike other
languages. But the Perl core is not optimized for them, nor will it be.
My point is that is_utf8_char_slow() will only be called in very
specialized cases, and we need not make those cases have as good a
performance as normal ones.
>> On invalid input, performance should be a minor consideration.
>
> See below...
See above. :)
And an optimizing compiler should figure that out and omit the call, so
we shouldn't have to manually. In the next few months I will be working
on some fixes to the :utf8 layer. That could lead to a core function
similar to this, but without relying on the compiler to figure things out.
>
> 2) Try to make inline version of function is_utf8_string_loc(). Maybe
> merge with is_strict_utf8_string_loc()? That should boost non strict
> decoder for invalid sequences (now it is slower then strict one).
I'll look at which functions in utf8.c should be inlined. This is a
candidate. But I doubt that that is why this is slower in this case.
Read blead's perlhack.pod for performance testing tool hints.
Your comments caused me to realize that the call to utf8n_to_uvchr()
when the input is syntactically valid, but is an illegal code point
(like a surrogate, etc) could be replaced by the faster
valid_utf8_to_uvchr(), which has not been in the public API. I have
pushed to
http://perl5.git.perl.org/perl.git/shortlog/refs/heads/smoke-me/khw-encode
a version of blead which has this function made public, and inlined, in
case you want to test with it.
There are some things in the error handling code that could be improved
upon. For example, memcpy() of a single character is slower than just
copying a byte at a time, as the function call overhead dwarfs the savings.
>
> And maybe it could make sense make all needed functions as part of
> public API.
Yes
>
> Are you going to prepare pull request for Encode module?
I will, after everything is settled. This may include changes to
Devel::PPPort to make sure this works on all perls that Encode supports.
However, I think it would be good to get the extra tests of malformed
inputs into the distribution. (I haven't looked at your PR yet for
flaws. I'll do that, and hopefully will find none, so will recommend
your PR be pulled.)
>
>
> Anyway, how it behave on EBCDIC platforms? And maybe another question
> what should Encode::encode('UTF-8', $str) do on EBCDIC? Encode $str to
> UTF-8 or to UTF-EBCDIC?
It works fine on EBCDIC platforms. There are other bugs in Encode on
EBCDIC that I plan on investigating as time permits. Doing this has
fixed some of these for free. The uvuni() functions should in almost
all instances be uvchr(), and my patch does that.
On EBCDIC platforms, UTF-8 is defined to be UTF-EBCDIC (or vice versa if
you prefer), so $str will effectively be in the version of UTF-EBCDIC
valid for the platform it is running on (there are differences depending
on the platform's underlying code page).
>