Jakub Wilk <jw...@debian.org
>>The reason is the following (see
>> fribidi_utf8_to_unicode consumes at most 3 bytes for a single
>> unicode character, i.e. it does not handle unicode character above
> As far as I can see this is not true. In Debian, we allocate 4 bytes
> per characters. (An upstream version, which the Debian package is
> based on, is completely broken in this respect: it allocates a buffer
> of static size. See bug #570068)
upstream is pretty much dead in this case. I've published our version on
PyPI. However, I didn't ask or inform the original authors about that.
>> For a 4 byte utf-8 sequence it will generate 2 unicode characters,
>> which overflows the logical buffer.
> I'm confused. What is "it" in your sentence? Why 2 Unicode characters?
"it" refers to the 4 byte utf-8 sequence.
here's the inner loop of "fribidi_utf8_to_unicode" from
| length = 0;
| while ((FriBidiStrIndex) (s - t) < len)
| register unsigned char ch = *s;
| if (ch <= 0x7f) /* one byte */
| *us++ = *s++;
| else if (ch <= 0xdf) /* 2 byte */
| *us++ = ((*s & 0x1f) << 6) + (*(s + 1) & 0x3f);
| s += 2;
| else /* 3 byte */
| *us++ =
| ((int) (*s & 0x0f) << 12) +
| ((*(s + 1) & 0x3f) << 6) + (*(s + 2) & 0x3f);
| s += 3;
Assume you have a 4-byte utf-8 sequence. One loop step consumes a maximum of
3 bytes of that 4-byte sequence (there's no "4 byte" case), leaving
1-byte of that sequence for further processing. this 1 byte will
generate another unicode character. pyfribidi uses the length of the
python unicode string as buffer size, which is less than what the
fribidi_utf8_to_unicode generates. and there you have your buffer
to confirm the issue, you can add an assert and check that
fribidi_utf8_to_unicode's return value (the length of the string) equals
> Anyway I tried to double the buffer size (8 bytes per characters of
> original string) but this didn't fix the crash. So likely the problem
> lies somewhere else.
I'm pretty sure my analysis is correct and I'm not so quite sure what
you did here.