[reportlab-users] String shapes and encodings

Claude Paroz

unread,

Jun 6, 2022, 10:29:05 AM6/6/22

to For users of Reportlab open source software

Hi,

In the spirit of Python3 strings being always Unicode, I think that
ReportLab String shape should behave the same and accept only Python
strings.
I admit this might be slightly backwards incompatible, but it could be a
first step in string handling simplification in ReportLab. The next step
could be a similar patch for platypus Paragraph.

Claude
--
www.2xlibre.net

0001-String-shapes-accept-Python-strings.patch

Robin Becker

unread,

Jun 7, 2022, 4:40:52 AM6/7/22

to reportlab-users, Claude Paroz

I don't think the fact that python regards a specific encoding of glyphs to be strings has much relevance here. Most of
the external data is in byte form whether encoded as unicode utf8 etc etc.

When python started to provide a unicode encoding of glyphs reportlab had to support them because people wanted to use
them. Today people still want to use bytes.

If python said it was abandoning byte strings then that would be a reason to drop all support for them. That would
really annoy the gene analysts though :)

I don't think I would like to apply this patch anytime soon. If others have an opinion please speak up.
--
Robin Becker
_______________________________________________
reportlab-users mailing list
reportl...@lists2.reportlab.com
https://pairlist2.pair.net/mailman/listinfo/reportlab-users

Claude Paroz

unread,

Jun 7, 2022, 12:27:58 PM6/7/22

to Robin Becker, reportlab-users

Le 07.06.22 à 10:40, Robin Becker a écrit :

> On 06/06/2022 15:28, Claude Paroz wrote:
>> Hi,
>>
>> In the spirit of Python3 strings being always Unicode, I think that
>> ReportLab String shape should behave the same and accept only Python
>> strings.
>> I admit this might be slightly backwards incompatible, but it could be
>> a first step in string handling simplification in ReportLab. The next
>> step could be a similar patch for platypus Paragraph.
>>
>> Claude
>
> I don't think the fact that python regards a specific encoding of glyphs
> to be strings has much relevance here. Most of the external data is in
> byte form whether encoded as unicode utf8 etc etc.
>
> When python started to provide a unicode encoding of glyphs reportlab
> had to support them because people wanted to use them. Today people
> still want to use bytes.

Of course, at a certain point in time, any digital content is a matter
of bytes. That's not what is discussed here.
The approach Python choose is to push for character conversion happening
in process boundaries, that is at input and output time. When you get
some string input, you have to know (or guess) the encoding and the idea
is to immediately convert to Unicode. Then during the whole string
lifetime in your program, it is Unicode (Python 3 str type). Then, at
some point you have to produce some outpout, and that's the time to
convert back to bytes with the expected encoding from the output
consumer side.
This simplify things *a lot* compared to the Python 2 world when you
never knew if you had to manipulate pure bytes or unicode, and had to
constantly test content in many parts of your code, as you can see in
ReportLab with the many isStr, isBytes, isUnicode, asNative, etc. uses
throughout the code base. I don't despise that, it was a "normal"
consequence of string status on Python 2.

> If python said it was abandoning byte strings then that would be a
> reason to drop all support for them. That would really annoy the gene
> analysts though :)

This won't happen. Bytes, be it strings or any other content type has
legitimate use cases, of course.

> I don't think I would like to apply this patch anytime soon. If others
> have an opinion please speak up.

I totally respect your maintainer choice. It was a (first-step) proposal
in order to simplify string handling and to also improve performances by
less function calls. I'm not angry if you refuse it, we can agree to
disagree :-)

Regards,

Claude
--
www.2xlibre.net

Lennart Regebro via reportlab-users

unread,

Jun 7, 2022, 10:31:22 PM6/7/22

to reportlab-users, Lennart Regebro

On Tue, Jun 7, 2022 at 10:40 AM Robin Becker <ro...@reportlab.com> wrote:

I don't think I would like to apply this patch anytime soon. If others have an opinion please speak up.

As I understand it, the encoding will simply be ignored if a unicode string is given, so keeping the possibility of passing in a bytestring and an encoding is not a problem, even though it's no longer necessary.

Claude Paroz

unread,

Jun 8, 2022, 2:51:54 AM6/8/22

to reportlab-users

Le 08.06.22 à 04:31, Lennart Regebro a écrit :

No, the isUnicode check would force text input to be Unicode (a normal
Python string). The encoding parameter should be deprecated/removed at
some point.
So instead of String(b'd\xe9j\xe0', encoding='latin-1'), users should
pass String(b'd\xe9j\xe0'.decode('latin-1')).

Claude
--
www.2xlibre.net

Robin Becker

unread,

Jun 8, 2022, 4:04:10 AM6/8/22

to reportl...@lists2.reportlab.com

..........

> No, the isUnicode check would force text input to be Unicode (a normal Python string). The encoding parameter should be
> deprecated/removed at some point.
> So instead of String(b'd\xe9j\xe0', encoding='latin-1'), users should pass String(b'd\xe9j\xe0'.decode('latin-1')).
>
> Claude

For whatever reason we decided to allow either utf8 bytes or unicode as inputs to many of the reportlab functions/methods.

It seems to me that forcing the decode into the caller is wrong.

1) It's not always true that the values passed are explicitly known to be bytes or unicode and
what a suitable encoding might be.
2) If we have to test to ensure the conversion that code gets scattered everywhere rather than being in the callee.
3) Claude's desire to make the decode explicit at the call is not prevented by current code.

Our default works in a lot of places, but I agree that it won't suit many windows users etc.
It might have been better to have a user controllable default encoding we could then set that into many of the argument
definitions ie func(.....,encoding=rl_config.default_byte_encoding,....)

If the decode fails we could fall back on chardet or similar.

I think removing the ability to use bytes is not an improvement.
--
Robin Becker

Reply all

Reply to author

Forward