[reportlab-users] Encoding UTF-8 instead of PDFDoc

1,269 views
Skip to first unread message

Koki Nomura

unread,
Mar 1, 2017, 12:05:34 AM3/1/17
to reportl...@lists2.reportlab.com
Hi,

pdfdocEnc() in pdfdoc.py raises a UnicodeEncodeError as below when I process a PDF file with Unicode characters. I'm running my script on Python 3.6.0.

UnicodeEncodeError: 'charmap' codec can't encode character '\x00' in position 11: character maps to <undefined>

This error disappears when I change the encoding from extpdfdoc to utf-8 in this block of code.

if isPy3:
    def pdfdocEnc(x):
        return x.encode('extpdfdoc') if isinstance(x,str) else x

While I don't fully understand 'extpdfdoc' encoding, can we change this encoding to utf-8 as PDF specifications allow to use Unicode as well as PDFDocEncoding?

Thanks,
Koki

Robin Becker

unread,
Mar 1, 2017, 7:23:24 AM3/1/17
to reportlab-users
........
Hi Koki,

not sure whether this is a good idea. The pdfdocEnc function is supposed to use
either a bytestring or unicode. The output is 'supposed' to be acceptable to PDF
and for that we would normally expect to use the pdfdoc standard encoding. The
extpdfdoc encoding just adds CR ('\r') and LF ('\n') identity mapped.

Can you give an example of where this is going wrong ie what you passed to a
reportlab function to cause the problem.

PDF does allow different encodings in various places, but usually we either end
up using pdfdoc or sometimes UTF16. I don't think PDF allows utf8 in many
places; names are one case and I believe some software uri's can be directly
encoded as utf8.
--
Robin Becker
_______________________________________________
reportlab-users mailing list
reportl...@lists2.reportlab.com
https://pairlist2.pair.net/mailman/listinfo/reportlab-users

Koki Nomura

unread,
Mar 1, 2017, 9:39:49 PM3/1/17
to reportlab-users
Hi Robin,

Yes, it was a wrong idea to use UTF-8 as the specifications explicitly require PDFDocEncoding or UTF-16BE for text string. (I thought all Unicode encodings were acceptable.) So now my idea is to encode using UTF-16BE.

I've attached a script causing the problem with two simple PDF files, which are called ascii.pdf and cjk.pdf. This script runs simply as below. 

$ python test.py

Changing the filename 'cjk.pdf' in the script to 'ascii.pdf' will remove the error. These PDF files are basically same while ascii.pdf has an optional content group called 'layer 1' and cjk.pdf has a group with its name in Japanese. These names are the default layer names in Adobe Illustrator, so I always have the same problem when I edit PDF files made by Ai.

I changed the code block raising the error as below and reinstalled reportlab, then my script didn't raise errors anymore. 

# reportlab/pdfbase/pdfdoc.py
if isPy3:
    def pdfdocEnc(x):
        return x.encode('utf_16_be') if isinstance(x,str) else x

I didn't check the 'else' block for Python 2.x but my script didn't raise any errors when I ran the same script with Python 2.7.12.

I'm using pdfrw library (https://github.com/pmaupin/pdfrw) to read PDF files. Here are my environments:

- macOS 10.12.3
- Python 3.6.0
- pdfrw 0.2
- reportlab 3.3.32

Thanks!
Koki

2017年3月1日(水) 21:23 Robin Becker <ro...@reportlab.com>:
ascii.pdf
cjk.pdf
test.py
Reply all
Reply to author
Forward
0 new messages