wx.lib.pdfviewer - special characters issue

Werner

unread,

May 22, 2013, 9:39:36 AM5/22/13

to wxpytho...@googlegroups.com

Hi,

I just noticed that the viewer has problems with special characters such
as �� etc etc.

The problem can be seen when one opens the attached .pdf in the viewer,
if I view it with e.g. Adobe Reader the characters are fine.

Is this to do with the way PythonReports defines/uses the fonts?

Any hint on how to make this display correctly?

This is with 2.9.5 preview (classic).

Werner

Cellarbook wine list - portrait.pdf

David Hughes

unread,

May 22, 2013, 10:29:29 AM5/22/13

to wxpytho...@googlegroups.com

On Wednesday, May 22, 2013 2:39:36 PM UTC+1, werner wrote:

Hi,

I just noticed that the viewer has problems with special characters such

as ï¿½ï¿½ etc etc.

It is probably for much the same reason as why they don't appear correctly here either ;-)

The problem can be seen when one opens the attached .pdf in the viewer,
if I view it with e.g. Adobe Reader the characters are fine.

Is this to do with the way PythonReports defines/uses the fonts?

Any hint on how to make this display correctly?

This is with 2.9.5 preview (classic).

Werner

I am just catching up with things after a few days away, but I will investigate as soon as I can

David

werner

unread,

May 22, 2013, 10:42:03 AM5/22/13

to wxpytho...@googlegroups.com

Hi David,

On Wednesday, 22 May 2013 16:29:29 UTC+2, David Hughes wrote:

On Wednesday, May 22, 2013 2:39:36 PM UTC+1, werner wrote:
Hi,

I just noticed that the viewer has problems with special characters such
as ï¿½ï¿½ etc etc.

How do I love all this encoding stuff.

In my "Sent" folder in Thunderbird it shows as it should "a accent" and "e accent" but the message which came in via google group is showing garbage.

Lets see does it work all the way if I reply to this on google group.
áé

It is probably for much the same reason as why they don't appear correctly here either ;-)

The problem can be seen when one opens the attached .pdf in the viewer,
if I view it with e.g. Adobe Reader the characters are fine.

Is this to do with the way PythonReports defines/uses the fonts?

Any hint on how to make this display correctly?

This is with 2.9.5 preview (classic).

Werner

I am just catching up with things after a few days away, but I will investigate as soon as I can

I am catching up to, was off for a few days.

Thanks for adding it to your list:)
Werner

Tim Roberts

unread,

May 22, 2013, 12:35:44 PM5/22/13

to wxpytho...@googlegroups.com

werner wrote:

On Wednesday, May 22, 2013 2:39:36 PM UTC+1, werner wrote:
Hi,

I just noticed that the viewer has problems with special characters such
as ï¿½ï¿½ etc etc.

How do I love all this encoding stuff.

In my "Sent" folder in Thunderbird it shows as it should "a accent" and "e accent" but the message which came in via google group is showing garbage.

Lets see does it work all the way if I reply to this on google group.
áé

Now, hold on a minute. The last two characters here did show up as "a accent" and "e accent", but in my Thunderbird, the two characters in your original mail were the Hebrew letters "tet" and "alef" (U+05D8 and U+05D0). What did you actually type?

-- 
Tim Roberts, ti...@probo.com
Providenza & Boekelheide, Inc.

Chris Barker - NOAA Federal

unread,

May 22, 2013, 12:57:55 PM5/22/13

to wxpytho...@googlegroups.com

On Wed, May 22, 2013 at 9:35 AM, Tim Roberts <ti...@probo.com> wrote:
> Now, hold on a minute. The last two characters here did show up as "a
> accent" and "e accent", but in my Thunderbird, the two characters in your
> original mail were the Hebrew letters "tet" and "alef" (U+05D8 and U+05D0).

Same for me in gmail web client...

Isn't this fun!

OT: Anyone know what the encoding story is with email? I'm sure the
original spec was ASCII only (probably 7 bit...), but are you now free
to use any (hopefully specified) encoding, or is it always UTF-8 or???

Just curious, really...

-CHB

--

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception

Chris....@noaa.gov

Robin Dunn

unread,

May 22, 2013, 1:10:33 PM5/22/13

to wxpytho...@googlegroups.com

Chris Barker - NOAA Federal wrote:
> On Wed, May 22, 2013 at 9:35 AM, Tim Roberts<ti...@probo.com> wrote:
>> Now, hold on a minute. The last two characters here did show up as "a
>> accent" and "e accent", but in my Thunderbird, the two characters in your
>> original mail were the Hebrew letters "tet" and "alef" (U+05D8 and U+05D0).
>
> Same for me in gmail web client...
>
> Isn't this fun!
>
> OT: Anyone know what the encoding story is with email? I'm sure the
> original spec was ASCII only (probably 7 bit...), but are you now free
> to use any (hopefully specified) encoding, or is it always UTF-8 or???
>
> Just curious, really...

Many mail clients (or MUAs, "Mail User Agents") will let you choose from
a large list of encodings, and the way the text is put into the
"payload" of the email message as ascii is well defined by various RFCs.
You can also dig in to the stock email package docs and code in the
Python standard library and get all kinds of juicy details about it.

http://docs.python.org/2/library/email.html

--
Robin Dunn
Software Craftsman
http://wxPython.org

Robin Dunn

unread,

May 22, 2013, 1:10:39 PM5/22/13

to wxPython-users

Tim Roberts wrote:

> werner wrote:
>> In my "Sent" folder in Thunderbird it shows as it should "a accent"
>> and "e accent" but the message which came in via google group is
>> showing garbage.
>>
>> Lets see does it work all the way if I reply to this on google group.
>> áé
>
> Now, hold on a minute. The last two characters here did show up as "a
> accent" and "e accent", but in my Thunderbird, the two characters in
> your original mail were the Hebrew letters "tet" and "alef" (U+05D8 and
> U+05D0). What did you actually type?

I saw the Hebrew letters too. Perhaps there was some strangeness in the
message encoding settings for the original message or the mail client?
Anyway, Werner's first message used the UTF-8 encoding and the 2nd was
ISO-8859-1, if that helps.

werner

unread,

May 22, 2013, 1:14:15 PM5/22/13

to wxpytho...@googlegroups.com

Hi Tim,

The same thing in the same way (both in Thunderbird and when it worked in Firefox browser) which is hold down "alt" and then the number 0025 for the "a accent" and 0233 for the "e accent" on Windows 7 with a keyboard configured as "uk english" (you will love it, actually have a french keyboard but never have it configured as a french one).

Werner

David Hughes

unread,

May 23, 2013, 11:58:55 AM5/23/13

to wxpytho...@googlegroups.com

On 22/05/2013 15:29, David Hughes wrote:

On Wednesday, May 22, 2013 2:39:36 PM UTC+1, werner wrote:
Hi,

I just noticed that the viewer has problems with special characters

<snip>

The problem can be seen when one opens the attached .pdf in the viewer,
if I view it with e.g. Adobe Reader the characters are fine.

Is this to do with the way PythonReports defines/uses the fonts?

Any hint on how to make this display correctly?

This is with 2.9.5 preview (classic).

Werner

Yes, when your attached pdf file is displayed in the viewer, the accented characters are all displayed incorrectly - see OriginalPdf.png. For example, e-acute is displayed as a dagger symbol. But, using the Wing debugger, the unicode strings in the PDF show the e-acute as \u2020 - which is indeed the Unicode character 'Dagger'

u'L\'\u2020volution du mill\u2020sime 2000 confirme bien sa r\u2020putation de "Mill\u2020sime du si\u2021cle". Un tr\u2021s grand potentiel qui commence'

Yet your attachment displays correctly in Adobe reader. Now if I cut and paste the text out of Adobe reader and inject as a comment in one of my recipes, then display it using the viewer, that all displays correctly as well - see PastedText.png. Wing now reports the unicode strings in the PDF as being like:

u'\xe9volution du mill\xe9sime 2000 confirme bien sa'

which is what I would expect.

So, I think the viewer is behaving correctly as far as it goes - but I don't know what Adobe reader is doing to make it work with the original data.

David

OriginalPdf.png

PastedText.png

Chris Barker - NOAA Federal

unread,

May 23, 2013, 2:12:05 PM5/23/13

to wxpytho...@googlegroups.com

On Thu, May 23, 2013 at 8:58 AM, David Hughes <d...@forestfield.co.uk> wrote:
>> Is this to do with the way PythonReports defines/uses the fonts?
>>
>> Any hint on how to make this display correctly?

> But, using the Wing
> debugger, the unicode strings in the PDF show the e-acute as \u2020 - which
> is indeed the Unicode character 'Dagger'

if you're using Wing, then this is the string after it was decoded
into a python unicode, object, yes?

In which case, the wrong encoding is being used to decode it.

So the question is, how are string encoded in PDF. From reading this thread:

http://stackoverflow.com/questions/128162/unicode-in-pdf

That's a hard question to answer, but presumable it is either:

All PDF text is encoded with a particular encoding
or
There is a way to specify the encoding in a particular document.

I suspect it's the latter, or you would have this problem all the
time. It could also be that PythonReports is using the wrong encoding
or specifying it incorrectly but as Adobe Reader is the reference
implementation, to some extent, if it works in Reader, it's right.

So you need to figure out how reader determines the encoding, and
emulate that. Maybe the specs will help:

http://www.adobe.com/devnet/pdf/pdf_reference.html

> So, I think the viewer is behaving correctly as far as it goes -

not really -- it's using the wrong encoding to decode the data in the
PDF -- that is not correct ( as long as you define correct as "same as
Adobe Reader" )

I'd make a tiny pdf with just a bit of non-ascii text in it, and take
a look at it. That may be easier than reading the spec!

-Chris

Chris Barker - NOAA Federal

unread,

May 23, 2013, 2:14:11 PM5/23/13

to wxpytho...@googlegroups.com

BTW,

For what it's worth, Chrome's PDF viewer work OK too -- not sure if
that's Adobe under the hood....

Chris Barker - NOAA Federal

unread,

May 23, 2013, 2:24:53 PM5/23/13

to wxpytho...@googlegroups.com

One more note....

$ grep -a Encoding Cellarbook\ wine\ list\ -\ portrait.pdf
/Encoding /WinAnsiEncoding

What is viewer using as an encoding when it decodes the pdf?

Though this may refer to encoding used for symbols in the PDF, rather
than text to display.

I also see this in there:

% 'toUnicodeCMap:AAAAAA+Arial-BoldMT': class PDFStream
7 0 obj
<< /Filter [ /FlateDecode ]
/Length 710 >>
stream
<<bunch of binary stuff....>>

Can't make much sense of that!

% Font Arial Bold subset 0
<< /BaseFont /AAAAAA+Arial-BoldMT
/FirstChar 0
/FontDescriptor 9 0 R
/LastChar 127
/Name /F2+0
/Subtype /TrueType
/ToUnicode 7 0 R

Given that I can't see any of the text in tehre when I look at it as
text (I think my terminal is set to utf-8), then it seems to be using
a multi-byte encoding of some sort -- but which one? (or it's
compressed or something -- I sure don't know anything about PDF...)

Again, I'd make a pdf with just a single paragraph of text and look at
/ experiment with that.

-Chris

David Hughes

unread,

May 24, 2013, 10:02:25 AM5/24/13

to wxpytho...@googlegroups.com

On 23/05/2013 19:12, Chris Barker - NOAA Federal wrote:
>> >So, I think the viewer is behaving correctly as far as it goes -
> not really -- it's using the wrong encoding to decode the data in the
> PDF -- that is not correct ( as long as you define correct as "same as
> Adobe Reader" )

The viewer itself doesn't do any encoding or decoding, it simply
receives text strings from pyPdf and draws them in a wx.DC.

I agree though that encoding is the problem. The PastedText example I
posted earlier works, I think, because it was written - via Reportlab -
using one of the standard fonts (helvetica) that Adobe provides.
Werner's pdf file contains references to AAAAAA+ArialMT, the definitions
of which seem to be embedded in the file and which, I guess, his text is
using.

The viewer doesn't currently handle embedded fonts (because I don't know
how to do it at the moment) and the problem is most likely that they are
encoded differently to the standard fonts

Werner, does PythonReports give you any choice which fonts you can use,
i.e. can you restrict it to use of the Adobe standard fonts? This
shouldn't make much difference to you in practice - Arial and Helvetica
are pretty much the same thing. Alternatively, it might make a
difference if the unicode(?) strings you pass it are encoded as 'latin-1'

Ideally, I would like to say that the viewer will be extended to handle
embedded fonts, but I have no idea what work and time would be involved.

--
Regards

David Hughes
Forestfield Software

werner

unread,

May 24, 2013, 10:39:10 AM5/24/13

to wxpytho...@googlegroups.com

Hi David,

On 24/05/2013 16:02, David Hughes wrote:

...

> Werner, does PythonReports give you any choice which fonts you can
> use, i.e. can you restrict it to use of the Adobe standard fonts? This
> shouldn't make much difference to you in practice - Arial and
> Helvetica are pretty much the same thing. Alternatively, it might make
> a difference if the unicode(?) strings you pass it are encoded as
> 'latin-1'

I had a look at the font selection in the past but couldn't make it work
then - will give it another go.

All my data comes via SQLAlchemy out of a Firebird SQL DB which uses
"UTF-8" character set and SA 0.8 all the fields/columns use the
"sa.Column(sa.Unicode(length=nn))" and don't do any encoding/decoding -
so my guess is that it is also a font issue.

>
> Ideally, I would like to say that the viewer will be extended to
> handle embedded fonts, but I have no idea what work and time would be
> involved.

I would be happy to test this;-)

Thanks for having looked at it.

Werner

Tim Roberts

unread,

May 24, 2013, 12:32:50 PM5/24/13

to wxpytho...@googlegroups.com

Chris Barker - NOAA Federal wrote:

> For what it's worth, Chrome's PDF viewer work OK too -- not sure if
> that's Adobe under the hood....

No. I was very surprised to learn that the built-in PDF viewer in
Firefox and Chrome is 100% Javascript, interpreted right there in the
browser. It's an open source component. I'm astonished that they are
able to do as good of a job as they do.

werner

unread,

May 24, 2013, 1:49:53 PM5/24/13

to wxpytho...@googlegroups.com

Hi David,

On 24/05/2013 16:02, David Hughes wrote:

...

> Werner, does PythonReports give you any choice which fonts you can
> use, i.e. can you restrict it to use of the Adobe standard fonts? This
> shouldn't make much difference to you in practice - Arial and
> Helvetica are pretty much the same thing. Alternatively, it might make
> a difference if the unicode(?) strings you pass it are encoded as
> 'latin-1'

If I use "Helvetica" in PythonReports then I get exceptions, but they
are thrown from within ReportLab.

Could it be that the problem is within pyPdf and/or pyPDF2 (I normally
use the later)?

I was experiementing a bit and see that there are display issues with
PDF's generated by e.g. LibreOffice 4.x, very simple one I did is
attached and here pdfviewer shows just a blank page.

Will try to dig around a bit more over the next few days.

Werner

testfromODF.pdf

testforodf.odt

David Hughes

unread,

Jun 11, 2013, 11:22:17 AM6/11/13

to wxpytho...@googlegroups.com

On 24/05/2013 15:39, werner wrote:
> Hi David,
>
> On 24/05/2013 16:02, David Hughes wrote:
>
> ...
>
>>

>> Ideally, I would like to say that the viewer will be extended to
>> handle embedded fonts, but I have no idea what work and time would be
>> involved.
> I would be happy to test this;-)
>
> Thanks for having looked at it.
>
> Werner
>

I have now got a version of pdfviewer that works for all types of PDF.
Instead of pyPdf it uses python-fitz - the python bindings for the mupdf
library, which does all the work of extraction and rendering of the PDF
content.

I will be happy to provide you a copy of all the source code, Werner but
I would like to ask - Robin in particular - about about the possibility
of providing it as an addition to, or a replacement for, the current
version of wx.lib.pdfviewer. My concern is that mupdf is released under
GPL, specifically the GNU Affero General Public License version 3, and
how this would affect the wxPython licence of pdfviewer and any software
that uses it.

Robin Dunn

unread,

Jun 11, 2013, 5:48:16 PM6/11/13

to wxpytho...@googlegroups.com

I've done a bit of research about this related to some work I've done at
Enthought. IMO it basically boils down to this: since DLLs (and
therefore Python extension modules) are, by their very nature,
dynamically loaded at runtime then using GPL'd DLLs (or whatever) from a
non-GPL'd program is allowed. What is still a very questionable issue
(and most likely not allowed) is distributing the GPL'd DLLs or other
binaries with the non-GPL'd program. In other words, using (dynamically
loaded) GPL with non-GPL is okay at runtime, distributing GPL in binary
form together with non-GPL is not okay. For example, if a developer
used py2exe to create an application that included python-fitz and
mupdf, then to be legally compliant the application would have to be
GPL. The alternative is that the developer would have to provide a way
for those to be downloaded and installed separately from their
application, and make that installer and whatever support code it uses
GPL too.

IANAL, this is just my interpretation, etc.

werner

unread,

Jun 12, 2013, 2:45:19 AM6/12/13

to wxpytho...@googlegroups.com

As Robin IANAL.

It is a pity that they don't use the LGPL, which I believe does not have
the above issues.

So in my view please don't replace the existing wx.lib.pdfviewer with
this version, maybe have it as pdfviewer2 or pdfviewerAlt.

Werner

Karsten Hilbert

unread,

Jun 12, 2013, 6:22:13 AM6/12/13

to wxpytho...@googlegroups.com

On Wed, Jun 12, 2013 at 08:45:19AM +0200, werner wrote:

> So in my view please don't replace the existing wx.lib.pdfviewer with
> this version, maybe have it as pdfviewer2 or pdfviewerAlt.

Has it been considered to make pdfviewer try several
backends and use the one that's available ?

Karsten
--
GPG key ID E4071346 @ gpg-keyserver.de
E167 67FD A291 2BEA 73BD 4537 78B9 A9F9 E407 1346

David Hughes

unread,

Jun 12, 2013, 7:20:24 AM6/12/13

to wxpytho...@googlegroups.com

On 12/06/2013 11:22, Karsten Hilbert wrote:
> On Wed, Jun 12, 2013 at 08:45:19AM +0200, werner wrote:
>
>> >So in my view please don't replace the existing wx.lib.pdfviewer with
>> >this version, maybe have it as pdfviewer2 or pdfviewerAlt.
> Has it been considered to make pdfviewer try several
> backends and use the one that's available ?

Good idea. I will try and merge the two versions of the program that I
now have and which have diverged somewhat.

Reply all

Reply to author

Forward