Another encoding issue

4 views
Skip to first unread message

chris

unread,
Oct 27, 2010, 8:23:11 AM10/27/10
to Mappa - Topic Maps
Hi there,

I just ran into another encoding error.

While writing a little test to include in a module, I am loading the
topicmap in ltm format from a string, for simplicity, The string
contains unicode characters and the whole file is encoded int utf-8.
If I load this, I get an error like this:
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/
lib/python2.6/site-packages/tm.reader.ltm-0.1.4-py2.6.egg/mio/reader/
ltm/lexer.py", line 86, in t_error
raise MIOException('Unexpected token "%r"' % t)
tm.mio._exceptions.MIOException: Unexpected token
"LexToken(error,u'\x00\x00\x00@\x00\x00\x00"\x00\x00\x00u\x00\x00\x00t
\x00\x00\x00f\x00\x00\x00-\x00\x00\x008\x00\x00\x00"

Here is the relevant part of my test file:

#!/usr/bin/env python -*- coding: utf-8 -*-

[...]

if __name__ == '__main__':
from mappa.utils import *
import codecs, mappa
sys.stdout = codecs.getwriter('utf8')(sys.stdout)
testmap=u'''
@"utf-8"
#VERSION "1.3"
#TOPICMAP ~ myfirst
[myfirst = "My first topic map"
= "My TM" / short-name
= "維習安的TM" / zh]
{myfirst, date, [[2010-10-07]]}
'''
conn = mappa.connect()
conn.loads(testmap, into="testmap.ltm", format="ltm")
tm = conn.get("testmap.ltm")


Chris

Lars Heuer

unread,
Oct 27, 2010, 5:39:23 PM10/27/10
to chris
Hi Chris,

[...]


> Here is the relevant part of my test file:

> #!/usr/bin/env python -*- coding: utf-8 -*-

I think this is the source of the failure. You should put

# -*- coding: utf-8 -*-

into the 2nd line of your source file and not into the same line as
the interpreter. Python does not detect the correct encoding of the
source file.

This works for me:

>>> import mappa
>>> conn = mappa.connect()
>>> src = 'http://cxtm-tests.svn.sourceforge.net/viewvc/cxtm-tests/trunk/ltm/in/utf-8.ltm'
>>> conn.load(src, into='http://www.example.org/my.map', format='ltm')
>>>

And the referenced file uses also UTF-8.

Best regards,
Lars
--
Semagia
<http://www.semagia.com>

Lars Heuer

unread,
Oct 27, 2010, 5:50:04 PM10/27/10
to Lars Heuer
[...]
> This works for me:
[...]

> >>> conn.load(src, into='http://www.example.org/my.map', format='ltm')
> >>>

... and this too:

>>> import mappa
>>> from urllib import urlopen
>>> s = urlopen('http://cxtm-tests.svn.sourceforge.net/viewvc/cxtm-tests/trunk/ltm/in/utf-8.ltm').read()
>>> s
'@"utf-8"\n[hiragana = "\xe3\x81\xb2\xe3\x82\x89\xe3\x81\x8c\xe3\x81\xaa"\n @"http://psi.ontopia.net/iso/15924.xtm#hira"]\n'
>>> conn = mappa.connect()
>>> conn.loads(s, into='http://example.org/testmap', format='ltm')
>>>

I think it's an encoding issue of your Python file.

Christian Wittern

unread,
Oct 27, 2010, 8:46:15 PM10/27/10
to ma...@googlegroups.com
Hi Lars,

On 2010-10-28 06:39, Lars Heuer wrote:
>
>> #!/usr/bin/env python -*- coding: utf-8 -*-
>>
> I think this is the source of the failure.

Unfortunately not, in this case you are wrong. Since the file does
contain non-ASCII utf-8 characters, the Python interpreter immediately
would complain about these, if the encoding declaration where not
recocgnized.


> You should put
>
> # -*- coding: utf-8 -*-
>
> into the 2nd line of your source file and not into the same line as
> the interpreter. Python does not detect the correct encoding of the
> source file.
>

In fact I tried this, just to be 100% sure -- it does not change anything.

> This works for me:
>
> >>> import mappa
> >>> conn = mappa.connect()
> >>> src = 'http://cxtm-tests.svn.sourceforge.net/viewvc/cxtm-tests/trunk/ltm/in/utf-8.ltm'
> >>> conn.load(src, into='http://www.example.org/my.map', format='ltm')
> >>>
>
>

Then it depends on what is meant by "works". Please also note that in
my example, I am using the string loading, which might have different
problems from the file loading. In fact, I have used the file loading
on my files all the time, so I can confirm that there is no problem there.

All the best,

Christian

--
Christian Wittern
Institute for Research in Humanities, Kyoto University
47 Higashiogura-cho, Kitashirakawa, Sakyo-ku, Kyoto 606-8265, JAPAN

Christian Wittern

unread,
Oct 27, 2010, 8:54:56 PM10/27/10
to ma...@googlegroups.com
Hi Lars,

On 2010-10-28 06:50, Lars Heuer wrote:
>
> ... and this too:
>
> >>> import mappa
> >>> from urllib import urlopen
> >>> s = urlopen('http://cxtm-tests.svn.sourceforge.net/viewvc/cxtm-tests/trunk/ltm/in/utf-8.ltm').read()
> >>> s
> '@"utf-8"\n[hiragana = "\xe3\x81\xb2\xe3\x82\x89\xe3\x81\x8c\xe3\x81\xaa"\n @"http://psi.ontopia.net/iso/15924.xtm#hira"]\n'
>

This in fact proves that the file is being read as 8-bit bytes and not
decoded into unicode. Try open it using the codecs module:

>>> import codecs
>>> f=codecs.open('/tmp/utf-8.ltm', 'r', 'utf-8')
>>> s=f.read()
>>> s
u'@"utf-8"\n[hiragana = "\u3072\u3089\u304c\u306a"\n
@"http://psi.ontopia.net/iso/15924.xtm#hira"]\n'

This is the correct representation of the file.

If you now do this:


> >>> conn = mappa.connect()
> >>> conn.loads(s, into='http://example.org/testmap', format='ltm')
> >>>
>
>

You will get exactly the error I reported.

> I think it's an encoding issue of your Python file.
>

Which proves that this is an encoding issue within the mio reader somewhere.

All the best,

Chris, who has had its share of encoding issues to deal with:-)


Lars Heuer

unread,
Oct 28, 2010, 3:34:20 AM10/28/10
to Christian Wittern
Hi Christian,

[...]


> u'@"utf-8"\n[hiragana = "\u3072\u3089\u304c\u306a"\n
> @"http://psi.ontopia.net/iso/15924.xtm#hira"]\n'

> This is the correct representation of the file.

You're right, of course.

> If you now do this:
>> >>> conn = mappa.connect()
>> >>> conn.loads(s, into='http://example.org/testmap', format='ltm')
>> >>>
>>
>>
> You will get exactly the error I reported.

Exactly :(


[...]


> who has had its share of encoding issues to deal with:-)

I'll fix it asap.

Christian Wittern

unread,
Oct 28, 2010, 3:45:26 AM10/28/10
to ma...@googlegroups.com
On 2010-10-28 16:34, Lars Heuer wrote:
>
> who has had its share of encoding issues to deal with:-)
> I'll fix it asap.
>
Lars, I hope you did not misunderstand this. What I wanted to say was only
that, from the experience of many years of programming with East-Asian
characters I know an encoding issue when I see one.

Cheers,

Lars Heuer

unread,
Oct 28, 2010, 3:46:13 AM10/28/10
to Christian Wittern
Hi Christian,

[...]


> Lars, I hope you did not misunderstand this. What I wanted to say was only
> that, from the experience of many years of programming with East-Asian
> characters I know an encoding issue when I see one.

No, I did not misunderstand it. Probably I forgot a smiley. ;) It's a
bug and I'll fix it. :)

Christian Wittern

unread,
Oct 28, 2010, 3:55:12 AM10/28/10
to ma...@googlegroups.com
On 2010-10-28 16:46, Lars Heuer wrote:
> No, I did not misunderstand it. Probably I forgot a smiley. ;) It's a
> bug and I'll fix it. :)
>
That's great, take your time.

Hacking topic maps in python is so much more fun than doing the same in Java:-)

All the best,

Christian

Lars Heuer

unread,
Oct 28, 2010, 5:53:23 AM10/28/10
to Christian Wittern
Hi Christian,

[...]


>> It's a bug and I'll fix it. :)
>>
> That's great, take your time.

It's fixed in rev. 385:

>>> import codecs, mappa
>>> from urllib import urlopen
>>> s = codecs.getreader('utf-8')(urlopen('http://cxtm-tests.svn.sourceforge.net/viewvc/cxtm-tests/trunk/ltm/in/utf-8.ltm')).read()
>>> s


u'@"utf-8"\n[hiragana = "\u3072\u3089\u304c\u306a"\n @"http://psi.ontopia.net/iso/15924.xtm#hira"]\n'

>>> conn = mappa.connect()
>>> conn.loads(s, into='http://www.semagia.com/map', format='ltm')
>>> tm = conn.get('http://www.semagia.com/map')
>>> for topic in tm.topics:
for name in topic.names:
print name.value


ひらがな
>>>

I'll prepare a tm release soon. Meanwhile you may copy
<https://code.google.com/p/mappa/source/browse/tm/trunk/src/tm/mio/_source.py>
into your
site-packages/tm/mio/
folder

> Hacking topic maps in python is so much more fun than doing the same in Java:-)

Python is more fun anyway. :) Well, until it comes to Unicode issues.
;)

Christian Wittern

unread,
Oct 28, 2010, 8:36:44 AM10/28/10
to ma...@googlegroups.com
Hi Lars,

On 28 October 2010 18:53, Lars Heuer <he...@semagia.com> wrote:
>
> ひらがな

yes, that's right!

> >>>
>
> I'll prepare a tm release soon. Meanwhile you may copy
> <https://code.google.com/p/mappa/source/browse/tm/trunk/src/tm/mio/_source.py>
> into your
> site-packages/tm/mio/
> folder

Yep, this works for me now. Great.

>
>> Hacking topic maps in python is so much more fun than doing the same in Java:-)
>
> Python is more fun anyway. :) Well, until it comes to Unicode issues.
> ;)

But Java has its own set of Unicode issues, especially with so-called
wide characters. Python is more transparent and issues can be solved
as soon as they are understood.

Cheers,

Christian

--
Christian Wittern, Kyoto

Lars Heuer

unread,
Oct 28, 2010, 12:07:00 PM10/28/10
to Christian Wittern
Hi Christian,

[...]


>> I'll prepare a tm release soon. Meanwhile you may copy
>> <https://code.google.com/p/mappa/source/browse/tm/trunk/src/tm/mio/_source.py>
>> into your
>> site-packages/tm/mio/
>> folder

> Yep, this works for me now. Great.

Goodie. Thanks for reporting the issue and for not trusting my initial
explanations :)

Reply all
Reply to author
Forward
0 new messages