python3 ,Umlauts, utf-8 issue

146 views
Skip to first unread message

JDi

unread,
May 30, 2017, 10:35:02 PM5/30/17
to antlr-discussion

I have a problem in processing Umlauts (utf-8) in python as produced by antlr4 -Dlanguage=Python3.


**How to reproduce:**

download JSON.g4, json2xml.py, t.json from here:

https://github.com/jszheng/py3antlr4book/tree/master/08-JSON

Verify that everything is OK without umlauts:

`python3 json2xml.py t.json # is o.k.`

copy t.json to t_uml.json and add a name with umlaut, so that line 5 now looks like this:

"admin": ["parrt", "tombu", "jürgen"],

The name jürgen looks like this in the hex editor: 6A C3 BC 72 67 65 6E , i.e. utf-8 compliant


```

python3 json2xml.py t_uml.json

Traceback (most recent call last):

File "json2xml.py", line 64, in <module>

input_stream = FileStream(sys.argv[1]) # Original

File "/usr/local/lib/python3.5/dist-packages/antlr4/FileStream.py", line 20, in __init__

super().__init__(self.readDataFrom(fileName, encoding, errors))

File "/usr/local/lib/python3.5/dist-packages/antlr4/FileStream.py", line 27, in readDataFrom

return codecs.decode(bytes, encoding, errors)

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 169: ordinal not in range(128)


```

now try to decode utf-8 correctly. I replaced

` input_stream = FileStream(sys.argv[1])`

by

```

fp = codecs.open(sys.argv[1], 'rb', 'utf-8')

try:

input_stream = fp.read()

finally:

fp.close()

```

which I found on stackoverflow.

```

python3 json2xml.py t_uml.json

Traceback (most recent call last):

File "json2xml.py", line 75, in <module>

tree = parser.json()

File "/media/sf_Entwicklung/antlr/08-JSON-Umlaut/JSONParser.py", line 112, in json

self.enterRule(localctx, 0, self.RULE_json)

File "/usr/local/lib/python3.5/dist-packages/antlr4/Parser.py", line 358, in enterRule

self._ctx.start = self._input.LT(1)

File "/usr/local/lib/python3.5/dist-packages/antlr4/CommonTokenStream.py", line 61, in LT

self.lazyInit()

File "/usr/local/lib/python3.5/dist-packages/antlr4/BufferedTokenStream.py", line 186, in lazyInit

self.setup()

File "/usr/local/lib/python3.5/dist-packages/antlr4/BufferedTokenStream.py", line 189, in setup

self.sync(0)

File "/usr/local/lib/python3.5/dist-packages/antlr4/BufferedTokenStream.py", line 111, in sync

fetched = self.fetch(n)

File "/usr/local/lib/python3.5/dist-packages/antlr4/BufferedTokenStream.py", line 123, in fetch

t = self.tokenSource.nextToken()

File "/usr/local/lib/python3.5/dist-packages/antlr4/Lexer.py", line 111, in nextToken

tokenStartMarker = self._input.mark()

AttributeError: 'str' object has no attribute 'mark'


```

The same error occurs w/o umlauts, e.g.:

`python3 json2xml.py t.json `


Now, to verify that the problem is not related to the grammar JSON.g4, I did:

```

antlr4 JSON.g4

javac *.java

run JSON json -gui t_uml.json

```

which displayed the three names as an array, the umlaut was represented correctly.

**Conclusion**: The grammar is o.k., but there is a problem in the generated python modules

Eric Vergnaud

unread,
May 31, 2017, 12:19:36 AM5/31/17
to antlr-discussion
Seems you didn't set the encoding when instantiating FileStream

JDi

unread,
May 31, 2017, 1:05:14 AM5/31/17
to antlr-discussion
Thanks, my fault:
FileStream(sys.argv[1],  encoding='utf-8')

solved it.
Reply all
Reply to author
Forward
0 new messages