I have a problem in processing Umlauts (utf-8) in python as produced by antlr4 -Dlanguage=Python3.
**How to reproduce:**
download JSON.g4, json2xml.py, t.json from here:
https://github.com/jszheng/py3antlr4book/tree/master/08-JSON
Verify that everything is OK without umlauts:
`python3 json2xml.py t.json # is o.k.`
copy t.json to t_uml.json and add a name with umlaut, so that line 5 now looks like this:
"admin": ["parrt", "tombu", "jürgen"],
The name jürgen looks like this in the hex editor: 6A C3 BC 72 67 65 6E , i.e. utf-8 compliant
```
python3 json2xml.py t_uml.json
Traceback (most recent call last):
File "json2xml.py", line 64, in <module>
input_stream = FileStream(sys.argv[1]) # Original
File "/usr/local/lib/python3.5/dist-packages/antlr4/FileStream.py", line 20, in __init__
super().__init__(self.readDataFrom(fileName, encoding, errors))
File "/usr/local/lib/python3.5/dist-packages/antlr4/FileStream.py", line 27, in readDataFrom
return codecs.decode(bytes, encoding, errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 169: ordinal not in range(128)
```
now try to decode utf-8 correctly. I replaced
` input_stream = FileStream(sys.argv[1])`
by
```
fp = codecs.open(sys.argv[1], 'rb', 'utf-8')
try:
input_stream = fp.read()
finally:
fp.close()
```
which I found on stackoverflow.
```
python3 json2xml.py t_uml.json
Traceback (most recent call last):
File "json2xml.py", line 75, in <module>
tree = parser.json()
File "/media/sf_Entwicklung/antlr/08-JSON-Umlaut/JSONParser.py", line 112, in json
self.enterRule(localctx, 0, self.RULE_json)
File "/usr/local/lib/python3.5/dist-packages/antlr4/Parser.py", line 358, in enterRule
self._ctx.start = self._input.LT(1)
File "/usr/local/lib/python3.5/dist-packages/antlr4/CommonTokenStream.py", line 61, in LT
self.lazyInit()
File "/usr/local/lib/python3.5/dist-packages/antlr4/BufferedTokenStream.py", line 186, in lazyInit
self.setup()
File "/usr/local/lib/python3.5/dist-packages/antlr4/BufferedTokenStream.py", line 189, in setup
self.sync(0)
File "/usr/local/lib/python3.5/dist-packages/antlr4/BufferedTokenStream.py", line 111, in sync
fetched = self.fetch(n)
File "/usr/local/lib/python3.5/dist-packages/antlr4/BufferedTokenStream.py", line 123, in fetch
t = self.tokenSource.nextToken()
File "/usr/local/lib/python3.5/dist-packages/antlr4/Lexer.py", line 111, in nextToken
tokenStartMarker = self._input.mark()
AttributeError: 'str' object has no attribute 'mark'
```
The same error occurs w/o umlauts, e.g.:
`python3 json2xml.py t.json `
Now, to verify that the problem is not related to the grammar JSON.g4, I did:
```
antlr4 JSON.g4
javac *.java
run JSON json -gui t_uml.json
```
which displayed the three names as an array, the umlaut was represented correctly.
**Conclusion**: The grammar is o.k., but there is a problem in the generated python modules