Hi @AIFDR/inasafe-developers, I am merging this unicode support for InaSAFE (using UTF-8 encoding). The state of the PR is that:
- We can handle layer with unicode chars (writing and reading keywords)
- We can run analysis with hazard/exposure/aggregation layer that has unicode chars in it (either in the layer path or in the field)
- Each tools could run without problems (e.g if the output path of the downloaded buildings from OSM downloader has got some unicode chars)
- I have updated some tests data to contain unicode chars
- All the tests pass
The ideal solution (or the simplest strategy) for unicode support would be to deal any strings in unicode and for I/O operation save it as byte strings and when we read it (from the file), turn the strings into unicode right away. Here is a pic of unicode sandwich (
http://nedbatchelder.com/text/unipain/unipain.html#35). This should be our haiku: Bytes on the outside, unicode on the inside. Encode/decode at the edges.
Since we are still using Python 2.7, and in most of the codes we operate in byte string, it's hard to get this ideal solution. So, I made some helper functions available to do the conversion from unicode to byte string or from byte string to unicode (see safe.utilities.unicode). The idea would be to use this function as less as possible but cover the problems as much as possible. When we move to Python 3 (somewhere in QGIS mailing list, I read that it will use Python 3 some time), we just remove this function helper and its usages (as we know that string literal or str in Python 3 will already store the string as unicode).
There might be some parts of the code that don't handle unicode yet (and when we develop new feature in the future, we need to consider it will be aware of unicode), so here are some little strategies that could be useful when we get some errors related (you can add the list):
- When we save unicode string to file, turn it to byte string, so you can do:
from safe.utilities.unicode import get_string
with open(file_path, 'w') as f:
f.write(get_string(unicode_strings))
will raise an error. See this changes
https://github.com/AIFDR/inasafe/pull/1674/files#diff-a1a4af021d1e15208f44fea1842e6332R334, that I deleted casting it to str.
It goes the same with PyQGIS, it will also return string as unicode. So, avoid doing thing like this:
source = str(layer.source()) // BAD - NOT UNICODE AWARE
source = layer.source() // GOOD
- When you write new codes and new tests, please add some non-ASCII chars into the test data (or in the test case).
- Don't forget to add encoding (UTF-8) to your python module.
Perhaps other developers want to add some strategies before we add it to our developer documentations.