Support Unicode on InaSAFE

16 views
Skip to first unread message

Akbar Gumbira

unread,
Mar 5, 2015, 6:23:13 AM3/5/15
to inasa...@googlegroups.com
Hi All,

This is a comment on my PR here (https://github.com/AIFDR/inasafe/pull/1674) that I think would be good to post it here to get other's opinion and add some more comment before adding it to our developer documentation.

Copying it from there:

Hi @AIFDR/inasafe-developers, I am merging this unicode support for InaSAFE (using UTF-8 encoding). The state of the PR is that:
- We can handle layer with unicode chars (writing and reading keywords)
- We can run analysis with hazard/exposure/aggregation layer that has unicode chars in it (either in the layer path or in the field)
- Each tools could run without problems (e.g if the output path of the downloaded buildings from OSM downloader has got some unicode chars)
- I have updated some tests data to contain unicode chars
- All the tests pass

The ideal solution (or the simplest strategy) for unicode support would be to deal any strings in unicode and for I/O operation save it as byte strings and when we read it (from the file), turn the strings into unicode right away. Here is a pic of unicode sandwich (http://nedbatchelder.com/text/unipain/unipain.html#35). This should be our haiku: Bytes on the outside, unicode on the inside. Encode/decode at the edges.

Since we are still using Python 2.7, and in most of the codes we operate in byte string, it's hard to get this ideal solution. So, I made some helper functions available to do the conversion from unicode to byte string or from byte string to unicode (see safe.utilities.unicode). The idea would be to use this function as less as possible but cover the problems as much as possible. When we move to Python 3 (somewhere in QGIS mailing list, I read that it will use Python 3 some time), we just remove this function helper and its usages (as we know that string literal or str in Python 3 will already store the string as unicode).


There might be some parts of the code that don't handle unicode yet (and when we develop new feature in the future, we need to consider it will be aware of unicode), so here are some little strategies that could be useful when we get some errors related (you can add the list):
  • When we save unicode string to file, turn it to byte string, so you can do:
from safe.utilities.unicode import get_string

with open(file_path, 'w') as f:
    f.write(get_string(unicode_strings))

or even better using codecs module:

import codecs

with codecs.open(file_path, 'w', encoding='utf-8') as f:
    f.write(unicode_strings)


  • Any strings from PyQt side will be returned as unicode (as QString is stored in unicode. This is good). Keep it as uniode and don't cast it to string unless we really need to have str (use get_string function from the unicode module helper). For example, in OSM downloader user can set the output path, if the user path has some non-ASCII chars in it, doing this:

path = str(self.output_directory.text())

will raise an error. See this changes https://github.com/AIFDR/inasafe/pull/1674/files#diff-a1a4af021d1e15208f44fea1842e6332R334, that I deleted casting it to str.

It goes the same with PyQGIS, it will also return string as unicode. So, avoid doing thing like this:

source = str(layer.source()) // BAD - NOT UNICODE AWARE
source = layer.source() // GOOD

  • When you write new codes and new tests, please add some non-ASCII chars into the test data (or in the test case).
  • Don't forget to add encoding (UTF-8) to your python module.

Perhaps other developers want to add some strategies before we add it to our developer documentations.

Regards
--

-------------------

Akbar Gumbira
Software Engineer
Geospatial, NLP, Data Mining, Machine Learning, Artificial Intelligence
Reply all
Reply to author
Forward
0 new messages