UTF-8 special characters, å ä ö

245 views
Skip to first unread message

josef....@gmail.com

unread,
May 24, 2016, 5:27:34 PM5/24/16
to OpenREM
Hello,
I'm trying out the 0.7.0b14 together with Posgresql and gunicorn. I have some testdata I run on new installations to make sure that everything is readable.

In one of the test RDSR-files, coming from a Siemens Definition Flash, we have included swedich characters. When trying to import that file I run into a byte encoding error saying:


File "/openrem/venv/openrem7/bin/openrem_rdsr.py", line 23, in <module>
rdsr(filename)
File "/openrem/venv/openrem7/local/lib/python2.7/site-packages/celery/local.py", line 188, in __call__
return self._get_current_object()(*a, **kw)
File "/openrem/venv/openrem7/local/lib/python2.7/site-packages/celery/app/task.py", line 420, in __call__
return self.run(*args, **kwargs)
File "/openrem/venv/openrem7/local/lib/python2.7/site-packages/openrem/remapp/extractors/rdsr.py", line 869, in rdsr
_rsdr2db(dataset)
File "/openrem/venv/openrem7/local/lib/python2.7/site-packages/openrem/remapp/extractors/rdsr.py", line 831, in _rsdr2db
_generalstudymoduleattributes(dataset,g)
File "/openrem/venv/openrem7/local/lib/python2.7/site-packages/openrem/remapp/extractors/rdsr.py", line 801, in _generalstudymoduleattributes
g.save()
File "/openrem/venv/openrem7/local/lib/python2.7/site-packages/django/db/models/base.py", line 734, in save
force_update=force_update, update_fields=update_fields)
File "/openrem/venv/openrem7/local/lib/python2.7/site-packages/django/db/models/base.py", line 762, in save_base
updated = self._save_table(raw, cls, force_insert, force_update, using, update_fields)
File "/openrem/venv/openrem7/local/lib/python2.7/site-packages/django/db/models/base.py", line 827, in _save_table
forced_update)
File "/openrem/venv/openrem7/local/lib/python2.7/site-packages/django/db/models/base.py", line 877, in _do_update
return filtered._update(values) > 0
File "/openrem/venv/openrem7/local/lib/python2.7/site-packages/django/db/models/query.py", line 580, in _update
return query.get_compiler(self.db).execute_sql(CURSOR)
File "/openrem/venv/openrem7/local/lib/python2.7/site-packages/django/db/models/sql/compiler.py", line 1062, in execute_sql
cursor = super(SQLUpdateCompiler, self).execute_sql(result_type)
File "/openrem/venv/openrem7/local/lib/python2.7/site-packages/django/db/models/sql/compiler.py", line 840, in execute_sql
cursor.execute(sql, params)
File "/openrem/venv/openrem7/local/lib/python2.7/site-packages/django/db/backends/utils.py", line 64, in execute
return self.cursor.execute(sql, params)
File "/openrem/venv/openrem7/local/lib/python2.7/site-packages/django/db/utils.py", line 98, in __exit__
six.reraise(dj_exc_type, dj_exc_value, traceback)
File "/openrem/venv/openrem7/local/lib/python2.7/site-packages/django/db/backends/utils.py", line 64, in execute
return self.cursor.execute(sql, params)
django.db.utils.DataError: invalid byte sequence for encoding "UTF8": 0xf6 0x64 0x65 0x72


The 0xf6 is the swedish character 'ö' in ISO-8859-1 encoding.

Kind regards,
Josef Lundman

Ed McDonagh

unread,
May 24, 2016, 5:55:08 PM5/24/16
to OpenREM
Hi Josef

Can you try upgrading to 0.7.0b15? I fixed some issues with non-ASCII text between beta 14 and 15!

I'm hoping that will resolve your problem.

You should just need to do a pip install openrem==0.7.0b15 followed by a database migration. See http://docs.openrem.org/en/latest/release-0.7.0.html#upgrading-from-version-0-7-0-beta-7-or-later for details.

Please do feed back how you get on!

Ed

josef....@gmail.com

unread,
May 26, 2016, 4:21:46 AM5/26/16
to OpenREM
Hi Ed

I've now tried importing the file using the b15-version. The result is, I'm sad to say, the same as before.

I'll try again later using MySQL instead of PostgreSQL. I'll get back to you to let you know if that works.

/Josef

Ed McDonagh

unread,
May 26, 2016, 4:56:16 AM5/26/16
to OpenREM, josef....@gmail.com
Oh dear. Are you able to send me an example - preferably a QA procedure that you've inserted some of these characters in, or an anonymised RDSR that still has the characters and fails on import.

It does look like the database is complaining though - which field is it?

josef....@gmail.com

unread,
May 26, 2016, 5:30:22 AM5/26/16
to OpenREM, josef....@gmail.com
I'll see if I can get you example files with different fields containing the special characters.

In this file I have several fields with Swedish characters, e.g. patient name, protocol, prescribing physician. However, I believe it is the prescribing physician that the error is reported for.

Ed McDonagh

unread,
May 27, 2016, 6:41:21 AM5/27/16
to OpenREM, josef....@gmail.com
I've modified one of my Siemens CT RDSR files by adding all three of the letters å ä ö into the following fields:
  • ReferringPhysicianName
  • PhysiciansOfRecord
  • PatientName
I've then turned on storing of patient names, and imported the file into a PostgreSQL database using OpenREM 0.7.0b15. It worked fine!

I did have a problem when I searched for the patient in the interface, because the hashing function didn't like the non-ASCII, but I've fixed this in issues 400.

What Specific Character Set is stated in your DICOM files Josef? (0008,0005)
Mine have ISO_IR 100, which is the alias DICOM use for ISO-8859-1, which should be the encoding that we want.

Can you check what encoding the database was created with? In the docs, UTF-8 is explicitly used: 
sudo -u postgres createdb -T template1 -O openremuser -E 'UTF8' openremdb
I wonder if your database as defaulted to something else?

It would be good to get to the bottom of this, so that I can know if something needs to change before 0.7 is released!

Kind regards

Ed

Ed McDonagh

unread,
Jun 9, 2016, 11:40:20 AM6/9/16
to Josef Lundman, OpenREM
I've finally looked at the DICOM file you sent me Josef in a hex editor.

In my DICOM file with non-ASCII, the letter ö is represented by hex: c3b6 - this is the correct UTF-8 encoding
In your DICOM file, the letter ö is represented by the hex: f6 - this is the correct ISO/IEC 8859-1 encoding (which has an alias of ISO_IR 100).

The issue I have established is that the decoding in get_value_kw doesn't do the decoding if the value is a PersonName!






On Fri, Jun 3, 2016 at 9:49 AM, Josef Lundman wrote:
Interesting! I used Pydicom to open the anonymized file, change the characters, and then save it again. Usin Pydicom I get the correct characters both in the original file and in the anonymized.


Ed McDonagh

unread,
Jun 9, 2016, 12:15:43 PM6/9/16
to OpenREM, josef....@gmail.com
Josef - can you try making the changes in commit fe60233263ef, or replace your version of openrem/remapp/tools/get_values.py with the one from here, then import some of the files you have had trouble with please?

Ed

OpenREM

unread,
Jun 28, 2016, 8:46:33 AM6/28/16
to OpenREM
Just an update to this thread to note that the solution was found and is included in release 0.7.1
Reply all
Reply to author
Forward
0 new messages