german "umlaute" are recoded

19 views
Skip to first unread message

Heinrich Schwietering

unread,
Sep 16, 2012, 1:28:44 PM9/16/12
to moz-ho...@googlegroups.com
Hi!

I am trying to use the hOCR edit plugin to correct german texts produced
by tesseract 3.02. The correction works fine, but when saving the
result, the umlauts are no longer in utf8, but show up as "&#246"
instead of "ö", "&#252" instead of ü etc. Is there a way to prevent
these conversions, or at least an easy way to recode it back into
"straight" utf8?

best regards

Heinrich Schwietering

Jim Garrison

unread,
Sep 20, 2012, 11:57:27 PM9/20/12
to Heinrich Schwietering, moz-ho...@googlegroups.com
On 09/16/2012 10:28 AM, Heinrich Schwietering wrote:
> Hi!
>
> I am trying to use the hOCR edit plugin to correct german texts produced
> by tesseract 3.02. The correction works fine, but when saving the
> result, the umlauts are no longer in utf8, but show up as "&#246"
> instead of "�", "&#252" instead of � etc. Is there a way to prevent
> these conversions, or at least an easy way to recode it back into
> "straight" utf8?
>
> best regards
>
> Heinrich Schwietering
>

Hi there!

The code explicitly serializes using the "US-ASCII" character set (see
http://gitorious.org/moz-hocr-edit/moz-hocr-edit/blobs/master/chrome/content/editor.js#line482
for the precise line of code). In your case, you want to change that to
say "UTF-8" instead.

Long ago I did not specify a character set explicitly, but due to some
issues with the tagsoup serializer I made the program use US-ASCII in
commit 606d1263, which has proven to work well in all cases that I know
of. Technically, I /think/ it's safe to use UTF-8 generally in any case
where the character set is defined to be utf8 in the XML/HTML file, but
I have not confirmed this. In any case, I don't expect you to have
problems if you use utf8 all around (and if you do, please let me know).

If you are not feeling adventurous enough to modify the code, let me
know and I will make the character set a configurable option (modifiable
at about:config) and package a new version of moz-hocr-edit with this
feature. I will only need to change a few lines of code and can have
this ready relatively quickly.

Otherwise, there is likely some [X]HTML parsing utility that converts
encoded characters to their utf8 equivalents, but I am not aware of any
such program myself.

Cheers,
Jim
Reply all
Reply to author
Forward
0 new messages