Great tool for working with unicode

224 views
Skip to first unread message

Rob H.

unread,
May 1, 2009, 10:25:30 AM5/1/09
to tesseract-ocr
I've been training OCR to recognize many characters spread throughout
unicode definition.
I found this handy webapp to be invaluable in understanding what are
some of the "unprintable" unicode characters.

I can copy/paste the character into the top left text area and hit
convert.
I am mainly interested in the "UTF-16 code units" text area on the
lower right side of the page, since these are the codes I'm using with
Tesseract.
http://rishida.net/scripts/uniview/conversion.php

If I don't recognize the UTF-16 (which is less frequent now that I've
stared at them so much), then I can click the "View in Uniview" which
is above the top left text area. This will pop-up another web page
which 99% of the time gives me a printable view of the unicode
character.

Hope it helps!


PS: Does anyone know of a single font which is capable of drawing ALL
unicode characters defined by unicode.org? Currently, I'm using MS
Arial Unicode which does a halfway decent job, but it isn't complete.

74yrs old

unread,
May 1, 2009, 1:16:46 PM5/1/09
to tesser...@googlegroups.com
In what way it will help for tesseractocr? and if so step by step procedure followed may please be indicated.

Rob H.

unread,
May 1, 2009, 8:57:05 PM5/1/09
to tesseract-ocr
Well Tesseract 2.0 has support for unicode, but many times it can be
hard to understand the results of the OCR because the characters are
not printable in many fonts.

Typically in text editors (including Notepad++, UltraEdit, MS Word,
Notepad, etc.), an unrecognized character will be displayed as a
simple box. This is not readable.
So, to verify your results, especially while training, you need to
check how accurate the results came out.

So, if you are using unprintable characters and don't have a font
which recognizes them correctly, then this webapp will help you know
which character the OCR recognized.... unless you know off the top of
your head what hex value matches what characters you want.

Rob H.

unread,
May 1, 2009, 9:07:56 PM5/1/09
to tesseract-ocr
Also, I got this e-mail from a someone named Albert
=========
Hi Rob,

Reply to your "ps"....

That doesn't make any sense to me. You are asking for a set of glyphs
that can represent every Unicode character in existence. Not
only would such a file be *HUGE* in size, but I can't see it as
serving any purpose to anyone (other than you, I guess)...

So you should stop looking for it.


-
Albert
=========

Arial Unicode covers ~50K of the ~140K characters defined at
unicode.org. This font file is 22mb.
Wouldn't a complete unicode font be around 70mb?

If you need a general text viewer which can legibly show documents
that contain any number of the valid ~140K characters,
then a complete font would be useful.

Great advice Albert...*roll eyes*... "stop looking"... how about
something a little more constructive?
maybe you know a strategy of mixing fonts to enable an application to
view all the possible unicode characters?





74yrs old

unread,
May 2, 2009, 7:04:48 AM5/2/09
to tesser...@googlegroups.com
Hi Rob,
I know about conversion.php which I am using for long time for Kannada project.
Will you kindly explain by step by step  of your experiment with sample if any. I
wanted to have hands on experience.  BTW which lang. you were training?
Regards,
sriranga(76yrs old)

Rob H.

unread,
May 2, 2009, 11:51:42 PM5/2/09
to tesseract-ocr
I'm training Tess to recognize letters/numbers/symbols/etc. used for
geometrical tolerancing and annotations (ASME Standard Y14.5)
Alot of the characters used in the ASME standard are coming from all
over the unicode tables (although the characters/words are from the
English language).

This is part of a data validation project and I'm using OCR as part of
the process.
Since OCR is not 100% accurate, some of the validation will need to be
done by hand (hopefully as little as possible).
If the person checking the annotation sees a "little box" (ie
unprintable character) then it will slow down their job.
For the moment, I check unprintable characters using the webapp which
I posted above.
Once this goes into production, there will be a font (purchasd or home-
brewed) which can correctly draw all the letters/numbers/symbols/etc.


On May 2, 7:04 am, 74yrs old <withblessi...@gmail.com> wrote:
> Hi Rob,
> I know about conversion.php which I am using for long time for Kannada
> project.
> Will you kindly explain by step by step  of your experiment with sample if
> any. I
> wanted to have hands on experience.  BTW which lang. you were training?
> Regards,
> sriranga(76yrs old)
>

74yrs old

unread,
May 3, 2009, 5:35:32 AM5/3/09
to tesser...@googlegroups.com
Thanks. very good idea. will you please upload sample of "little box"?

Rob H.

unread,
May 4, 2009, 9:16:49 AM5/4/09
to tesseract-ocr
Copy and paste the following text into the basic notepad application.
It will show up as "little boxes".
There's a good chance that your web browser doesn't have a unicode
enabled font, so most of the following characters will display as
garbage.

The following characters are: circled E, circled F, circled L, circled
L, circled U, circled P, circled S, circled S, circled T, circled U

ⒺⒻⓁⓁⓊⓅⓈⓈⓉⓊ

Or you can copy/paste those into the web app and view them:
http://rishida.net/scripts/uniview/uniview.php?codepoints=24BA 24BB
24C1 24C1 24CA 24C5 24C8 24C8 24C9 24CA
> > > > view all the possible unicode characters?- Hide quoted text -
>
> - Show quoted text -

Albert Law

unread,
May 4, 2009, 10:34:05 AM5/4/09
to tesser...@googlegroups.com
Hi Rob,

Oh, I'm sorry you didn't interpret my advise as constructive. I can see it from your point of view where you have a task, and I'm
simply not helping. So here's a verbose version of my original answer.

What you are asking for is somewhat mysterious in purpose. Allow me to explain. Unicode doesn't specify what characters should
look like. Fonts specify how characters are visually represented. Hence, I see no reason why a font should exists that covers all
of the Unicode specifications because such a font would not be generally regarded as useful. This is doubly true when one considers
that fonts are tied to operation systems (or, in the case of Java, operating environments) and/or specific tasks (i.e. fixed-width
fonts use?).

Furthermore, the Unicode specifications is an ever evolving beast. I may be incorrect, but I believe they are currently working on
extending the specifications to cover ancient Asian characters which are no longer in any vernacular. Due to this disuse, font
makers (in this case, calligraphers) disagree on the exact visual representations.

Lastly, Unicode is not the only game in town (see GB18030). Your alternative font mapping might get a little messy at this point.

Moreover, you have indicated that you are currently using MS Arial Unicode. It may be wrong, but Unicode.org states that "the Arial
Unicode MS font ... is the most complete" [http://www.unicode.org/help/display_problems.html]. You may augment MS Arial Unicode
with "last resort" [http://www.unicode.org/policies/lastresortfont_eula.html] but I think that links to an Mac-OSX-only solution.

Of course, what you really need to do is string several fonts together. This probably must be done manually in the code and should
usually involves knowledge of the language being supplemented into MS Arial Unicode. Oh, there may be font collisions so watch out.

You know what? This is a problem already semi-solved (I believe there is no full-solution due to the ill-defined nature of the
problem) by Adobe in Acrobat PDF Reader. Though, the PDF's purpose was originally for printing so they "cheated" and had
file-embedded fonts. You should talk to a PDF expert and see how Adobe did it.

I hope you find this answer less of an eye-roller. Unfortunately, my suggestion remains "stop looking".



-
Albert

74yrs old

unread,
May 4, 2009, 1:24:04 PM5/4/09
to tesser...@googlegroups.com
ⒺⒻⓁⓁⓊⓅⓈⓈⓉⓊ
U+24BA U+24BB U+24C1 U+24C1 U+24CA U+24C5 U+24C8 U+24C8 U+24C9 U+24CA U+000A

E F L L U P S S T U   http://rishida.net/scripts/uniview/conversion.php
000A     [control]
  0020     SPACE
  0045  E  LATIN CAPITAL LETTER E
  0020     SPACE
  0046  F  LATIN CAPITAL LETTER F
  0020     SPACE
  004C  L  LATIN CAPITAL LETTER L
  0020     SPACE
  004C  L  LATIN CAPITAL LETTER L
  0020     SPACE
  0055  U  LATIN CAPITAL LETTER U
  0020     SPACE
  0050  P  LATIN CAPITAL LETTER P
  0020     SPACE
  0053  S  LATIN CAPITAL LETTER S
  0020     SPACE
  0053  S  LATIN CAPITAL LETTER S
  0020     SPACE
  0054  T  LATIN CAPITAL LETTER T
  0020     SPACE
  0055  U  LATIN CAPITAL LETTER U
  000A     [control]
  000A     [control]

Rob H.

unread,
May 4, 2009, 2:55:24 PM5/4/09
to tesseract-ocr
Thanks for the reply Albert. I think I'll stop looking ... for a
silver bullet and create a strategy which covers my set of glyphs.
(maybe the pdf solution will work).

I thought Unicode did specify what a character looks like (on a basic
level), and then fonts were responsible for their interpretation
(which can be completely off).
For example, "WingDings" is vastly different from what Unicode shows
in their PDF renderings. I assumed that the character drawn in those
unicode files were a basic rendition of what the character should look
like.

Do you have any experience creating fonts? I might create one... it
doesn't have to be pretty... just needs to help the user accomplish
their task of comparing text extract from the UI vs text extracted
from the model.

Albert Law

unread,
May 4, 2009, 3:29:04 PM5/4/09
to tesser...@googlegroups.com
Hi Rob,

I have no experience in creating a font set (font mapping? what would be the correct term?). However, I have to warn you that the
task you propose is gigantic.

Here are some people who take this task to heart. Check out:
1) DynaFonts: http://www.dynalab.com/Products/tabid/608/language/en-US/Default.aspx
2) GNU Free Font: http://www.gnu.org/software/freefont/index.html
3) Search Free Fonts: http://www.searchfreefonts.com/

Good luck.


-
Albert


-----Original Message-----
From: tesser...@googlegroups.com [mailto:tesser...@googlegroups.com] On Behalf Of Rob H.
Sent: Monday, May 04, 2009 14:55
To: tesseract-ocr

74yrs old

unread,
Jun 11, 2009, 1:29:25 PM6/11/09
to tesser...@googlegroups.com
Rob,
Able to view in this email itself.   As per your problem=
"'Typically in text editors (including Notepad++, UltraEdit, MS Word,

Notepad, etc.), an unrecognized character will be displayed as a
simple box. This is not readable.
So, to verify your results, especially while training, you need to
check how accurate the results came out.'
  Have you succeeded or solved the problem?
regards
Reply all
Reply to author
Forward
0 new messages