Missing detailed documentation about Unicharset files

Albrecht Hilker

unread,

Jul 4, 2014, 12:40:51 AM7/4/14

to tesser...@googlegroups.com

Hello

Generally it is very sad that there is no detailed documentation about Tesseract.

The only documentation about Unicharset file that I could find is this:
https://tesseract-ocr.googlecode.com/svn-history/r683/trunk/doc/unicharset.5.html

But this is completely insufficient and not understandable.

And unicharset_extractor.exe produces wrong and uncomplete files.
So I have to edit them by hand.
But how ?

I need a detailed explanation how to enter the values for the several min/max parameters.

The sparse documentation says that 128 is the x-height.
Does anybody think that with this information one is able to edit a Unicharset file ???

How do I enter the width of a character ?
How do I enter the minimum bottom and the maximum bottom value ?

And the example given on that page does not make any sense:

1 8 59,69,203,255,45,128,0,66,74,173 Common 3 2 3 1
9 8 18,66,203,255,89,156,0,39,104,173 Common 4 2 4 9

So this example says that
the character "1" has a min_bottom value of 59 and
the character "9" has a min_bottom value of 18.

Weird ? ? ?
Both numbers are aligned to the baseline!

Wouldn't it be more intelligent to define the min_bottom for "9" with a higher value to distinguish it from a lowercase "g" ??

And what about the other values ?
bearing, advance ?
Where do I get them from ?

The most weird thing is that the training data may contain 32 fonts but there is only one Unicharset file!
If there was one Unicharset file per font I would understand.

But in a monospaced font the advance is equal for an "i" and a "W" while in in Arial they are very different.
How do I create a Unicharset file that must fit for such different fonts ?

I need a detailed explanation with images (not only text!!) how to obtain these values.

zdenko podobny

unread,

Jul 4, 2014, 3:25:04 AM7/4/14

to tesser...@googlegroups.com

Can you please provide explanation why do you think that "unicharset_extractor.exe produces wrong and uncomplete files"?

Zdenko

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/2c8fa12f-d315-4907-b3d2-afd25eddeb00%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Albrecht Hilker

unread,

Jul 4, 2014, 1:02:43 PM7/4/14

to tesser...@googlegroups.com

> Can you please provide explanation why do you think that "unicharset_extractor.exe produces wrong and uncomplete files"?

Because this is definitely wrong:

90
NULL 0 NULL 0
A 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # A [41 ]A
B 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # B [42 ]A
C 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # C [43 ]A
D 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # D [44 ]A
E 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # E [45 ]A
F 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # F [46 ]A
G 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # G [47 ]A
H 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # H [48 ]A
I 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # I [49 ]A
J 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # J [4a ]A
K 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # K [4b ]A
L 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # L [4c ]A
M 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # M [4d ]A
N 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # N [4e ]A
O 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # O [4f ]A
P 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # P [50 ]A
Q 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # Q [51 ]A
R 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # R [52 ]A
S 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # S [53 ]A
T 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # T [54 ]A
U 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # U [55 ]A
V 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # V [56 ]A
W 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # W [57 ]A
X 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # X [58 ]A
Y 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # Y [59 ]A
Z 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # Z [5a ]A
a 3 0,255,0,255,0,32767,0,32767,0,32767 NULL 1 0 0   # a [61 ]a
b 3 0,255,0,255,0,32767,0,32767,0,32767 NULL 2 0 0    # b [62 ]a
c 3 0,255,0,255,0,32767,0,32767,0,32767 NULL 3 0 0    # c [63 ]a
d 3 0,255,0,255,0,32767,0,32767,0,32767 NULL 4 0 0    # d [64 ]a

1.)
The column "other_case" should contain the ID of the other-case letter.
For the lowercase letters they point correctly to the uppercase letters.
But the uppercase letters they all have a value of -1 which is wrong.
Here should be the corresponding ID of the lowercase letter.

2.)
The script name is always NULL.
It should be LATIN or COMMON

3.)
All the min / max values are completely missing.
They are 0, 255 or 32767.
10 missing columns!

4.)
The last column "normed_form" is missing.
With the '#' a comment is starting.
But when reading this unicharset the '#' is misinterpreted as the "normed_form".
Here should be mostly the same letter as in the first column.

Here you see a unicharset extracted from a trainddata file with all columns filled correctly:

A 5 52,68,216,255,100,216,0,17,98,231 Latin 2 0 15 A    # A [41 ]A
B 5 62,68,216,255,91,227,0,27,106,227 Latin 23 0 102 B # B [42 ]A

etc..

a 3 58,65,186,200,85,164,0,26,97,185 Latin 15 0 2 a     # a [61 ]a
b 3 58,64,216,255,87,180,0,25,100,200 Latin 102 0 23 b # b [62 ]a

Result:
The unicharset_extractor tool is very buggy.
I have to edit all by hand.

So my question remains:

Were do I find a detailed documentation of the Unicharset file ???

zdenko podobny

unread,

Jul 4, 2014, 2:47:13 PM7/4/14

to tesser...@googlegroups.com

First of all - the source code is documentation...

Next - it is just your expectation that it is wrong ;-)

Did manual changing of values bring any improvement to OCR? I would not be surprised if that values are not use by current version of tesseract.

Zdenko

--

You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/40473df4-df33-45f1-a593-2348f15b6b0b%40googlegroups.com.

Albrecht Hilker

unread,

Jul 5, 2014, 6:34:05 PM7/5/14

to tesser...@googlegroups.com

Hello zdenop

It is clear that you are not the right person to answer this question.
If YOU would ever have looked into the source code you have seen that these values ARE in use (in version 3.03).

A simple search for "unicharset.get_top_bottom" shows that they are used for example
-- to detect super script and sub script in superscript.cpp
-- to calculate thresholds in ratngs.cpp
-- to detect bad blobs in fixxht.cpp

Is here nobody who can answer my question ??????????
This is really sad !

Nick White

unread,

Jul 9, 2014, 1:51:58 PM7/9/14

to tesser...@googlegroups.com

Hi Albrecht,

On Thu, Jul 03, 2014 at 09:40:51PM -0700, Albrecht Hilker wrote:
> Generally it is very sad that there is no detailed documentation about
> Tesseract.

I agree. I do work on the documentation, but there is an awful lot
missing. I appreciate you taking the time to ask questions here so
we can help improve it.

> The only documentation about Unicharset file that I could find is this:
> https://tesseract-ocr.googlecode.com/svn-history/r683/trunk/doc/
> unicharset.5.html
>
> But this is completely insufficient and not understandable.

Yes, that's all there is, plus a very basic overview of the older
format in the TrainingTesseract3 wiki page, IIRC.

> And unicharset_extractor.exe produces wrong and uncomplete files.

They are not really wrong, though they are not as complete as would
be ideal.

> So I have to edit them by hand.
> But how ?

The new training program set_unicharset_properties helps by setting
some more of the properties automatically. You can see how I'm using
it in my grc Makefile if you're interested[0].

However it doesn't set the dimensions of characters, as you've
noticed. I started looking into this a little while ago, but ran out
of time to go further (and you've clearly got further than I did
already - good job!)

We should figure out exactly what's required for each value
together, and then I will very happily document it properly.

I don't have time to look into your specific questions now, sorry,
but between us we should be able to figure it out in short order.

Thanks a lot for bringing this up; as I said, it has been bothering
me, but I hadn't found the time to do anything much about it.

More soon!

Nick

0. git clone http://ancientgreekocr.org/grc.git

Nick White

unread,

Jul 10, 2014, 11:25:12 AM7/10/14

to tesser...@googlegroups.com

I'm just going to go through your numbered points here.

On Fri, Jul 04, 2014 at 10:02:43AM -0700, Albrecht Hilker wrote:
> 1.)
> The column "other_case" should contain the ID of the other-case letter.
> For the lowercase letters they point correctly to the uppercase letters.
> But the uppercase letters they all have a value of -1 which is wrong.
> Here should be the corresponding ID of the lowercase letter.

The set_unicharset_properties tool sets this correctly.

> 2.)
> The script name is always NULL.
> It should be LATIN or COMMON

The set_unicharset_properties tool sets this correctly.

> 3.)
> All the min / max values are completely missing.
> They are 0, 255 or 32767.
> 10 missing columns!

Yes. They are missing, and as you rightly point out, that sucks.

> 4.)
> The last column "normed_form" is missing.
> With the '#' a comment is starting.
> But when reading this unicharset the '#' is misinterpreted as the
> "normed_form".
> Here should be mostly the same letter as in the first column.

Good spot that the unicharset_extractor's '#' is misinterpreted as
the normed_form. That is definitely a bug. The
set_unicharset_properties tool does set this correctly, though.

As far as I'm aware there's no good reason for unicharset_extractor
to be separate from set_unicharset_properties, though I haven't
looked at the code of either in depth yet.

> Here you see a unicharset extracted from a trainddata file with all columns
> filled correctly:

You can also see a bunch of unicharset files in training/langdata;
at the moment it seems like they're generated by
unicharset_extractor, run through set_unicharset_properties, and
then the metrics are set somehow, maybe by some tool, maybe by hand.

I'll ask on the dev list in a moment if there's such a tool, and if
it can be released (some of the training tools like this were
originally written for internal use by Google and do funky things
like depend on map-reduce, so have to be rewritten for us plebs ;))

Nick

Nick White

unread,

Jul 10, 2014, 1:14:14 PM7/10/14

to tesser...@googlegroups.com

On Sat, Jul 05, 2014 at 03:34:05PM -0700, Albrecht Hilker wrote:
> Hello zdenop
>
> It is clear that you are not the right person to answer this question.
> If YOU would ever have looked into the source code you have seen that these
> values ARE in use (in version 3.03).

You're being pretty unfair on Zdenko here. He just made a guess
about whether the values are used "I would not be

surprised if that values are not use by current version of

tesseract" - it turns out that guess was wrong (and thank you for
doing the investigation).

We answer quite a bit of email on this list, and don't always have
time to look in depth for each question we aren't sure about,
instead sometimes saying "check x, I suspect y," which is completely
OK.

Nick White

unread,

Jul 10, 2014, 1:27:33 PM7/10/14

to tesser...@googlegroups.com

I have more thoughts to the unicharset metrics discussion.

> So this example says that
> the character "1" has a min_bottom value of 59 and
> the character "9" has a min_bottom value of 18.
>
> Weird ? ? ?
> Both numbers are aligned to the baseline!

I am guessing now (I'll take a look at the code later), but I
presume "baseline-normalized" isn't supposed to mean baseline = 0.

> Wouldn't it be more intelligent to define the min_bottom for "9"
> with a higher value to distinguish it from a lowercase "g" ??

Comparing the lines for 9 and g is useful:
9 8 0,66,200,255,89,156,0,39,104,173 Common 64 2 64 9 # 9 [39 ]0
g 3 0,43,188,212,88,176,0,32,100,210 Latin 93 0 54 g # g [67 ]a

So the min_bottom for both is 0, that's true. But don't forget that
in some fonts 9 does dip significantly below the baseline. And the
max_bottom is quite different, and probably is more useful for the
differentiation here. It says g hardly ever rises above 43, whereas
9 can quite happily rise up to 66 (which looks like it roughly
corresponds to the baseline, given how many other characters are
about there). From that we can guess that 128 is the x-height, and
64 is roughly the baseline.

More anon.

Nick

Paul

unread,

Jul 10, 2014, 5:07:32 PM7/10/14

to tesser...@googlegroups.com

Maybe the numbers you are complaining about come from the possible use of "old style numerals" like the font Georgia has them. (see old-style-numerals.png) But this is only a guess.

old-style-numerals.png

Nick White

unread,

Jul 10, 2014, 11:14:18 PM7/10/14

to tesser...@googlegroups.com

OK, so I whipped up a program that uses Pango to get character
metrics information for a given font, of the sort that is useful for
Tesseract's unicharset file.

It takes a file with UTF-8 characters separated by newlines, and a
font description (in the same format as you provide to text2image;
pango's "font description" format). It outputs the character,
followed by the bottom, top, width, bearing, and advance values,
roughly calibrated to the co-ordinate system Tesseract uses.

This could be the basis for a tool that takes all the different
fonts used and gets the minimums and maximums for each value, but
first we should compare it to the sorts of values in the official
unicharset files to look for discrepancies.

It is very very provisional; the output seems to be sensible from
light testing, but it's intended more as a base for further testing
and questioning than as a finished tool. Oh, and there will be bugs,
and you can probably crash it. Also it gives you no indication of
whether the asked for font was loaded... Again, it's a proof-of-
concept; something to work with.

Attached is the code, plus the chars file for eng to play around
with.

Example runs:
./charmetrics eng.unicharset.chars 'Linux Libertine' | head -n 3
I 63 192 70 3 73
' 173 192 22 14 36
v 61 137 126 1 127

./charmetrics eng.unicharset.chars 'DejaVu Sans' | head -n 3
I 64 205 25 25 50
' 182 205 21 25 46
v 64 158 136 8 144

charmetrics.c

eng.unicharset.chars

Albrecht Hilker

unread,

Jul 14, 2014, 12:38:27 PM7/14/14

to tesser...@googlegroups.com

Hello Nick

After some days I came back here and I'm very surprised about your lots of posts.
Thanks for answering and taking the time.

I found another bug in the tool.
(As I received no answer here, I already posted it to the Issues:
http://code.google.com/p/tesseract-ocr/issues/detail?id=1251 )
_____________________________________________________

Apart from the 4 bugs I described in the forum there is another one:

While the downloaded traineddata files distinguish between punctuation and non-punctuation unichars like:

Punctuation: !"#%&'()*,-./:;?@[\]_{}
Others     : $+<=>|~º®«

the unicharset_extractor tool returns ALL non-alphanumeric characters as punctuation unichars.

_____________________________________________________

I think all the problems that I described can easily be fixed except the min/max values.

And I still don't understand the basic question:
How can we ever write ONE Unicharset file with font metrics for a whole bunch of completely different and contradicting fonts ?
If there was one unicharset file per font, it would be easier.
But ONE Unicharset file with min/max values for 358 fonts seems completely unsane for me!
Did you know that the english and the spanish traineddata for 3.02 were trained with 358 fonts ?
https://groups.google.com/forum/?fromgroups#!topic/tesseract-ocr/boQ188SeFsY

There are fonts that put the "9" below the baseline and other that do not.
How do we ever write a Unicharset for such different fonts ?
It simply doesn't make sense to me.
_______________________________________________

Why does Tesseract need these min/max values at all ?
Wouldn't it be much more intelligent to store this information directly in the feature data ?
So each character brings the information about it's baseline, height etc, along with the training data ?
These values could be easier to auto-generate.
_______________________________________________

And the other thing that I absolutely don't understand:
You are investigating about this topic now.
But where are the people who know ?
Is this only Ray ?

Google is one of the richest companies on earth.
Are they not able to pay one of the persons who knows to write a documentation (at least part time) ?
One of the persons who work on the code will require let's say a month to write a good documentation about Tesseract, which currently is completely abandoned.

Albrecht Hilker

unread,

Jul 14, 2014, 4:10:07 PM7/14/14

to tesser...@googlegroups.com

And there is another thing about Unicharset files that I don't understand:

When I download the traineddata files and extract the unicharset file from them I notice that some are extremely different from the ones on SVN in the folder training/langdata.

For example:
Bengali, Hebrew, Greek, Kannada, Malayam, Tamil, Telugu, Thai.

These files differ significantly.
So for example Greek has a size of 9 kB in the traineddata file tesseract-ocr-3.02.ell.tar.gz and defines 151 characters.
But Greek.unicharset in the folder training/langdata has a size of 216 kB and defines 2820 unichars.
See attached file.

Isn't that weird ?
The greek alphabet does not have much more characters than the latin alphabet!
Where do they come from ?

This is another example that shows how important a documenation is.
The poor users of Tesseract are left alone in the dark and there is nobody who turns on the light!

Greek.zip

Nick White

unread,

Jul 15, 2014, 10:54:41 AM7/15/14

to tesser...@googlegroups.com

Hi Albrecht,

On Mon, Jul 14, 2014 at 01:10:07PM -0700, Albrecht Hilker wrote:
> When I download the traineddata files and extract the unicharset file from them
> I notice that some are extremely different from the ones on SVN in the folder
> training/langdata.
>
> For example:
> Bengali, Hebrew, Greek, Kannada, Malayam, Tamil, Telugu, Thai.
>
> These files differ significantly.
> So for example Greek has a size of 9 kB in the traineddata file
> tesseract-ocr-3.02.ell.tar.gz and defines 151 characters.
> But Greek.unicharset in the folder training/langdata has a size of 216 kB and
> defines 2820 unichars.

I am guessing, but it looks likely that Ray/Google has some internal
tools that look replace any line in the extracted .unicharset with a
line from the "pregenerated" one in training/langdata. Ray said in
an email to the dev list some months back that he was planning to
update the training files a lot soon, so it will be interesting to
see what lands there.

> The greek alphabet does not have much more characters than the latin alphabet!
> Where do they come from ?

Well, if you include all the different combinations of diacritics
used in polytonic Greek there are a lot more characters - the first
350ish characters look like they're taken straight from the relevant
parts of the Unicode standard.

If look slightly further down that file, you see loads of special
symbols, including some Hebrew. If you grep around, you'll see that
they're similar for quite a few of the unicharset files. I would
again venture a guess that they're just copied in case the training
decides to include more special characters in the future. But we'll
have to see the scripts making use of these files to be sure.

> This is another example that shows how important a documenation is.
> The poor users of Tesseract are left alone in the dark and there is nobody who
> turns on the light!

Because lots of cool stuff regarding training has landed from Ray's
new work, but not everything, it's particularly difficult at the
moment. Once more stuff makes it into the repository things should
get better.

I'll reply to your other email soon.

Nick

Nick White

unread,

Jul 15, 2014, 11:22:34 AM7/15/14

to tesser...@googlegroups.com

Hi again,

On Mon, Jul 14, 2014 at 09:38:26AM -0700, Albrecht Hilker wrote:
> After some days I came back here and I'm very surprised about your lots of
> posts.
> Thanks for answering and taking the time.

As you may have noticed, there aren't too many people around here
who are comfortable looking into why things are they way they are -
I'm very happy to read and learn and take time to answer when people
have done so!

> I think all the problems that I described can easily be fixed except the min/
> max values.
>
> And I still don't understand the basic question:
> How can we ever write ONE Unicharset file with font metrics for a whole bunch
> of completely different and contradicting fonts ?
> If there was one unicharset file per font, it would be easier.
> But ONE Unicharset file with min/max values for 358 fonts seems completely
> unsane for me!
> Did you know that the english and the spanish traineddata for 3.02 were trained
> with 358 fonts ?
> https://groups.google.com/forum/?fromgroups#!topic/tesseract-ocr/boQ188SeFsY
>
> There are fonts that put the "9" below the baseline and other that do not.
> How do we ever write a Unicharset for such different fonts ?
> It simply doesn't make sense to me.

From browsing the code, it looks like it's basically used to do some
scoring of a few things, or determine whether the letter seems to be
subscript or superscript, or help determine x-height, or table
detection.

One unicharset file for all fonts is indeed slightly problematic,
but presumably in general the sorts of shapes and sizes are common
enough for each character that it's still useful. Frankly part of
the reason it's done this way is probably historical, from back
before Tesseract was generally trained with many fonts.

> Why does Tesseract need these min/max values at all ?
> Wouldn't it be much more intelligent to store this information directly in the
> feature data ?
> So each character brings the information about it's baseline, height etc, along
> with the training data ?
> These values could be easier to auto-generate.

Sounds sensible to me.

> And the other thing that I absolutely don't understand:
> You are investigating about this topic now.
> But where are the people who know ?
> Is this only Ray ?

Yes, Ray is basically responsible for everything Tesseract. Other
people are brought in to do various things, but he is the one
continuous developer, to my knowledge.

Zdenko does regular fix-ups and improvements, but the bulk of the
work is done by Ray. And he works by making improvements in a
private repository, and periodically merging it back to the SVN
repository. It is not ideal, and certainly a community of interested
people openly bouncing ideas off one another would be nice, but that
doesn't happen a lot at the moment. It does a bit on the -dev list.

> Google is one of the richest companies on earth.
> Are they not able to pay one of the persons who knows to write a documentation
> (at least part time) ?

Well Google has the advantage of having Ray, who can just explain
things to anyone there who wants to understand some part of
Tesseract. It would be nice for them to fund it more, but they don't
really *need* to. Google aren't the only profitable company using
Tesseract in their products, though. It would be nice if another
company sponsored someone to improve the documentation, or just gave
their employees enough free time to contribute back once they'd
figured something non-obvious out. To an extent that's what I do,
but it's all rather ad-hoc.

> One of the persons who work on the code will require let's say a month to write
> a good documentation about Tesseract, which currently is completely abandoned.

Well, I work on the Tesseract documentation, so I'd like to think of
it as not "completely abandoned" ;) I've been focused on more
end-user things, partly because they cover the sorts of questions I
see a lot on this mailing list, and partly because most people don't
want to think about the code at all.

You'd clearly like more details on how the code works, and how each
part of the training data is used to generate results. I'd like to
do more of this, not least because it would improve my understanding
of the codebase, but ultimately I have limited time and haven't got
around to it yet. Are there particular things you'd like
documentated, that I could start on?

Nick

Nick White

unread,

Jul 15, 2014, 12:58:26 PM7/15/14

to tesser...@googlegroups.com

Sorry for the noise. I've looked into this more, and discovered more
:)

On Tue, Jul 15, 2014 at 10:54:06AM -0400, Nick White wrote:
> On Mon, Jul 14, 2014 at 01:10:07PM -0700, Albrecht Hilker wrote:
> > When I download the traineddata files and extract the unicharset file from them
> > I notice that some are extremely different from the ones on SVN in the folder
> > training/langdata.
> >
> > For example:
> > Bengali, Hebrew, Greek, Kannada, Malayam, Tamil, Telugu, Thai.
> >
> > These files differ significantly.
> > So for example Greek has a size of 9 kB in the traineddata file
> > tesseract-ocr-3.02.ell.tar.gz and defines 151 characters.
> > But Greek.unicharset in the folder training/langdata has a size of 216 kB and
> > defines 2820 unichars.
>
> I am guessing, but it looks likely that Ray/Google has some internal
> tools that look replace any line in the extracted .unicharset with
> a line from the "pregenerated" one in training/langdata.

This tool actually already exists, and is set_unicharset_properties
in training/

I had been using it, but not paying attention to the --script_dir
argument. That gives a directory to look for .unicharset files in,
and adds any metrics found there to the unicharset file it writes.

Good news, eh?

I need to write some manpages for the tools in training/ soon. For
my own sake, if no-one elses ;)

Nick

Albrecht Hilker

unread,

Jul 17, 2014, 11:00:48 PM7/17/14

to tesser...@googlegroups.com

Hello Nick

It is great that you are motivated to make a documentation and that you answer the questions in the forum.

Nevertheless I read a post from Ray where he says that he receives millions of emails and the last thing he likes to do is writing long texts (email responses or documentations). I think this is a fatal situation, because if he is the only one who really knows the code, he is predestined to write that documenation. But I understood that he is not motivated to do that. He is testing new classifiers rather than caring about what is already done.

If he doesn't like writing documentations I think he should explain what he knows to someone else verbally who then writes the documentation. But I doubt that this will ever happen. And if he retires one day it will be too late.

_________________________________

I studied the code of the set_unicharset_properties tool.
But this is a very basic tool. It only sets the basic properties.
The min/max values don't get touched and I'm sure that there must exist a tool (that is not published) that obtains them, because the han.unicharset has 23514 characters defined - all with min / max values set. Or do you think that someone has edited 23514 characters manually ?

Ok we are stuck at the same point.
Ray knows, but Ray is unavailable.
It is really a sad situation.
It is not the way open source projects should work.
_________________________________

> Are there particular things you'd like
> documentated, that I could start on?

I would like to generate unicharset files automatically, but I don't know how to calculate the min/max values.

So we have one person who is motivated (Nick) but does not know
and we have another person who knows (Ray) but is not motivated to write a documentation.
________________________

In deed the documentation is totally inclomplete.
If you see for example the documentation of the MySql server (which is excellent) you immediatly admit that Tesseract is on the other extreme end - light years away from that.

If you want an idea where to start with: I think a good starting point would be to explain what all these training files are good for and what they do exactly.
What is INTTEMP, what values does it contain exactly, how is it generated in the training process and how is it used in recognition ?
What is PFFMTABLE good for, NORMPROTO etc.

And then the DAWG files.
I still did not understand in which step of the recognition the Number DAWG is used. (Did you see the weird things it contains?)
And what is the PUNC DAWG good for, how is it used exactly ? How should I generate the values in it ?
What is the difference between a flat shape table and a clustered shapetable ?

There are millions of questions !

Nick White

unread,

Aug 6, 2014, 10:54:06 AM8/6/14

to tesser...@googlegroups.com

Hi Albrecht,

Sorry for not replying sooner, I've been away.

> Nevertheless I read a post from Ray where he says that he receives
> millions of
> emails and the last thing he likes to do is writing long texts (email responses
> or documentations). I think this is a fatal situation, because if he is the
> only one who really knows the code, he is predestined to write that
> documenation. But I understood that he is not motivated to do that. He is
> testing new classifiers rather than caring about what is already done.

Ah, but others can work to figure out how the code and tools work,
and slowly but surely piece together documentation. Also, Ray is
good at explaining when he has the time. I agree it isn't an ideal
situation, but think we can fix it.

> I studied the code of the set_unicharset_properties tool.
> But this is a very basic tool. It only sets the basic properties.
> The min/max values don't get touched

This is wrong, actually. The unicharset.SetPropertiesFromOther()
function called in set_unicharset_properties copies all properties
from any copy of the character found in the script_dir. As I
mentioned in my previous message to this thread, set the script_dir
to the training/langdata directory and the data from all the
.unicharset files there will be pulled in as appropriate.

> I'm sure that there must exist a tool
> (that is not published) that obtains them, because the han.unicharset has 23514
> characters defined - all with min / max values set. Or do you think that
> someone has edited 23514 characters manually ?

Ultimately, yes, there must be an unpublished tool that obtains the
metrics that exist in the training/langdata directory. I suspect it
looks quite like the pango based proof of concept I attached to a
previous mail on this thread (charmetrics.c).

> It is not the way open source projects should work.

So, you pick yourself up and jump in! That's how open source
projects should work. Patches are welcomed :)

> > Are there particular things you'd like
> > documentated, that I could start on?
>
> I would like to generate unicharset files automatically, but I don't know how
> to calculate the min/max values.

As I say, you can get good general figures by using the --script_dir
option with set_unicharset_properties. I think we're clear now on
the general definitions of all the fields.

To calculate the min/max values for specific fonts where they may be
very different, I recommend you try the charmetrics.c tool I posted,
and compare the output to what you get without it.

> If you want an idea where to start with: I think a good starting point would be
> to explain what all these training files are good for and what they do exactly.
> What is INTTEMP, what values does it contain exactly, how is it generated in
> the training process and how is it used in recognition ?
> What is PFFMTABLE good for, NORMPROTO etc.
>
> And then the DAWG files.
> I still did not understand in which step of the recognition the Number DAWG is
> used. (Did you see the weird things it contains?)
> And what is the PUNC DAWG good for, how is it used exactly ? How should I
> generate the values in it ?
> What is the difference between a flat shape table and a clustered shapetable ?

These are all good points, and good places to start, thank you.

My current plan for documentation is as follows:

- Rewrite and simplify TrainingTesseract3 on the wiki
- Write manpages for each tool in training/
- Document how each training file is used, and what it contains

Does that sound good to people? I'll take silence from the list to
mean "that sounds perfect in every way, you wonderful man." ;)

Nick

Shree Devi Kumar

unread,

Aug 6, 2014, 11:21:11 AM8/6/14

to tesser...@googlegroups.com

My current plan for documentation is as follows:

- Rewrite and simplify TrainingTesseract3 on the wiki
- Write manpages for each tool in training/
- Document how each training file is used, and what it contains

Does that sound good to people? I'll take silence from the list to
mean "that sounds perfect in every way, you wonderful man." ;)

Thanks, Nick. That's great. You should probably have separate sections for training 3, 3.02, 3.03, 3.03.03 ...etc. Since the method has changed quite a bit.

BTW, do you know if the new training tools can be compiled on Windows or do I need to to get access to Linux somewhere to give them a try.

Shree Devi Kumar
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Nick

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/20140806145323.GG7804%40manta.lan.

Nick White

unread,

Aug 6, 2014, 11:41:22 AM8/6/14

to tesser...@googlegroups.com

On Wed, Aug 06, 2014 at 08:50:27PM +0530, Shree Devi Kumar wrote:
> My current plan for documentation is as follows:
>
> - Rewrite and simplify TrainingTesseract3 on the wiki
> - Write manpages for each tool in training/
> - Document how each training file is used, and what it contains
>
> Does that sound good to people? I'll take silence from the list to
> mean "that sounds perfect in every way, you wonderful man." ;)
>
>
>
> Thanks, Nick. That's great. You should probably have separate sections for
> training 3, 3.02, 3.03, 3.03.03 ...etc. Since the method has changed quite
> a bit.
>
>
> BTW, do you know if the new training tools can be compiled on Windows or do
> I need to to get access to Linux somewhere to give them a try.

I don't think there's anything Linux dependant about the new
training tools. They need pango, but that's available for Windows.

So they should be able to be compiled for Windows, but I don't think
anybody has done that yet. Somebody who knows their way around
Visual Studio could definitely help out by updating / adding things
to the vs2008 directory as appropriate (I think that's the right
place, but I know very little about building software on Windows).

Nick

zdenko podobny

unread,

Aug 6, 2014, 1:21:47 PM8/6/14

to tesser...@googlegroups.com

Building training tools on windows is not priority. But is should be possible to compile most of tools with cygwin or msys&mingw.

Zdenko

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduV2aUSCsuuyednh9j20McdeVM2A2SG1NtYaxLtOBT5gwA%40mail.gmail.com.

Mark Ravina

unread,

Jun 17, 2017, 2:03:37 PM6/17/17

to tesseract-ocr

I am also getting stuck on unichar generation. I have appropriate box and tiff files, and I seem to be able to generate good training data, but when I run

> unicharset_extractor jpn.meiryo.exp0.box

I get a unicharset file full of zeros. Why is this?

unicharset

jpn.meiryo.exp0.box

jpn.meiryo.exp0.tr

Reply all

Reply to author

Forward