Training tesseract for new language - some problems

212 views
Skip to first unread message

Tenzin Dendup

unread,
Sep 25, 2009, 10:07:37 AM9/25/09
to tesser...@googlegroups.com
Hi once again,

I am trying to train tesseract for Dzongkha
(http://en.wikipedia.org/wiki/Dzongkha_language) language. I am using
tesseract-2.03 on Debian Squeeze. I have created a training image file
in gray-scale. When I try to create the box file using "tesseract
dzo1.tif dzo1 batch.nochop makebox" , I get some output which says:

Using substitute bounding box at (562,2547)->(1376,2629)
Using substitute bounding box at (988,2372)->(1830,2440)
Using substitute bounding box at (560,1924)->(1374,2006)
Using substitute bounding box at (222,1841)->(2135,1916)
Using substitute bounding box at (220,1216)->(2132,1291)
Using substitute bounding box at (291,675)->(1370,757)

Afterwards, while trying to edit the box file using
tesseractTrainer.py, I notice that some 20 clearly separated blocks
are grouped into one box (Please see the attached file 1.png).

Also, Some of the image blocks (characters) are not exactly fitting
inside the bounding box (attached file 2.png).

It would be very helpful if i could get some solutions to this or if
this has been discussed earlier in the list, i would be grateful to
get a link to it.

Regards
Tenzin

1.png
2.png

Svetlin Nakov

unread,
Sep 25, 2009, 10:34:41 AM9/25/09
to tesser...@googlegroups.com
I use my own script for generating training images and box files but it is
for Cyrillic alphabet only.

Regards,
Svetlin Nakov

neskie

unread,
Sep 26, 2009, 6:51:58 PM9/26/09
to tesseract-ocr
Hi,

One thing you could do is you could open up the actually box file and
split the boxes manually. There's a description on the main tesseract
training page. [1] Here is the excerpt

As you can see, the low double quote character has been expressed as
two single commas. The bounding boxes must be merged as follows:

* First number (left) take the minimum of the two lines (197)
* Second number (bottom) take the minimum of the two lines (496)
* Third number (right) take the maximum of the two lines (214)
* Fourth number (top) take the maximum of the two lines (508)

So I think in this case you would just have to keep your second and
third numbers the same, and then your lefts and right would be based
on the first box and then add the character width.

I see that your using trainingTesseract.py. I'm using the same
script. It works great and added a bit of functionality to spit out a
PDF with text placed over the image. The split box functionality
though is not really working on my system it gives an error. I would
like to fix it and then I could use it. I"ll keep you posted.

[1] - http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract
>  1.png
> 484KViewDownload
>
>  2.png
> 235KViewDownload

Bob Spence

unread,
Nov 1, 2009, 2:48:44 AM11/1/09
to tesseract-ocr
Tenzin

I have been a bit successful training Tibetan using the Jomolhari font
on OS X. I found I needed to space the characters out a bit more when
they clumped together on he line. Then there is a bit more fiddling in
tesseractTrainer.py to merge boxes.

The merge function failed for me because the pygtk library didn't
handle unicode well. So, I wrote my own merge.

You will also have problems with sscanf reading the UTF-8 encoding for
some consonants. The Tibetan 30 character alphabet includes E0 BD 85
and E0 BD A0 characters. The file boxread.cpp uses sscanf with %s to
read UTF-8 Unicode strings. Unfortunately, U+0085 NEL (control
character next line) and U+00A0 NBSP (NO-BREAK SPACE) are treated as
whitespace by sscanf, so only the first two bytes are read. So all
scanf's doing %s reads of unicode UTF-8 chars need to be replaced by
code which does a simple scan to avoid truncating the encoding.

I'm using 2.04, but trying to get 3.0 going but the new unified file
format needs to be cracked.

Here is my fix or the sscanf problem:

bool read_next_box(int target_page, FILE* box_file, char* utf8_str,
int* x_min, int* y_min, int* x_max, int* y_max) {
static int line = 0;
int count = 0;
int page = 0;
char buff[kBoxReadBufSize]; //boxfile read buffer
char *uch;
char *buffptr = buff;

while (fgets(buff, sizeof(buff) - 1, box_file)) {
line++;

buffptr = buff;
const unsigned char *ubuf = reinterpret_cast<const unsigned char*>
(buffptr);
if (ubuf[0] == 0xef && ubuf[1] == 0xbb && ubuf[2] == 0xbf)
buffptr += 3; // Skip unicode file designation.
/* Check for blank lines in box file */
while (*buffptr == ' ' || *buffptr == '\t')
buffptr++;
// change to simply scan over the uch rather than fail on bad
unicode sscan(%s") which fails on Tibetan consonants
uch = buffptr;
while (*buffptr != ' ' && *buffptr != '\t')
buffptr++;
if (*buffptr != '\0') {
*buffptr++ = 0;
count = sscanf(buffptr, "%d %d %d %d %d",
x_min, y_min, x_max, y_max, &page);
if (count != 5) {
page = 0;
count = sscanf(buffptr, "%d %d %d %d",
x_min, y_min, x_max, y_max);
}
if (target_page >= 0 && target_page != page)
continue; // Not on the appropriate page.
if (count == 4) {
#if debug_utf8
// Print the hex codes of the utf8 code.
int x;
for (x = 0; buff[x] != '\0'; ++x)
tprintf("[%02x]", (unsigned char)buff[x]);
tprintf("\n");
for (x = 0; uch[x] != '\0'; ++x)
tprintf("[%02x]", (unsigned char)uch[x]);
tprintf(" %d %d %d %d\n", *x_min, *y_min, *x_max, *y_max);
#endif
// Validate UTF8 by making unichars with it.
int used = 0;
int uch_len = strlen(uch);
while (used < uch_len) {
UNICHAR ch(uch + used, uch_len - used);
int new_used = ch.utf8_len();
if (new_used == 0) {
tprintf("Bad utf-8 char starting with 0x%02x at col %d,
line %d \n",
(unsigned char)uch[used], used + 1, line);
count = 0;
break;
}
used += new_used;
}
if (uch_len > UNICHAR_LEN) {
tprintf("utf-8 string too long at line %d\n", line);
count = 0;
}
}
if (count < 4) {
tprintf("Box file format error on line %i ignored\n", line);
} else {
strcpy(utf8_str, uch);
return true; //read a box ok
}
}
}
fclose(box_file);
line = 0;
return false; //EOF
}

74yrs old

unread,
Nov 1, 2009, 7:51:44 AM11/1/09
to tesser...@googlegroups.com
Svetlin Nakov,
Is it possible to use your script for generating training images and box files for UTF-8  which I intend to use for Indic lang viz. Kannada lang. Attached herewith kan-U0C80.pdf for your research purpose.
Awaiting valuable guidance.
Regards,
-sriranga(77yrsold)
New Text Document (2).txt
kan-U0C80.pdf
Reply all
Reply to author
Forward
0 new messages