Using latex to train tesseract

392 views
Skip to first unread message

begemotv2718

unread,
Oct 23, 2007, 11:17:33 PM10/23/07
to tesseract-ocr
As I understand, the most painstaking process in training of tesseract
is rewriting the box file. It is especially difficult for the
languages that don't use Latin script, since the recognized
characters in box file produced by running tesseract makebox bear very
vague resemblance with the original characters. It is desirable to
automate the process of finding correspondence between the train boxes
produced by tesseract makebox and the letters in the original text.

I suggest to use LaTeX for this purpose. It is known that dvi file,
that LaTeX produce contains information about bounding boxes of all
the characters. And it is easy to print dvi file on printer or to
render it into tiff file (using dvips + ghostscript). After that it is
relatively easy to find the correspondence between the bounding boxes
of symbols in dvi file and the bounding boxes of symbols in tif file
found by tesseract, and this can be done automatically, even if some
small distortion was introduced into tif file in printing-scanning-
tesseracting process. I have written perl scripts to do the job.

Besides, modern distributions of LaTeX, such as tetex has a decent
support of different languages, so this automation can be useful for
many people attempting to train tesseract.

Do you think this may be useful? I can post the perl files but I'm
lazy to write instructions for them if you decide that all this is
nonsence.

74yrs old

unread,
Oct 24, 2007, 2:20:51 AM10/24/07
to tesser...@googlegroups.com
Whether Latex will work in MSwindows and support for Indian langugates
like Kannada? If so how to use it?

Jeffrey Ratcliffe

unread,
Oct 24, 2007, 2:41:54 AM10/24/07
to tesser...@googlegroups.com
On 24/10/2007, 74yrs old <withbl...@gmail.com> wrote:
> Whether Latex will work in MSwindows and support for Indian langugates
> like Kannada? If so how to use it?

The MiKTeX project is an excellent Latex for M$ Windows. Whether it
supports Kannada, I don't know.

74yrs old

unread,
Oct 24, 2007, 3:20:45 AM10/24/07
to tesser...@googlegroups.com
I want to automate training as suggested by  begemotv2718.  It will be appreciated  if  suitable
program is available.  I am  not a programmer.

begemotv2718

unread,
Oct 24, 2007, 3:45:38 AM10/24/07
to tesseract-ocr
For 74yrs old. I found some project supporting kannada language for
LaTeX by googling.
http://ptsg.eecs.berkeley.edu/%7Evenkates/kannada.html

Whether this project is user friendly enough I do not know.

The problem is that one need to write an encoding translation
procedure to be able to translate symbols from TeX internal font
encoding into unicode. That may require some programming skill and
some moderate amount of research.

On Oct 24, 2:20 am, "74yrs old" <withblessi...@gmail.com> wrote:
> I want to automate training as suggested by begemotv2718. It will be
> appreciated if suitable
> program is available. I am not a programmer.
>

> On 10/24/07, Jeffrey Ratcliffe <jeffrey.ratcli...@gmail.com> wrote:

74yrs old

unread,
Oct 24, 2007, 5:32:53 AM10/24/07
to tesser...@googlegroups.com
At present I have baraha software (www.baraha.com).  With help of this barahaIME, I have to edit the textbox generated by tesseract( i.e by typing  in Kannada script).in Windows.

 It is presumed that your point is that characters(font) in text file generated by running "tesseract fontfile.tif fontfile batch.nochop makebox" does not agree with resembalance (identical)
with original  image characters(font) in the tiff file.  your suggestion to automate the process with help of Latex is not clear. \Whether you mean that with help latex software, font image in tiff file can be copied to  in the generated(i.e output) text file in addition to characters printed by default by tesseract? and if so, it is  good idea.
I like to see output samples generated by you using perl script and latex.
-Sriranga(74yrsold)



On 10/24/07, begemotv2718 <begemo...@gmail.com> wrote:

begemotv2718

unread,
Oct 24, 2007, 3:03:34 PM10/24/07
to tesseract-ocr
Well, the basic procedure for you case can be the following.
a) You install MikTeX and some package for it that allows latex to
understand kannada language, as I know there is
package called itrans that work with this. However you need to provide
input for it using latin transliteration.

b) You prepare training text (using transliteration) and process it
with itrans and latex. You get at this stage a .dvi file that is
typeseted in kannada language and contains (in a cryptic form) all the
information about the character boxes of your text. You extract this
information using my perl script
dvitype <file.dvi>| perl script1.pl > file.texbox

c) You produce training image for tesseract. You can do this
electronically using dvips + ghostscript
dvips -o <ps.file> <dvi.file>
gs -r300x300 -dNOPAUSE -sOutputFile=<file.tif> -sDEVICE=tiffg4
<ps.file>
or by printing dvi file on printer and then scanning it.
You run tesseract <file.tif> <file.txt> batch.nochop makebox
You rename file.txt into file.box.


d) You produce final box file for training tesseract using my second
perl script
perl correlatebox.pl file.texbox file.box > result_file.box
My script automatically finds the correlation between the boxes in the
texbox file (produced from dvi) and the boxes recognized by tesseract.
It then replaces the correct character codes for tesseract boxes to
produce the final box file for training. It detects possible problems
like character splitting into two parts or collating of two characters
and handle them. At this stage no human intervention is necessary.

e)You run the tesseract in the training mode and follow all the
remaining steps (running mftrain, cntrain, etc., all this also does
not require human intervention and can be done with shell script).

f)You get the result and test it.

All the operations excluding the initial text file preparation (and
scanning, if you choose this option) can be automated, so that you
just run shell script named like latex-train.sh with input of single
text file and get on the exit all the tesseract files (normproto,
pffmtable, etc..) which are the results of training.

However, all this quite heavily rely on Unix operating system
environment: you need a working installation of latex, perl,
ghostscript, and you need the ability to run all this from command
line. I am not quite sure that doing all this is easy for Windows
user: although such a possibilities exists in this system, it does not
coincide with general Windows philosophy to have black-box style GUI
program that does everything.

On Oct 24, 4:32 am, "74yrs old" <withblessi...@gmail.com> wrote:
> At present I have baraha software (www.baraha.com). With help of this
> barahaIME, I have to edit the textbox generated by tesseract( i.e by typing
> in Kannada script).in Windows.
>
> It is presumed that your point is that characters(font) in text file
> generated by running "tesseract fontfile.tif fontfile batch.nochop
> makebox" does
> not agree with resembalance (identical)
> with original image characters(font) in the tiff file. your suggestion to
> automate the process with help of Latex is not clear. \Whether you mean that
> with help latex software, font image in tiff file can be copied to in the
> generated(i.e output) text file in addition to characters printed by default
> by tesseract? and if so, it is good idea.
> I like to see output samples generated by you using perl script and latex.
> -Sriranga(74yrsold)
>

74yrs old

unread,
Oct 25, 2007, 5:27:36 AM10/25/07
to tesser...@googlegroups.com
Thanks for the detailed procedure . It appears you are using in Ubuntu LinuxOS
and if so, is it possible to forward copy of  the typescript generated  by Ubuntu -
to enable me to study and try on LiveCD - to have  hands-on experience.
.

> At present I have baraha software ( www.baraha.com).  With help of this

begemotv2718

unread,
Oct 28, 2007, 5:39:14 AM10/28/07
to tesseract-ocr
Well, actually I run not Ubuntu but Debian Linux (which is like a
parent for Ubuntu) as well as Mac OS on the laptop computer. Most
things does not depend much of which Unix-like system you use. By the
way, you may try to install Cygwin on your Windows box, this is the
easiest way to turn Windows machine into (almost) fully functional
Unix without sacrificing anything from Windows itself.

I experimented a little bit with Kannada language on the samples from
itrans package. The machinery seems to work OK. However, to further
work on this I need a person who knows the language.

I am going to download the archive with some samples now.


On Oct 25, 4:27 am, "74yrs old" <withblessi...@gmail.com> wrote:
> Thanks for the detailed procedure . It appears you are using in Ubuntu
> LinuxOS
> and if so, is it possible to forward copy of the typescript generated by
> Ubuntu -
> to enable me to study and try on LiveCD - to have hands-on experience.
> .
>

> > > At present I have baraha software (www.baraha.com). With help of this

74yrs old

unread,
Oct 28, 2007, 6:00:52 AM10/28/07
to tesser...@googlegroups.com
Thank you very much for your research. If you using Debian Linux, typesrcipt can be generated.
I dont know wheher Cygwin will generate typescript as done in Linux.  If you want I shall provide
you whether in form bmp file or text file.   I could not understand what is itrans package?
With regards,

begemotv2718

unread,
Oct 28, 2007, 6:29:10 AM10/28/07
to tesseract-ocr
http://tesseract-ocr.googlegroups.com/web/latex_train_kannada.tgz

Some instructions.
To experiment with all this on Unix system you need to have some
packages installed.
First of all, you need itrans package, which allows to typeset kannada
language from the transliterated input file.
If you type
your_machine>sudo apt-get install itrans itrans-fonts itrans-doc
this will install itrans package and some latex package, since itrans
depends on it.

Secondly, you need Font::TFM perl package. Unfortunately it is not in
the standard distribution, so you need to run
your_machine>sudo cpan install Font::TFM
cpan will ask you several questions for which you may give the default
answers and it will finally install the Font::TFM package.
You may want to install several dependencies to the Font::TFM package
via apt-get
your_machine>sudo apt-get install libparse-yapp-perl libio-pty-perl
libdate-manip-perl libxml-dom-xpath-perl
before running cpan.
I am sorry for this inconvenience. Currently I am trying to rewrite
this part of my code in C, this will be easier then.

You may also need texlive-extra-utils package :
sudo apt-get install texlive-extra-utils.
This will provide dvitype program.

How to use all this
First of all prepare *.itx file with transliterated text. I included
two sample files for you, one contain smth like alphabet and the other
some sample of poetry that I found in the itrans documentation. The
transliteration scheme in this file should be similar to those that
your Baraha software use.

Then you process your file with itrans

itrans <sample.itx >sample.tex

Then you process it with latex to get the dvi file

latex sample.tex

After that you will have sample.dvi file. You need to open it with
xdvi sample.dvi
in order to have font files generated.

Then you obtain the texbox file, that contain the boundary boxes for
all characters as latex typesetted it.
dvitype sample.dvi | perl scripts/kannada.pl > sample.texbox

Than you produce tif file for tesseract

dvips -o sample.ps sample.dvi
gs -r300x300 -sOutputFile=sample.tif -sDEVICE=tiffg4 -dNOPAUSE
sample.ps quit.ps

Now you run tesseract to produce its own box files
tesseract sample.tif sample batch.nochop makebox

Then you run my second program to get final file to train tesseract:
perl scripts/correlatebox.pl sample.texbox sample.txt >sample.box

If you have imagemagick installed on your system, you may want to run
another script
perl scripts/draw2.pl sample.tif sample_dir sample.box >sample.html
which will produce the html file that you can open in your browser,
this file will contain pictures of characters cutted out from the tif
file, as well as representation of this characters.

Well and final training steps is usual
tesseract sample.tif junk nobatch box.train
mftrain sample.tr
cntrain sample.tr
etc.

begemotv2718

unread,
Oct 28, 2007, 6:42:41 AM10/28/07
to tesseract-ocr
Notice that currently there is no real unicode output implemented. The
problem is that I do not know the language. You need to edit
kannada_texenc.txt file in order to get unicode output. Currently this
file is generated from the information on latex font enconding. It
contains some shortened names of characters with corresponding latex
font codes (numbers). What you may need to do is to replace this names
with utf-8 codes of the actual characters. The names itself should be
relatively straightforward: they are of the form
prefix_lettername_code, where prefix is v for vowels, vm for dependend
vowels cb for base consonants (this actually output only some skeleton
of consonant and made it easy for latex to join this base with
dependend vowel, I am not sure about how this work in unicode), cc for
double? consonant.

begemotv2718

unread,
Oct 28, 2007, 6:58:30 AM10/28/07
to tesseract-ocr
itrans

their homepage:
http://www.aczoom.com/itrans/

Nguyen Huu Hoa

unread,
Nov 2, 2007, 12:19:07 AM11/2/07
to tesser...@googlegroups.com
Hello begemotv2718,
I would like to do training tesseract understand vietnamese using latex. Could you assist me the steps I should take to accomplish the entire training process.

As I understand from the scripts you uploaded some days ago, we have to have a latex file that contains script to serve as samples in training process. Next step is to feed that script to latex as well as pdflatex to create dvi file. Then using perl script to translate dvi file to textbox file. The problem is I cannot translate dvi file to textbox file using rus.pl script or kanakan.pl script. Could you help me to write one for vietnamese language? For writing Vietnamese in latex, we use font name vnr10 (for more information about vnr10 font, visit here http://www.tug.org/TUGboat/Articles/tb24-1/thanh.pdf .)

And for translating dvi file to textbox file, there is a perl package called TEX::DVI::Parse to parse dvi directly instead of using dvitype. Thus, if we use that package to parse dvi file directly, the performance would be better, IMHO. But I don't know much about perl, therefore have no idea using it.

Thank you very much.

Hoa Nguyen

On 10/28/07, begemotv2718 <begemo...@gmail.com> wrote:

begemotv2718

unread,
Nov 3, 2007, 3:50:15 AM11/3/07
to tesseract-ocr
Well I prepared the replacement of rus.pl/kannada.pl for vietnamese
language http://tesseract-ocr.googlegroups.com/web/vietnam.tgz , the
perl file should always be in the same directory with vietnam-
texenc.txt.


On Nov 1, 11:19 pm, "Nguyen Huu Hoa" <huu...@gmail.com> wrote:
> Hello begemotv2718,
> I would like to do training tesseract understand vietnamese using latex.
> Could you assist me the steps I should take to accomplish the entire
> training process.
>
> As I understand from the scripts you uploaded some days ago, we have to have
> a latex file that contains script to serve as samples in training process.
> Next step is to feed that script to latex as well as pdflatex to create dvi
> file. Then using perl script to translate dvi file to textbox file. The
> problem is I cannot translate dvi file to textbox file using rus.pl script
> or kanakan.pl script. Could you help me to write one for vietnamese language?
> For writing Vietnamese in latex, we use font name vnr10 (for more
> information about vnr10 font, visit herehttp://www.tug.org/TUGboat/Articles/tb24-1/thanh.pdf.)
>
> And for translating dvi file to textbox file, there is a perl package called
> TEX::DVI::Parse to parse dvi directly instead of using dvitype. Thus, if we
> use that package to parse dvi file directly, the performance would be
> better, IMHO. But I don't know much about perl, therefore have no idea using
> it.
>
> Thank you very much.
>
> Hoa Nguyen
>

Nguyen Huu Hoa

unread,
Nov 4, 2007, 9:44:29 PM11/4/07
to tesser...@googlegroups.com
Hello,
Thank you for your input. I will try to train vietnamese to tesseract and update the result to list.
 
Best regards,
 
Hoa Nguyen
 

74yrs old

unread,
Nov 5, 2007, 12:55:24 AM11/5/07
to tesser...@googlegroups.com


On 11/4/07, 74yrs old <withbl...@gmail.com> wrote:
I am very happy and thankful to you for  trying to solve the problem. I am familiar with MSwindows whereas not so familiar with command lines of Linux - as such difficult to understand.However with help  of generated  typescript of Linux, I can follow similar steps to execute the program. I have to learn so many things from you.
I wish if your programs  works in MSwindows(not cygwin).  Because LinuxOS is better than cygwin.
concepts:
If the output in Latin English (without Kannada Script) like  "kA"
( for "ಕಾ"  ( ಕ + ಾ) typed as "kA"(OC95 + OCBE) phonetic way..
In otherwords oC95 is independent consonant  whereas  oCBE is dependent vowel - not independent vowel as in English)  It is further clarified that  dependent vowel( ಾ) will sit/over-lap on consonant (ಕ)
changed/converted  (ಕ) into (ಕಾ).  Trust you understand how font "kA" is created in
(Kannada script) by combination of consonant plus dependent vowel.

It is presumed that by using your program,  output of   image Kannada font(say ಕಾ) will be as
"kA" (in English Latin form) first  and then only with help of tool "itrans" , output  " kA"  are
converted to "ಕಾ"  and
if so output "kA" can be easily converted to "ಕಾ" with help of barahaIME tool ( windows version). 
In the nutshell,  if the output of  the image font(scanned  "ಕಾ") is in English editable text as " kA" , then there is no problem for me to convert the output as "kA" into Kannada editable script as "ಕಾ" with help of barahaIME.   I am also  attached screen-shot   how  baraha  can  convert  typed in english into kannada or any  other Indian languages.
Awaiting valuable guidance.
With Regards,
-sriranga(75yrs old)


.
 


.


On 11/4/07, Yury Adamov < begemo...@gmail.com> wrote:
Sorry for late reply, I actually somewhat neglected my mailbox.

You still need Unix package named "itrans". May be it is obsolete, but it is the only working possibility (known to me) to pass indic script to TeX/LaTeX engine, which is capable of producing dvi file that 
you need in the following process. The other option could be using of
omega LaTeX engine, which works with unicode, 
but I do not know how to make it work with indic fonts,
 probably you need some additional package for this (to be honest I haven't tried this yet).

Sincerely yours, Yury (aka begemotv2718)


On 10/28/07, 74yrs old < withbl...@gmail.com> wrote:
Hi,
Thanks for the information. visited itrans home page - it is noticed  message pasted as
"This page is here for historical purposes, this package is no longer under active development, nor is there any support available.

All major operating systems now support Unicode, and have built-in input methods to enter Indic script letters, so there is no need for pre-processors for Indic scripts.
- January 2006

Now I undertstand  itrans is nothing but transliteration. I have baraha software.

 Indic is supported in WinXP.




On 10/28/07, begemotv2718 <begemo...@gmail.com > wrote:

itrans

their homepage:
http://www.aczoom.com/itrans/





74yrs old

unread,
Nov 13, 2007, 6:15:36 AM11/13/07
to tesser...@googlegroups.com
Slucessfully generated sample.dvi.
Tried to run dvitype sample.dvi | perl scripts/kannada.pl > sample.texbox - but failed to generate box - png attached.
Guidance is requested.
Message has been deleted

74yrsold

unread,
Nov 14, 2007, 1:29:30 AM11/14/07
to tesseract-ocr
Hello,
Have you suceeded to train vietnamese and generate box file using
perl script "dvitype sample.dvi | perl scripts/vietnamese.pl >
sample.texbox" ?
In my case, I am struggling hard to generate box file using the said
perl script.

I trust Begemotv , will solve the problem.of the "Automate the


process of finding correspondence between the train boxes produced by

tesseract makebox and the letters in the original text." which is an
EXCELLENT perl program - for benefit of trainers of tesseractOCR
earliest possible.
Regards

On Nov 5, 7:44 am, "Nguyen Huu Hoa" <huu...@gmail.com> wrote:
> Hello,
> Thank you for your input. I will try to train vietnamese to tesseract and
> update the result to list.
>
> Best regards,
>
> Hoa Nguyen
>

> On 11/3/07, begemotv2718 <begemotv2...@gmail.com> wrote:
>
>
>
> > Well I prepared the replacement of rus.pl/kannada.pl for vietnamese

> > languagehttp://tesseract-ocr.googlegroups.com/web/vietnam.tgz, the

lauhlau

unread,
Nov 18, 2014, 2:03:51 PM11/18/14
to tesser...@googlegroups.com
Hi,

I am trying to do what you did. I noticed that this topic is reeeeeaaaally old (7 years !).

But I could not download your script files script1.pl and correlatebox.pl (404 not found error).

Do you still have them anywhere ?

Thanks in advance
Reply all
Reply to author
Forward
0 new messages