I suggest to use LaTeX for this purpose. It is known that dvi file,
that LaTeX produce contains information about bounding boxes of all
the characters. And it is easy to print dvi file on printer or to
render it into tiff file (using dvips + ghostscript). After that it is
relatively easy to find the correspondence between the bounding boxes
of symbols in dvi file and the bounding boxes of symbols in tif file
found by tesseract, and this can be done automatically, even if some
small distortion was introduced into tif file in printing-scanning-
tesseracting process. I have written perl scripts to do the job.
Besides, modern distributions of LaTeX, such as tetex has a decent
support of different languages, so this automation can be useful for
many people attempting to train tesseract.
Do you think this may be useful? I can post the perl files but I'm
lazy to write instructions for them if you decide that all this is
nonsence.
The MiKTeX project is an excellent Latex for M$ Windows. Whether it
supports Kannada, I don't know.
Whether this project is user friendly enough I do not know.
The problem is that one need to write an encoding translation
procedure to be able to translate symbols from TeX internal font
encoding into unicode. That may require some programming skill and
some moderate amount of research.
On Oct 24, 2:20 am, "74yrs old" <withblessi...@gmail.com> wrote:
> I want to automate training as suggested by begemotv2718. It will be
> appreciated if suitable
> program is available. I am not a programmer.
>
> On 10/24/07, Jeffrey Ratcliffe <jeffrey.ratcli...@gmail.com> wrote:
b) You prepare training text (using transliteration) and process it
with itrans and latex. You get at this stage a .dvi file that is
typeseted in kannada language and contains (in a cryptic form) all the
information about the character boxes of your text. You extract this
information using my perl script
dvitype <file.dvi>| perl script1.pl > file.texbox
c) You produce training image for tesseract. You can do this
electronically using dvips + ghostscript
dvips -o <ps.file> <dvi.file>
gs -r300x300 -dNOPAUSE -sOutputFile=<file.tif> -sDEVICE=tiffg4
<ps.file>
or by printing dvi file on printer and then scanning it.
You run tesseract <file.tif> <file.txt> batch.nochop makebox
You rename file.txt into file.box.
d) You produce final box file for training tesseract using my second
perl script
perl correlatebox.pl file.texbox file.box > result_file.box
My script automatically finds the correlation between the boxes in the
texbox file (produced from dvi) and the boxes recognized by tesseract.
It then replaces the correct character codes for tesseract boxes to
produce the final box file for training. It detects possible problems
like character splitting into two parts or collating of two characters
and handle them. At this stage no human intervention is necessary.
e)You run the tesseract in the training mode and follow all the
remaining steps (running mftrain, cntrain, etc., all this also does
not require human intervention and can be done with shell script).
f)You get the result and test it.
All the operations excluding the initial text file preparation (and
scanning, if you choose this option) can be automated, so that you
just run shell script named like latex-train.sh with input of single
text file and get on the exit all the tesseract files (normproto,
pffmtable, etc..) which are the results of training.
However, all this quite heavily rely on Unix operating system
environment: you need a working installation of latex, perl,
ghostscript, and you need the ability to run all this from command
line. I am not quite sure that doing all this is easy for Windows
user: although such a possibilities exists in this system, it does not
coincide with general Windows philosophy to have black-box style GUI
program that does everything.
On Oct 24, 4:32 am, "74yrs old" <withblessi...@gmail.com> wrote:
> At present I have baraha software (www.baraha.com). With help of this
> barahaIME, I have to edit the textbox generated by tesseract( i.e by typing
> in Kannada script).in Windows.
>
> It is presumed that your point is that characters(font) in text file
> generated by running "tesseract fontfile.tif fontfile batch.nochop
> makebox" does
> not agree with resembalance (identical)
> with original image characters(font) in the tiff file. your suggestion to
> automate the process with help of Latex is not clear. \Whether you mean that
> with help latex software, font image in tiff file can be copied to in the
> generated(i.e output) text file in addition to characters printed by default
> by tesseract? and if so, it is good idea.
> I like to see output samples generated by you using perl script and latex.
> -Sriranga(74yrsold)
>
> At present I have baraha software ( www.baraha.com). With help of this
I experimented a little bit with Kannada language on the samples from
itrans package. The machinery seems to work OK. However, to further
work on this I need a person who knows the language.
I am going to download the archive with some samples now.
On Oct 25, 4:27 am, "74yrs old" <withblessi...@gmail.com> wrote:
> Thanks for the detailed procedure . It appears you are using in Ubuntu
> LinuxOS
> and if so, is it possible to forward copy of the typescript generated by
> Ubuntu -
> to enable me to study and try on LiveCD - to have hands-on experience.
> .
>
> > > At present I have baraha software (www.baraha.com). With help of this
Some instructions.
To experiment with all this on Unix system you need to have some
packages installed.
First of all, you need itrans package, which allows to typeset kannada
language from the transliterated input file.
If you type
your_machine>sudo apt-get install itrans itrans-fonts itrans-doc
this will install itrans package and some latex package, since itrans
depends on it.
Secondly, you need Font::TFM perl package. Unfortunately it is not in
the standard distribution, so you need to run
your_machine>sudo cpan install Font::TFM
cpan will ask you several questions for which you may give the default
answers and it will finally install the Font::TFM package.
You may want to install several dependencies to the Font::TFM package
via apt-get
your_machine>sudo apt-get install libparse-yapp-perl libio-pty-perl
libdate-manip-perl libxml-dom-xpath-perl
before running cpan.
I am sorry for this inconvenience. Currently I am trying to rewrite
this part of my code in C, this will be easier then.
You may also need texlive-extra-utils package :
sudo apt-get install texlive-extra-utils.
This will provide dvitype program.
How to use all this
First of all prepare *.itx file with transliterated text. I included
two sample files for you, one contain smth like alphabet and the other
some sample of poetry that I found in the itrans documentation. The
transliteration scheme in this file should be similar to those that
your Baraha software use.
Then you process your file with itrans
itrans <sample.itx >sample.tex
Then you process it with latex to get the dvi file
latex sample.tex
After that you will have sample.dvi file. You need to open it with
xdvi sample.dvi
in order to have font files generated.
Then you obtain the texbox file, that contain the boundary boxes for
all characters as latex typesetted it.
dvitype sample.dvi | perl scripts/kannada.pl > sample.texbox
Than you produce tif file for tesseract
dvips -o sample.ps sample.dvi
gs -r300x300 -sOutputFile=sample.tif -sDEVICE=tiffg4 -dNOPAUSE
sample.ps quit.ps
Now you run tesseract to produce its own box files
tesseract sample.tif sample batch.nochop makebox
Then you run my second program to get final file to train tesseract:
perl scripts/correlatebox.pl sample.texbox sample.txt >sample.box
If you have imagemagick installed on your system, you may want to run
another script
perl scripts/draw2.pl sample.tif sample_dir sample.box >sample.html
which will produce the html file that you can open in your browser,
this file will contain pictures of characters cutted out from the tif
file, as well as representation of this characters.
Well and final training steps is usual
tesseract sample.tif junk nobatch box.train
mftrain sample.tr
cntrain sample.tr
etc.
their homepage:
http://www.aczoom.com/itrans/
On Nov 1, 11:19 pm, "Nguyen Huu Hoa" <huu...@gmail.com> wrote:
> Hello begemotv2718,
> I would like to do training tesseract understand vietnamese using latex.
> Could you assist me the steps I should take to accomplish the entire
> training process.
>
> As I understand from the scripts you uploaded some days ago, we have to have
> a latex file that contains script to serve as samples in training process.
> Next step is to feed that script to latex as well as pdflatex to create dvi
> file. Then using perl script to translate dvi file to textbox file. The
> problem is I cannot translate dvi file to textbox file using rus.pl script
> or kanakan.pl script. Could you help me to write one for vietnamese language?
> For writing Vietnamese in latex, we use font name vnr10 (for more
> information about vnr10 font, visit herehttp://www.tug.org/TUGboat/Articles/tb24-1/thanh.pdf.)
>
> And for translating dvi file to textbox file, there is a perl package called
> TEX::DVI::Parse to parse dvi directly instead of using dvitype. Thus, if we
> use that package to parse dvi file directly, the performance would be
> better, IMHO. But I don't know much about perl, therefore have no idea using
> it.
>
> Thank you very much.
>
> Hoa Nguyen
>
I am very happy and thankful to you for trying to solve the problem. I am familiar with MSwindows whereas not so familiar with command lines of Linux - as such difficult to understand.However with help of generated typescript of Linux, I can follow similar steps to execute the program. I have to learn so many things from you.
I wish if your programs works in MSwindows(not cygwin). Because LinuxOS is better than cygwin.
concepts:
If the output in Latin English (without Kannada Script) like "kA"
( for "ಕಾ" ( ಕ + ಾ) typed as "kA"(OC95 + OCBE) phonetic way..
In otherwords oC95 is independent consonant whereas oCBE is dependent vowel - not independent vowel as in English) It is further clarified that dependent vowel( ಾ) will sit/over-lap on consonant (ಕ)
changed/converted (ಕ) into (ಕಾ). Trust you understand how font "kA" is created in
(Kannada script) by combination of consonant plus dependent vowel.
It is presumed that by using your program, output of image Kannada font(say ಕಾ) will be as
"kA" (in English Latin form) first and then only with help of tool "itrans" , output " kA" are
converted to "ಕಾ" and
if so output "kA" can be easily converted to "ಕಾ" with help of barahaIME tool ( windows version).
In the nutshell, if the output of the image font(scanned "ಕಾ") is in English editable text as " kA" , then there is no problem for me to convert the output as "kA" into Kannada editable script as "ಕಾ" with help of barahaIME. I am also attached screen-shot how baraha can convert typed in english into kannada or any other Indian languages.
Awaiting valuable guidance.
With Regards,
-sriranga(75yrs old)
.
.On 11/4/07, Yury Adamov < begemo...@gmail.com> wrote:Sorry for late reply, I actually somewhat neglected my mailbox.
You still need Unix package named "itrans". May be it is obsolete, but it is the only working possibility (known to me) to pass indic script to TeX/LaTeX engine, which is capable of producing dvi file that
you need in the following process. The other option could be using of
omega LaTeX engine, which works with unicode,
but I do not know how to make it work with indic fonts,
probably you need some additional package for this (to be honest I haven't tried this yet).
Sincerely yours, Yury (aka begemotv2718)
On 10/28/07, 74yrs old < withbl...@gmail.com> wrote:Hi,
Thanks for the information. visited itrans home page - it is noticed message pasted as
"This page is here for historical purposes, this package is no longer under active development, nor is there any support available.All major operating systems now support Unicode, and have built-in input methods to enter Indic script letters, so there is no need for pre-processors for Indic scripts.
- January 2006Now I undertstand itrans is nothing but transliteration. I have baraha software.
Indic is supported in WinXP.
On 10/28/07, begemotv2718 <begemo...@gmail.com > wrote:
itrans
their homepage:
http://www.aczoom.com/itrans/
I trust Begemotv , will solve the problem.of the "Automate the
process of finding correspondence between the train boxes produced by
tesseract makebox and the letters in the original text." which is an
EXCELLENT perl program - for benefit of trainers of tesseractOCR
earliest possible.
Regards
On Nov 5, 7:44 am, "Nguyen Huu Hoa" <huu...@gmail.com> wrote:
> Hello,
> Thank you for your input. I will try to train vietnamese to tesseract and
> update the result to list.
>
> Best regards,
>
> Hoa Nguyen
>
> On 11/3/07, begemotv2718 <begemotv2...@gmail.com> wrote:
>
>
>
> > Well I prepared the replacement of rus.pl/kannada.pl for vietnamese
> > languagehttp://tesseract-ocr.googlegroups.com/web/vietnam.tgz, the