Tamil Trained data

1,278 views
Skip to first unread message

nkantan r

unread,
Mar 28, 2012, 3:31:53 PM3/28/12
to tesseract-ocr
hi
i know there are two tamil trained data files corresponding to Latha
and Lohit fonts; going through the box and tif files i understand that
the boxes for combined consonants (உயிர்மெய்) are selected as
individual (for eg: கே is selected as individual ே and க instead of a
merged கே. Since the vowel variation ே comes before the base consonant
க, post processing is elaborately required while such post-processing
can be written by a person knowing tamil aswell cpp! and as such post-
processing is now altogether missing;

to elaborate further: குகூகெகே is read correctly but texted out as
குகூெகேக; this is because the sequence is read as கு கூ ெ, க ே க; by
unicharater reading க followed by ே is read as single unicharacter
கே; the net result is குகூெகேக
this becomes worse when a single characters "கொ" "கோ" "கௌ" are read
as three characters in three boxes!

another major issue is the missing vowel ஔ which is read as while
reading ஒ and ள;

to avoid these issues, i am retraining the tamil alphabet in its
proper form; though i have succeeded doing the same in one font (Latha
size 12), while combining the language files i am getting :

Combining tessdata files
TessdataManager combined tess
Offset for type 0 is -1
Offset for type 1 is 108
Offset for type 2 is -1
Offset for type 3 is -1
Offset for type 4 is 17420
Offset for type 5 is -1
Offset for type 6 is -1
Offset for type 7 is 21008
Offset for type 8 is -1
Offset for type 9 is 31506
Offset for type 10 is -1
Offset for type 11 is -1
Offset for type 12 is -1

C:\indicocr\tesseract301>

obviously the -1 above indicates something wrong;? in the whole of the
tesseract-ocr project page, it is not possible to get the samples for

•tessdata/eng.config
•tessdata/eng.unicharset
•tessdata/eng.unicharambigs
•tessdata/eng.inttemp
•tessdata/eng.pffmtable
•tessdata/eng.normproto
•tessdata/eng.punc-dawg
•tessdata/eng.word-dawg
•tessdata/eng.number-dawg
•tessdata/eng.freq-dawg

There are 13 items listed in the combinedTess while only 10 files are
listed out above.

Though it is mentioned that unicharset, inttemp, pffmtable, normproto
are the four files required about from word-dawg and freq-dawg, there
is no mention if the other files such as tam,config, tam.unicharmbigs
etc can be left absent or empty files are required.

now while trying to Tesseract using the above made tam.traineddata
i am getting the error as below:
===================================
C:\indicocr\tesseract301>tesseract image.tif testtxt -l tam
tessdata_manager.SeekToStart(TESSDATA_INTTEMP):Error:Assert failed:in
file ..\classify\adaptmatch.cpp, line 512

C:\indicocr\tesseract301>
=======================================

kinly advise what went wrong, and what need be done to get proper
traineddata file. and i am really hopeful that the files used before
combining are also made availalable so that one can see the samples.

regards
rnkantan

nkantan r

unread,
Apr 1, 2012, 10:02:00 AM4/1/12
to tesser...@googlegroups.com
hi all!
 
i am surprised that no one replied on this subject in this forum, but not shocked as i find the interest level in tamil ocr is rather very limited; the real error on the above is that the "fullstop" in my training image is treated as zero; so the box file had "two" zeros but the number of unichars were not matching.
 
while i have successfully trained tesseract (3.01) with suitable unicharambigs to generate the correct ocr for simple computer passages, i am keen on sharing some of my notes on the quirky (that is strange) ways the box files are used for training; though i have used my own traineddata for training pages of other fonts and even real fonts snapshots of old books, i will be using here in this thread the exisitng trained data initially.
 
First thing to be noted by would-be trainers is never to use just letters in the image file; either clube two or three letter to form "word" like uneven spacing or use deliberately more spacing between letters; to clarify further, use அஆஇ   ஈஉஊஎ ஐ  ஒஓஔ instead of அ ஆ இ ஈ உ ஊ எ ஏ ஐ ஒ ஓ ஔ ஃ.  i donot know the reason for the same but a strange way of tesseract.
 
i created a file called tam.latha.exp0.tif (from a snapshot of a pdf file of a text file name tam.latha.exp0.odt). This contains all the tamil characters latha font regular and 10 size, spaced out but presented in the alphabetical order. the file is enclosed below. i created the box file using the command below using the existing trained data:

C:\indicocr\tesseract301>tesseract tam.latha.exp0.tif tam.latha.exp0 -l tam batch.nochop makebox
 
the created box file is enclosed after renaming it as tam.latha.exp0.orig.box; (the reason for renaming is that i have edited the file). If any body opens the file in a box editior after renaming it to the orignal name, they will find the following:
a) there is no blob corresponding to ஃ  and ஹ் in the first part; also the boxes are created in a sequence different from the arrangement of letters: அ, ஆ, இ, ஈ , உ, ஊ, எ, ஏ, ஐ, ஒ, க், ங். ............ெ  ஔ. 
 THAT IS THOUGH ஒ, ஓ, ஔ ARE IN SEQUENCE THE BOXES/BLOBS ARE CREATED IN DIFFERENT ORDER.  This wrong order happens only in the first part, the same set of letters are repeated in the bottom of the page and the BOXES/BLOBS are in same sequence.  I manually edited the box file. using jTess editior deleting the ஔ box and inserted boxes for ஔ and ஃ I also deleted irrelevant boxes aroung the vowel-variations. the edited file is enclosed below (tam.latha.exp0.box). Now that the box file is satisfactory, as seen in the jTess box editor, i attampted creating the tr file as below:
==================
C:\indicocr\tesseract301>tesseract tam.latha.exp0.tif tam.latha.exp0 nobatch box.train
Tesseract Open Source OCR Engine v3.01 with Leptonica
Page 0
APPLY_BOXES: boxfile line 14/α«â ((1099,3010),(1124,3040)): FAILURE! Couldn't find a matching blob
APPLY_BOXES:
   Boxes read from boxfile:    1151
   Boxes failed resegmentation:       1
APPLY_BOXES: Unlabelled word at :Bounding box=(2941,-1043)->(2966,-1027)
APPLY_BOXES: Unlabelled word at :Bounding box=(3010,-1071)->(3037,-1032)
   Found 1150 good blobs and 55 unlabelled blobs in 0 words.
   2 remaining unlabelled words deleted.
TRAINING ... Font name = latha
Generated training data for 103 words
=================
the generated Tr file is also enclosed;
 
my observations and questions:
1) the box (1099,3010),(1124,3040)  coresponds to ஃ  and has been manually inserted; Also it is the 13th box and not in the 14th line!
2) what is meant by "boxes failed resegmentation"
3) second message regarding the bounding boxes  ( 2941,-1043)->(2966,-1027) (3010,-1071)->(3037,-1032);  i am not able to identify any boxes; not sure about the negative values; do they represent the boxes in the box file or some blob co-ordinates?
4) if we open the Tr file in any editor, we find the letter ஔ is after அ, ஆ, இ, ஈ , உ, ஊ, எ, ஏ, ஐ, ஒ, க், ங். ............ெ ;
5) this means the image file is again read first and then the blobs are compared to the nearest boxes.  not that the box file is used to directly create the blobs on the tif image and generate the training data within the box boundaries. Obviously for a user, it appeals to common sense that the box file will be used to create the blob on the image file.
6)  more curious to note is that the same set of  letters in the first part are repeated in the second half of the page; it is correctly sequenced in the box file automatically;  so the layout (linear arrangement of letters) probaly does not matter.?
 
===============
Again the tif file is manually edited moving ஃ  closer to  க்  ஔ closer to ஒ; this time the box file and tr files are created properly; for reference the tif file and box file are enclosed (tam.latha.exp1)
 
====
i would like some answers this time;
if any body really wants to use and improve the revised trained data for testing please feel free to write
 
regards
rnkantan
 
(PS since google in its wisdom doesnot want tif images, they are added as zip files!)
tam.latha.exp0.orig.box
tam.latha.exp0.box
tam.latha.exp0.tr
tam.latha.exp1.box
tam.latha.exp0.zip
tam.latha.exp1.zip
Message has been deleted

Ahamed Nishadh

unread,
Jul 3, 2013, 11:56:55 AM7/3/13
to tesser...@googlegroups.com
hi all

has anyone developed a better training set for tamil on tesseract 3.02?? 

the current one made by google is good. but needs to be improved... 

clyde

unread,
Sep 22, 2013, 3:05:03 PM9/22/13
to tesser...@googlegroups.com
I had the same error: tessdata_manager.SeekToStart(TESSDATA_INTTEMP):Error:Assert failed:in 
file ..\classify\adaptmatch.cpp, line 555

how did you solve it? Pls help me


Noong Huwebes, Marso 29 2012 03:31:53 UTC+8, si nkantan r ay sumulat:

zdenko podobny

unread,
Sep 22, 2013, 3:20:09 PM9/22/13
to tesser...@googlegroups.com
That error message means you did not follow tesseract training wiki (or you ignored error messages).

Zdenko


--
--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesser...@googlegroups.com
To unsubscribe from this group, send email to
tesseract-oc...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en
 
---
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Anupam Srivatsav

unread,
Mar 1, 2015, 11:00:03 AM3/1/15
to tesser...@googlegroups.com
Dear rnkantan,

I am getting the same type of errors as you specified.  Have you over come them? If you have the traineddata, I like to get it.
Thanks in advance.
Anupam.

Sriranga(81+yrsold)

unread,
Mar 1, 2015, 11:17:51 AM3/1/15
to Michael Reimer
similar type problem in the output. txt only for  Kannada lang also

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

er.pras...@gmail.com

unread,
Feb 12, 2018, 10:39:32 AM2/12/18
to tesseract-ocr
Hi..

can i get the box file for those tif files and trained data also for latha font...

ShreeDevi Kumar

unread,
Feb 12, 2018, 10:52:01 AM2/12/18
to tesser...@googlegroups.com
That is a really old email regarding traineddata for 3.01.

You might get better results using the latest version of files from github.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages