hi all!
i am surprised that no one replied on this subject in this forum, but not shocked as i find the interest level in tamil ocr is rather very limited; the real error on the above is that the "fullstop" in my training image is treated as zero; so the box file had "two" zeros but the number of unichars were not matching.
while i have successfully trained tesseract (3.01) with suitable unicharambigs to generate the correct ocr for simple computer passages, i am keen on sharing some of my notes on the quirky (that is strange) ways the box files are used for training; though i have used my own traineddata for training pages of other fonts and even real fonts snapshots of old books, i will be using here in this thread the exisitng trained data initially.
First thing to be noted by would-be trainers is never to use just letters in the image file; either clube two or three letter to form "word" like uneven spacing or use deliberately more spacing between letters; to clarify further, use அஆஇ ஈஉஊஎ ஐ ஒஓஔ instead of அ ஆ இ ஈ உ ஊ எ ஏ ஐ ஒ ஓ ஔ ஃ. i donot know the reason for the same but a strange way of tesseract.
i created a file called tam.latha.exp0.tif (from a snapshot of a pdf file of a text file name tam.latha.exp0.odt). This contains all the tamil characters latha font regular and 10 size, spaced out but presented in the alphabetical order. the file is enclosed below. i created the box file using the command below using the existing trained data:
C:\indicocr\tesseract301>tesseract tam.latha.exp0.tif tam.latha.exp0 -l tam batch.nochop makebox
the created box file is enclosed after renaming it as tam.latha.exp0.orig.box; (the reason for renaming is that i have edited the file). If any body opens the file in a box editior after renaming it to the orignal name, they will find the following:
a) there is no blob corresponding to ஃ and ஹ் in the first part; also the boxes are created in a sequence different from the arrangement of letters: அ, ஆ, இ, ஈ , உ, ஊ, எ, ஏ, ஐ, ஒ, க், ங். ............ெ ஔ.
THAT IS THOUGH ஒ, ஓ, ஔ ARE IN SEQUENCE THE BOXES/BLOBS ARE CREATED IN DIFFERENT ORDER. This wrong order happens only in the first part, the same set of letters are repeated in the bottom of the page and the BOXES/BLOBS are in same sequence. I manually edited the box file. using jTess editior deleting the ஔ box and inserted boxes for ஔ and ஃ I also deleted irrelevant boxes aroung the vowel-variations. the edited file is enclosed below (tam.latha.exp0.box). Now that the box file is satisfactory, as seen in the jTess box editor, i attampted creating the tr file as below:
==================
C:\indicocr\tesseract301>tesseract tam.latha.exp0.tif tam.latha.exp0 nobatch box.train
Tesseract Open Source OCR Engine v3.01 with Leptonica
Page 0
APPLY_BOXES: boxfile line 14/α«â ((1099,3010),(1124,3040)): FAILURE! Couldn't find a matching blob
APPLY_BOXES:
Boxes read from boxfile: 1151
Boxes failed resegmentation: 1
APPLY_BOXES: Unlabelled word at :Bounding box=(2941,-1043)->(2966,-1027)
APPLY_BOXES: Unlabelled word at :Bounding box=(3010,-1071)->(3037,-1032)
Found 1150 good blobs and 55 unlabelled blobs in 0 words.
2 remaining unlabelled words deleted.
TRAINING ... Font name = latha
Generated training data for 103 words
=================
the generated Tr file is also enclosed;
my observations and questions:
1) the box (1099,3010),(1124,3040) coresponds to ஃ and has been manually inserted; Also it is the 13th box and not in the 14th line!
2) what is meant by "boxes failed resegmentation"
3) second message regarding the bounding boxes ( 2941,-1043)->(2966,-1027) (3010,-1071)->(3037,-1032); i am not able to identify any boxes; not sure about the negative values; do they represent the boxes in the box file or some blob co-ordinates?
4) if we open the Tr file in any editor, we find the letter ஔ is after அ, ஆ, இ, ஈ , உ, ஊ, எ, ஏ, ஐ, ஒ, க், ங். ............ெ ;
5) this means the image file is again read first and then the blobs are compared to the nearest boxes. not that the box file is used to directly create the blobs on the tif image and generate the training data within the box boundaries. Obviously for a user, it appeals to common sense that the box file will be used to create the blob on the image file.
6) more curious to note is that the same set of letters in the first part are repeated in the second half of the page; it is correctly sequenced in the box file automatically; so the layout (linear arrangement of letters) probaly does not matter.?
===============
Again the tif file is manually edited moving ஃ closer to க் ஔ closer to ஒ; this time the box file and tr files are created properly; for reference the tif file and box file are enclosed (tam.latha.exp1)
====
i would like some answers this time;
if any body really wants to use and improve the revised trained data for testing please feel free to write
regards
rnkantan
(PS since google in its wisdom doesnot want tif images, they are added as zip files!)