Traineddata for Latin-Indic

101 views
Skip to first unread message

Shree

unread,
Aug 22, 2013, 5:40:11 AM8/22/13
to tesser...@googlegroups.com
I had started training Tessearct for recognizing texts which have Indic transliteration - please see http://www.unicode.org/cldr/charts/transforms/Latin-Indic.html for the diacritics used for the same.

After Ray's post regarding upcoming merge and next release, I am holding off on further training.

However, I wanted to check whether this is already available as part of another language data. I am attaching a sample image, text file as well as the unicharset for reference.

Thanks,
Shree


ipa.unicharset
ipa.traineddata
latin-indic.txt
san.0chandas.exp0.png

Ray Smith

unread,
Aug 22, 2013, 10:21:45 AM8/22/13
to tesser...@googlegroups.com
OCR of transliterated text as a special-purpose language is not available in any traineddata today.
Is this kind of text common?
If so, where is it typically used?


--
You received this message because you are subscribed to the Google Groups "tesseract-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-de...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Shree Devi Kumar

unread,
Aug 22, 2013, 1:27:09 PM8/22/13
to tesser...@googlegroups.com
Ray,

Transliterated text is used by indologists for representing Indic languages. 

eg. indological books may be written mainly in english and use italized transliterated text for indic terms/verses etc. 

and
for a sample with tamil text in transliteration

Complete indic texts are also available in transliteration - eg. see 





Before unicode, the texts were written using  ISO 15919 ("Transliteration of Devanagari and related Indic scripts into Latin characters". 

Please see the following links for more details.



Shree

Shree Devi Kumar
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com


--
You received this message because you are subscribed to a topic in the Google Groups "tesseract-dev" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/tesseract-dev/bRD21wf3GxQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to tesseract-de...@googlegroups.com.

Shree Devi Kumar

unread,
Sep 3, 2013, 2:54:39 AM9/3/13
to tesser...@googlegroups.com, tesser...@googlegroups.com
I updated tesseract to the latest version in svn and now I am getting errors while running training ..


D:\BuildFolder\testing\TRAINdata\v6-TransliterationOnly>echo off
tesseract 3.02.03
 leptonica-1.68 (Mar 14 2011, 10:43:03) [MSC v.1500 LIB Release 32 bit]
  libgif 4.1.6 : libjpeg 8c : libpng 1.4.3 : libtiff 3.9.4 : zlib 1.2.5

**** extracting unicharset *****
Extracting unicharset from ipa.sanskrit2003.exp994.box
Wrote unicharset file ./unicharset.
**** done extracting unicharset from *****
****   ipa.sanskrit2003.exp994.box ****
**** Training using following  .tr files *****
****   ipa.sanskrit2003.exp994.tr ****
****  NO Shapeclustering - Non Indic Language*****
**** Started MFTraining *****
Read shape table shapetable of 733 shapes
Reading ipa.sanskrit2003.exp994.tr ...

id < this->size():Error:Assert failed:in file ..\..\ccutil\unicharset.cpp, line
237

Has anyone else had this problem?


Additionally, for sanskrit language data
I am errors while running OCR on .png images - it worked fine earlier.

        1 file(s) copied.
tesseract 3.02.03
 leptonica-1.68 (Mar 14 2011, 10:43:03) [MSC v.1500 LIB Release 32 bit]
  libgif 4.1.6 : libjpeg 8c : libpng 1.4.3 : libtiff 3.9.4 : zlib 1.2.5

processing san.0s2003.exp0.tif
processing san.0s2003.exp8.tif
processing san.0sanskrit2003.exp0.tif
processing san.0sanskrit2003.exp8.tif
processing san.mnt.exp013.png
TIFFstream: Not a TIFF file, bad magic number 20617 (0x5089).
processing san.mnt.exp014.png
TIFFstream: Not a TIFF file, bad magic number 20617 (0x5089).
processing san.mnt.exp031.png
TIFFstream: Not a TIFF file, bad magic number 20617 (0x5089).
processing san.mnt.exp032.png
TIFFstream: Not a TIFF file, bad magic number 20617 (0x5089).
processing san.mnt.exp038.png
TIFFstream: Not a TIFF file, bad magic number 20617 (0x5089).
processing san.mnt.exp424.png
TIFFstream: Not a TIFF file, bad magic number 20617 (0x5089).
Press any key to continue . . .


Should I open issues for the above?







Shree Devi Kumar
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com


zdenko podobny

unread,
Sep 3, 2013, 3:52:24 AM9/3/13
to tesser...@googlegroups.com, tesser...@googlegroups.com
On Tue, Sep 3, 2013 at 8:54 AM, Shree Devi Kumar <shree...@gmail.com> wrote:
I updated tesseract to the latest version in svn and now I am getting errors while running training ..


D:\BuildFolder\testing\TRAINdata\v6-TransliterationOnly>echo off
tesseract 3.02.03
 leptonica-1.68 (Mar 14 2011, 10:43:03) [MSC v.1500 LIB Release 32 bit]
  libgif 4.1.6 : libjpeg 8c : libpng 1.4.3 : libtiff 3.9.4 : zlib 1.2.5

**** extracting unicharset *****
Extracting unicharset from ipa.sanskrit2003.exp994.box
Wrote unicharset file ./unicharset.
**** done extracting unicharset from *****
****   ipa.sanskrit2003.exp994.box ****
**** Training using following  .tr files *****
****   ipa.sanskrit2003.exp994.tr ****
****  NO Shapeclustering - Non Indic Language*****
**** Started MFTraining *****
Read shape table shapetable of 733 shapes
Reading ipa.sanskrit2003.exp994.tr ...

id < this->size():Error:Assert failed:in file ..\..\ccutil\unicharset.cpp, line
237

What was your last working tesseract version? Did you used svn version in past? 
 TIFFstream: Not a TIFF file should be error from leptonica. So please test it with some leptonica program. If there is still problem, create issue at leptonica project.
Strange is that  comment show that you are processing png, but error is regarding tiff... 
Check if everything is ok with filename....







Shree Devi Kumar
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com


On Thu, Aug 22, 2013 at 3:10 PM, Shree <shree...@gmail.com> wrote:
I had started training Tessearct for recognizing texts which have Indic transliteration - please see http://www.unicode.org/cldr/charts/transforms/Latin-Indic.html for the diacritics used for the same.

After Ray's post regarding upcoming merge and next release, I am holding off on further training.

However, I wanted to check whether this is already available as part of another language data. I am attaching a sample image, text file as well as the unicharset for reference.

Thanks,
Shree


--
You received this message because you are subscribed to a topic in the Google Groups "tesseract-dev" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/tesseract-dev/bRD21wf3GxQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to tesseract-de...@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "tesseract-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-de...@googlegroups.com.

Shree Devi Kumar

unread,
Sep 3, 2013, 4:23:24 AM9/3/13
to tesser...@googlegroups.com
​>>​
What was your last working tesseract version? Did you used svn version in past?

​Yes, I used svn version in past. Last version was r855 - I had asked at that time about updating the version number in the executable.


>> TIFFstream: Not a TIFF file
>>should be error from leptonica. So please test it with some leptonica program. If there is still problem, create issue at leptonica project.
>>Strange is that  comment show that you are processing png, but error is regarding tiff... 
>>Check if everything is ok with filename....

The same files had worked fine with old version.

I am rolling back to r856 to see if the problems persist.

Thanks,
Shree


Shree Devi Kumar

unread,
Sep 3, 2013, 4:58:43 AM9/3/13
to tesser...@googlegroups.com
Zdenko,
The tiff related problem doesn't come with r856. See comments from new batch run.

tesseract 3.02

 leptonica-1.68 (Mar 14 2011, 10:43:03) [MSC v.1500 LIB Release 32 bit]
  libgif 4.1.6 : libjpeg 8c : libpng 1.4.3 : libtiff 3.9.4 : zlib 1.2.5

processing san.0s2003.exp0.tif
processing san.0s2003.exp8.tif
processing san.0sanskrit2003.exp0.tif
processing san.0sanskrit2003.exp8.tif
processing san.mnt.exp013.png
processing san.mnt.exp014.png
processing san.mnt.exp031.png
processing san.mnt.exp032.png
processing san.mnt.exp038.png
processing san.mnt.exp424.png

Press any key to continue . . .

>>id < this->size():Error:Assert failed:in file ..\..\ccutil\unicharset.cpp, line
237

The above problem was resolved by running shapeclustering step before mftraining (Thanks, SriRangaji). I had removed that step thinking it was recommended only for Indic languages.



Shree Devi Kumar
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com


zdenko podobny

unread,
Sep 3, 2013, 9:09:07 AM9/3/13
to Shree Devi Kumar, tesser...@googlegroups.com
I tried your files on Win XP SP3 32bit and there was no problem with tiff error...

Zdenko


On Tue, Sep 3, 2013 at 11:27 AM, Shree Devi Kumar <shree...@gmail.com> wrote:
Sending a zip file with the batch files and sample png/tif and box files.

You'll need to modify the batch file to point to your location of tesseract - I had it at D:\BuildFolder\testing\tesseract

Shree

Shree Devi Kumar
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com


On Tue, Sep 3, 2013 at 2:39 PM, zdenko podobny <zde...@gmail.com> wrote:
can you sent me you batch file and processing san.mnt.exp014.png, processing san.mnt.exp031.png and their box files? 

Zdenko


Reply all
Reply to author
Forward
0 new messages