Extracting files from .tessdata

438 views
Skip to first unread message

Ramon

unread,
Apr 26, 2010, 5:06:33 AM4/26/10
to tesseract-ocr
Hi,
After some tests I realized the best for me is to put effort to extend
the Catalan Diccionari which is in svn repository (v3).
It will be so useful if you can do one of these:

-> deliver the different files combined to create the cat.traineddata
unified file. (the utf8 files used to generate the dawg would be also
amazing!).
-> show how to extract these files from the cat.traineddata and how to
dawg2utf8 (if it is possible).

THANKS!

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To post to this group, send email to tesser...@googlegroups.com.
To unsubscribe from this group, send email to tesseract-oc...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.

zdenko podobny

unread,
Apr 28, 2010, 9:55:20 AM4/28/10
to tesser...@googlegroups.com
Hello Ramon,

for extending existing language you need "Tif/Box pairs" see http://code.google.com/p/tesseract-ocr/wiki/FAQ and there "How do I add just one character or one font to my favourite language, without having to retrain from scratch?"

Unfortunately tif/box pairs are provided only for eng, deu, fra, ita, nld and spa languages... So you can wait that somebody will someday release tif/box pairs for your language or you will start training from scratch. I choose second option and this is reason why I started with testing of training process for  tesseract 3.00.

BR,

Zdenko

Ramon

unread,
Apr 29, 2010, 3:30:05 AM4/29/10
to tesseract-ocr
Hi for you quick answer Zdenko.

As you pointed out, I'm already using tif / box pair from spanish
language to train my catalan .traineddata language. (As spanish
characters suits catalan characters too).

But doing just this (with no words in dictionary files) the dictionary
is not quite good. I think the difference is from the words used in
those dictionaries. So I'm asking for that utf8 files...

Don't know if you (or a developer) can provide them.

Thanks.

Ramon.




On 28 Abr, 15:55, zdenko podobny <zde...@gmail.com> wrote:
> Hello Ramon,
>
> for extending existing language you need "Tif/Box pairs" seehttp://code.google.com/p/tesseract-ocr/wiki/FAQand there "How do I add just
> > tesseract-oc...@googlegroups.com<tesseract-ocr%2Bunsubscribe@goog legroups.com>
> > .
> > For more options, visit this group at
> >http://groups.google.com/group/tesseract-ocr?hl=en.
>
> --
> You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
> To post to this group, send email to tesser...@googlegroups.com.
> To unsubscribe from this group, send email to tesseract-oc...@googlegroups.com.
> For more options, visit this group athttp://groups.google.com/group/tesseract-ocr?hl=en.

Zdenko Podobný

unread,
Apr 29, 2010, 5:49:53 PM4/29/10
to tesser...@googlegroups.com
Hi Ramon,

I do not have source files for dawg dictionaries and I am not able to "decompile" them. Anyway I think to create dictionaries is the easiest part of tesseract training: based on wiki[1] input is simple utf-8 file with one word per line. This file is split to several files:
  • lang.punc    -> words with punctuation patterns
  • lang.number    -> words with number patterns
  • lang.freq    -> frequent words
  • lang.word    -> rest of the words
I believe you can get list of words from other opensource projects (e.g. spellchecker, dictionary projects as apertium.org, or search for free Catalan Corpus - do not forget to clear license of data first!) or you can create it from wikipedia[2].

dawg files are easy to create (big input file can cause a long run this command!):
$ wordlist2dawg [-t] word_list_file dawg_file unicharset_file

e.g. wordlist2dawg lang.punc lang.punc-dawg lang.unicharset

This command is valid for tesseract 3.00. wordlist2dawg in tesseract 2.04 do not use unicharset_file as input.

I hope there will be more details soon on http://www.sk-spell.sk.cx/tesseract-ocr-en.

[1] http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract
[2] http://wiki.apertium.org/wiki/Building_dictionaries

Zdenko

Ramon

unread,
May 20, 2010, 4:53:20 AM5/20/10
to tesseract-ocr
Hi Zdenko,

After some tests, I realized I need the tiff pair boxes that the
creators used to generate Catalan tessdata file.

Do you know a way to contact to them?

Ramon.




On 29 Abr, 23:49, Zdenko Podobný <zde...@gmail.com> wrote:
> Hi Ramon,
>
> I do not have source files for dawg dictionaries and I am not able to
> "decompile" them. Anyway I think to create dictionaries is the easiest
> part of tesseract training: based on wiki[1] input is simple utf-8 file
> with one word per line. This file is split to several files:
>
>     * lang.punc    -> words with punctuation patterns
>     * lang.number    -> words with number patterns
>     * lang.freq    -> frequent words
>     * lang.word    -> rest of the words
>
> I believe you can get list of words from other opensource projects (e.g.
> spellchecker, dictionary projects as apertium.org, or search for free
> Catalan Corpus - do not forget to clear license of data first!) or you
> can create it from wikipedia[2].
>
> dawg files are easy to create (big input file can cause a long run this
> command!):
>
>     $ wordlist2dawg [-t] word_list_file dawg_file unicharset_file
>
> e.g. wordlist2dawg lang.punc lang.punc-dawg lang.unicharset
>
> This command is valid for tesseract 3.00. wordlist2dawg in tesseract
> 2.04 do not use unicharset_file as input.
>
> I hope there will be more details soon onhttp://www.sk-spell.sk.cx/tesseract-ocr-en.
> Dn(a 29.04.2010 09:30, Ramon  wrote / napísal(a):
>
>
>
> > Hi for you quick answer Zdenko.
>
> > As you pointed out, I'm already using tif / box pair from spanish
> > language to train my catalan .traineddata language. (As spanish
> > characters suits catalan characters too).
>
> > But doing just this (with no words in dictionary files) the dictionary
> > is not quite good. I think the difference is from the words used in
> > those dictionaries. So I'm asking for that utf8 files...
>
> > Don't know if you (or a developer) can provide them.
>
> > Thanks.
>
> > Ramon.
>
> > On 28 Abr, 15:55, zdenko podobny <zde...@gmail.com> wrote:
>
> >> Hello Ramon,
>
> >> for extending existing language you need "Tif/Box pairs" seehttp://code.google.com/p/tesseract-ocr/wiki/FAQandthere "How do I add just
>  smime.p7s
> 5kBMostraBaixa

Jimmy O'Regan

unread,
May 21, 2010, 8:04:00 AM5/21/10
to tesser...@googlegroups.com, tesseract-ocr
On 20 May 2010, at 09:53, Ramon <rsa...@gmail.com> wrote:

Hi Zdenko,

After some tests, I realized I need the tiff pair boxes that the
creators used to generate Catalan tessdata file.

Do you know a way to contact to them?

That might be difficult. As you said before, you might be able to reuse the Spanish files - the images are in the download section (the boxtiff files) but I would recommend French instead - both French and Catalan use the grave, which Spanish does not.

http://tesseract-ocr.googlecode.com/files/boxtiff-2.01.fra.tar.gz

Zdenko Podobný

unread,
May 22, 2010, 5:11:36 AM5/22/10
to tesser...@googlegroups.com
Hello Ramon,

tesseract-ocr is developed by google (see http://groups.google.com/group/tesseract-ocr/msg/7408c699e27db341). I hope that after solving all/some issues final version of tesseract-ocr 3.00 will be released including tif+box files...

Zd.
Reply all
Reply to author
Forward
0 new messages