how to use my collected corpus and convert it one line tif

Ayub Rauf

unread,

Jan 7, 2020, 2:54:47 PM1/7/20

to tesseract-ocr

Hi , it's about two days that I'm struggling with tesseract train, but really I've been confused while I read the wiki and really disappointed. some please tell me a simple way to make single-line tif files from ready texts. I'm running tesseract train ocr-d on ubuntu 18.04 and one step away from creating my model and it's making single line tif files from gt.txt ready texts and put them in ground folder then execute make training command. I searched a lot but couldn't find and way to do that. please don't tell me see tesseract train wiki! I found tesstrain.sh on net but till now I couldn't work with it.

Thanks.

Shree Devi Kumar

unread,

Jan 8, 2020, 12:05:56 AM1/8/20

to tesseract-ocr

Read your textfile line by line

run text2image to create box/tif, similar to following.

text2image --fonts_dir="$unicodefontdir" --text="${linetext}" --strip_unrenderable_words --xsize=2500 --ysize=300 --leading=32 --margin=12 --exposure=0 --font="$fontname" --outputbase="${fontname// /_}.exp0"

run tesseract to create lstmf files , similar to following.

tesseract "${fontname// /_}.exp0".tif "${fontname// /_}.exp0" -l "$lang" --psm 13 --dpi 300 lstm.train

On Wed, Jan 8, 2020 at 1:24 AM Ayub Rauf <ayub....@gmail.com> wrote:

Hi please someone help me how to create single-line tif from texts and use them for training my model.
Thanks

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/47c002a2-9a79-431d-8ff5-8acce2e00941%40googlegroups.com.

--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Ayub Rauf

unread,

Jan 8, 2020, 1:22:02 AM1/8/20

to tesseract-ocr

Hi Shree thanks for you reply I'll try.

On Wednesday, January 8, 2020 at 8:35:56 AM UTC+3:30, shree wrote:

Read your textfile line by line
run text2image to create box/tif, similar to following.

text2image --fonts_dir="$unicodefontdir" --text="${linetext}" --strip_unrenderable_words --xsize=2500 --ysize=300 --leading=32 --margin=12 --exposure=0 --font="$fontname" --outputbase="${fontname// /_}.exp0"

run tesseract to create lstmf files , similar to following.

tesseract "${fontname// /_}.exp0".tif "${fontname// /_}.exp0" -l "$lang" --psm 13 --dpi 300 lstm.train

On Wed, Jan 8, 2020 at 1:24 AM Ayub Rauf <ayub....@gmail.com> wrote:

Hi please someone help me how to create single-line tif from texts and use them for training my model.
Thanks

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.

To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/47c002a2-9a79-431d-8ff5-8acce2e00941%40googlegroups.com.

Message has been deleted

Ayub Rauf

unread,

Jan 8, 2020, 5:39:53 AM1/8/20

to tesseract-ocr

Thanks it helped and I could create a multi-page tif but as you know tesseract 4 accept single line tif with his truth text and doesn't need box file, am I right?I say that i only need lstmf file not box! is that right? anyway I'll find a splitter and get data ready. Do you have any solution for that can split and rename files automatically, multi-page tif and also multi-line text?

And does those two files I mean tif and truth text paired files will be enough for start create my language model? because when I try to training it says "Tesseract couldn't load any languages!

Could not initialize tesseract."

when I searched for making .traindata I found tesstrain.sh but don't know how to run it and work with it, so please if you can help me to make a new traindata because I don't wanna use existing traindata!

Thanks

On Wednesday, January 8, 2020 at 8:35:56 AM UTC+3:30, shree wrote:

Read your textfile line by line
run text2image to create box/tif, similar to following.

text2image --fonts_dir="$unicodefontdir" --text="${linetext}" --strip_unrenderable_words --xsize=2500 --ysize=300 --leading=32 --margin=12 --exposure=0 --font="$fontname" --outputbase="${fontname// /_}.exp0"

run tesseract to create lstmf files , similar to following.

tesseract "${fontname// /_}.exp0".tif "${fontname// /_}.exp0" -l "$lang" --psm 13 --dpi 300 lstm.train

On Wed, Jan 8, 2020 at 1:24 AM Ayub Rauf <ayub....@gmail.com> wrote:

Hi please someone help me how to create single-line tif from texts and use them for training my model.
Thanks

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.

To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/47c002a2-9a79-431d-8ff5-8acce2e00941%40googlegroups.com.

Shree Devi Kumar

unread,

Jan 8, 2020, 7:37:42 AM1/8/20

to tesseract-ocr

If you want to train using text, then you also need to specify a set of fonts. eg.

~/tesseract/src/training/tesstrain.sh \
--fonts_dir ~/.fonts \
--lang ara \
--linedata_only \
--noextract_font_properties \
--langdata_dir ~/langdata \
--tessdata_dir ~/tessdata \
--fontlist "Amiri" \

"Amiri Bold Italic" \
"Amiri Bold" \
"Amiri Italic" \
--training_text ./ara.training_text \
--workspace_dir ~/tmp/ \
--save_box_tiff \
--output_dir ~/tesstutorial/araeval

This will create a set of lstmf files and their list and those can be used for lstmtraining.

If you don't want to use existing traineddata, then follow instructions to train from scratch -

https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#training-from-scratch

Training from scratch will take a long time - days/weeks.

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/4f67b2af-b14e-4a9c-848a-af72d3272a1d%40googlegroups.com.

Ayub Rauf

unread,

Jan 8, 2020, 8:38:22 AM1/8/20

to tesseract-ocr

Training from scratch will take a long time - days/weeks ! also if I want to train only for one font?
I wanna train Kurdish written in Arabic script but in Arabic script traineddada we have a lots of characters that doesn't exists in Kurdish. can you tell me a shortcut for that "long time - days/weeks". I want to make a best traineddata for it.

thanks again

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/4f67b2af-b14e-4a9c-848a-af72d3272a1d%40googlegroups.com.

Shree Devi Kumar

unread,

Jan 8, 2020, 9:02:48 AM1/8/20

to tesseract-ocr

you can test with attached traineddata file for Kurdish.

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/827b054d-1ac3-49c1-96ca-0159adf0ebc3%40googlegroups.com.

kur_araGS7Minus_fast.traineddata

Reply all

Reply to author

Forward

Message has been deleted