how to use my collected corpus and convert it one line tif

127 views
Skip to first unread message

Ayub Rauf

unread,
Jan 7, 2020, 2:54:47 PM1/7/20
to tesseract-ocr
Hi , it's about two days that I'm struggling with tesseract train, but really I've been confused while I read the wiki and really disappointed. some please tell me a simple way to make single-line tif files from ready texts. I'm running tesseract train ocr-d on ubuntu 18.04 and one step away from creating my model and it's making single line tif files from gt.txt ready texts and put them in ground folder then execute make training command. I searched a lot but couldn't find and way to do that. please don't tell me see tesseract train wiki! I found tesstrain.sh on net but till now I couldn't work with it.
Thanks.

Shree Devi Kumar

unread,
Jan 8, 2020, 12:05:56 AM1/8/20
to tesseract-ocr
Read your textfile line by line 
run text2image to create box/tif, similar to following.

text2image --fonts_dir="$unicodefontdir" --text="${linetext}" --strip_unrenderable_words --xsize=2500 --ysize=300  --leading=32 --margin=12 --exposure=0  --font="$fontname"   --outputbase="${fontname// /_}.exp0" 

run tesseract to create lstmf files , similar to following.

tesseract "${fontname// /_}.exp0".tif "${fontname// /_}.exp0" -l "$lang" --psm 13 --dpi 300 lstm.train


On Wed, Jan 8, 2020 at 1:24 AM Ayub Rauf <ayub....@gmail.com> wrote:
Hi please someone help me how to create single-line tif from texts and use them for training my model.
Thanks

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/47c002a2-9a79-431d-8ff5-8acce2e00941%40googlegroups.com.


--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Ayub Rauf

unread,
Jan 8, 2020, 1:22:02 AM1/8/20
to tesseract-ocr
Hi Shree thanks for you reply I'll try.


On Wednesday, January 8, 2020 at 8:35:56 AM UTC+3:30, shree wrote:
Read your textfile line by line 
run text2image to create box/tif, similar to following.

text2image --fonts_dir="$unicodefontdir" --text="${linetext}" --strip_unrenderable_words --xsize=2500 --ysize=300  --leading=32 --margin=12 --exposure=0  --font="$fontname"   --outputbase="${fontname// /_}.exp0" 

run tesseract to create lstmf files , similar to following.

tesseract "${fontname// /_}.exp0".tif "${fontname// /_}.exp0" -l "$lang" --psm 13 --dpi 300 lstm.train


On Wed, Jan 8, 2020 at 1:24 AM Ayub Rauf <ayub....@gmail.com> wrote:
Hi please someone help me how to create single-line tif from texts and use them for training my model.
Thanks

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.
Message has been deleted

Ayub Rauf

unread,
Jan 8, 2020, 5:39:53 AM1/8/20
to tesseract-ocr
Thanks it helped and I could create a multi-page tif but as you know tesseract 4 accept single line tif with his truth text and doesn't need box file, am I right?I say that i only need lstmf file not box! is that right?  anyway I'll find a splitter and get data ready. Do you have any solution for that can split and rename files automatically, multi-page tif and also multi-line text?
 And does those two files I mean tif and truth text paired files will be enough for start create my language model? because when I try to training it says "Tesseract couldn't load any languages!
Could not initialize tesseract."
when I searched for making .traindata I found  tesstrain.sh but don't know how to run it and work with it, so please if you can help me to make a new traindata because I don't wanna use existing traindata!
Thanks


On Wednesday, January 8, 2020 at 8:35:56 AM UTC+3:30, shree wrote:
Read your textfile line by line 
run text2image to create box/tif, similar to following.

text2image --fonts_dir="$unicodefontdir" --text="${linetext}" --strip_unrenderable_words --xsize=2500 --ysize=300  --leading=32 --margin=12 --exposure=0  --font="$fontname"   --outputbase="${fontname// /_}.exp0" 

run tesseract to create lstmf files , similar to following.

tesseract "${fontname// /_}.exp0".tif "${fontname// /_}.exp0" -l "$lang" --psm 13 --dpi 300 lstm.train


On Wed, Jan 8, 2020 at 1:24 AM Ayub Rauf <ayub....@gmail.com> wrote:
Hi please someone help me how to create single-line tif from texts and use them for training my model.
Thanks

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

Shree Devi Kumar

unread,
Jan 8, 2020, 7:37:42 AM1/8/20
to tesseract-ocr
If you want to train using text, then you also need to specify a set of fonts. eg.

~/tesseract/src/training/tesstrain.sh \
  --fonts_dir ~/.fonts \
  --lang ara \
  --linedata_only \
  --noextract_font_properties \
  --langdata_dir ~/langdata \
  --tessdata_dir ~/tessdata \
  --fontlist "Amiri" \
  "Amiri Bold Italic" \
  "Amiri Bold" \
  "Amiri Italic" \
  --training_text ./ara.training_text \
  --workspace_dir ~/tmp/ \
  --save_box_tiff \
  --output_dir ~/tesstutorial/araeval

This will create a set of lstmf files and their list and those can be used for lstmtraining.

If you don't want to use existing traineddata, then follow instructions to train from scratch -

Training from scratch will take a long time - days/weeks. 

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/4f67b2af-b14e-4a9c-848a-af72d3272a1d%40googlegroups.com.

Ayub Rauf

unread,
Jan 8, 2020, 8:38:22 AM1/8/20
to tesseract-ocr
Training from scratch will take a long time - days/weeks !   also if I want to train only for one font? 
I wanna train Kurdish written  in Arabic script but in Arabic script traineddada we have a lots of characters that doesn't exists in Kurdish. can you tell me a shortcut for that "long time - days/weeks". I want to make a best traineddata for it.
thanks again

Shree Devi Kumar

unread,
Jan 8, 2020, 9:02:48 AM1/8/20
to tesseract-ocr
you can test with attached traineddata file for Kurdish.

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/827b054d-1ac3-49c1-96ca-0159adf0ebc3%40googlegroups.com.
kur_araGS7Minus_fast.traineddata
Reply all
Reply to author
Forward
Message has been deleted
Message has been deleted
Message has been deleted
Message has been deleted
0 new messages