How to automate training/text2image?

210 views
Skip to first unread message

Dan9er

unread,
Aug 31, 2017, 11:58:59 AM8/31/17
to tesseract-ocr
Running 
training/text2image --text=npn_training_text.txt --outputbase=npn.Exo.exp0 --font='Exo' --fonts_dir=/usr/share/fonts

gives the desired output of two files:
  • npn.Exo.exo0.tif
  • npn.Exo.exp0.box
But running this command for the 162 fonts I want to use is very time consuming and monotonous. I tried running this command:
training/text2image --text=npn_training_text.txt --outputbase=npn --fonts_dir=/usr/share/fonts  --find_fonts --min_coverage=1.0 --render_per_font=true

But that only made files in this format: npn.{fontName}.tif

How do I automate making .tif AND .box files? Do I have to change the --outputbase to something different or do I have to make a .sh script?

PS. I did run training/text2image --find_fonts with --render_per_font set to false, so I have a npn.fontlist.txt file on hand.

ShreeDevi Kumar

unread,
Aug 31, 2017, 12:09:05 PM8/31/17
to tesser...@googlegroups.com
Please see tesseract.sh script file in training directory.

It automates the whole training process.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/9d7df5ab-e1ad-43a6-9d7b-d7ba4ef39951%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

ShreeDevi Kumar

unread,
Aug 31, 2017, 12:11:38 PM8/31/17
to tesser...@googlegroups.com
You can change the fontlist either in language-specific.sh or as a parameter when you run tesstrain.sh

Read the wiki pages regarding training for more info.

ShreeDevi Kumar

unread,
Aug 31, 2017, 12:13:42 PM8/31/17
to tesser...@googlegroups.com
tesstrain.sh

Sorry about typo in earlier msg (autocorrect problem on phone)

Dan9er

unread,
Aug 31, 2017, 1:57:40 PM8/31/17
to tesseract-ocr
SHIT, I installed 4.0dev from the GitHub repository (which has a new training method and the wiki is SUPER confusing on this!!), I meant to build 3.05.1! How do I uninstall the unstable build and git clone the 3.05 branch?

On Thursday, August 31, 2017 at 12:09:05 PM UTC-4, shree wrote:
Please see tesstrain.sh script file in training directory.

It automates the whole training process.
On 31-Aug-2017 9:29 PM, "Dan9er" <dan9ert...@gmail.com> wrote:
Running 
training/text2image --text=npn_training_text.txt --outputbase=npn.Exo.exp0 --font='Exo' --fonts_dir=/usr/share/fonts

gives the desired output of two files:
  • npn.Exo.exo0.tif
  • npn.Exo.exp0.box
But running this command for the 162 fonts I want to use is very time consuming and monotonous. I tried running this command:
training/text2image --text=npn_training_text.txt --outputbase=npn --fonts_dir=/usr/share/fonts  --find_fonts --min_coverage=1.0 --render_per_font=true

But that only made files in this format: npn.{fontName}.tif

How do I automate making .tif AND .box files? Do I have to change the --outputbase to something different or do I have to make a .sh script?

PS. I did run training/text2image --find_fonts with --render_per_font set to false, so I have a npn.fontlist.txt file on hand.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

ShreeDevi Kumar

unread,
Aug 31, 2017, 2:09:18 PM8/31/17
to tesser...@googlegroups.com
git branch 

will show you the branches

git checkout 3.05

will checkout the 3.05 branch

tesstrain.sh is available in 3.05 also.


ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

Dan9er

unread,
Sep 1, 2017, 9:05:03 AM9/1/17
to tesseract-ocr
Before, I ran git clone https://github.com/tesseract-ocr/tesseract.git . I then built it and installed it. How do I do clone the 3.05 branch?

Also, you didn't say ANYTHING about uninstalling Tesseract 4.

ShreeDevi Kumar

unread,
Sep 1, 2017, 10:08:58 AM9/1/17
to tesser...@googlegroups.com
As far as I understand about git, the 3.05 branch would have already been cloned when you cloned the repository. 

Did you try the command 

git branch

in your tesseract folder. 

You should get a response such as 


  3.05
* master

git checkout 3.05

should put you on 3.05 branch


git pull origin

should get you any updates on 3.05 branch

-----------

I usually only build master branch, hence do not know of exact commands to uninstall 4.00.00alpha for building 3.05

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

Dan9er

unread,
Sep 1, 2017, 10:14:37 AM9/1/17
to tesseract-ocr
dan9er@Inspiron-530s:~/Projects/tesseract/tesseract-ocr$ git branch
* master

ShreeDevi Kumar

unread,
Sep 1, 2017, 10:55:34 AM9/1/17
to tesser...@googlegroups.com

ShreeDevi Kumar

unread,
Sep 1, 2017, 11:05:15 AM9/1/17
to tesser...@googlegroups.com


Well you can face problems if you install several version of tesseract event to different location like /usr, usr/local, /opt (or you have to very careful and you have to be familiar with your system, linking to shared libs etc.).
So I would suggest you to install only one version of tesseract (and uninstall former version before installing new version).
If you want to have several version of tesseract (e.g. you want to compare OCR result) I would suggest you to compile them from source (e.g. in /usr/src) and not installed them. If you want to test particular version you can run it this way:
/usr/src/tesseract-3.03/api/tesseract eurotext.tif eurotext
/usr/src/tesseract-ocr.3.02/api/tesseract eurotext.tif eurotext
/usr/src/tesseract-3.03/api/tesseract is shell wrapper script, and it will take care that correct shared library is used (without installation...).
Zdenko


ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Dan9er

unread,
Sep 2, 2017, 10:27:53 AM9/2/17
to tesseract-ocr
There is a sudo make uninstall script. I just ran it and it uninstalled Tesseract 4 perfectly. However, I still have the training tools. What make parameter should I use? training-uninstall?

Dan9er

unread,
Sep 2, 2017, 7:59:32 PM9/2/17
to tesseract-ocr
It IS training-uninstall. LUL
training-uninstall:
@cd "$(top_builddir)/training" && $(MAKE) uninstall

ShreeDevi Kumar

unread,
Sep 3, 2017, 3:05:17 AM9/3/17
to tesser...@googlegroups.com
Thanks!

I have added the info to wiki at


ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
Reply all
Reply to author
Forward
0 new messages