Can :traineddata" for Tesseract 3 be used for Tesseract 4

chandra churh chatterjee

oläst,

13 juni 2018 09:16:302018-06-13

till tesseract-ocr

I have trained tesseract 3 with 64 fonts using respective box and .tr files, But now i want to use the same trained data for training tesseract 4 after creating the starter trained data using the "

Using tesstrain

The setup for running tesstrain.sh is the same as for base Tesseract. Use --linedata_only option for LSTM training. Note that it is beneficial to have more training text and make more pages though, as neural nets don't generalize as well and need to train on something similar to what they will be running on. If the target domain is severely limited, then all the dire warnings about needing a lot of training data may not apply, but the network specification may need to be changed.

Training data is created using tesstrain.sh as follows: Note that your fonts location may vary.

training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng --linedata_only \
  --noextract_font_properties --langdata_dir ../langdata \
  --tessdata_dir ./tessdata --output_dir ~/tesstutorial/engtrain

The above command makes LSTM training data equivalent to the data used to train base Tesseract for English. For making a general-purpose LSTM-based OCR engine, it is woefully inadequate, but makes a good tutorial demo.

Now try this to make eval data for the 'Impact' font:

training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng --linedata_only \
  --noextract_font_properties --langdata_dir ../langdata \
  --tessdata_dir ./tessdata \

--fontlist "Impact Condensed" --output_dir ~/tesstutorial/engeval"

Now i want to proceed further using my previous trained data to do the training but the problem is that the previous trained data had .tr files and box files but tesseract 4 requires .lstmf files .

Requesting for any solution.

ShreeDevi Kumar

oläst,

13 juni 2018 11:08:072018-06-13

till tesser...@googlegroups.com

If you have box tiff pairs in tesseract4 format you can generate the lstmf files by running

tesseract lang.file.exp0.tif lang.file.exp0 lstm.train

lstm.train is a config file.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/f3d6c64e-7763-478e-b047-a64edd032d99%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

chandra churh chatterjee

oläst,

14 juni 2018 01:36:072018-06-14

till tesser...@googlegroups.com

can you tell me from which directory we have to run the following command and what will be the following arguments if we are using our trained data which contains files as follows:

-07-2016 12:45 11 digits.f4.exp0.txt

-a---- 08-07-2016 12:37 198 digits.f5.exp0.box

-a---- 08-07-2016 12:10 14044 digits.f5.exp0.jpg

-a---- 08-07-2016 12:45 16309 digits.f5.exp0.tr

-a---- 08-07-2016 12:45 11 digits.f5.exp0.txt

-a---- 08-07-2016 12:31 188 digits.f6.exp0.box

-a---- 23-06-2016 13:06 9824 digits.f6.exp0.jpg

-a---- 08-07-2016 12:45 17538 digits.f6.exp0.tr

-a---- 08-07-2016 12:45 11 digits.f6.exp0.txt

-a---- 08-07-2016 12:38 199 digits.f7.exp0.box

-a---- 08-07-2016 12:11 13178 digits.f7.exp0.jpg

-a---- 08-07-2016 12:45 16019 digits.f7.exp0.tr

-a---- 08-07-2016 12:45 11 digits.f7.exp0.txt

-a---- 08-07-2016 12:38 198 digits.f8.exp0.box

-a---- 23-06-2016 13:06 9485 digits.f8.exp0.jpg

-a---- 08-07-2016 12:45 17078 digits.f8.exp0.tr

-a---- 08-07-2016 12:45 11 digits.f8.exp0.txt

-a---- 08-07-2016 12:38 199 digits.f9.exp0.box

-a---- 08-07-2016 12:11 13411 digits.f9.exp0.jpg

-a---- 08-07-2016 12:45 15916 digits.f9.exp0.tr

-a---- 08-07-2016 12:45 11 digits.f9.exp0.txt

-a---- 08-07-2016 12:57 543 digits.font_properties

-a---- 08-07-2016 12:59 184521 digits.inttemp

-a---- 08-07-2016 13:00 4832 digits.normproto

-a---- 08-07-2016 12:59 84 digits.pffmtable

-a---- 08-07-2016 12:59 6520 digits.shapetable

-a---- 08-07-2016 13:01 196755 digits.traineddata

-a---- 08-07-2016 12:59 658 digits.unicharset

-a---- 08-07-2016 12:55 648 unicharset

how to convert these files and from where to run the command as sugested by you?

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWD0-BJ6sq4mypJhnc5FKudVcmSeBg%2BB5w5EARV4NPL4g%40mail.gmail.com.

chandra churh chatterjee

oläst,

14 juni 2018 05:56:012018-06-14

till tesser...@googlegroups.com

How to convert the images as stated above into fonts for tesstrain.sh command runnning which generates images files along with box and .lstmf files?

Svara alla

Svara författaren

Vidarebefordra