Getting the model output from my trained data

465 views
Skip to first unread message

ZenMaster181

unread,
Oct 25, 2019, 1:48:53 AM10/25/19
to tesseract-ocr
Hi, I am new to this training tesseract.
 I have annotated the images and created .tif and .box files from JTessboxEditor  and also created my traineddata and unicharset files from the same tool.

As I have tried to get the net_specs from the below command

training/combine_tessdata -d tessdata/xxx.traineddata
However, It just gave me the results as 
/src/training$ combine_tessdata -d 1.traineddataVersion string:v4.0.0.201810301:unicharset:size=4447, offset=1923:inttemp:size=344340, offset=46394:pffmtable:size=511, offset=3489795:normproto:size=785, offset=34949013:shapetable:size=76, offset=35027523:version:size=15, offset=350351
How do I get the net_specs to generate the model out put?
I do have tried to create traineddata from the tesseract git page commands, However, I do get the following errorWarning: properties incomplete for index 72 = JWarning: properties incomplete for index 73 = &Warning: properties incomplete for index 74 = %Warning: properties incomplete for index 75 = $Warning: properties incomplete for index 76 = ’Warning: properties incomplete for index 77 = #Warning: properties incomplete for index 78 = ~Warning: properties incomplete for index 79 = ^Config file is optional, continuing...Failed to read data from: tesstutorial/tessdata/my/my.configFailed to read data from: tesstutorial/tessdata/radical-stroke.txt
I have attached my traineddata and listed created from jtessbox editor
Please suggest me a methord to solve this issue. Thank you in advance
1.traineddata
1.inttemp
1.normproto
1.pffmtable
1.shapetable
1.unicharset
unicharset

Shree Devi Kumar

unread,
Oct 25, 2019, 3:07:07 AM10/25/19
to tesseract-ocr
You are mixing legacy Tesseract training and LSTM training.

The traineddata and other files from jtessboxeditor seem to be for the legacy engine.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/5ac1f0de-a926-4d3f-8a1f-fcf1482ef91a%40googlegroups.com.

ZenMaster181

unread,
Oct 25, 2019, 3:32:41 AM10/25/19
to tesseract-ocr
I am sorry to be a drag but, I do not know the difference, I do have .tif, .box, .lstmf and a ttf font file with me

Is it possible to create traineddata file from the above given files Using the below code?

src/training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng --linedata_only \
  --noextract_font_properties --langdata_dir ../langdata \
  --tessdata_dir ./tessdata --output_dir ~/tesstutorial/engtrain
Sorry for being a noob
EBOS - 4718210.pdf-1.box

ZenMaster181

unread,
Oct 25, 2019, 3:35:02 AM10/25/19
to tesseract-ocr
I have also cloned the git as per your instructions in the


"Follow the instructions below to do the first time setup for TessTutorial"

Shree Devi Kumar

unread,
Oct 25, 2019, 4:04:27 AM10/25/19
to tesseract-ocr
If you have the box and tiff files from jtesseditor, you can use https://github.com/tesseract-ocr/tesstrain for training

However, training is needed only in special cases.

Have you tried with existing traineddata files?

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.


--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

ZenMaster181

unread,
Oct 25, 2019, 4:23:57 AM10/25/19
to tesseract-ocr
Yes, I have tried to generate model output from the code mentioned on the page, However, the traineddata file which I have created from JtessboxEditor does not have the Net_ specifications.
I have tried to this code to get the version and net specs of my trained data file. this is the result

combine_tessdata -d 1.traineddata
    • Version string:v4.0.0.201810301:unicharset:size=4447,
    • offset=1923:inttemp:size=344340,
    • offset=46394:pffmtable:size=511,
    • offset=3489795:normproto:size=785,
    • offset=34949013:shapetable:size=76,
    • offset=35027523:version:size=15,
    • offset=350351
    • offset=46394:pffmtable:size=511,
    This does not specify net_specs, for temp I have used the net specs of eng.trained data however, It me another error
    Deserialized Header failed : EBOS- 4718210.pdf-1.tif.lstmf(which was loaded from lstmf file list)
    Please check the above file
    1.traineddata

    Shree Devi Kumar

    unread,
    Oct 25, 2019, 5:13:32 AM10/25/19
    to tesseract-ocr

    --
    You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

    ZenMaster181

    unread,
    Oct 25, 2019, 5:27:11 AM10/25/19
    to tesseract-ocr
    I got that, However, When I tried to create the traineddata file from the code:
    src/training/tesstrain.sh\
      • --fonts_dir /usr/share/fonts\
      • --lang eng --linedata_only \
      • --noextract_font_properties \
      • --langdata_dir ../langdata \
      • --tessdata_dir ./tessdata \
      • -output_dir ~/tesstutorial/engtrain
      ERROR: Could not find training text file tesstutorial/langdata/eng/eng/eng.training_text

      I have checked that the file exists in langdata directory
      But I have a doubt, How do the trained data file will work on my .tif file If I am not giving any information to the above code?

      Shree Devi Kumar

      unread,
      Oct 25, 2019, 5:30:25 AM10/25/19
      to tesseract-ocr
      You are mixing many different approaches for training.

      If you have box/tiff pairs, use makefile from tesseract-ocr/tesstrain

      If you want to train from text and fonts, use tesseract-ocr/tesseract/src/tesstrain.sh



      --
      You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
      To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

      ZenMaster181

      unread,
      Oct 25, 2019, 5:30:47 AM10/25/19
      to tesseract-ocr
      Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.Failed loading language 'eng'Tesseract couldn't load any languages!Could not initialize tesseract.ERROR: Program tesseract failed. Abort.

      Sorry this was the error after I have run the script
      Should I upload my own font file or create it with any tool?

      ZenMaster181

      unread,
      Oct 25, 2019, 5:59:35 AM10/25/19
      to tesseract-ocr
      Thank's for the reply on this regard. I will try to follow the procedure.
      Reply all
      Reply to author
      Forward
      0 new messages