oem Detection

Ibr

unread,

Jun 13, 2017, 5:47:34 AM6/13/17

to tesseract-ocr

Hi,

when make detection using the tesseract 4.00.00alpha and use the command: tesseract image results -l ara --tessdata-dir ./tessdata --oem 1 the oem here means "Neural nets LSTM only", so there is no argument in tesseract to specify where to find the LSTM files, how the tesseract find them? I used to place the LSTM files inside the tesseract folder, but I tried to detect after I deleted the LSTM files, with the argument --oem 1 which meanst LSTM only yet the detection happened, so does the tesseract search in other folders for LSTM files? as I had LSTM files in different folders

Thanks.

ShreeDevi Kumar

unread,

Jun 13, 2017, 7:36:54 AM6/13/17

to tesser...@googlegroups.com

tesseract image results -l ara --tessdata-dir ./tessdata --oem 1

uses the LSTM files that are there in ara.traineddata in your tessdata directory.

Just placing lstm files in tesseract folder is not going to change anything.

You need to create a new traineddata with the new lstm files and then test with it.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/eefc8290-c407-4075-b845-4b226094e752%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ibr

unread,

Jun 13, 2017, 7:55:33 AM6/13/17

to tesseract-ocr

seems so, to add or merge the new LSTM files in the traineddata this command to user correct: training/combine_tessdata -o tessdata/jpn.traineddata ~/tesstutorial/eng_from_chi/.lstm

but that gave me the following:

TessdataManager can't determine which tessdata component is represented by lstmf
TessdataManager combined tesseract data files.
Offset for type 0 (.traineddataconfig                ) is 172
Offset for type 1 (.traineddataunicharset            ) is 2745
Offset for type 2 (.traineddataunicharambigs         ) is 283372
Offset for type 3 (.traineddatainttemp               ) is 288048
Offset for type 4 (.traineddatapffmtable             ) is 30906394
Offset for type 5 (.traineddatanormproto             ) is 30942955
Offset for type 6 (.traineddatapunc-dawg             ) is 31395690
Offset for type 7 (.traineddataword-dawg             ) is 31398292
Offset for type 8 (.traineddatanumber-dawg           ) is 32406214
Offset for type 9 (.traineddatafreq-dawg             ) is 32406256
Offset for type 10 (.traineddatafixed-length-dawgs    ) is -1
Offset for type 11 (.traineddatacube-unicharset       ) is -1
Offset for type 12 (.traineddatacube-word-dawg        ) is -1
Offset for type 13 (.traineddatashapetable            ) is 32407402
Offset for type 14 (.traineddatabigram-dawg           ) is -1
Offset for type 15 (.traineddataunambig-dawg          ) is -1
Offset for type 16 (.traineddataparams-model          ) is 33071948
Offset for type 17 (.traineddatalstm                  ) is 33072647
Offset for type 18 (.traineddatalstm-punc-dawg        ) is 43371656
Offset for type 19 (.traineddatalstm-word-dawg        ) is 43374258
Offset for type 20 (.traineddatalstm-number-dawg      ) is 44380188

any idea?

thanks

On Tuesday, June 13, 2017 at 2:36:54 PM UTC+3, shree wrote:

tesseract image results -l ara --tessdata-dir ./tessdata --oem 1

uses the LSTM files that are there in ara.traineddata in your tessdata directory.

Just placing lstm files in tesseract folder is not going to change anything.

You need to create a new traineddata with the new lstm files and then test with it.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Tue, Jun 13, 2017 at 3:17 PM, Ibr <ibr.h...@gmail.com> wrote:

Hi,

when make detection using the tesseract 4.00.00alpha and use the command: tesseract image results -l ara --tessdata-dir ./tessdata --oem 1 the oem here means "Neural nets LSTM only", so there is no argument in tesseract to specify where to find the LSTM files, how the tesseract find them? I used to place the LSTM files inside the tesseract folder, but I tried to detect after I deleted the LSTM files, with the argument --oem 1 which meanst LSTM only yet the detection happened, so does the tesseract search in other folders for LSTM files? as I had LSTM files in different folders

Thanks.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

ShreeDevi Kumar

unread,

Jun 13, 2017, 8:03:40 AM6/13/17

to tesser...@googlegroups.com

you have to be clear on what files you are combining.

the command you have given is overwriting japanese traineddata - is that what you want to do?

> training/combine_tessdata -o tessdata/jpn.traineddata

Look at help for all options of combine_tessdata

Figure out which files (lstm, dawg etc) you want to combine

Give appropriate command options and files to create new traineddata

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/16ce1839-6af2-4c5a-850a-62843b185b4b%40googlegroups.com.

Ibr

unread,

Jun 13, 2017, 8:39:07 AM6/13/17

to tesseract-ocr

thanks for the response, well actually I wrote the command wrong, I wanted to combine, also I didn't extract the lstm file before I do the combination, which brings another question.

if I use the tesstrain.sh it will create .lstmf files, correct? but if I used combine_tessdata -e that will create lstm file, so what is the difference between both of them?

I know that lstmf files are substitute for the .tr files, if you gave me little explanation about both I would be grateful, since there were not much of explanation on the web about them

Thanks in advance

ShreeDevi Kumar

unread,

Jun 13, 2017, 9:28:21 AM6/13/17

to tesser...@googlegroups.com

combine_tessdata -e

extracts the lstm file from the traineddata provided from original training by google.

-----------------

tesstrain.sh it will create .lstmf files

yes. these are created from the box-tiff pairs created from the training text and fonts

---------------------------

lstmtraining program takes all of these .lstmf files (via the file which has all the .lstmf filenames)

and

creates intermediate .lstm files and _checkpoint files

-------------------------------

these can be converted to the final .lstm file for use in traineddata

--------------------------

the final .lstm file has to be combined using combine_tessdata to create new traineddata.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/ef0bbae1-572c-4a05-949e-83b8cb8b69f0%40googlegroups.com.

Ibr

unread,

Jun 14, 2017, 3:27:30 AM6/14/17

to tesseract-ocr

Thanks

Message has been deleted

Ibr

unread,

Jun 14, 2017, 9:17:24 AM6/14/17

to tesseract-ocr

is this command correct too create the intermediate .lstm and _checlpoint?

training/lstmtraining --model_output ~/tesstutorial/impact_from_small/impact \
--train_listfile ~/tesstutorial/jpntrain/jpn.training_files.txt \
--continue_from ~/tesstutorial/impact_from_full/jpn.lstm

as for --continue_from, its mentioned in here its can be for recognition model which is be .lstm, if not what is the existing model? because when I run the command above it says:-

Loaded file /home/ibr/tesstutorial/impact_from_full/jpn.traineddata, unpacking...
Failed to continue from: /home/ibr/tesstutorial/impact_from_full/jpn.traineddata

On Tuesday, June 13, 2017 at 4:28:21 PM UTC+3, shree wrote:

ShreeDevi Kumar

unread,

Jun 14, 2017, 9:49:51 AM6/14/17

to tesser...@googlegroups.com

You need to extract .lstm from traineddata

eg. (change foldernames to match ur setup)

combine_tessdata -e ../tessdata/jpn.traineddata jpn.lstm

Extracting tessdata components from ../tessdata/jpn.traineddata

Wrote jpn.lstm

0:config:size=2573, offset=168

1:unicharset:size=280627, offset=2741

2:unicharambigs:size=4676, offset=283368

3:inttemp:size=30618346, offset=288044

4:pffmtable:size=36561, offset=30906390

5:normproto:size=452735, offset=30942951

6:punc-dawg:size=2602, offset=31395686

7:word-dawg:size=1007922, offset=31398288

8:number-dawg:size=42, offset=32406210

9:freq-dawg:size=1146, offset=32406252

13:shapetable:size=664546, offset=32407398

16:params-model:size=699, offset=33071944

17:lstm:size=10299009, offset=33072643

18:lstm-punc-dawg:size=2602, offset=43371652

19:lstm-word-dawg:size=1005930, offset=43374254

20:lstm-number-dawg:size=50, offset=44380184

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Wed, Jun 14, 2017 at 6:45 PM, Ibr <ibr.h...@gmail.com> wrote:

is this command correct too create the intermediate .lstm and _checlpoint?

training/lstmtraining --model_output ~/tesstutorial/impact_from_small/impact \
--train_listfile ~/tesstutorial/jpntrain/jpn.training_files.txt \
--continue_from ~/tesstutorial/impact_from_full/jpn.lstm

as for --continue_from, its mentioned in here its can be for recognition model which is be .lstm, if not what is the existing model? because when I run the command above it says:-
Loaded file /home/ibr/tesstutorial/impact_from_full/jpn.traineddata, unpacking...
Failed to continue from: /home/ibr/tesstutorial/impact_from_full/jpn.traineddata

On Tuesday, June 13, 2017 at 4:28:21 PM UTC+3, shree wrote:

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/ef0bbae1-572c-4a05-949e-83b8cb8b69f0%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/49503e1f-e96e-458e-953f-5acb32367ff7%40googlegroups.com.

Ibr

unread,

Jun 14, 2017, 9:58:47 AM6/14/17

to tesseract-ocr

yes I already extracted the lstm file and specified that at the argument continue: --continue_from ~/tesstutorial/impact_from_full/jpn.lstm

isn't this step should do it?
yet the error keep coming:

Loaded file /home/ibr/tesstutorial/impact_from_full/jpn.lstm, unpacking...
Failed to continue from: /home/ibr/tesstutorial/impact_from_full/jpn.lstm

Thanks for the response

ShreeDevi Kumar

unread,

Jun 14, 2017, 10:53:35 AM6/14/17

to tesser...@googlegroups.com

check that the file is there

ls -l /home/ibr/tesstutorial/impact_from_full/jpn.lstm

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/8bfe51b8-68fe-4128-9d46-c8000238f354%40googlegroups.com.

Reply all

Reply to author

Forward