Traineddata inspector

Jozef M.

unread,

Sep 3, 2015, 5:33:33 AM9/3/15

to tesser...@googlegroups.com

Dear all,

you can use the following web app to inspect some of the internals of traineddata files:
https://te-traineddata-ui.herokuapp.com

Few notes:
- this version does not parse cube specifics and some of the newer files;

- free hosting limits apply which means several parallel requests will kill it, be patient.

Best,

Jozef

zdenko podobny

unread,

Sep 4, 2015, 2:01:49 PM9/4/15

to tesser...@googlegroups.com

Nice! Unfortunately limit is low for eng.traineddata, but limit should be ok for custom training.

Zdenko

Sriranga(81+yrsold)

unread,

Sep 5, 2015, 6:13:53 AM9/5/15

to Michael Reimer

unfortunately does not work for Kan.traineddata file ( which is 34MB) which was downloaded from Github.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8xaPJC-p6NDjV1TcS3y-i%3DiYg28oA%2BjPidM728SEYX2-w%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

jm

unread,

Sep 6, 2015, 6:23:02 AM9/6/15

to tesseract-ocr

Works for both (eng/kan) now, occasionally you might reach time/mem limits with larger files.

Best,
Jozef

Sriranga(82yrsold)

unread,

Sep 6, 2015, 2:44:49 PM9/6/15

to tesser...@googlegroups.com

now works for Kan. traineddata file Inspector - thanks for the valuable guidance

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/a71482cb-82b7-4103-87f1-b380fb1de54e%40googlegroups.com.

Tom Morris

unread,

Sep 7, 2015, 10:02:21 PM9/7/15

to tesseract-ocr

That looks interesting. Is the source available? What's the significance of the different colors in the feature plots for characters?

The ambigs data looks suspect because the source and target for the replacements all seems to be the same which seems unlikely.

For example,

           m ->     m
          rn ->    rn
           m ->     m

I would expect to be something more like:

m -> rn

rn -> m

Is there a bug causing it to print the source instead of the target, or vice versa?

Tom

jm

unread,

Sep 9, 2015, 4:38:44 AM9/9/15

to tesseract-ocr

On Tuesday, September 8, 2015 at 4:02:21 AM UTC+2, Tom Morris wrote:

On Thursday, September 3, 2015 at 5:33:33 AM UTC-4, jm wrote:
Dear all,

you can use the following web app to inspect some of the internals of traineddata files:
https://te-traineddata-ui.herokuapp.com

Few notes:
- this version does not parse cube specifics and some of the newer files;
- free hosting limits apply which means several parallel requests will kill it, be patient.

That looks interesting. Is the source available?

No (or not yet, reused code from our co., this needs to be solved).

What's the significance of the different colors in the feature plots for characters?

Different colour for each protoset (the same colour for protos in a protoset).

The ambigs data looks suspect because the source and target for the replacements all seems to be the same which seems unlikely.

For example,
           m ->     m
          rn ->    rn
           m ->     m
I would expect to be something more like:

m -> rn
rn -> m

Is there a bug causing it to print the source instead of the target, or vice versa?

Fixed the typo in the UI, thanks.

Tom

ShreeDevi Kumar

unread,

Sep 9, 2015, 8:44:49 AM9/9/15

to tesser...@googlegroups.com

Hello Jozef,

Thank you for this tool. It is very helpful to have a visual look at inttemp.

I tried it with hin.traineddata (devanagri script) as well as some custom trained data. The inttemp display does not seem to correspond to the titles for the boxes. When I checked for eng.traineddata they seem ok. I am wondering whether the problem is in the Training or in the inspector UI for this unicode range.

I would appreciate if you can allow for hin, nep, mar, san from https://github.com/tesseract-ocr/tessdata - all devanagari script based languages.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Tom

--

You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/c50d5b73-8d96-41f9-becd-ca0fd2d54976%40googlegroups.com.

Ruwanka De Silva

unread,

Sep 18, 2015, 12:27:01 PM9/18/15

to tesseract-ocr

Hi Jozef,

Thank you for the valuable tool. I am training tesseract for the Sinhalese language and your tool is very helpful to identify what are the characters that have not been trained well. But I have an issue when analyzing traineddata file which generated from multiple training images or generated from multiple fonts.

Issue is the features (character glyph/feature map) of characters and corresponding Unicode labels are not matched. But they are correct if traineddata file is only for few training images and only for one font. Is it a bug in the tool or generated traineddata file is distorted somehow? Please let me know what is the issue for this effect.