Traineddata inspector

375 views
Skip to first unread message

Jozef M.

unread,
Sep 3, 2015, 5:33:33 AM9/3/15
to tesser...@googlegroups.com
Dear all,

you can use the following web app to inspect some of the internals of traineddata files:
https://te-traineddata-ui.herokuapp.com

Few notes:
- this version does not parse cube specifics and some of the newer files;
- free hosting limits apply which means several parallel requests will kill it, be patient.

Best,
Jozef

zdenko podobny

unread,
Sep 4, 2015, 2:01:49 PM9/4/15
to tesser...@googlegroups.com

Nice! Unfortunately limit is low for eng.traineddata, but limit should be ok for custom training.

Zdenko
 


Sriranga(81+yrsold)

unread,
Sep 5, 2015, 6:13:53 AM9/5/15
to Michael Reimer
unfortunately does not work for Kan.traineddata file ( which is 34MB) which was downloaded from Github.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8xaPJC-p6NDjV1TcS3y-i%3DiYg28oA%2BjPidM728SEYX2-w%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

jm

unread,
Sep 6, 2015, 6:23:02 AM9/6/15
to tesseract-ocr
Works for both (eng/kan) now, occasionally you might reach time/mem limits with larger files.

Best,
Jozef

Sriranga(82yrsold)

unread,
Sep 6, 2015, 2:44:49 PM9/6/15
to tesser...@googlegroups.com
now works for Kan. traineddata file Inspector - thanks for the valuable guidance


Tom Morris

unread,
Sep 7, 2015, 10:02:21 PM9/7/15
to tesseract-ocr
That looks interesting.  Is the source available?  What's the significance of the different colors in the feature plots for characters?

The ambigs data looks suspect because the source and target for the replacements all seems to be the same which seems unlikely.

For example,

           m ->     m
rn -> rn
m -> m

 I would expect to be something more like:

 m -> rn
rn -> m

Is there a bug causing it to print the source instead of the target, or vice versa?

Tom

jm

unread,
Sep 9, 2015, 4:38:44 AM9/9/15
to tesseract-ocr


On Tuesday, September 8, 2015 at 4:02:21 AM UTC+2, Tom Morris wrote:
On Thursday, September 3, 2015 at 5:33:33 AM UTC-4, jm wrote:
Dear all,

you can use the following web app to inspect some of the internals of traineddata files:
https://te-traineddata-ui.herokuapp.com

Few notes:
- this version does not parse cube specifics and some of the newer files;
- free hosting limits apply which means several parallel requests will kill it, be patient.

That looks interesting.  Is the source available?  

No (or not yet, reused code from our co., this needs to be solved).
 
What's the significance of the different colors in the feature plots for characters?

Different colour for each protoset (the same colour for protos in a protoset).
 

The ambigs data looks suspect because the source and target for the replacements all seems to be the same which seems unlikely.

For example,

           m ->     m
rn -> rn
m -> m

 I would expect to be something more like:

 m -> rn
rn -> m

Is there a bug causing it to print the source instead of the target, or vice versa?

Fixed the typo in the UI, thanks.
 

Tom

ShreeDevi Kumar

unread,
Sep 9, 2015, 8:44:49 AM9/9/15
to tesser...@googlegroups.com
Hello Jozef,

Thank you for this tool. It is very helpful to have a visual look at inttemp.

I tried it with hin.traineddata (devanagri script) as well as some custom trained data. The inttemp display does not seem to correspond to the titles for the boxes. When I checked for eng.traineddata they seem ok. I am wondering whether the problem is in the Training or in the inspector UI for this unicode range.

I would appreciate if you can allow for hin, nep, mar, san from https://github.com/tesseract-ocr/tessdata - all devanagari script  based languages.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com


Tom

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.

Ruwanka De Silva

unread,
Sep 18, 2015, 12:27:01 PM9/18/15
to tesseract-ocr
Hi Jozef,

Thank you for the valuable tool. I am training tesseract for the Sinhalese language and your tool is very helpful to identify what are the characters that have not been trained well. But I have an issue when analyzing traineddata file which generated from multiple training images or generated from multiple fonts.

Issue is the features (character glyph/feature map) of characters and corresponding Unicode labels are not matched. But they are correct if traineddata file is only for few training images and only for one font. Is it a bug in the tool or generated traineddata file is distorted somehow? Please let me know what is the issue for this effect. 

Thank you,
Ruwanka De Silva
Reply all
Reply to author
Forward
0 new messages