Adding symbol character to chi_sim

superw...@gmail.com

unread,

Oct 11, 2017, 10:48:17 AM10/11/17

to tesseract-ocr

We had an application that used Tesseract 3.05 to recognize some Chinese document image. The results were good but performance is pretty slow. We discovered "chi_sim_fast" trained data for Tesseract 4.0 which indeed having much better performance and slightly better accuracy. However, in the 3.05 version the "*" character can be recognize while in 4.0, the "*" character recognized as another Chinese character. Is it possible to add the "*" character to the list of recognition and continue using the "chi_sim_fast" data. Attache with an example image, the desired output is "年*十" while the actual result is "年友十". I have tried adding "-c tessedit_char_whitelist=*" to the command, but no luck.

Anyone has idea about this case, or I will need to retrain a data set for my own?

Thank you!

star.jpg

ShreeDevi Kumar

unread,

Oct 11, 2017, 11:38:12 AM10/11/17

to tesser...@googlegroups.com

Please add this as feedback in tessdata_fast as an issue so that Ray can include for next training.

You can try the plus minus fine-tune training to see if that helps.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/676b01e6-139a-4691-9841-78c2a4943b7e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

ShreeDevi Kumar

unread,

Oct 11, 2017, 2:15:22 PM10/11/17

to tesser...@googlegroups.com

You could also try Han traineddata files, which have both english and chi_sim. It may have better support for *.,

superw...@gmail.com

unread,

Oct 15, 2017, 10:37:58 PM10/15/17

to tesseract-ocr

Thank you for the reply, I have tried Han as well, no luck. Training my own data for now

On Thursday, October 12, 2017 at 2:15:22 AM UTC+8, shree wrote:

You could also try Han traineddata files, which have both english and chi_sim. It may have better support for *.,

On 11-Oct-2017 9:08 PM, "ShreeDevi Kumar" <shree...@gmail.com> wrote:

Please add this as feedback in tessdata_fast as an issue so that Ray can include for next training.

You can try the plus minus fine-tune training to see if that helps.

On 11-Oct-2017 8:18 PM, <superw...@gmail.com> wrote:

We had an application that used Tesseract 3.05 to recognize some Chinese document image. The results were good but performance is pretty slow. We discovered "chi_sim_fast" trained data for Tesseract 4.0 which indeed having much better performance and slightly better accuracy. However, in the 3.05 version the "*" character can be recognize while in 4.0, the "*" character recognized as another Chinese character. Is it possible to add the "*" character to the list of recognition and continue using the "chi_sim_fast" data. Attache with an example image, the desired output is "年*十" while the actual result is "年友十". I have tried adding "-c tessedit_char_whitelist=*" to the command, but no luck.

Anyone has idea about this case, or I will need to retrain a data set for my own?

Thank you!

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

Reply all

Reply to author

Forward

Adding symbol character to chi_sim_fast in Tesseract 4.0

superw...@gmail.com

ShreeDevi Kumar

ShreeDevi Kumar

superw...@gmail.com