Adding symbol character to chi_sim_fast in Tesseract 4.0

92 views
Skip to first unread message

superw...@gmail.com

unread,
Oct 11, 2017, 10:48:17 AM10/11/17
to tesseract-ocr
We had an application that used Tesseract 3.05 to recognize some Chinese document image. The results were good but performance is pretty slow. We discovered "chi_sim_fast" trained data for Tesseract 4.0 which indeed having much better performance and slightly better accuracy. However, in the 3.05 version the "*" character can be recognize while in 4.0, the "*"  character recognized as another Chinese character. Is it possible to add the "*" character to the list of recognition and continue using the "chi_sim_fast" data. Attache with an example image, the desired output is "年*十" while the actual result is "年友十". I have tried adding "-c tessedit_char_whitelist=*" to the command, but no luck.

Anyone has idea about this case, or I will need to retrain a data set for my own?

Thank you!
star.jpg

ShreeDevi Kumar

unread,
Oct 11, 2017, 11:38:12 AM10/11/17
to tesser...@googlegroups.com
Please add this as feedback in tessdata_fast as an issue so that Ray can include for next training.

You can try the plus minus fine-tune training to see if that helps.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/676b01e6-139a-4691-9841-78c2a4943b7e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

ShreeDevi Kumar

unread,
Oct 11, 2017, 2:15:22 PM10/11/17
to tesser...@googlegroups.com
You could also try Han traineddata files, which have both english and chi_sim. It may have better support for *.,

superw...@gmail.com

unread,
Oct 15, 2017, 10:37:58 PM10/15/17
to tesseract-ocr
Thank you for the reply, I have tried Han as well, no luck. Training my own data for now


On Thursday, October 12, 2017 at 2:15:22 AM UTC+8, shree wrote:
You could also try Han traineddata files, which have both english and chi_sim. It may have better support for *.,

On 11-Oct-2017 9:08 PM, "ShreeDevi Kumar" <shree...@gmail.com> wrote:
Please add this as feedback in tessdata_fast as an issue so that Ray can include for next training.

You can try the plus minus fine-tune training to see if that helps.
On 11-Oct-2017 8:18 PM, <superw...@gmail.com> wrote:
We had an application that used Tesseract 3.05 to recognize some Chinese document image. The results were good but performance is pretty slow. We discovered "chi_sim_fast" trained data for Tesseract 4.0 which indeed having much better performance and slightly better accuracy. However, in the 3.05 version the "*" character can be recognize while in 4.0, the "*"  character recognized as another Chinese character. Is it possible to add the "*" character to the list of recognition and continue using the "chi_sim_fast" data. Attache with an example image, the desired output is "年*十" while the actual result is "年友十". I have tried adding "-c tessedit_char_whitelist=*" to the command, but no luck.

Anyone has idea about this case, or I will need to retrain a data set for my own?

Thank you!

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages