Chinise characters.

Jan Ploska

unread,

Mar 16, 2024, 12:41:13 PM3/16/24

to tesseract-ocr

Hello,

I am making a transcrypt of YT wideos using tessaract.

Images I input to tessaract look like this:

The output is mostly correct but sometimes the same character give numerous output.

Example:

Input:

Output: 大叔中文 - CORRECT

Input:

Output: 今天不是3位大档 - INCORRECT

In preparation of the images I use:

dilatation,
cropping the area of image containg characters
I add borders.

For dilatation I use 2x2 kernel and the border is 2px thick.

For segmentation method I am currently experimentig with psg --7 and psg -- 13. psg --7 seems to give a bit better results. Of course the language is : lang='chi_sim'

Could you give my any advice how to improve the robustness of the output?

Thank you in advance,

Jan

ziyan xu

unread,

Jul 18, 2024, 1:32:25 PM7/18/24

to tesseract-ocr

你好，请问一下用的是哪个版本呀，方便分享一下你的chi_sim 和chi_sim_vert 的文件嘛？

John

unread,

Jul 19, 2024, 12:28:37 AM7/19/24

to tesseract-ocr

to tesseract-ocr Is version

Danny

unread,

Aug 2, 2024, 9:33:21 PM8/2/24

to tesseract-ocr

I had many similar issues, especially with input with Yuan (rounded) fonts. In the end I found the exact font used and ran additional training with the new font.

Even after retraining some characters would be confused with others (like your case). To strengthen those, I included many instances of those characters in various combinations in the training data and ran the training again.

eg:

大叔中文

叔大中文

叔

叔叔

大中文叔

叔/叔

etc

Recognition got much much better, but still have an issue when there is an ellipsis or three dots after the text, in which case it doesn't output anything at all! See conversation here.

eg, this image below produces no output at all... No idea why!

Reply all

Reply to author

Forward