Chinise characters.

116 views
Skip to first unread message

Jan Ploska

unread,
Mar 16, 2024, 12:41:13 PM3/16/24
to tesseract-ocr
Hello, 

I am making a transcrypt of YT wideos using tessaract. 
Images I input to tessaract look like this:
aftercut29.0.jpg

The output is mostly correct but sometimes the same character give numerous output.
Example: 
Input:
aftercut3.0.jpg
Output: 大中文 - CORRECT

Input:
aftercut10.5.jpg 
Output: 今天不是3位 大 - INCORRECT

In preparation of the images I use:
  •  dilatation
  • cropping the area of image containg characters
  •  I add borders.
 For dilatation I use 2x2 kernel and the border is 2px thick.
 For segmentation method I am currently experimentig with psg --7 and psg -- 13. psg --7 seems to give a bit better results. Of course the language is : lang='chi_sim'

Could you give my any advice how to improve the robustness of the output?

Thank you in advance,
Jan

ziyan xu

unread,
Jul 18, 2024, 1:32:25 PM7/18/24
to tesseract-ocr
你好,请问一下用的是哪个版本呀,方便分享一下你的chi_sim 和chi_sim_vert 的文件嘛?

John

unread,
Jul 19, 2024, 12:28:37 AM7/19/24
to tesseract-ocr
to tesseract-ocr  Is version

Danny

unread,
Aug 2, 2024, 9:33:21 PM8/2/24
to tesseract-ocr
I had many similar issues, especially with input with Yuan (rounded) fonts.  In the end I found the exact font used and ran additional training with the new font.  

Even after retraining some characters would be confused with others (like your case).  To strengthen those, I included many instances of those characters in various combinations in the training data and ran the training again.
 eg:
中文
中文


中文
叔/
etc

Recognition got much much better, but still have an issue when there is an ellipsis or three dots after the text, in which case it doesn't output anything at all!  See conversation here.

eg, this image below produces no output at all...  No idea why!
bad_sub_243.png

Reply all
Reply to author
Forward
0 new messages