Reading text mixing different fonts, colors, languages,...

185 views
Skip to first unread message

Hoang Pham Huy

unread,
Dec 21, 2023, 11:58:49 PM12/21/23
to tesseract-ocr
Currently i'm trying to read this image in Japanese for translating, but the result kinda odd. What should i do to improve it?

I'm only using this code for extract text from the image using Japanese tessdata_best and some others:

```
def extract_text_from_image(self, image_path):
img = cv2.imread(image_path) 
text = pytesseract.image_to_string(img, lang='jpn+jpn_vert+jpn_ver5+eng+osd+equ')
return text.strip()
```


Screen Shot 2023-12-22 at 10.12.00.png

Zdenko Podobny

unread,
Dec 22, 2023, 12:02:24 AM12/22/23
to tesser...@googlegroups.com
What should i do to improve it?
Did you read the Tesseract documentation?

Zdenko


pi 22. 12. 2023 o 5:58 Hoang Pham Huy <akiray...@gmail.com> napísal(a):
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/06f86c3c-4b4c-4a99-b2fa-50f38b13d54bn%40googlegroups.com.

Ger Hobbelt

unread,
Dec 24, 2023, 11:01:18 PM12/24/23
to tesseract-ocr
See also discussion in mailing list at https://groups.google.com/d/msgid/tesseract-ocr/f86e2d35-4c35-4643-835f-109994e46632n%40googlegroups.com?utm_medium=email&utm_source=footer

Plus https://github.com/tesseract-ocr/tessdoc/blob/main/ImproveQuality.md, which is the most important documentation page that addresses all kinds of OCR result quality issues such as this.




--

Hoang Pham Huy

unread,
Dec 26, 2023, 11:38:14 PM12/26/23
to tesseract-ocr
nvm, the config --oem 3 --psm 6  extract text real good but if the image like bellow, it combine 2 paragraph to 1 , so i use config --oem 3 --psm 4 , work great but skip lot of text in page .  Now the problem i have is the image i read sometimes have both 2 kind of text:
-Text read from left to right
-Text read from top to bottom

How can i detect it to switch between tessdata (if i remember correctly: jpn used to read left to right text and jpn_vert used to read top to bottom text). Thanks
Screen Shot 2023-12-26 at 10.28.28.png


Vào lúc 11:01:18 UTC+7 ngày Thứ Hai, 25 tháng 12, 2023, g...@hobbelt.com đã viết:
Reply all
Reply to author
Forward
0 new messages