Japanese - Problems with vertical words

172 views
Skip to first unread message

Jorge Castrillo

unread,
Jun 2, 2019, 5:22:42 AM6/2/19
to tesseract-ocr
Hi everyone. I'm making a program on that uses tesseract to get a word from a manga with a snipping-tool like program, and translates that word with JMdict.
The thing is tesseract gives weird values for vertical, small selections. I'm going to explain it in more detail:


Say I get a full horizontal line in Japanese, like  the following one:

horizontal_full.jpg

The output "元来日本語は漢文に倣い、文字を上" is perfect

Getting a full vertical line gives no problems either:

vertical_full.jpg


Gives the same correct output. Now if I want to get only words, when examining horizontal text there are no problems, while with the vertical text the output is almost always (except when examining a Kanji alone) wrong, like this:

nih-horizontal.jpg


nih-vertical.jpg


The first one returns 日本語 while the second one returns 髑升田.
They are both from the same file, same size, same font, yet the results vary greatly-


Another example, this time from a manga:

ej2full.jpg


The output is 今日の勝敗よりも, again, correct.
But going word by word we start to have errors:

eje2-word1.jpg

Output 由」〉

and

ej2-word.jpg

Output 健雛

Why is it that it can examine the full line without problem, but have so much trouble getting vertical words? I am using psm 8 for words, but it only seems to work with horizontal ones, and I can't get my head around it. I've been trying to find a solution to this all day, but without success. I'm not an expert programmer by any means, this is more of a college project, but any insight would be really, really appreciated. Thank you for reading.
02-japanese-02.jpg
1558668461345 (copy).jpg

Shree Devi Kumar

unread,
Jun 3, 2019, 11:31:29 AM6/3/19
to tesser...@googlegroups.com
tesseract 4 has been trained on line images and hence gives better results for lines, as far as I have seen.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/71b34e0f-5713-42d3-9ba0-4926291758cb%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Seokbong Choi

unread,
Jun 3, 2019, 4:28:29 PM6/3/19
to tesser...@googlegroups.com
Are you using jpn_vert instead of jpn?
I have trained jpn_vert 



ultra

unread,
Aug 7, 2020, 12:51:11 AM8/7/20
to tesseract-ocr
Hello zodiac,

I'm trying to train vertical Japanese, but the documentation is not great for vertical language.
Could you briefly describe the steps you took?
Is it line image with text file? Is it vertical line image or horizontal line image?

Thank you! :)

shree

unread,
Jan 8, 2021, 5:49:25 AM1/8/21
to tesseract-ocr
Reply all
Reply to author
Forward
0 new messages