Japanese - Problems with vertical words

Jorge Castrillo

unread,

Jun 2, 2019, 5:22:42 AM6/2/19

to tesseract-ocr

Hi everyone. I'm making a program on that uses tesseract to get a word from a manga with a snipping-tool like program, and translates that word with JMdict.

The thing is tesseract gives weird values for vertical, small selections. I'm going to explain it in more detail:

Say I get a full horizontal line in Japanese, like the following one:

The output "元来日本語は漢文に倣い、文字を上" is perfect

Getting a full vertical line gives no problems either:

Gives the same correct output. Now if I want to get only words, when examining horizontal text there are no problems, while with the vertical text the output is almost always (except when examining a Kanji alone) wrong, like this:

The first one returns 日本語 while the second one returns 髑升田.

They are both from the same file, same size, same font, yet the results vary greatly-

Another example, this time from a manga:

The output is 今日の勝敗よりも, again, correct.

But going word by word we start to have errors:

Output 由」〉

and

Output 健雛

Why is it that it can examine the full line without problem, but have so much trouble getting vertical words? I am using psm 8 for words, but it only seems to work with horizontal ones, and I can't get my head around it. I've been trying to find a solution to this all day, but without success. I'm not an expert programmer by any means, this is more of a college project, but any insight would be really, really appreciated. Thank you for reading.

02-japanese-02.jpg

1558668461345 (copy).jpg

Shree Devi Kumar

unread,

Jun 3, 2019, 11:31:29 AM6/3/19

to tesser...@googlegroups.com

tesseract 4 has been trained on line images and hence gives better results for lines, as far as I have seen.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/71b34e0f-5713-42d3-9ba0-4926291758cb%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Seokbong Choi

unread,

Jun 3, 2019, 4:28:29 PM6/3/19

to tesser...@googlegroups.com

Are you using jpn_vert instead of jpn?

I have trained jpn_vert

https://github.com/zodiac3539/jpn_vert

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWb0n%2Bie5ukkq7bRxtuD%2Bx6iQWYV5KK1b19s6yT-NhS1Q%40mail.gmail.com.

ultra

unread,

Aug 7, 2020, 12:51:11 AM8/7/20

to tesseract-ocr

Hello zodiac,

I'm trying to train vertical Japanese, but the documentation is not great for vertical language.

Could you briefly describe the steps you took?

Is it line image with text file? Is it vertical line image or horizontal line image?

Thank you! :)

shree

unread,

Jan 8, 2021, 5:49:25 AM1/8/21

to tesseract-ocr

See https://groups.google.com/g/tesseract-ocr/c/GFHIZ8hO3c4/m/ieYUckMvBgAJ

Reply all

Reply to author

Forward