Black & white comic text recognition

47 views

Skip to first unread message

Krzysztof Studnicki

unread,

May 23, 2019, 1:12:29 AM5/23/19

to tesseract-ocr

Hello!

I'm trying to train Tesseract to give me text from manga pages.

So far I have mixed results. I've tried using stock .traineddata file and self-made ones, but accuracy is similar (I have only trained it with couple of pages, I know it's not enough).

When I tried to get text from a whole page, it recognized many of the words, but a lot more random characters were between (it recognized letters from drawings).

Much better result is from a cropped cloud - almost 83% accuracy, but the best is when only text is cropped with a little white border around it - 94%.

Is it possible to teach Tesseract to recognize text on such pages? I was thinking about preparing dozen of such pages with corresponding box files and by using the process explained here.

I thought about more work by using some other software to recognize and crop text clouds, but I feel it kinda defeats a purpose of using a full potential of Tesseract's neural network.

The question is, if it can be taught to search for text in the sea of drawings and how. So far I'm going in circles and seeing no end...

I have included an image of a part of a page that was processed by using command "tesseract IMAGE_NAME BOX_FILE_NAME batch.nochop makebox".

To check accuracy (and correct errors) I'm using QT Box Editor.

whole page recognition.png

example_whole_page.png

Lorenzo Bolzani

unread,

May 24, 2019, 4:37:27 AM5/24/19

to tesser...@googlegroups.com

Hi,

I do not think tesseract page segmentation can handle this kind on layout. It's more oriented towards paragraphs, tables and classic text layouts. And I think page segmentation is not based on neural networks.

I would try something like opencv EAST text detection in this case or try to detect, with custom code, the white regions of the baloons (something like this).

Also the training document you are referring to is for tesseract 3.x, training with 4.x is easier and there is no need to draw boxes. Again this training has nothing to do with page segmentation (AFAIK).

Bye

Lorenzo

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/03065e53-b571-461a-9b61-ca330d4b32b6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward

0 new messages