Hello!
I'm trying to train Tesseract to give me text from manga pages.
So far I have mixed results. I've tried using stock .traineddata file and self-made ones, but accuracy is similar (I have only trained it with couple of pages, I know it's not enough).
When I tried to get text from a whole page, it recognized many of the words, but a lot more random characters were between (it recognized letters from drawings).
Much better result is from a cropped cloud - almost 83% accuracy, but the best is when only text is cropped with a little white border around it - 94%.
Is it possible to teach Tesseract to recognize text on such pages? I was thinking about preparing dozen of such pages with corresponding box files and by using the process explained
here.I thought about more work by using some other software to recognize and crop text clouds, but I feel it kinda defeats a purpose of using a full potential of Tesseract's neural network.
The question is, if it can be taught to search for text in the sea of drawings and how. So far I'm going in circles and seeing no end...
I have included an image of a part of a page that was processed by using command "tesseract IMAGE_NAME BOX_FILE_NAME batch.nochop makebox".
To check accuracy (and correct errors) I'm using QT Box Editor.