The results of the tesseract scan were:
」ブ「.'~ー ' .'~ー ' .'~ー ' .'~ー ' .'~ー ' .'~ー '
鱒` 私、 お藁り見に来たんだ。
` ねえ、 あなたこの町の人でしょ ?
一人じゃ面臼くないもん。
The first line is obviously it trying to read the specs left over on the top. and aside from an extra kanji and an apostrophe the reading is right on. I tried it on another screenshot with different sprites and got.
)“シナ「とつせゅつへ` こうふんして
` ねつけなかったんで しょ?
ま 、 建国千年のお祭りだから
無理ないけど ・・・・・・
Which is again pretty close, the extra )" and the つ in the first line is actually a う, and the へ is actually a べ which are easy to mess up. But the problem is the filters are really specific to this game at the moment and I was hoping to keep them more generalized also there's no way to really tell if a a kanji that's in the reading has been placed there in error. I want to make a program that periodically tries to read text from the game as I'm playing and perform some functions on it. Any ideas? I may just end up looking into another route, this one seemed the simplest but the errors could mess up the functionality I'm trying to achieve.
Tesseract OCR only really reads black text on white background, so your approach of
processing the image to get that is good (and would fix about 1/2 of the other issues
which people report here...)
The original text is white with a black drop-shadow (to the right & down directions).
So, process for white original pix to be black in OCR image and anything darker
to be white. (This may be what you've done already.). This is a combination of
Inversion and binarization.
These characters are fairly blocky - due to the low res original art. If they are still blocky
after conversion to b/w then you may be able to fill in the blocks by using a dialate and
erode sequence (std. image proc ops, look up...) to fill the gaps somewhat intelligently.
This may help the recognition rates.
I can think of two approaches to address the specks at the top - either a noise elimination
image processing step or, maybe, a windowed approach to binarization. The simplest
binarization technique is the one you are already using - a fixed threshold value for deciding
black or white. A more complex approach is to vary the threshold value based on a window of
surrounding pixel values. Research "Sauvola binarization" for details on a proven algorithm.
It's nicer to figure out what image processing is needed without extensive programming work.
Once you know what operations/algorithms are needed then you can call them from a
(hopefully) free and easy to use (and debugged) library (ex. OpenCV?). To experiment
like this I use the demo program for Accusoft's ScanFix library - it lets you process images
with a sequence of pretty low level ops. There are probably other "image processing
laboratory" apps available. A paint program or viewer (Paint.NET, IrfanView) can do a lot
of these processing ops, but often not in a way that gives you access to the low level
details (like, choice of binarization algorithm, etc)
Finally, no OCR system is perfect - if your project requires perfect OCR then maybe rethink it
(Or buy a commercial OCR engine that can recognize 99+%, though, still not perfect...)
- Rich