need help removing garbage characters from my OCR

3,162 views
Skip to first unread message

Alex Ryan

unread,
Jul 8, 2014, 1:31:39 AM7/8/14
to tesser...@googlegroups.com
I'm trying to make a words with friends cheat for a university project. I'm obviously trying to OCR the tiles from a screen shot of the app. I have tesseract 3.03 set up and running fine, but I'm not getting useable output. I've tried various training methods but so far haven't hit upon the right method and was hoping someone had some suggestions for me.

Here's a sample image if you are unfamiliar with the program

http://i.imgur.com/kAzXxJP.jpg

I've trained tesseract using each tile as a letter of a new font. But that doesnt seem to work, as it still sees the actual letter and number on the tile as two different parts instead of as all part of the same letter. I tried changing the "textord_min_linesize" as suggested in the FAQ for solutions to diacritics, which would be a similar issue to what I'm having, but if I input value higher than the default of 1.25 then it doesn't see anything at all in the picture, I get a "Empty page!!". I've tried various image pre processing and it hasn't helped either.

Ideally id like to be able to differentiate between a normal "J" tile with the small "10" in the top right corner (the score for that particular letter) and a "J" tile without a number, as that means it was a "wild card" tile in the game, as I would like to keep track of those. But if I have to scrap that at this point I'm willing because I just want to get something to work. Meaning if I could get Tesseract to ignore all the tiny numbers and other noise and only read the letters I would be pleased.

I also cant figure out how its scanning the image. Sometimes it goes top to bottom right to left, and other times it seems to go left to right, top to bottom. And sometimes it just seems to jump around.

I know what I'm trying to do is possible as there are various marketplace apps that accomplish this task, and some of them mention using Tesseract. I just can't for the life of me figure out how.

Sorry for the length of this post, I'm just desperate for any help and want to make sure I express myself correctly. I've spent at least 30 hours on this already, and while I have the whole training aspect down (which was incredibly confusing to me when I first started), I still don't feel any closer to actually having something useful, and the project deadline keeps getting closer.

My most humble and sincere thanks for any help or suggestions you may have.

Cheers,

Alex

Paul

unread,
Jul 8, 2014, 7:41:47 AM7/8/14
to tesser...@googlegroups.com

Nick White

unread,
Jul 8, 2014, 12:25:51 PM7/8/14
to tesser...@googlegroups.com
Hi Alex,

If you're up for some programming, you could recognise the squares
yourself, and pass each one separately to tesseract with the
PSM_SINGLE_CHAR segmentation type. That should help if Tesseract is
not segmenting each whole square separately.

If the board is always the same size, you could even do it by just
creating a .uzn file like this:
0 0 60 60 squarea1
60 0 60 60 squarea2
120 0 60 60 squarea3
etc.

That way you're completely controlling the section segmentation. I
suspect you'd get better results by feeding each square separately,
though, as you can then use PSM_SINGLE_CHAR, and Tesseract has a
better chance of taking the number in the corner into account.

As Paul suggested, though, binarisation may be an issue too. You can
check how well it does by using the tessedit_write_images config
setting; see
https://code.google.com/p/tesseract-ocr/wiki/ImproveQuality#Image_processing

Nick

Alex Ryan

unread,
Jul 9, 2014, 1:36:50 AM7/9/14
to tesser...@googlegroups.com
Thank you SO much for the replies guys!!

I read up on those binarization links, and that looks like its going to be a bit out of my wheel house to implement, I see that there is a python/openCV implementation of that paper, but im not sure if I could get that going, as im not familiar with either. I looked at the image file its using right before it processes it via the tessedit_write_image config and the quality is good and everything is sharp, so im not sure how much it would help. http://i.imgur.com/ljBtNMQ.jpg (other than removing the gibberish)

I tried going a character at a time, however for some reason I cant seem to get tesseract to work when I give it just one character, it doesnt see anything. If I give it a block of 4 tiles then it works. I tried all the different pageseg_mode options as well. ???

In one of the links tho I saw something about -psm setting. When I run the OCR with -psm 6 all of a sudden it worked perfect!!! Im really not sure what that setting does, ive tried doing some searches, but im still unclear. Can you guys shed some light on that? I made a box file from that setting and put together a new traineddata file with that. Now if I try and run it using that language on anything other than -psm 6 it crashes tesseract. Is this something I need to be concerned about?

My plan now is to do another set of better box files to make a new language using the -psm 6 with the current traineddata. I'm hoping that it will be able to distinguish between the normal and wild card tiles this way. *fingers crossed*

thank you again for taking the time to reply, it was a very big help

cheers,

Alex

Paul

unread,
Jul 9, 2014, 6:16:08 AM7/9/14
to tesser...@googlegroups.com

How about using ImageJ (can be automated with macros) to create a better binary result of the image.

  1. Download and install ImageJ
  2. Open the image
  3. Split the color channels (Image -> Color -> Split Channels)
  4. Close the blue channel, since it has low contrast
  5. Select the green channel, do a global thresholding (Image -> Adjust -> Threshold...; Parameters: 50 and 255; see kAzXxJP_green.png)
  6. Select the red channel, do a global thresholding (Image -> Adjust -> Threshold...; Parameters: 35 and 255)
  7. Subtract the background (Process-> Subtract Background...; Rolling ball size: 5.0; see kAzXxJP_red.png)
  8. Merge the cannels (Image -> Color -> Merge Channels; Leave settings as they are; see kAzXxJP_composite.png)
  9. Make it an RGB image (Image -> Type -> RGB Color)
  10. Make it an 8 bit image again (Image -> Type -> 8 bit; this will make it grayscale)
  11. Do another binarization (Image -> Adjust -> Threshold...; Parameters: 50 and 255; see kAzXxJP_binary.png)
Quite a lot of steps, but the result is quite good without programming anything. You can also record the steps by using Plugins -> Macros -> Record... although I did not try that.

Hope it helps,
Paul
kAzXxJP_green.png
kAzXxJP_red.png
kAzXxJP_composite.png
kAzXxJP_binary.png

Nick White

unread,
Jul 9, 2014, 12:56:20 PM7/9/14
to tesser...@googlegroups.com
On Wed, Jul 09, 2014 at 03:16:08AM -0700, Paul wrote:
> How about using ImageJ (can be automated with macros) to create a better binary
> result of the image.

Thanks for mentioning this; I hadn't heard of it and it sounds very
useful. I added a link to the ImproveQuality wiki page.

Nick

Alex Ryan

unread,
Jul 9, 2014, 7:17:47 PM7/9/14
to tesser...@googlegroups.com
Paul, I havent gotten a chance to play around with that yet, but thanks for linking that, I might very well have to go that route.

I am having a very confusing issue tho that Im hoping maybe someone can shed some light on.

I've been testing out my language traineddata on a bunch of different boards, and for what seems like no rhyme or reason sometimes tesseract outputs perfect and other times I get total garbage. Even tho the file its seeing seems the same. It also changes depending on if I have the "-psm 6" flag added or not. Which makes sense that there would be a change, but I dont understand why its changing the way that it is. (I now know that the -psm 6 treats the image as a single uniform block of text)

Examples

Here is output when its working how I want it to.

This is the .tif file tesseract sees that I captured via "tessedit_write_images 1" config

http://i.imgur.com/uQdrEsQ.jpg

Here is how it detects the characters (viewed in jTessBoxEditor) with the "tesseract image.tif image -psm 6 -l lang batch.nochop makebox" command. With the resulting output of a "tesseract image.tif output -psm 6 -l lang" shown along side

http://i.imgur.com/Abzq2LC.jpg

It has a near perfect recognition with only a couple minor errors, the boxes are clearly drawn around both the letter and the score, and in the case of the wild card tiles it correctly detects it and recognizes it as a lowercase character (Which is what I trained it to do). removal of the -psm 6 flag and nothing at all is detected and I get an "empty page!!" output.

Now another tif file that is as far as I can tell functionally identical (grabbed via write_images config)

http://i.imgur.com/ui1u8qk.jpg

this time tho, character recognition is terrible and Its not recognizing that the letter and score parts of a tile are the same character. Using the identical "tesseract image.tif image -psm 6 -l lang batch.nochop makebox" command and with the resulting output of a "tesseract image.tif output -psm 6 -l lang" shown along side

http://i.imgur.com/anqdXGk.jpg

however curiously, if I do the same thing but this time without the -psm 6 flag, It does a decent job (not as good as in the first example tho) and gets most of the letters right, however now it reads the .tif from top to bottom, and right to left. When I make a box file tho, it draws it the same, which I dont understand because its definitely detecting the characters differently.
("tesseract image.tif image -l lang batch.nochop makebox" and "tesseract image.tif output -l lang")

http://i.imgur.com/o1Id32L.jpg

I am sooo confused. What is going on? I have about 4 screens it recognizes perfectly, and 7 or so that its garbage and use of the -psm is identical to as described here. I don't see any functional differences between them. Tile distribution doesnt seem to matter, how much border I give around doesnt seem to matter. It just detects some and refuses to detect others. It never flip flops either, if it works on a board, it always works, and if it doesnt, it never does.

here is my traineddata file if it helps http://www.idspispopd.net/fnl.traineddata

any ideas? Im starting to go mad :)

thanks!

Alex

Nick White

unread,
Jul 10, 2014, 10:45:54 AM7/10/14
to tesser...@googlegroups.com
Hi Alex,

One quick thought, if you're still using .uzn, it's only loaded with
certain psm levels (it is with -psm 6, but not -psm 3, the default).
And it's loaded from <imagename_without_extension>.uzn. So if you
have any .uzn files lying around, they will be being applied with
psm 6, but not if you don't explicitly state the -psm.

Nick
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an email
> to tesseract-oc...@googlegroups.com.
> To post to this group, send email to tesser...@googlegroups.com.
> Visit this group at http://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/msgid/
> tesseract-ocr/6027b26d-cd8a-493f-a4a5-22609b1c00dc%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Nick White

unread,
Jul 10, 2014, 2:18:50 PM7/10/14
to tesser...@googlegroups.com
On Tue, Jul 08, 2014 at 10:36:50PM -0700, Alex Ryan wrote:
> In one of the links tho I saw something about -psm setting. When I run the OCR
> with -psm 6 all of a sudden it worked perfect!!! Im really not sure what that
> setting does, ive tried doing some searches, but im still unclear. Can you guys
> shed some light on that? I made a box file from that setting and put together a
> new traineddata file with that. Now if I try and run it using that language on
> anything other than -psm 6 it crashes tesseract. Is this something I need to be
> concerned about?

Are you still able to reproduce this crash? If so, can you open a
bug in the issue tracker, attaching the training data and image file
that crash it?

Thanks,

Nick

Alex Ryan

unread,
Jul 10, 2014, 2:30:55 PM7/10/14
to tesser...@googlegroups.com
Nick,

In searching I found out what was causing that crash. When I combined my files to make that particular trainieddata file I omitted the shapetable. I recombined them with the shapetable and it doesnt crash on the default psm.

In regards to the uzn files, I double checked and there arent any in the directory, so that cant be it, altho that would have made sense.

thanks for all the help!
Message has been deleted

Alex Ryan

unread,
Jul 11, 2014, 6:06:29 PM7/11/14
to tesser...@googlegroups.com
just wanted to follow up

I wrote some simple code to preprocess the image because I realized I will be doing basically the same image every time so its foolish to try and use Tesseracts binaziration technique which was designed for a different and more general purpose. So basically I just turned every pixel white that wasnt a pixel that contained part of a letter, and when I send that to tesseract I get flawless output with the language data I trained. Thanks so much for the replies Paul and Nick, I learned a lot and it put me in the right direction! cheers!

Nick White

unread,
Jul 12, 2014, 12:45:46 PM7/12/14
to tesser...@googlegroups.com
Great, good work, I'm glad it's working so well for you now!

Paul

unread,
Jul 13, 2014, 1:19:33 PM7/13/14
to tesser...@googlegroups.com
Yes it's very useful if you are familiar with basic image processing, but it can be very confusing if you are not. :)

Paul

Paul

unread,
Jul 13, 2014, 1:24:06 PM7/13/14
to tesser...@googlegroups.com
Do you simply filter out any color other than brown and white or is your algorithm more sophisticated? If it is, it would be great if you could share the basic idea.

Paul

Alex Ryan

unread,
Jul 14, 2014, 5:06:30 AM7/14/14
to tesser...@googlegroups.com
Paul,

Thankfully the RGB values of the colors of the pixels I want arent used anywhere else in the area of the image I want scanned. So yeah I simply just turn all other colors white, and turn the colors I want black.

Paul

unread,
Jul 14, 2014, 2:44:05 PM7/14/14
to tesser...@googlegroups.com
:)

I wonder why I didn't come up with that idea before doing the complex ImageJ processing.
Reply all
Reply to author
Forward
0 new messages