Extract Graphics from Video and get text with OCR

Keith Reilly

unread,

Sep 15, 2015, 3:36:37 AM9/15/15

to tesseract-ocr

Okay so my project is i want to extract the text imbedded in video. After experimenting with Imagemagick i was able to isolate the text and put it on a white background. I thought that would be the hard part. But every command line OCR software i try is pretty bad at converting what i have. In the sample image, f2.png, you can see what i'm working with. It is just the network name and date i need. With this imagemagick command:

convert f1.png f2.png f3.png f4.png f5.png f6.png f7.png -evaluate-sequence Min -threshold 60% -negate output.png. I thought that was pretty good result. Clean image with decent text. Tesseract is about %50. My question is this: Can i train tesseract without the full alphabet? Since these are all labeled by network and Vanderbilt only records a few i'll have FOX, ABC, CBS, NBC, and CNN. Not too many letters to train with. Also could anyone point out instructions on getting the training tools installed on Mac os X? Macports doesn't have the training part, I did install v3 from source but the training programs won't compile. Any help is appreciated

f2.png

output.png

Dmitri Silaev

unread,

Sep 15, 2015, 4:58:25 AM9/15/15

to tesser...@googlegroups.com

Good work extracting text. But not sufficient for Tesseract. Try blurring your result image until characters become less blocky. This way you probably wouldn't need training.

Completely different approach is to use fixed pattern matching. Go find my post about pulling text out of game screenshots. You'll need to program yourself then.

The last thing I'd try is training. Wiki is your friend.

-Dmitri

On Sep 15, 2015 10:36 AM, "Keith Reilly" <kre...@retroreport.com> wrote:

Okay so my project is i want to extract the text imbedded in video. After experimenting with Imagemagick i was able to isolate the text and put it on a white background. I thought that would be the hard part. But every command line OCR software i try is pretty bad at converting what i have. In the sample image, f2.png, you can see what i'm working with. It is just the network name and date i need. With this imagemagick command:
convert f1.png f2.png f3.png f4.png f5.png f6.png f7.png -evaluate-sequence Min -threshold 60% -negate output.png. I thought that was pretty good result. Clean image with decent text. Tesseract is about %50. My question is this: Can i train tesseract without the full alphabet? Since these are all labeled by network and Vanderbilt only records a few i'll have FOX, ABC, CBS, NBC, and CNN. Not too many letters to train with. Also could anyone point out instructions on getting the training tools installed on Mac os X? Macports doesn't have the training part, I did install v3 from source but the training programs won't compile. Any help is appreciated

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/52275c37-543e-4b85-ab44-6c51f890ca6b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Keith Reilly

unread,

Sep 15, 2015, 4:23:44 PM9/15/15

to tesseract-ocr

Thanks for the feed back. I worked a little bit at getting better results from Imagemagick and have better text now. This is with an imagemagick blur at 1x1 to get rid of jaggies. Tesseract is about 85% accurate now. I saw your post on extracting game text, i think: https://groups.google.com/forum/#!topic/tesseract-ocr/ZsYvAIHWumA That did give me the idea to crop the two areas i need and stitch them back together as seen above. This let me go down with the threshold since i don't have to worry so much about other pixels showing up since its cropped now. But I don't think your preferred method in the game text extraction post will work here. Let me list the reasons why and if i'm wrong please let me know.

1) The character generator used will change the shade of white depending on what the video behind it looks like, 2) Different video clips will have been processed with a different character generator so where the text is displayed in the video might shift a little, 3) high compression artifacts from the method of encoding.

In a specific game you would always expect the pixels in a given coordinate to be the same if its displaying the letter "A" for example. So if you compare your control sample to what was extracted in the game being played you could see if they are identical. But in my case the letter "A" from one video would be mathematically different from the letter "A" in the next. Therefore a comparison won't work. Correct? If not just tell me. I am a novice at this, i never tried to extract text before. I appreciate the tip on not training tesseract that saved me a lot of time. I thought that was the way to go.

ShreeDevi Kumar

unread,

Sep 15, 2015, 11:38:23 PM9/15/15

to tesser...@googlegroups.com

If you have limited letters and numbers that you want to recognize, also look at the whitelist

- sent from my phone. excuse the brevity and typos.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/1340659d-b291-4ad8-ba95-9ed6976a1d15%40googlegroups.com.

Dmitri Silaev

unread,

Sep 16, 2015, 12:13:38 AM9/16/15

to tesser...@googlegroups.com

Text color - somehow you need to replicate or take into account the logic behind color selection to extract as much correct pixels as possible.
Text position - just work with the cropped text.

High compression - see below.

When you use fixed pattern matching, it's about fixed patterns but not necessarily about "fixed matching". Here you can go with "fuzzy" matching, e.g. when a defined percentage of pixels match to a pattern.

Another "big thing" that came to my mind is to rectify italics by unshifting respective scanlines. This would make characters closer to what Tesseract is trained for.

-Dmitri

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/1340659d-b291-4ad8-ba95-9ed6976a1d15%40googlegroups.com.

Keith Reilly

unread,

Sep 21, 2015, 3:26:33 PM9/21/15

to tesseract-ocr

So your idea of skewing the image to fix the italics was a good one. I'm getting more accurate results.

Now with fixed pattern matching are you referring to using tools like OpenCV? Never done anything like that before. I think with the rectified italics i can get results i need. Since i'm looking for a network and date their are only a certain amount of possibles, FOX, ABC, CBS - so if tesseract comes close i could probably write a script that figures you what it is supposed to be. This will be the path i'll pursue. Dmitri thanks for your input and advice and shree thanks for pointing out the whitelist. I didn't know that existed, i'm sure my results will get better once i get it to work.

Keith

Dmitri Silaev

unread,

Sep 23, 2015, 5:21:41 PM9/23/15

to tesser...@googlegroups.com

Glad the italics deskewing worked well.

I'm not referring to OpenCV as its methods probably are an overkill for such a trivial problem. Assume you just overlay a rectangular black/white stencil (character template) over an area in the black/white image and see if the stencil exactly matches the image area. Try to match all stencils you have. Found a match? - found a character. Then move on to the next fixed position (because your images use a monospace font). And so on. That would be "fixed" pattern matching. Would work in an Nintendo game screenshot. But you have JPEG artifacts and other complications. Therefore allow for a bit of discrepancy - i.e. do not require a perfect match, but e.g. allow for 15% non-matching pixels but other should match. That's what I called "fuzzy" matching.

Tesseract is not used in the above method at all. Takes time to program.

I know it's tempting to use Tesseract as a free off-the-shelf tool but it comes at a cost of less accuracy. What I suggested gives an accuracy close to 100%.

The choice is yours.

Best regards,
Dmitri Silaev
www.CustomOCR.com

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/0fc63467-5f89-459c-a0f6-0841d7e46dac%40googlegroups.com.

Reply all

Reply to author

Forward