Re: [tesseract-ocr] Help OCR'in an image

Message has been deleted

Allistair

unread,

Jul 12, 2016, 5:14:01 AM7/12/16

to tesser...@googlegroups.com

In my opinion, given you have a very fixed layout/template this gives you more control over how you perform the OCR. Rather than give Tesseract the entire spreadsheet here why not program a preprocessing stage where you extract the text you want out cleanly into a new image (given you know all (X, Y, WIDTH, HEIGHT) rectangle locations for such an input image?

On 11 July 2016 at 22:00, Raphael Budd <woder...@gmail.com> wrote:

Hey everyone,

I've got this pdf document which is a schedule. I'm trying to extract the text from it via tesseract but I'm not having that good results.

I've tried a lot of different things, in my inexperienced opinion the image seems very high quality as I can zoom in a lot without seeing pixels. I've also tried to convert the pdf->tiff and add grayscale filter (all via java).

I've attached both the end result and the original pdf here along with a sample of the output, any help making the output better would be appreciated.

The tiff file is too big for the attachement; see this link: http://wltd.org/Daily%20schedule-14.tiff

---Begin text---
008 KIERA MCG 3:00 PM 11:00 PM TRWN 8.00 —
718 KYLE s 11:00 PM 7:00 AM MT 8.00 < —
686 JOSEPH e 11:00 PM 5:00 AM MT 6.00 — >
718 KYLE s 11:00 PM 7:00 AM MT 8.00 — >
656 CHANDLER A 1:00 PM 4:00 PM MB 3.00 —
720 TYLER D 11:00 PM 7:00 AM T|_ F 8.00 < —
720 TYLER D 11:00 PM 7:00 AM T|_ F 8.00 — >
052 SH ELLY L 5:30 AM 2:00 PM FLRIFFIMGR F 8.50 _:I
Riley M 372 8:00 AM 4:00 PM FLR F 8.00 —
‘ Raphael B602 4:00 PM 12:00 AM FLRIMGR F 8.00 ‘ —:| I
‘ Kevin G 652 11:00 AM 7:00 PM g$Y$IWNIMNY$I F 8.00 ‘ I:-:| I
Joseph C 191 8:00 AM 4:00 PM ADMIBKIMB F 8.00 -:—
2014 ROXANA T 11:00 AM 7:00 PM ADM F 8.00 _

--END TEXT---

As you can see tesseract becomes quite creative with its attempt at parsing this, earlier in the document it even parsed the letter "N" as "|\|", creative but useless for parsing!

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/f77f8dd8-f6d2-4f6b-b5fe-5510fac4f878%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Message has been deleted

Allistair

unread,

Jul 12, 2016, 9:12:21 AM7/12/16

to tesser...@googlegroups.com

Yes, you should think of it in those terms. That would remove the noise you are seeing to the right hand side of your result as Tesseract likes to turn shapes into text if it can ;)

Even if you add new rows into the left side, the x,y top corner intervals are still consistent enough to just keep going down the image creating rectangles of input. At some point those rectangles will be white rectangles - you can easily check to see if a rectangle is full of white pixels or anything non-white to control when rows have ended etc.

The only thing I can see as disrupting your template is the title "Managers" - if you have variants where there could be zero or many of these titles for different sections then you x,y finding method will need to be more complex. However it seems like it could be easy to spot these sections as you have a chunk of white space between the bottom most border of the upper section and the bold black header area of the next box.

Extract left hand portion of the image with the boxes
Identify a pixel column that provides structural table information (not where text would be encountered) - you have plenty of these due to the layout
Apply logic to find section headers (pixelN and pixelN+1 are black)
Apply logic to find rows (pixelN == grey)
Find your rectangles of text based on fixed column widths and the previous row-finding logic

Something like that :)

On 12 July 2016 at 13:41, Raphael Budd <woder...@gmail.com> wrote:

I could, the only issue is that based on the number of people scheduled the box can grow, which would change all the x,y coords...

What can be easily done is to narrow down the scope of the ocr by only getting the horizontal table part and omitting the rest, I'm guessing that might also help?

Thanks for the help by the way!

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/d3270fa9-7706-4260-9f90-c8b8d0f350d6%40googlegroups.com.

Allistair

unread,

Jul 12, 2016, 9:13:57 AM7/12/16

to tesser...@googlegroups.com

In fact I should say, extract the whole left bit first and see how you are.

If you continue to find issues it's probably the borders of your table interfering by being too close to the text, hence the reason I am saying totally clearing the table away might help.

You might try to just remove the dividing grey lines as a 2nd text - find and replace all grey pixels with white in this case.

If that still does not work then you will need to address the fact borders are getting in the way and do something drastic as I've suggested.

Raphael Budd

unread,

Jul 13, 2016, 10:44:25 PM7/13/16

to tesseract-ocr

So I added really strong pre processing that chops up the schedule, however it is being weird.

Output of the attached image is:

721 BENJI B 7:00 AM 3:00 PIVI DT 8.00

Once again, almost perfect but the M becoming IVI is just a deal breaker and having to do post processing on that is going to be hell because there is no promise that IVI might never appear.

Daily schedule-11348.tiff

Allistair C

unread,

Jul 14, 2016, 3:46:38 AM7/14/16

to tesser...@googlegroups.com

Have you tried resizing your image to be larger, try x2 larger - can sometimes help. Is this happening to all Ms or just one?

Sent from my iPhone

--

You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/32721b73-7333-468c-8232-d6f5f68487a1%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

<Daily schedule-11348.tiff>

Raphael Budd

unread,

Jul 14, 2016, 9:33:06 AM7/14/16

to tesseract-ocr

It just seems to happen to this one, its super weird because all the other ones work perfectly fine!

I'l try resizing to 2x though.

Raphael Budd

unread,

Jul 14, 2016, 9:48:59 AM7/14/16

to tesseract-ocr

So I have added the scaling and with the scaling it makes a mistake of somehow interpreting "11:00" as "11 :00", something my program doesn't take too kindly too.

I'm not sure what else I can do to make it work as I feel I'm already spoon feeding the text to it, unless there is noise on the image or something I'm not aware of?

Thanks

Allistair

unread,

Jul 14, 2016, 9:52:27 AM7/14/16

to tesser...@googlegroups.com

I'm afraid that's about the limit of what I can suggest - there are a great many "engine settings" available that can be tweaked to alter the OCR but they are not very well documented. Perhaps someone more familiar with these kinds of mistakes can try and help. Did the scaling fix the M issue even though it caused the new issue?

OCR is and should never be considered perfect or reliable in my opinion and today generally needs a helping hand - you might be expecting too much :)

One more suggestion - if you know the font being used for your sheet you could do some dedicated training to generate a training file for Tesseract.

Cheers

--

You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/86f2ff29-e666-4136-8fc1-43ef6a509e75%40googlegroups.com.

Raphael Budd

unread,

Jul 15, 2016, 12:48:21 AM7/15/16

to tesseract-ocr

Alright so I did a bunch of testing and I've found weirdly enough that running tesseract via console produces 100% accuracy via my preprocessing.. just not when I do it via api call in java. I now suspect old version of tesseract screwing stuff up, if that is the case hopefully there is a more updated version of tess4j or else this is going to be really painful to do this via java through cmd (possible but a pain in the ****).

Raphael Budd

unread,

Jul 15, 2016, 1:54:49 AM7/15/16

to tesseract-ocr

Update; I really don't understand the difference between these two installs. I am using the absolute latest version of Tess4j and it just does not work whereas literally the SAME IMAGE works with tesseract command line. 100% confirmed that this is the behaviour every time, I can setup the java app and the console at the same time on the same image and the console gets it completely right while Tess4j screws it up. I should add that they are using the same tessdata, as I copy passed it over.

Anyways I've had a long night of messing with this and thats enough for my poor soul.

(thanks for helping me through this, by the way. Getting closer and closer to the goal)

Allistair

unread,

Jul 15, 2016, 5:09:32 AM7/15/16

to tesser...@googlegroups.com

Did you find out the versions being used?

Tess4J changelog suggests:

Recompile Tesseract 3.04.01 DLL against Leptonica 1.73

How does that compare with your CLI?

Is any config file or option being injected anywhere? Are you pushing the same page segmentation model param (psm) or using automatic (I would recommend choosing one and punching it in).

Cheers

--

You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/9019638c-8c74-44f5-b887-3430a0f63d4a%40googlegroups.com.

Raphael Budd

unread,

Jul 15, 2016, 7:56:03 AM7/15/16

to tesseract-ocr

So the CLI says this when I run tesseract.exe --version

tesseract 3.05.00dev

leptonica-1.73 (Feb 5 2016, 01:13:58) [MSC v.1900 LIB Release x86]

libgif 5.1.2 : libjpeg 9 : libpng 1.6.19 : libtiff 4.0.2 : zlib 1.2.8 : libwebp 0.3.1

Tess4J I'm just running the latest compiled binary, did you say that it says I should re-compile the binary for a different release?

Allistair

unread,

Jul 15, 2016, 8:09:42 AM7/15/16

to tesser...@googlegroups.com

So that sounds like you're running a non-master version (dev) and Tess4J is running latest master 3.04 (https://github.com/tesseract-ocr/tesseract) - as shown in its changelog http://tess4j.sourceforge.net/changelog.html

Eradicate the difference and then see if you see different results.

--

You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/0797dc24-b857-4245-9299-61df5a32a488%40googlegroups.com.

Raphael Budd

unread,

Jul 15, 2016, 10:47:44 AM7/15/16

to tesseract-ocr

Alright I have returned from testing. tesseract-CLI version 3.02 returns the correct string. Tess4j version 3.2.0 returns the incorrect string, they are both running on default settings with nothing configured.

I'm kind of stumped now, I don't know what else could be wrong or why my java implementation isn't working as well as the CLI version.

Raphael Budd

unread,

Jul 15, 2016, 11:49:04 AM7/15/16

to tesseract-ocr

Okay so after all of that I did something with maven - not really sure what but now it works. I have 100% accuracy and everything is amazing. Thanks for all the help! Just as an aside for anyone reading this also trying to do this; cutting out the individual rows and removing the borders makes Tesseract a lot happier than just throwing the entire document. I went from maybe around 60% accuracy to 100% with some pre processing. I also had to scale the image up a lot, but it works great now.

Allistair

unread,

Jul 15, 2016, 11:59:33 AM7/15/16

to tesser...@googlegroups.com

Great stuff. My parting advice is don't think it will always be 100% perfect. I hope it will but you could get a weird person name that brings 2 letters together just close enough to make Tesseract get it wrong. I would maybe do further testing against lots of test images - of course it depends on your onward promises and system dependencies - is this just a helper tool or does it need to be 100% accurate 100% of the time :) Great you got somewhere with it.

Cheers

--

You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/f7688cd4-63a7-4ade-b150-0133c49364d7%40googlegroups.com.

Reply all

Reply to author

Forward