Criminal record JPGs: Improving image quality

brad.sol...@gmail.com

unread,

Jan 18, 2018, 7:49:22 AM1/18/18

to tesseract-ocr

Hello--I am attempting to pull full text from a few hundred JPGs that contain information on death row executions hosted by the Texas Department of Criminal Justice (TDCJ).

Here's one example: http://www.tdcj.state.tx.us/death_row/dr_info/ruizroland.jpg; another: http://www.tdcj.state.tx.us/death_row/dr_info/rodrigezlionell.jpg.

In raw form, the images are mostly ~840x1100, 139 KB, grayscale, with a fair amount of whitespace.

Tesseract has been able to capture the field names quite well, but has had trouble with the values/sequences corresponding to each field/key. For example, on the jpg above, I get:

Co-Defendants'

U-l {IAIN .I'i. ‘ III! [.03 'I‘ I - I95 w. I .-II vII A I I

II I U i I I o. '4 I99 0' .1“, DA. 3 I I ‘ v 9 3.), I .‘aI vlh. I

II M I. {?HJI 0 I: III; '403‘I0 v. IIJ' HI. I IO.“ I II I-!

{.A.‘l. .' I Ilu 'J: -. I' 3. I IIvIII I .III II

0 Inn . I II I

What I have tried thus far:

- Increasing image size & dpi significantly.

- Pixel thresholding (from opencv)

- Median blurring (from opencv) - both through Python interface

- Went through the Improve Quality page, but it is clear i am flailing around helplessly.

Appreciate any suggestions for next steps; based on the characteristics of the jpgs, what transformations would be most or least useful?

Thank you.

brad.sol...@gmail.com

unread,

Jan 18, 2018, 12:58:38 PM1/18/18

to tesseract-ocr

Update: I provided a more detailed walkthrough of my process thus far here:

https://stackoverflow.com/questions/48327567/fixing-text-grainy-ness-with-opencv

Message has been deleted

j...@mazoea.com

unread,

Jan 25, 2018, 6:07:10 AM1/25/18

to tesseract-ocr

I allow myself to elaborate in this thread on general image processing questions in this forum. On the other hand I also include one example solution at the end to justify this email.

Personally, I do not think that these questions should be posted exactly into this forum because tesseract is already doing a great job in segmentation when you do not have additional information about the input document set. Can it be improved? Definitely, but the price performance ratio is too high and I would rather see the authors/committers focusing on other things than handling of very specific documents.

That being said, to if you really want to have high(er) precision you simply have to do image processing.

I have seen references to opencv quite a lot but no matter how great that library is, for document image processing my suggestion is to use Leptonica (https://github.com/DanBloomberg/leptonica/). Yes, the one tesseract is using internally. That library is very powerful, super fast even without cpu/gpu magic. I have to admit that I do not understand why it is not much more popular and more widely used if you are/have to be at least a bit serious with document image processing.

The basic keywords you should understand before even trying any processing are: connected components, basic morphological operations (dilate, erode, open, close), structuring elements and seed fills. With their rather simple usage, many questions in this forum could be answered (at least in a hardcoded way). The reason for only a few helpful answers might be that it takes a considerable amount of time and I believe some people have their internal frameworks where it can be done super easily but cannot share it.

Furthermore, the current (lstm based) traineddata are very good but you will find (even simple) examples where they are not performing well and you have to either do image processing or retrain (or use older version that relies on different properties). Have a look at these simple images:

1. https://github.com/mazoea/tesseract-samples/blob/master/bitchanges/t1.png

2. https://github.com/mazoea/tesseract-samples/blob/master/bitchanges/t2.png

3. they slightly differ in the value of one pixel - (red dot in https://github.com/mazoea/tesseract-samples/blob/master/bitchanges/diff.png)

4. download Latin best and execute do OCR for both images e.g.,

tesseract -l Latin --psm 8 --oem 1 ./t1.png stdout

and you should get `MMEA` vs `MEA`.

Well, this might not be the best example but I hope it illustrates the point.

Answer to original question

In order to keep this message "short", I will stop here and point you to a

https://github.com/mazoea/tesseract-samples/blob/master/lines/main.cc

and

https://github.com/mazoea/tesseract-samples/blob/master/lines/test.sh

The code users leptonica and it prepares the image by scaling and deskewing it, binarizing it and then it (very) roughly tries to find possible letter descenders of latin text on a line (here you could traverse the lines by columns and look for black pixels above/below), finds lines and computes the result. It looks far from perfect but the result is usable.

Kind Regards,

Jozef

ShreeDevi Kumar

unread,

Jan 26, 2018, 8:23:39 AM1/26/18

to tesser...@googlegroups.com

Jozef,

Thank you for your detailed answer and sample.

Do you have a sample which can handle an image with tables using leptonica and tesseract?

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/44423aa4-ed2e-46a8-a31c-a90489bf9f6a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

j...@mazoea.com

unread,

Jan 28, 2018, 9:50:11 AM1/28/18

to tesseract-ocr

On Friday, January 26, 2018 at 2:23:39 PM UTC+1, shree wrote:

Jozef,

Thank you for your detailed answer and sample.

Do you have a sample which can handle an image with tables using leptonica and tesseract?

Dear Shree,

your request is simply too generic. First of all, if you identify a table, what next? Imagine invoice tables with multi line lines and completely different columns etc., removing horizontal/vertical lines does not help much (it can even make quality worse). You would need to find the contents of a cell which is again non trivial with touching or even overlapping letters (like in a form). Furthermore, not all tables have horizontal/vertical lines and many other specifics.

However, to be at least somehow helpful I suggest to start by looking at (and modifying)

https://github.com/DanBloomberg/leptonica/blob/45f5dbb78e5ac742312b85b21a79dedc726bb23b/src/pageseg.c#L1585

Best,

Jozef

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

ShreeDevi Kumar

unread,

Jan 30, 2018, 1:03:16 PM1/30/18

to tesser...@googlegroups.com

Thanks for your response and the link to leptonica's table detection routines.

Yes, my query was generic in nature, because I have seem many posts related to OCR of tables, but hadn't come across any method addressing the same.

You have correctly pointed out the reasons why it is so.

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/16b82120-1d88-4df8-ba8e-1e4f38dd7221%40googlegroups.com.

Reply all

Reply to author

Forward