Auto cropping for OCR

Andrew Ollett

unread,

May 10, 2024, 2:07:50 PMMay 10

to sanskrit-p...@googlegroups.com

Hi everyone,

Has anyone found a solution for isolating different texts from OCR output when they are printed on the same page? (I am thinking of texts and commentaries, like this example). I have used various jugaad solutions (based on the text and commentary being in different languages, or having different font sizes, etc.) but it would be nice to preprocess the images using OpenCV or something. If anyone has experimented with this, please do share your experiences (and code if possible). I am thinking of image processing with OpenCV > GCV for OCR.

Andrew

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,

May 10, 2024, 8:47:20 PMMay 10

to sanskrit-p...@googlegroups.com

I don't think so - If you find a solution, please let us know

--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-program...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/sanskrit-programmers/CAANHO16jSgZcqZ-2y-EOneS5QEX4z3cg2_Or4%2BVUKUxwLN26Yw%40mail.gmail.com.

--

--
Vishvas /विश्वासः

Anunad Singh

unread,

May 10, 2024, 10:15:14 PMMay 10

to sanskrit-p...@googlegroups.com

I think what is needed is an OCR tool that does not mix the multicolumn text in a page. There exist a few such OCR toots. I used one such OCR a year back but forgetting which one it was. Its output was quite satisfactory.

-- AnunAda

To view this discussion on the web visit https://groups.google.com/d/msgid/sanskrit-programmers/CAFY6qgGaGbNssKDVpZt1fe9dT1Qs39Sq1yJqxb_qwo27TKyYFA%40mail.gmail.com.

karthika

unread,

May 11, 2024, 1:54:10 AMMay 11

to sanskrit-p...@googlegroups.com, Andrew Ollett

Tesserect or Surya OCR may work.

On 2024-05-10 23:37, Andrew Ollett wrote:
> Hi everyone,
>
> Has anyone found a solution for isolating different texts from OCR output
> when they are printed on the same page? (I am thinking of texts and
> commentaries, like this example

> <https://archive.org/details/Anandashram_Samskrita_Granthavali_Anandashram_Sanskrit_Series/ASS_097_Mimamsadarsana_with_Tantravartika__Sabarabhashya_Part_6_-_Subbasastri_1934/page/n69/mode/2up>).

> I have used various jugaad solutions (based on the text and commentary
> being in different languages, or having different font sizes, etc.) but it
> would be nice to preprocess the images using OpenCV or something. If anyone
> has experimented with this, please do share your experiences (and code if
> possible). I am thinking of image processing with OpenCV > GCV for OCR.
>
> Andrew

--
Karthika N J
PhD student
(Teaching Assistant),
CSE, IIT Bombay.

Andrew Ollett

unread,

May 11, 2024, 11:30:33 AMMay 11

to karthika, sanskrit-p...@googlegroups.com

If anyone is still interested, this is indeed possible with OpenCV. I am attaching an image of the bounding boxes which can be used to crop the image to size. Here is a gist:

https://gist.github.com/aso2101/d66252772a34a4f61b617e0d8f3b132a

vol4-044-new.tif

Shreevatsa R

unread,

May 11, 2024, 2:37:38 PMMay 11

to sanskrit-p...@googlegroups.com, Andrew Ollett

I'm working on something related, and going about it a bit differently: you can get the bounding boxes of each word from the Google OCR (or Tesseract or whatever) response itself.

In the case of Google OCR, this is in jsonResponse.responses[0].textAnnotations, in elements of the array after the first one. A gist from the thing I'm working on (will share when ready, hopefully in a few days):

https://gist.github.com/shreevatsa/f9b2029f43eb68a2aec42154513e3c87#file-ocr-ts-L58-L73

(I use this bounding box data to identify lines of the text, then manually select a group of lines and hit a button to group those lines into regions corresponding to the different texts. I'm fine doing it manually because it takes only a couple of seconds per region and if it's a text I care about I may not mind glancing over it anyway. An earlier version of this was what I used to extract matching regions — verse and footnote — for this or this, for example: the same thing can be used for different texts or text vs commentary.)

--

You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-program...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/sanskrit-programmers/CAANHO156SnSxNEFsq3BK-rRsjm6ssORsWwBfMmoudwPdbUmAag%40mail.gmail.com.

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,

May 11, 2024, 8:24:38 PMMay 11

to sanskrit-p...@googlegroups.com, Andrew Ollett

On Sun, 12 May 2024 at 00:07, Shreevatsa R <shree...@gmail.com> wrote:

for this or this, for example: the same thing can be used for different texts or text vs commentary.)

So nice - thanks for sharing!

Something like:

(not proofread) → Searchable text (not proofread)

would make it's use more obvious.

To view this discussion on the web visit https://groups.google.com/d/msgid/sanskrit-programmers/CAKEM%3DPOuE4em9sJTW-KfUj0Tpra%2BrKdvJheohP6V6Mx8O%3DVJWQ%40mail.gmail.com.

Shreevatsa R

unread,

May 12, 2024, 10:45:11 AMMay 12

to sanskrit-p...@googlegroups.com, Andrew Ollett

On Sat, 11 May 2024 at 17:24, विश्वासो वासुकिजः (Vishvas Vasuki) <vishvas...@gmail.com> wrote:

On Sun, 12 May 2024 at 00:07, Shreevatsa R <shree...@gmail.com> wrote:
for this or this, for example: the same thing can be used for different texts or text vs commentary.)

So nice - thanks for sharing!
Something like:
(not proofread) → Searchable text (not proofread)
would make it's use more obvious.

Done, thanks :-)

Andrew, I took a closer look at the Gist you shared (thank you), and really like the idea, of classifying lines by their widths. Even for my manual approach I think this classification can serve as a good default / starting point, depending on the book in question. Thanks for sharing...

I'll also repeat in case it helps that these bounding boxes will also be found in the OCR response: after all, finding the horizontal lines on the page, and their respective bounding boxes, is what any OCR tool needs to do too, and most of them will return this data. (Tesseract includes bounding boxes of individual lines too, while Google OCR only has each "block", but that will serve too.) For the page that you shared as example (input attached), the Google OCR response (screenshot from here attached) unfortunately includes the commentary below the line in the same block 6 (lol), but if you look at the JSON (or the orange underlines in the screenshot), there are *paragraph* bounding boxes and individual words' bounding boxes that have the information to identify the widths of lines.

To view this discussion on the web visit https://groups.google.com/d/msgid/sanskrit-programmers/CAFY6qgELfABT4zdexr96L4HBPwz_qe6HgdERnS2kt0MOoyHEfA%40mail.gmail.com.

ASS_097_Mimamsadarsana_with_Tantravartika__Sabarabhashya_Part_4_-_Subbasastri_1932_0043.jpg

Screenshot 2024-05-12 at 07.39.31.png

Reply all

Reply to author

Forward