Auto cropping for OCR

28 views
Skip to first unread message

Andrew Ollett

unread,
May 10, 2024, 2:07:50 PMMay 10
to sanskrit-p...@googlegroups.com
Hi everyone,

Has anyone found a solution for isolating different texts from OCR output when they are printed on the same page? (I am thinking of texts and commentaries, like this example). I have used various jugaad solutions (based on the text and commentary being in different languages, or having different font sizes, etc.) but it would be nice to preprocess the images using OpenCV or something. If anyone has experimented with this, please do share your experiences (and code if possible). I am thinking of image processing with OpenCV > GCV for OCR.

Andrew

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
May 10, 2024, 8:47:20 PMMay 10
to sanskrit-p...@googlegroups.com
I don't think so - If you find a solution, please let us know

--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-program...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/sanskrit-programmers/CAANHO16jSgZcqZ-2y-EOneS5QEX4z3cg2_Or4%2BVUKUxwLN26Yw%40mail.gmail.com.


--
--
Vishvas /विश्वासः

Anunad Singh

unread,
May 10, 2024, 10:15:14 PMMay 10
to sanskrit-p...@googlegroups.com
I think what is needed is an OCR tool that does not mix the multicolumn text in a page. There exist a few such OCR toots. I used one such OCR a year back but forgetting which one it was. Its output was quite satisfactory.

-- AnunAda

karthika

unread,
May 11, 2024, 1:54:10 AMMay 11
to sanskrit-p...@googlegroups.com, Andrew Ollett
Tesserect or Surya OCR may work.

On 2024-05-10 23:37, Andrew Ollett wrote:
> Hi everyone,
>
> Has anyone found a solution for isolating different texts from OCR output
> when they are printed on the same page? (I am thinking of texts and
> commentaries, like this example
> <https://archive.org/details/Anandashram_Samskrita_Granthavali_Anandashram_Sanskrit_Series/ASS_097_Mimamsadarsana_with_Tantravartika__Sabarabhashya_Part_6_-_Subbasastri_1934/page/n69/mode/2up>).
> I have used various jugaad solutions (based on the text and commentary
> being in different languages, or having different font sizes, etc.) but it
> would be nice to preprocess the images using OpenCV or something. If anyone
> has experimented with this, please do share your experiences (and code if
> possible). I am thinking of image processing with OpenCV > GCV for OCR.
>
> Andrew

--
Karthika N J
PhD student
(Teaching Assistant),
CSE, IIT Bombay.

Andrew Ollett

unread,
May 11, 2024, 11:30:33 AMMay 11
to karthika, sanskrit-p...@googlegroups.com
If anyone is still interested, this is indeed possible with OpenCV. I am attaching an image of the bounding boxes which can be used to crop the image to size. Here is a gist:

vol4-044-new.tif

Shreevatsa R

unread,
May 11, 2024, 2:37:38 PMMay 11
to sanskrit-p...@googlegroups.com, Andrew Ollett
I'm working on something related, and going about it a bit differently: you can get the bounding boxes of each word from the Google OCR (or Tesseract or whatever) response itself.

In the case of Google OCR, this is in jsonResponse.responses[0].textAnnotations, in elements of the array after the first one. A gist from the thing I'm working on (will share when ready, hopefully in a few days):

(I use this bounding box data to identify lines of the text, then manually select a group of lines and hit a button to group those lines into regions corresponding to the different texts. I'm fine doing it manually because it takes only a couple of seconds per region and if it's a text I care about I may not mind glancing over it anyway. An earlier version of this was what I used to extract matching regions — verse and footnote — for this or this, for example: the same thing can be used for different texts or text vs commentary.)



--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-program...@googlegroups.com.

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
May 11, 2024, 8:24:38 PMMay 11
to sanskrit-p...@googlegroups.com, Andrew Ollett
On Sun, 12 May 2024 at 00:07, Shreevatsa R <shree...@gmail.com> wrote:
 for this or this, for example: the same thing can be used for different texts or text vs commentary.)

So nice - thanks for sharing!  
Something like:
(not proofread) → Searchable text (not proofread)
would make it's use more obvious.

 

Shreevatsa R

unread,
May 12, 2024, 10:45:11 AMMay 12
to sanskrit-p...@googlegroups.com, Andrew Ollett
On Sat, 11 May 2024 at 17:24, विश्वासो वासुकिजः (Vishvas Vasuki) <vishvas...@gmail.com> wrote:


On Sun, 12 May 2024 at 00:07, Shreevatsa R <shree...@gmail.com> wrote:
 for this or this, for example: the same thing can be used for different texts or text vs commentary.)

So nice - thanks for sharing!  
Something like:
(not proofread) → Searchable text (not proofread)
would make it's use more obvious.

Done, thanks :-)

Andrew, I took a closer look at the Gist you shared (thank you), and really like the idea, of classifying lines by their widths. Even for my manual approach I think this classification can serve as a good default / starting point, depending on the book in question. Thanks for sharing...

I'll also repeat in case it helps that these bounding boxes will also be found in the OCR response: after all, finding the horizontal lines on the page, and their respective bounding boxes, is what any OCR tool needs to do too, and most of them will return this data. (Tesseract includes bounding boxes of individual lines too, while Google OCR only has each "block", but that will serve too.) For the page that you shared as example (input attached), the Google OCR response (screenshot from here attached) unfortunately includes the commentary below the line in the same block 6 (lol), but if you look at the JSON (or the orange underlines in the screenshot), there are *paragraph* bounding boxes and individual words' bounding boxes that have the information to identify the widths of lines.


 
ASS_097_Mimamsadarsana_with_Tantravartika__Sabarabhashya_Part_4_-_Subbasastri_1932_0043.jpg
Screenshot 2024-05-12 at 07.39.31.png
Reply all
Reply to author
Forward
0 new messages