Mathematical equation detection & recognition

163 views
Skip to first unread message

Владимир Калачихин

unread,
May 18, 2020, 11:07:56 AM5/18/20
to tesseract-ocr
What is the current situation with subj?
I am ready to devote my time to training Tesseract to math symbols but why hasn't anyone done it yet?

Владимир Калачихин

unread,
May 20, 2020, 8:27:14 AM5/20/20
to tesseract-ocr
"equ Math / equation detection module" not present in Tesseract 4. But trainerdata present.
Does this mean that I must retrain the equ module from scratch?

Владимир Калачихин

unread,
May 27, 2020, 9:25:10 AM5/27/20
to tesseract-ocr
Heh, "equ" language is not present on language-specific.sh, so training Tesseract 4 to math symbols impossible.

Common question:

Is there a real way to create a language model from scratch? For new, unknown language?

Weslley Torres

unread,
May 27, 2020, 11:02:59 AM5/27/20
to tesseract-ocr
Hi, 

I have a similar situation, in my case I "just" need to identify/detect the equation in the picture. I don't need to "read" it. Known the location is enough for me, just like the paper you mentioned "A Simple Equation Region Detector for Printed Document Images in Tesseract". However, it seems this feature is OFF by default as said in here -> https://groups.google.com/forum/#!msg/tesseract-ocr/_V7pOll2kPo/JKkJGJMNqUAJ 

Did you manage to detect the area of equations in a picture?

Kind regards, 

Владимир Калачихин

unread,
May 27, 2020, 12:20:43 PM5/27/20
to tesseract-ocr
Hi Weslley
среда, 27 мая 2020 г., 18:02:59 UTC+3 пользователь Weslley Torres написал:

Did you manage to detect the area of equations in a picture?


I did it by naive approsh via consolidate areas with bad recognited symbols:

Снимок экрана в 2020-05-18 00-10-39.png

It is no so good for me, so I intend to repeat the approach with image tightness from the article above.

Weslley Torres

unread,
May 27, 2020, 3:22:38 PM5/27/20
to tesseract-ocr
Hi!! 

I think what you accomplished is good enough for me. 
Do you mind sharing your code/script?

Kind regards

Владимир Калачихин

unread,
May 27, 2020, 5:01:48 PM5/27/20
to tesseract-ocr
This is not a production code, just sketch.
badBlocks.py

Weslley Torres

unread,
May 27, 2020, 5:19:26 PM5/27/20
to tesseract-ocr
thank you very much, I will have a look at it =).

Kind regards, 

Weslley Torres

unread,
May 27, 2020, 7:42:23 PM5/27/20
to tesseract-ocr
Hi, 

probably you have done it already, but in any case.. in line 40, try it:

ocrData = pytesseract.image_to_data(thresh, output_type=Output.DICT, config='--tessdata-dir /new/folder/address/Share/ --oem 0 -c textord_equation_detect=1', lang='equ')

Please create one folder with the files "equ.traineddata" and "eng.traineddata" from this link https://github.com/tesseract-ocr/tessdata
You might need the folder configs too, but try without it first ..

in lang try lang='eng+equ' too


Please lemme know whether your results improved or not.

Kind regards, 

Em quarta-feira, 27 de maio de 2020 23:01:48 UTC+2, Владимир Калачихин escreveu:

Владимир Калачихин

unread,
May 28, 2020, 5:26:08 AM5/28/20
to tesseract-ocr
Hi Weslley!
четверг, 28 мая 2020 г., 2:42:23 UTC+3 пользователь Weslley Torres написал:
probably you have done it already, but in any case..

Yes, I did.
The equations are recognized very bad, with textord_equation_detect=1 or without. This works with the legacy engine only, LSTM does not have support "equ" language completely.
As I understand it.

Weslley Torres

unread,
May 28, 2020, 7:59:05 AM5/28/20
to tesseract-ocr
Hi.. 

Yes, indeed the equations are recognised very bad =/. You are correct, "equ" only works with legacy engine, but I though we should use "equ" instead of "eng" for equations detection. I mean, how "eng" would recognise Greek letters? And Greek letters are commonly used in equations. 

In any case, I am still learning how to use Tesseract so I might be saying bullshit.  

I will let you know if I manage to improve the results. Please, keep me posted if you also improved your results =). 

Kind regards, 

Владимир Калачихин

unread,
May 28, 2020, 8:07:31 AM5/28/20
to tesseract-ocr
четверг, 28 мая 2020 г., 14:59:05 UTC+3 пользователь Weslley Torres написал:
I though we should use "equ" instead of "eng" for equations detection. I mean, how "eng" would recognise Greek letters? And Greek letters are commonly used in equations. 

No. Base concept of my naive equation detection approach - equation recognized badly but plain text - good. So I just collect bad recognized blocks.
These blocks include pictures and tables, of course.
Reply all
Reply to author
Forward
0 new messages