Re: Math / equation detection module for Tesseract 3.02

1,826 views
Skip to first unread message
Message has been deleted

Igor

unread,
Jan 30, 2015, 6:13:31 PM1/30/15
to tesser...@googlegroups.com










I have tried to OCR this simple image using math module whith this command:

tesseract.exe simple.bmp output -l equ
↕↕⊣⊳↗∫

But as you can see output is garbage.
When I use tesseract with english dictionary I get better result

tesseract.exe simple.bmp output -l eng
11+X

Is it possible to OCR fractions which span over several lines using math module or this math module is math optimiziation for in-line math symbols?


Matan Safriel

unread,
May 15, 2015, 6:48:49 AM5/15/15
to tesser...@googlegroups.com
Same here (on linux in my case).
Version 3.03.

Also tried dropping in the latest online downloadable equations add-on's equ.traineddata, which might not be compatible. With and without it, same results.

Errors during the run:
Bad properties for index 1, char ⊒: 255,0 255,0 0,32767 0,32767 0,32767
Bad properties for index 2, char ⊟: 255,0 255,0 0,32767 0,32767 0,32767
Bad properties for index 3, char ⇓: 255,0 255,0 0,32767 0,32767 0,32767
Bad properties for index 4, char ≆: 255,0 255,0 0,32767 0,32767 0,32767
Bad properties for index 5, char ⊅: 255,0 255,0 0,32767 0,32767 0,32767
Bad properties for index 6, char ↧: 255,0 255,0 0,32767 0,32767 0,32767
Bad properties for index 7, char ⇥: 255,0 255,0 0,32767 0,32767 0,32767
Bad properties for index 8, char ∅: 255,0 255,0 0,32767 0,32767 0,32767
Bad properties for index 9, char ⋕: 255,0 255,0 0,32767 0,32767 0,32767
.
.
.
.

Tom Morris

unread,
May 15, 2015, 2:47:13 PM5/15/15
to tesser...@googlegroups.com
A few notes on equation detection:

- In the 3.02 announcement, the feature is listed as "experimental equation detector" (emphasis added)
- there's no documentation on what's actually in the "equ" trained data file
- the equation detector appears to be turned off by default https://code.google.com/p/tesseract-ocr/source/browse/ccmain/tesseractclass.cpp#486

Note also that the feature is called detection, not recognition, so it's entirely possible it's just something to help out with page segmentation.

Tom

Tom Morris

unread,
May 15, 2015, 3:02:05 PM5/15/15
to tesser...@googlegroups.com
p.s. For anyone interested in the topic, here's a masters thesis on math equation detection and segmentation (and the work was done using Tesseract):

Matan Safriel

unread,
May 16, 2015, 4:44:05 AM5/16/15
to tesser...@googlegroups.com
Thanks Tom,

I guess turning the feature on in tesseract might indeed help its performance :-)
As for the thesis, a very thorough, informative, well written and objectively self-evaluated one. From its Table 14, I am not really able to convince myself that plugging its MEDS module into tesseract would perform better than tesseract 3.02 does as is. Table 14 seems to show a trade off between true positives increase and false positive decrease. False discovery rate doubles when using its MEDS module, according to that table. 

Matan



--
You received this message because you are subscribed to a topic in the Google Groups "tesseract-ocr" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/tesseract-ocr/_V7pOll2kPo/unsubscribe.
To unsubscribe from this group and all its topics, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/0c0aaca9-cd7d-41df-93cc-6b5b6bce559a%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages