Recognition of chemical formulas

49 views
Skip to first unread message

Vadim Fedorov

unread,
Dec 17, 2018, 12:48:39 PM12/17/18
to tesseract-ocr
Hello everyone,

I need an advice. Would it make sense to train a separate model (datafile) exclusively for recognition of chemical formulas?
With the default model for English the following formula

test5.png

is recognized as "CONH(CH5)3N(CoHs)o" by LSTM engine. So there are mistakes in subscripts. My intuition is that a model trained on chemical formulas only would be able to handle this better.
What do you think?

Shree Devi Kumar

unread,
Dec 17, 2018, 1:13:47 PM12/17/18
to tesser...@googlegroups.com
Please take a look at related issue regarding subscripts/superscripts (in langdata or tessdata repos). 

As far as I understand, the currently used normalization routines convert them to  regular numbers. Hence, training did not seem to help in my fine tuning trial.

However, you can give it a try and share your results.


--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/a5704736-173a-4e21-a532-26595d94589b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Vadim Fedorov

unread,
Dec 18, 2018, 10:29:19 AM12/18/18
to tesseract-ocr
Thank you, I'll take a look. However, my problem here is more about the subscripts not being recognized as numbers. On which data did you try to fine tune? 

Shree Devi Kumar

unread,
Dec 18, 2018, 11:28:15 AM12/18/18
to tesser...@googlegroups.com
Ok. In that case try fine tuning with single line images of chemical equations using ocrd-train project.

I had used tessdata_best/eng.traineddata for fine tuning.

Art Rhyno

unread,
Dec 18, 2018, 4:33:45 PM12/18/18
to tesser...@googlegroups.com

Tesseract’s API allows you to get at the character level coordinates. One idea is to look at the vertical position of the characters and try to identify the subscripts by their position. If detected, you could extract the gylph programmatically and run Tesseract on it as a single character, which might give more accurate output.

 

art

--

Reply all
Reply to author
Forward
0 new messages