Recognition of chemical formulas

Vadim Fedorov

unread,

Dec 17, 2018, 12:48:39 PM12/17/18

to tesseract-ocr

Hello everyone,

I need an advice. Would it make sense to train a separate model (datafile) exclusively for recognition of chemical formulas?

With the default model for English the following formula

is recognized as "CONH(CH5)3N(CoHs)o" by LSTM engine. So there are mistakes in subscripts. My intuition is that a model trained on chemical formulas only would be able to handle this better.

What do you think?

Shree Devi Kumar

unread,

Dec 17, 2018, 1:13:47 PM12/17/18

to tesser...@googlegroups.com

Please take a look at related issue regarding subscripts/superscripts (in langdata or tessdata repos).

As far as I understand, the currently used normalization routines convert them to regular numbers. Hence, training did not seem to help in my fine tuning trial.

However, you can give it a try and share your results.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/a5704736-173a-4e21-a532-26595d94589b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Vadim Fedorov

unread,

Dec 18, 2018, 10:29:19 AM12/18/18

to tesseract-ocr

Thank you, I'll take a look. However, my problem here is more about the subscripts not being recognized as numbers. On which data did you try to fine tune?

Shree Devi Kumar

unread,

Dec 18, 2018, 11:28:15 AM12/18/18

to tesser...@googlegroups.com

Ok. In that case try fine tuning with single line images of chemical equations using ocrd-train project.

I had used tessdata_best/eng.traineddata for fine tuning.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/00b095b7-3428-4146-91dc-53e60959d4bc%40googlegroups.com.

Art Rhyno

unread,

Dec 18, 2018, 4:33:45 PM12/18/18

to tesser...@googlegroups.com

Tesseract’s API allows you to get at the character level coordinates. One idea is to look at the vertical position of the characters and try to identify the subscripts by their position. If detected, you could extract the gylph programmatically and run Tesseract on it as a single character, which might give more accurate output.