Extracting molecular labels from biological pathway images

99 views
Skip to first unread message

Alexander Pico

unread,
Apr 27, 2015, 3:26:23 PM4/27/15
to tesser...@googlegroups.com
I am trying to identify the molecules from pathway images. This should be relatively simple from clear, high-res images like the one attached, but my attempts with Tesseract so are are pretty dismal...

It found 9 of 25 molecules. I even have the luxury of knowing in advance all the words I'd like extract and tried supplying these as eng.user-words, but there was no improvement.

I suspect I need to find the magic combination of parameter settings or perhaps image pre-processing.  Any suggestions?

Thanks!
 - Alex
F1.large2.jpg
F1.large2.txt

Art Rhyno.

unread,
Apr 27, 2015, 7:44:56 PM4/27/15
to tesser...@googlegroups.com

Hi Alex,

 

You might consider a template matching toolkit like OpenCV [1], I haven’t used it with words but I suspect it would work well in this kind of situation. OpenCV can also be used to remove basic shapes, such as circles and so on, but having a list of the words you want is a huge advantage.

 

art

---

1. http://docs.opencv.org/

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/ff5a2873-8392-4771-b314-3f2f146b0027%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Dmitri Silaev

unread,
Apr 29, 2015, 5:17:43 PM4/29/15
to tesser...@googlegroups.com
Hi Alexander,

Tweaking Tesseract parameters won't help you at all. Preprocessing - yes, you'd need to remove as much graphics as possible, leaving text only. Major steps required for this:
1. Threshold image so that all shades of gray become black
2. Label connected components (CCs)
3. Erase CCs that are too big in either X or Y direction, or both (bigger than an average character). This will leave only text
4. Crop regions containing dense text
5. Process these regions one by one with Tesseract to produce final results

This can be done e.g. by ImageMagick and shell/batch scripting. I can show how if you're interested. For some clues on that see my post in this thread: https://groups.google.com/forum/#!msg/tesseract-ocr/STHaLGYsiCo/pCT2kxMgwI8J

Best regards,
Dmitri Silaev
www.CustomOCR.com





--

Alexander Pico

unread,
May 3, 2015, 1:46:34 AM5/3/15
to tesser...@googlegroups.com
Thank you both for the helpful replies. I will certainly look into OpenCV. That's the second independent recommendation I've got for that tool for this particular problem. I also started to dive into preprocessing with imagemagick. Your blog post was VERY helpful. Unfortunately, my ultimate set of pathway images are quite diverse, which I can't handle case-by-case, so there will only be a few things I can reliable apply across all cases.

So far, here are some numbers for those who are interested...

I took 4,000 pathway images (more complicated and diverse than the simple case above) and applied both Adobe Acrobat's OCR and Tesseract with custom user-words:
* Adobe found 2,366 unique human gene identifiers
* Tesseract found 2,199 unique human gene identifiers

And the sets were not completely overlapping, resulting in a combined total of 3,187 unique identifiers.  That's less than 1 per image, and of course the results were heavily skewed. Adobe best performance was 44 hits from a single pathway, but it failed to find a single hit on 1,600 pathways. Tesseract's best was 31, but failed on 1,201 pathways.

Tom Morris

unread,
May 6, 2015, 2:41:13 PM5/6/15
to tesser...@googlegroups.com
You might consider looking at some of the papers on text detection in natural images and using the techniques from the later stages in the pipeline.  These are similar what Dmitri outlined, but reviewing what others have done might give you ideas on additional ways to filter and group connected components (e.g. aspect ratio, inter-CC spacing, etc).  This Microsoft Research paper describes their pipeline. Obviously you don't need all the front end edge detection stuff.

Extracting information from diagrams in publications is an increasingly popular topic.  Your task sounds pretty similar to what's described in this paper in Bioinformatics http://bioinformatics.oxfordjournals.org/content/28/5/739.short
Google Scholar has a list of related papers.  If the two diagrams that you present are representative, your task may be easier since the text is always horizontal.


On Sunday, May 3, 2015 at 1:46:34 AM UTC-4, Alexander Pico wrote:

So far, here are some numbers for those who are interested...

I took 4,000 pathway images (more complicated and diverse than the simple case above) and applied both Adobe Acrobat's OCR and Tesseract with custom user-words:
* Adobe found 2,366 unique human gene identifiers
* Tesseract found 2,199 unique human gene identifiers

And the sets were not completely overlapping, resulting in a combined total of 3,187 unique identifiers.  That's less than 1 per image, and of course the results were heavily skewed. Adobe best performance was 44 hits from a single pathway, but it failed to find a single hit on 1,600 pathways. Tesseract's best was 31, but failed on 1,201 pathways.

What's the denominator ie how many identifiers were there to find?  Is there a one-to-one correspondence between "pathway" and "image" ? I'm guessing yes, but want to check that the change in terminology isn't significant.

Tom

Tom Morris

unread,
May 6, 2015, 3:11:31 PM5/6/15
to tesser...@googlegroups.com
Reply all
Reply to author
Forward
0 new messages