Hi Alex,
You might consider a template matching toolkit like OpenCV [1], I haven’t used it with words but I suspect it would work well in this kind of situation. OpenCV can also be used to remove basic shapes, such as circles and so on, but having a list of the words you want is a huge advantage.
art
---
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
tesseract-oc...@googlegroups.com.
To post to this group, send email to
tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/ff5a2873-8392-4771-b314-3f2f146b0027%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
So far, here are some numbers for those who are interested...I took 4,000 pathway images (more complicated and diverse than the simple case above) and applied both Adobe Acrobat's OCR and Tesseract with custom user-words:* Adobe found 2,366 unique human gene identifiers* Tesseract found 2,199 unique human gene identifiersAnd the sets were not completely overlapping, resulting in a combined total of 3,187 unique identifiers. That's less than 1 per image, and of course the results were heavily skewed. Adobe best performance was 44 hits from a single pathway, but it failed to find a single hit on 1,600 pathways. Tesseract's best was 31, but failed on 1,201 pathways.