text extraction problem with tesseract for the image

64 views

Skip to first unread message

exete...@gmail.com

unread,

Nov 24, 2016, 2:16:35 AM11/24/16

to tesseract-ocr

tesseract extracting Text Figure as ngure

how to get Figure text from the above image using tesseract

Figure 1figure supplement 1 Vera et al.tiff

Allistair

unread,

Nov 24, 2016, 3:39:53 AM11/24/16

to tesser...@googlegroups.com

By figure text, so you mean "Figure 1: figure supplement 1 Vera et al."?

If so I would do a two-pass approach of cropping out the clearly separated top right figure text, then resizing it to Tesseract-friendly resolution, then OCR it.

It worked for me (MacOS, ImageMagick, Tesseract 3.04.01) ...

➜ ocr convert -crop 720x100+0+0 Figure1.jpg Figure1_Crop.jpg

➜ ocr convert -density 72 -resample 300x300 Figure1_Crop.jpg Figure1_Resampled.jpg

➜ ocr tesseract Figure1_Resampled.jpg fig1

Tesseract Open Source OCR Engine v3.04.01 with Leptonica

Warning in pixReadMemJpeg: work-around: writing to a temp file

➜ ocr cat fig1.txt

Figure 1: ﬁgure supplement 1 Vera et al.

On 24 November 2016 at 05:12, <exete...@gmail.com> wrote:

tesseract extracting Text Figure as ngure

how to get Figure text from the above image using tesseract

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/21e4b8e3-851d-43d0-8928-c4b12b4db0af%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward

0 new messages