text extraction problem with tesseract for the image

64 views
Skip to first unread message

exete...@gmail.com

unread,
Nov 24, 2016, 2:16:35 AM11/24/16
to tesseract-ocr
tesseract extracting  Text Figure as ngure 

how to get Figure text from the above image using tesseract
Figure 1figure supplement 1 Vera et al.tiff

Allistair

unread,
Nov 24, 2016, 3:39:53 AM11/24/16
to tesser...@googlegroups.com
By figure text, so you mean "Figure 1: figure supplement 1 Vera et al."?

If so I would do a two-pass approach of cropping out the clearly separated top right figure text, then resizing it to Tesseract-friendly resolution, then OCR it.

It worked for me (MacOS, ImageMagick, Tesseract 3.04.01) ...

➜  ocr convert -crop 720x100+0+0 Figure1.jpg Figure1_Crop.jpg                      

➜  ocr convert -density 72 -resample 300x300 Figure1_Crop.jpg Figure1_Resampled.jpg

➜  ocr tesseract Figure1_Resampled.jpg fig1                                        

Tesseract Open Source OCR Engine v3.04.01 with Leptonica

Warning in pixReadMemJpeg: work-around: writing to a temp file

➜  ocr cat fig1.txt                                                                

Figure 1: figure supplement 1 Vera et al.




On 24 November 2016 at 05:12, <exete...@gmail.com> wrote:
tesseract extracting  Text Figure as ngure 

how to get Figure text from the above image using tesseract

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/21e4b8e3-851d-43d0-8928-c4b12b4db0af%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages