Dropped single character words

62 צפיות
מעבר להודעה הראשונה שלא נקראה

Clinton Graham

לא נקראה,
24 באוג׳ 2017, 15:06:5724.8.2017
עד tesseract-ocr
Do you have any simple suggestions for improving OCR quality where tesseract is missing single character words like "a" and "I"?

I'm using the default packages available in Ubuntu:
tesseract 3.03
 leptonica-1.70
  libgif 4.1.6(?) : libjpeg 8d : libpng 1.2.50 : libtiff 4.0.3 : zlib 1.2.8 : webp 0.4.0

I've also tried updating Ubuntu, building later 3.x sources:
tesseract 3.05.01
 leptonica-1.74.4
  libjpeg 8d (libjpeg-turbo 1.3.0) : libpng 1.2.50 : libtiff 4.0.3 : zlib 1.2.8

I'm using a command line run of simply:
tesseract -psm 1 -l eng $f $f pdf

I've also tried -psm 6 based on another forum post (though some of my input will be multicolumn).

In whatever case, the first paragraph of the my TIFF (attached) is consistently read without instances of single character words:

Honors Award {Presentation to Robert H. Ivy, M.D., D.D.S., Sc.D., F_‘.A.C.S. At the business meeting .of the American Cleft Palate Association on May 6, 1961 in Montreal, Canada, an Honors and Awards Committee was established and its duties were set forth. The Executive Committee then selected Dr. Robert Ivy to be the first recipient of an Honors Award. An HOnors and Awards Committee was then selected by the President; serve as the current chairman. It therefore becomes personal honor and privilege to me to be able to present this first award to good friend. Dr. Ivy has had long and brilliant career in the field of plastic surgery with particular interest in the cleft lip and palate patient. It will be possible for us to mention only very few of Dr. Ivy’s many accomplishments in our allotted time here today. would, therefore, like to recommend to you two publications which will give you more insight into the life of our honored guest.

I'm hoping this sample and description is also representative of other dropped characters, such as single numerals in pagination and single initials in some instances.

Unfortunately, I don't have a lot of time to devote to this project, so anything easy and obvious which I'm missing?

Thanks,

- Clinton Graham

Systems Developer

University of Pittsburgh | University Library System

412-383-1057


00030001.tif

ShreeDevi Kumar

לא נקראה,
24 באוג׳ 2017, 15:12:0324.8.2017
עד tesser...@googlegroups.com
You can try building latest GitHub source for 4.0alpha and test with the best/eng.traineddata from the tessdata repository.

-- Excuse the brevity, msg sent from phone.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/e0b62d2b-2e27-4732-b4fe-8d5b78c52d98%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

ShreeDevi Kumar

לא נקראה,
24 באוג׳ 2017, 15:15:2224.8.2017
עד tesser...@googlegroups.com
There is an unofficial ppa package available with latest code, if you do not want to build it.

-- Excuse the brevity, msg sent from phone.

ShreeDevi Kumar

לא נקראה,
25 באוג׳ 2017, 7:54:2525.8.2017
עד tesser...@googlegroups.com

Clinton Graham

לא נקראה,
25 באוג׳ 2017, 8:14:5825.8.2017
עד tesseract-ocr
Thanks for the suggestion.  The 4.0 alpha does seem to be providing better results out of the box.  I pulled the Windows installer:
tesseract 4.00.00alpha
 leptonica-1.74.1
  libgif 4.1.6(?) : libjpeg 8d (libjpeg-turbo 1.5.0) : libpng 1.6.20 : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.3 : libopenjp2 2.1.0

Enjoy,


- Clinton Graham
Systems Developer
University of Pittsburgh | University Library System

412-383-1057


To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

ShreeDevi Kumar

לא נקראה,
25 באוג׳ 2017, 8:50:2325.8.2017
עד tesser...@googlegroups.com

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
השב לכולם
השב למחבר
העבר לנמענים
0 הודעות חדשות