Tesseract doesn't detect text properly.

124 views
Skip to first unread message

momo898

unread,
Jul 8, 2022, 1:20:40 AM7/8/22
to tesseract-ocr
Hi 
I have a simple image and I tried tesseract ocr 4.x (with eng language) but it didn't detect text properly.
test.bmp
The OCR Result is like the below. Especially page number separator "..." and page number is wrong. 
Is there any way to improve accuracy for this?
And I feel it takes time to OCR a full page image. Is there any option to make it faster? 
Thanks
Jason
---
Table of Contents

LD. [introductions eee cccecceecsesensesenecsecensesenecseceeseeeecseceeseseecnecsesesereneceesesereneceeenseseneseeneneeeen
1.1. Purpose of this document... ccc ccc ceecsetecsesceecseecsescsecsesescscsececneseeenesseessseenans
1.2. U.S. Electronic Submission Back ground 00.0.0... ccc ee ccetecseseeeceetetseeeneeeeenseseneceeeneeeenees
1.3. CDISC i ccceccccesecsteseststessnsessssessssssecssseecisseecnsseecisseecesseecesseecasseecesseesesseecesseeeaseseeesasess

1.3.1. Operational DataModel (ODM).......0ccccccececeteseeesesesesesesesesesereseseresesesereseneeeneees
1.3.2. Study Data Tabulation M odel (SDTM) ou... cccccceececeseeteseseeresesesesesererseereneeeees

1.3.3. Analysis Dataset Model (ADaM) .u.......ccccccccccccccccccsecccceceneseceecestseceecestaceeeensuseeenenaas


test.zip

ArtmanDC

unread,
Jul 8, 2022, 2:17:07 PM7/8/22
to tesseract-ocr
I got similar poor results with v5.0.1.20220118 —
Table of Contents
LD. [introduction ccc cccccccccccccescescescescessessessessessessessessessessesseeesseseecescseceeseeseesiscasesseseeeseesees
1.1. Purpose of this document... ccc ccc esceecsesecsesceecsesensescseceesecsesciesseecseensssenesseeenans
1.2. U.S. Electronic Submission Back ground 0.0.0.0... ccc ce ccececsesceeceetetseeeeceeenseeenecseneneeeeness
1.3. CDS cccccccccccccseccessescsecsecsecsesecsessessessesesscsecsescseseesessessessesesssecsecsessesessessessesessseesesaees
1.3.1. Operational DataModel (ODM)... ccc ccecceteseseteseseseseseseseseseseseseseseeerereeesereneees
1.3.2. Study Data Tabulation M odel (SDTM) .0....ccccccccecececeseseteseeseresesesesesesereeerereaetes
1.3.3. Analysis Dataset Model (ADaM) .o.......ccccccccccccccccccccccsccecenessceecestseceeerstsceeeensusceeeesas
#   #   #   #   #   #   #   #   #   #   #   #  

In other contexts I have gotten gibbersih like this with rows of dots, or even standard ellipses (...)
Your original is very low res, which may be the issue. In my case I'm working with scanned microfilm, a less than ideal source.
Good luck!
Reply all
Reply to author
Forward
0 new messages