How are you specifying the output format? For example, if you use the default pdf config file, it includes the line:tessedit_pageseg_mode 1which may override your intended -psm flag.
leohocr) that contains only:load_system_dawg 0
load_freq_dawg 0
tessedit_create_hocr 1
tesseract clean01.tif t01_3 -c tessedit_pageseg_mode=3 leohocr
tesseract clean01.tif t01_5 -c tessedit_pageseg_mode=5 leohocr
[...]
tesseract clean01.tif t01_11 -c tessedit_pageseg_mode=11 leohocr
tesseract clean01.tif t01_12 -c tessedit_pageseg_mode=12 leohocr
Having said that, you probably have more information than tesseract about the page layout, so you may want to try doing page segmentation yourself and feeding the resulting columns or cells to tesseract for recognition individually.
Hi Leo,
Your example has such good contrast that you might consider using the colors to identify single characters. I have attached a quick sample of what I mean. I used opencv and defer greatly to the blog post I reference at the top of the script, but the idea would be to try to catch single characters using opencv’s “inrange” function. I would use tesseract on the image first and weed out blobs for further processing based on the coordinates of what tesseract has already detected. I would then use single character mode on what’s left. Feel free to ping me if you are interested in this approach.
art
Your example has such good contrast that you might consider using the colors to identify single characters. I have attached a quick sample of what I mean. I used opencv and defer greatly to the blog post I reference at the top of the script, but the idea would be to try to catch single characters using opencv’s “inrange” function. I would use tesseract on the image first and weed out blobs for further processing based on the coordinates of what tesseract has already detected. I would then use single character mode on what’s left. Feel free to ping me if you are interested in this approach.
For sure, best of luck!
art
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
tesseract-oc...@googlegroups.com.
To post to this group, send email to
tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/d16bb097-f4a7-4deb-a5bd-fa1545e25c33%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.