Train for big letters in the beginning of the sentences(pic)

109 views
Skip to first unread message

tlit...@gmail.com

unread,
Aug 4, 2020, 7:07:22 AM8/4/20
to tesseract-ocr
Hello,

Is it possible to train for bigger fonts in the beginning of the sentences, since it seems that tesseract always misses them.

Thanks in advance.
big_low.jpeg

Zdenko Podobny

unread,
Aug 4, 2020, 1:39:21 PM8/4/20
to tesser...@googlegroups.com
Not sure what do you mean...

tesseract big_low.jpeg - --psm 6
Warning: Invalid resolution 0 dpi. Using 70 instead.
FY, MINERS.—TO LET, ON LEASE, on such terms as may
be agreed on, the MINERALS in the ESTATE of KNOCKSHINNOCK, lying in
the parish of New Cumnock, and county of Ayr. Acdead vein has been lately discovered


Problem is there only with initial TO which is IMO caused by T with size of two lines with following smaller size letters.

Zdenko


ut 4. 8. 2020 o 13:07 tlit...@gmail.com <tlit...@gmail.com> napísal(a):
Hello,

Is it possible to train for bigger fonts in the beginning of the sentences, since it seems that tesseract always misses them.

Thanks in advance.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/0f97a784-e8e4-4c05-8296-b95dc2211e78n%40googlegroups.com.

tlit...@gmail.com

unread,
Aug 5, 2020, 4:58:20 AM8/5/20
to tesseract-ocr
That's right, that initial "TO" and this is just a fraction of the text, there are dozens of examples like "TO" on a single page. But since it spreads to two lines there's nothing I can do I assume?

Tom Morris

unread,
Aug 5, 2020, 12:12:32 PM8/5/20
to tesseract-ocr
The technical term for these is "drop-caps," which is useful to know if you want to Google for it.

It's pretty dated now, but Ray's 2007 description of the line finding algorithm says: "Assuming that page layout analysis has already provided text regions of a roughly uniform text size, a simple percentile height filter removes drop-caps and vertically touching characters." [Emphasis added]

It looks like the commercial package Omnipage supports drop caps. Teaching Tesseract to recognize them would involve tweaking the internal segmentation and line finding algorithms, not additional training. Another approach would be to do your own segmentation to identify them and recognize them separately as single letters.

There's some general background which may be interesting/useful here: https://how-ocr-works.com/OCR/line-segmentation.html

Tom


tlit...@gmail.com

unread,
Aug 6, 2020, 8:00:46 AM8/6/20
to tesseract-ocr
Okay, I see. Very interesting articles, thank you. Since I don't know any other method for line segmentation I used hocr output from tesseract than I used hocr-tools, I dug that out on some older GitHub issues and that's how I generated line images for ground truth. Than I manually checked about 500-800 files and trained with them. There are lots of "misses" with line segmentation, with 2 to 4 lines being "cut" as a line image, so I corrected all of them. I also used those big "drop-caps" too, as a line image, but no many of them.

I never did anything like this I'm sure I made some mistakes, since the OCR quality barely improved and the error rate won't go below 0.5 - 0.3%. Images are scans of old books from 1800s in TIF format with 231DPI, grayscale, dual pages. Some of them are skewed slightly, which I tried to correct with so many different methods and there's always a drawback. That's only a part of the text skewed, mind you, as well as mixed with page skew. Pretty difficult to serialize through some software. I also tried textcleaner as well as manual Image magick tools for binarization and resampling to 400-600DPI, with that resampling being of the most useful things I tried. (auto-threshold)OTSU destroys/degrades the image quality too much, font doesn't have any sharpness and it loses parts of the letters,  Kapur is much better, but it's inconsistent and also slightly loses some font precision, but the images that have darker spots get basically all black with Kapur.

Filip

Reply all
Reply to author
Forward
0 new messages