Okay, I see. Very interesting articles, thank you. Since I don't know any other method for line segmentation I used hocr output from tesseract than I used hocr-tools, I dug that out on some older GitHub issues and that's how I generated line images for ground truth. Than I manually checked about 500-800 files and trained with them. There are lots of "misses" with line segmentation, with 2 to 4 lines being "cut" as a line image, so I corrected all of them. I also used those big "drop-caps" too, as a line image, but no many of them.
I never did anything like this I'm sure I made some mistakes, since the OCR quality barely improved and the error rate won't go below 0.5 - 0.3%. Images are scans of old books from 1800s in TIF format with 231DPI, grayscale, dual pages. Some of them are skewed slightly, which I tried to correct with so many different methods and there's always a drawback. That's only a part of the text skewed, mind you, as well as mixed with page skew. Pretty difficult to serialize through some software. I also tried textcleaner as well as manual Image magick tools for binarization and resampling to 400-600DPI, with that resampling being of the most useful things I tried. (auto-threshold)OTSU destroys/degrades the image quality too much, font doesn't have any sharpness and it loses parts of the letters, Kapur is much better, but it's inconsistent and also slightly loses some font precision, but the images that have darker spots get basically all black with Kapur.