Hello Tesseract Gurus!
I'm working on a pdf2text extraction for legal documents. I've done some searches and found tips to improve quality, but I was wondering if someone here can provide info beyond the basics. Image processing-wise I've been: resizing to 600dpi, correcting for skew angle and [denoising with a median filter, contrast stretching, dilating with a small structuring element and otsu_thresholding] All those things improved the results only for a subset of the documents and given how widely they vary in acquisition quality, noise level and contrast, Ive realized the imaging pipeline is not one size fits all. I'm considering creating parallel image processing pipelines and do OCR on all of them and just pick the best,
1. Can anyone comment on what would be some good variations of an imaging pipelines for 'high variance dirty-text' ? Or alternatively can anyone think of an imaging pipeline that would cover a wider range of document quality?As far as tesseract parameters go. I've put together a small parameter exploration loop evaluating iterations with a text-quality metric(M2, based on number of dictionary words, using pyenchant), here is an example for the linesize value parameter for a single document:
+++For linesize value, 1.25, M2 value is 0.661157024793
+++For linesize value, 1.35, M2 value is 0.661157024793
+++For linesize value, 1.45, M2 value is 0.644628099174
+++For linesize value, 1.55, M2 value is 0.611111111111
+++For linesize value, 1.65, M2 value is 0.0
+++For linesize value, 1.75, M2 value is 0.693693693694
+++For linesize value, 1.85, M2 value is 0.672413793103
+++For linesize value, 1.95, M2 value is 0.631578947368
+++For linesize value, 2.05, M2 value is 0.0
Here the best linsize value is 1.75 for that document, which yields good results(the ). From this info,
2. Can anyone recommend what are some good parameters to do apply this method with? Any other tips of combining parameters into something more general o any other exploration tips? Thanks for reading! Any other potentially useful tips or info would be greatly appreciated, whether its on the image processing or on the tesseract parameters.
Best, Luis.