Re: Improve Current Tesseract Results

Message has been deleted

Tom Morris

unread,

Jan 13, 2021, 10:24:15 AM1/13/21

to tesseract-ocr

I suspect your problem is more to do with the tabular format and the lines than the fact that it's Korean or the image quality. You might want to search the archive for other threads discussing handling tabular data and/or line removal. There's a Leptonica tutorial on line removal (http://www.leptonica.org/line-removal.html), but table OCR a little specialized.

Tom

On Wednesday, January 13, 2021 at 8:12:58 AM UTC-5 Glenn wrote:

Hello, I am currently working on this Korean dataset and was having some issues on getting the values all correctly. A few problems are the pictures being slightly wonky as well as it being in Korean.

I cropped the data as well as made it greyscale to attempt to better the image, but it still looks slightly blurry. I'm not sure if this is the best way and can crop out to a larger image.

The current problem is that the performance is not very good. The default settings gives me a jumble. Although I found that psm 4 is the best, it still does not look very good and it seems like tesseract just breaks halfway through.

How can I improve this? I was thinking of cutting the data into slices to read each, but still I am not sure if I can fix this. Is the image quality just not good enough?

Thank you

Reply all

Reply to author

Forward

Message has been deleted