Tesseract OCR numbers in figures not found

66 views

Skip to first unread message

MaSei

unread,

Oct 20, 2020, 12:20:48 PM10/20/20

to tesseract-ocr

I want to extract numbers from an image. Usually the numbers are around some figure and sometimes within the figure. I'm using Tesseract for this task. Tesseract works quite well for documents with a lot of text but I have not really found the right parameters to get good results for this task. I tried different page segmentation modes (PSM_SPARSE_TEXT should in theory work best here), all different engine modes, character whitelist, disabled table detection, disabled dictionary and so on.

Usually the images look like the attached 'NumbersWithFigure'.

But also using a 'cleaned' image like the attached 'OnlyNumbers' didn't really bring better results.

I'm using Tess4j to access Tesseract with Java like this:

Tesseract1 tesseract = new Tesseract1(); //default-lang is eng, default OEM is TessOcrEngineMode.OEM_DEFAULT; tesseract.setTessVariable("textord_tabfind_find_tables", "0"); //table detection disabled tesseract.setTessVariable("tessedit_enable_doc_dict", "0"); //don't use dictionary tesseract.setTessVariable("tessedit_char_whitelist", "0123456789"); //only numbers tesseract.setTessVariable("load_system_dawg", "0"); // system dictionary will not be loaded. tesseract.setPageSegMode(TessPageSegMode.PSM_SPARSE_TEXT); tesseract.setDatapath(new File("./tessdata/").getAbsolutePath()); System.out.println("Words: " + tesseract.getWords(entry.getValue(), TessPageIteratorLevel.RIL_WORD));

Any ideas (parameters and/or links to specialized training data)?

I've also posted this question on StackOverflow (here), but maybe I got more luck here :-)

Reply all

Reply to author

Forward

0 new messages