Dear all,
we are attempting to read bank statements with tesseract (via tess4j, version 4.6.0 using libtesseract 4.1.3). These statements are formalized letters where the crucial information for us appears at pre-defined locations. Among other information, we are interested in extracting the ISIN (international securities identifier), which is a alphanumeric code consisting of a two-letter country code, nine arbitrary letters
or digits and a numeric check digit.
When attempting to extract this information with tesseract, we observe patterns of read errors by tesseract such as
- zeros in the ISIN's padding appear as 0O combinations in tesseract's output. For example IE00BG0J4C88 in the document is read as IE0O0BG0J4C88
- the check-digit is misread as a letter. E.g. I or J for 1, S for 5 etc.
- letters in the country code (first two characters of the ISIN) are misinterpreted as digits, e.g. 1E instead of IE, F1 instead of FI.
These problems appear arbitrarily for such documents coming from different banks using different fonts. Preliminary tests using a user patterns file where we specify a pattern for the ISIN have had no effect, the ocr result is exactly the same as without custom pattern file. Our pattern file contains this line:
\A\A\c\c\c\c\c\c\c\c\c\d
and we use it by setting the "user_patterns_file" variable like so
Tesseract tesseract = new Tesseract();
tesseract.setTessVariable("user_patterns_file", "path/to/my.pattern");
Anyhow, my questions:
- is this the correct way to configure user patterns with tess4j? Related to that, do user patterns work when using tesseract 4.1.3 in LSTM mode (as we do currently)? I am aware of a number of issues (see
https://github.com/tesseract-ocr/tesseract/issues/403 resp.
https://github.com/tesseract-ocr/tesseract/issues/960) and PR
https://github.com/tesseract-ocr/tesseract/pull/2328 that attempted to add it for LSTM but am not sure what the current status is.
- is using a pattern the right way to go to augment tesseract's accuracy for alphanumeric identifiers like an ISIN? Does this yield positive results even when the alphanumeric
identifier is part of a longer text and not the only thing that is present in the picture?
- what other approaches to improve tesseract's accuracy when recognizing alphanumeric characters exist? I am aware of user dictionaries, but have my doubts this is a feasible approach for us given the large number of existing ISINs (> 3 million).
Thanks in advance for any hints,
Stefan