Extracting alphanumeric identifiers (ISINs)

302 views
Skip to first unread message

Stefan Bretzel

unread,
Jun 23, 2022, 9:30:36 AM6/23/22
to tesseract-ocr
Dear all,
we are attempting to read bank statements with tesseract (via tess4j, version 4.6.0 using libtesseract 4.1.3). These statements are formalized letters where the crucial information for us appears at pre-defined locations. Among other information, we are interested in extracting the ISIN (international securities identifier), which is a alphanumeric code consisting of a two-letter country code, nine arbitrary letters
or digits and a numeric check digit.

When attempting to extract this information with tesseract, we observe patterns of read errors by tesseract such as

- zeros in the ISIN's padding appear as 0O combinations in tesseract's output. For example IE00BG0J4C88 in the document is read as IE0O0BG0J4C88
- the check-digit is misread as a letter. E.g. I or J for 1, S for 5 etc.
- letters in the country code (first two characters of the ISIN) are misinterpreted as digits, e.g. 1E instead of IE, F1 instead of FI.

These problems appear arbitrarily for such documents coming from different banks using different fonts. Preliminary tests using a user patterns file where we specify a pattern for the ISIN have had no effect, the ocr result is exactly the same as without custom pattern file. Our pattern file contains this line:

\A\A\c\c\c\c\c\c\c\c\c\d

and we use it by setting the "user_patterns_file" variable like so

Tesseract tesseract = new Tesseract();
tesseract.setTessVariable("user_patterns_file", "path/to/my.pattern");

Anyhow, my questions:

- is this the correct way to configure user patterns with tess4j? Related to that, do user patterns work when using tesseract 4.1.3 in LSTM mode (as we do currently)? I am aware of a number of issues (see https://github.com/tesseract-ocr/tesseract/issues/403 resp.
  https://github.com/tesseract-ocr/tesseract/issues/960) and PR https://github.com/tesseract-ocr/tesseract/pull/2328 that attempted to add it for LSTM but am not sure what the current status is.
- is using a pattern the right way to go to augment tesseract's accuracy for alphanumeric identifiers like an ISIN? Does this yield positive results even when the alphanumeric
  identifier is part of a longer text and not the only thing that is present in the picture?
- what other approaches to improve tesseract's accuracy when recognizing alphanumeric characters exist? I am aware of user dictionaries, but have my doubts this is a feasible approach   for us given the large number of existing ISINs (> 3 million).

Thanks in advance for any hints,
Stefan

Zdenko Podobny

unread,
Jun 23, 2022, 10:58:18 AM6/23/22
to tesser...@googlegroups.com
Can please provide some examples of input images? 
It would be much easier for other user to test your problem and suggest some solution.

Zdenko


št 23. 6. 2022 o 15:30 'Stefan Bretzel' via tesseract-ocr <tesser...@googlegroups.com> napísal(a):
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/d6756bbe-7d58-4bdd-98c6-f08ca91bd615n%40googlegroups.com.

Stefan Bretzel

unread,
Jun 25, 2022, 8:40:32 AM6/25/22
to tesseract-ocr
Hi zdenop,
thanks for the quick reply.

I've attached two (artificial -- can't post real-world scans due to legal/data protection reasons) examples illustrating the problem.

Zweitschrift_Muster.pdf comes close to what we try to OCR in the real-world. The ISIN is DE00A1DAHHQ0 but tesseract reads DEOOA1DAHHQO (all zeros are read as O).
While the first two Os might be indeed legal (an ISIN allows nine alphanumeric characters after the country code), the O at the end is definately wrong
as the last character must always be a digit. I had hoped to give tesseract a hint by providing a pattern. Besides the ISIN we extract further information
from the document, such as the execution date (Nov 26th, 2021 at 7:45), the rate and amount (220,00 resp. 22000,00) as well as the depot number (789789789).

multiple_ISINs.pdf contains a number of ISINs for which we have observed the same issue:

Found                         Expected
FRO0000127771      FR0000127771 -> additional O
IEOO0B3RBWM25    IE00B3RBWM25 -> OO0 instead of 00
NLO0011794037      NL0011794037 -> O00 instead of 00
DEOOA1DAHHQO    DE00A1DAHHQ0 -> double O instead of double 0, O instead of 0 at the end
DEOOAO0QWMPJ6  DE00A0QWMPJ6 -> double O instead of double 0

Cheers,
Stefan
Zweitschrift_Muster.pdf
multiple_ISINs.pdf

Zdenko Podobny

unread,
Jun 26, 2022, 10:10:42 AM6/26/22
to tesser...@googlegroups.com
Hello Stefan,

recognizing such codes (e.g. no words) is difficult since some letters could be easily replaced (e.g zero with capital O, 1 with l ).

I had a discussion with one commercial provider of data extraction from invoices (based on commercial OCR engines) and their claim that you always need a human validator for fields like numbers, codes, etc, as any OCR is not 100% accurate, but your accounting/service must be 100% correct. So they push for getting and processing data and they try to avoid data extraction from images as much as possible...

Regarding your examples:
I expect that you are extracting ISIN information from the image and not pdf.  Image preprocessing hints - resize the image so the capital letter has a size 30-33 points [1]
I got there result (with tessdata_best) (tesseract ISINs_rs.png -):
FR0000127771
IEOOB3RBWM25
NL0011794037
DEOOA1DAHHQO
DEOOAOQWMPJE
DEOOOA1IML7]J1
IE00BG0J4C88

The good point is that ISIN could be validated ([2], [3]), so can automatically check for OCR output and maybe do automatic post-processing (replace common problem "O" with "0" and validate once again) before asking for human validation/correction. 

If the font and letter size are the same on all documents, maybe you can consider making your own custom OCR just for ISIN.


Zdenko


so 25. 6. 2022 o 14:40 'Stefan Bretzel' via tesseract-ocr <tesser...@googlegroups.com> napísal(a):
ISINs_rs.png

Stefan Bretzel

unread,
Jun 29, 2022, 2:40:30 AM6/29/22
to tesseract-ocr
Hi Zdenko,
indeed, avoiding OCR completely for this kind of documents would be desirable though not possible for us.

As a matter of fact, we already use ISIN validation and do quite a lot of postprocessing on the extracted ISIN as an attempt to compensate for OCR misreadings. In doing so, we use ISIN validation already quite extensively. The problem with it is that Luhn algorithm through which the check digit is calculated is not sensitive enough to catch all character transpositions and mixups, so the danger is that we accidentally "create" one or more valid ISINs by these postprocessing steps that are rather far away from the ISIN found on the statement or which are equally similar to the found string so that we basically have a "draw". That's why I had hoped improving tesseract's output might give us less opportunity for this kind of problems.

I didn't know that enlarging the font-size could play a crucial role. I'll try and evaluate how much this helps in our case.

In any case, thanks for your hints and your time,
Stefan
Reply all
Reply to author
Forward
0 new messages