--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/6f5f957e-4f33-419f-aba6-2e8a3f6f8d92n%40googlegroups.com.
White list the digits so that the O will not confuse it.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/5aa9548f-a539-46d1-94a1-fc25850d5982n%40googlegroups.com.
It is not a “super quality” parameter, but one possible approach to critical numbers and other types of content where a dictionary is not helpful is to target individual characters. Tesseract will provide individual characters and probabilities of accuracy for each, either using the API or in hocr with "-c hocr_char_boxes=1". With the glyph coordinates and something like a range between 90 and 98 percent probability, it might be possible to get closer to 99 per cent by extracting individual glyphs and using single character recognition (PSM 10). This, of course, adds a lot more overhead but it can help with tricky recognition, like distinguishing between "O" and "0".
art
From: tesser...@googlegroups.com <tesser...@googlegroups.com>
On Behalf Of A Nederpelt
Sent: Friday, September 22, 2023 8:25 AM
To: tesseract-ocr <tesser...@googlegroups.com>
Subject: Re: [tesseract-ocr] quality of recognition of customer invoices
Well i have approximatelly 3000 customers at the moment for our software. We are using lots of invoices to OCR i.e. 1 customer uses approx 10.000 documents a month.
So opensource is worth it. I want tesseract, sinds it is free to use.
I believe opensource is the future.
So, can somebody help me optimize it.
With lots of CPU usage i mean when it needs to use more CPU for some parameter like "super quality". I want to use that parameter.
Op vrijdag 22 september 2023 om 14:03:53 UTC+2 schreef desal...@gmail.com:
The CPU usage is unusual. I have pretty old mac (from 2011); have been running Tesseract quite fine.
But, as to the accuracy, if your project is limited in scale, the commercial tools would definitely perform better for you. But, if you have long lasting, and extensive projects, Tesseract is worth spending your time and developing (training) it.
On Friday, September 22, 2023 at 2:50:50 PM UTC+3 powe...@gmail.com wrote:
Well, the problem is that why it chooses for:
NLOO7900000B01
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/5aa9548f-a539-46d1-94a1-fc25850d5982n%40googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/3801b871-aa3f-4ddc-85c0-6df1f9063180n%40googlegroups.com.
Not sure I am following, hocr is just an output format, the results should be the same. The trick would be to use the coordinates to extract the glyphs for problem characters, like the two Os below, and then use single character mode on the resulting images. I put a simple demo of this approach here [1], you would probably want to test if the approach consistently caught problem characters and then use the API to get better performance in production.
art
---
From: tesser...@googlegroups.com <tesser...@googlegroups.com>
On Behalf Of A Nederpelt
Sent: Monday, September 25, 2023 3:46 AM
To: tesseract-ocr <tesser...@googlegroups.com>
Subject: Re: [tesseract-ocr] quality of recognition of customer invoices
Well the strange effect is, that hocr shows different characters.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/3801b871-aa3f-4ddc-85c0-6df1f9063180n%40googlegroups.com.
