Not able to force a specific sequence length

30 views
Skip to first unread message

Fernando

unread,
Nov 22, 2019, 4:08:16 AM11/22/19
to tesseract-ocr
Hello everyone!
I am trying to use tesseract-ocr (pytesseract) to detect some specific codes and I receive as input a single word at a time.
Those codes have always the same length (8) and I want to receive as output only sequences with 8 characters.

I have tried all the solution described in the manual https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc#CONFIGFILE without success.

More in details I tried to :

  • Create a CONFIGFILE, referring to a user pattern file
  • Pass directly the file with the --user-patterns option
I also tried few different regular expression (I read that tesseract supports only a subset).
The ideal regex will be something like that ^.{8}$ because I want only to select the length, not a specific set of character (all unicode char)

I also tried some very general regex that I read are supported, such as \d that should return only sequences made of digits but it seems to be ignored.

I am missing something or it is not possible to force a sequence output length?

Thank you in advance
Reply all
Reply to author
Forward
0 new messages