Unable to get Orientation with node-tesseract, Warning, detects only orientation with -l eng Error, OSD requires a model for the legacy engine

693 views
Skip to first unread message

Oliver Saintilien

unread,
Jan 11, 2024, 1:25:53 PM1/11/24
to tesseract-ocr
So I keep getting an error that I have to set the TESSDATA_PREFIX env var which I did do, both in the User Vars and System Var. However after doing that I get another error. I attached screenshots to make my setup and issuse as clear as possible. Im using node-tesseract-ocr - npm (npmjs.com)

Screenshot 2024-01-11 131619.pngScreenshot 2024-01-11 131802.png

Screenshot 2024-01-11 131330.png

Zdenko Podobny

unread,
Jan 11, 2024, 1:34:35 PM1/11/24
to tesser...@googlegroups.com
The subject of your email states something different than your email text. Can you please clarify?


Zdenko


št 11. 1. 2024 o 19:25 Oliver Saintilien <osaint...@gmail.com> napísal(a):
So I keep getting an error that I have to set the TESSDATA_PREFIX env var which I did do, both in the User Vars and System Var. However after doing that I get another error. I attached screenshots to make my setup and issuse as clear as possible. Im using node-tesseract-ocr - npm (npmjs.com)

Screenshot 2024-01-11 131619.pngScreenshot 2024-01-11 131802.png

Screenshot 2024-01-11 131330.png

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/ff0dfe5f-969c-412f-bc2b-1bca7358afa0n%40googlegroups.com.

Oliver Saintilien

unread,
Jan 11, 2024, 10:59:00 PM1/11/24
to tesseract-ocr

When  I do 
```js
tesseract
  .recognize(`C:\\Users\\osain\\OneDrive\\Desktop\\1992 Spring\\Document_20240109_0014.jpg`, {
    lang: "eng",
    oem: 1,
    psm: 0,
     
  })
  .then((text) => {
   
    console.log(text )
   
  }) ```
I was expecting to get some orientation info on the image, like if its, sideways, upsidedown, etc, but instead it gives me the error you see in my subject, and in the screenshot.  Changing the psm to 3 extracts the text perfect! but when I change it to 0 I get that error. I got the number code for psm from here  Improving the quality of the output | tessdoc (tesseract-ocr.github.io)

Zdenko Podobny

unread,
Jan 12, 2024, 1:11:56 AM1/12/24
to tesser...@googlegroups.com
Unfortunately you don't.

Instead of showing irrelevant information, make sure  tesseract (outside of wrapper) is providing expected results.

You are claiming "I keep getting an error that I have to set the TESSDATA_PREFIX" but your only relevant screenshot (you made it hardly readable) shows that this is not true.
Please do not post a screenshot - send relevant logs (text) or copy text from the console.

Zdenko


pi 12. 1. 2024 o 4:59 Oliver Saintilien <osaint...@gmail.com> napísal(a):
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
Message has been deleted
Message has been deleted

Zdenko Podobny

unread,
Jan 12, 2024, 8:36:42 AM1/12/24
to tesser...@googlegroups.com
tesseract executable problem:

for TESSDATA_PREFIX you use a path with space and you did not not escape it properly. That is why you get an error about an existing file ("C:\Program/eng.traineddata").
Solutions:
a) use path without speciation characters like space
b) learn how to properly escaped path to environment variables

When you solve this problem you will face the same problem (Error, OSD requires a model for the legacy engine) as with node-tesseract-ocr (that seems to take care about handling paths correctly) ;-) 
I guess problem is that OSD needs legacy engine while you restrict tesseract to use only LSTM engine. So you need to fix your option to allow usage of legacy engine. I am not sure if OSD needs also eng.traineddata with legacy components, but you will see.

KR,

Zdenko


pi 12. 1. 2024 o 14:08 Oliver Saintilien <osaint...@gmail.com> napísal(a):
Sorry for the confusion, When I do 

tesseract
  .recognize(`C:\\Users\\osain\\OneDrive\\Desktop\\1992 Spring\\Document_20240109_0014.jpg`, {
    lang: "eng",
    oem: 1,
    psm: 0,
     
  })

I get 
Command failed: tesseract "C:\Users\osain\OneDrive\Desktop\1992 Spring\Document_20240109_0014.jpg" stdout -l eng --oem 1 --psm 0
Warning, detects only orientation with -l eng
Error, OSD requires a model for the legacy engine

How do I fix this error? I am using it through this wrapper node-tesseract-ocr - npm (npmjs.com). I hear you when you say  make sure  tesseract (outside of wrapper) is providing expected results. But thats the thing when I set psm to 0 I expect to get back orientation data. However when I set the psm to other numbers like 3 or 1 it returns to me the text from an image.

Something else I tried was this 
const tesseract = require("node-tesseract-ocr")

tesseract
  .recognize(`C:\\Users\\osain\\OneDrive\\Desktop\\1992 Spring\\Document_20240109_0014.jpg`, {
    lang: "eng",
    oem: 1,
    psm: 0,
    "tessdata-dir": "C:\\Program Files\\Tesseract-OCR\\tessdata"
  }) 

Thats when I get the error about the Tessdata env var. I have pasted it below:
 
Command failed: tesseract "C:\Users\osain\OneDrive\Desktop\1992 Spring\Document_20240109_0014.jpg" stdout -l eng --oem 1 --psm 3 --tessdata-dir C:\Program Files\Tesseract-OCR\tessdata
Error opening data file C:\Program/eng.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.
Failed loading language 'eng'
Tesseract couldn't load any languages!
Could not initialize tesseract.

Ger Hobbelt

unread,
Jan 12, 2024, 10:42:05 AM1/12/24
to tesseract-ocr
On Fri, 12 Jan 2024, 14:08 Oliver Saintilien, <osaint...@gmail.com> wrote:
Something else I tried was this 
const tesseract = require("node-tesseract-ocr")
tesseract
  .recognize(`C:\\Users\\osain\\OneDrive\\Desktop\\1992 Spring\\Document_20240109_0014.jpg`, {
    lang: "eng",
    oem: 1,
    psm: 0,
    "tessdata-dir": "C:\\Program Files\\Tesseract-OCR\\tessdata"
  }) 

Thats when I get the error about the Tessdata env var. I have pasted it below:
 
Command failed: tesseract "C:\Users\osain\OneDrive\Desktop\1992 Spring\Document_20240109_0014.jpg" stdout -l eng --oem 1 --psm 3 --tessdata-dir C:\Program Files\Tesseract-OCR\tessdata
Error opening data file C:\Program/eng.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.

Adding to Zdenko's answer: what you need to do is fix / patch node-tesseract-ocr (or file a bug report there and see if someone else does it for you; since this is open source I suggest fork+fix+pullreq at node-tesseract-ocr instead ;-) ) where it then correctly converts paths with spaces as specified in js config struct to operating system dependent correctly escaped commandline arguments for tesseract executable that is invoked by node-tesseract-ocr.
Quickest fix would be to wrap the --tessdata-dir path argument in double quotes, which fixes most/your path issues on mswindows (as long as the path itself is not adversarial, containing dquote of it's own).

In other words: currently node-tesseract-ocr produces this commandline, as reported by you:

tesseract "C:\Users\osain\OneDrive\Desktop\1992 Spring\Document_20240109_0014.jpg" stdout -l eng --oem 1 --psm 3 --tessdata-dir C:\Program Files\Tesseract-OCR\tessdata

which is interpreted like this (extra newlines added to show the arguments separated):

tesseract
 "C:\Users\osain\OneDrive\Desktop\1992 Spring\Document_20240109_0014.jpg"
 stdout 
 -l eng
 --oem 1
 --psm 3
 --tessdata-dir C:\Program 
Files\Tesseract-OCR\tessdata

so tesseract receives this and gets a damaged path PLUS a surplus argument it apparently ignored: "Files\Tesseract-OCR\tessdata".

Would SHOULD have been generated by node-tesseract-ocr is this (with extra newlines again):


tesseract
 "C:\Users\osain\OneDrive\Desktop\1992 Spring\Document_20240109_0014.jpg"
 stdout 
 -l eng
 --oem 1
 --psm 3
 --tessdata-dir "C:\Program Files\Tesseract-OCR\tessdata"

as was intended in the js code.


HTH,

Ger


Oliver Saintilien

unread,
Jan 12, 2024, 8:21:03 PM1/12/24
to tesseract-ocr
Great it works like a charm now, thanks very much for your help.

Oliver Saintilien

unread,
Jan 12, 2024, 9:38:52 PM1/12/24
to tesseract-ocr
Oh right, for those facing a similar issue, what I did was 
1. relpace the eng.traineddata file with the  eng.traineddata found here tesseract-ocr/tessdata: Trained models with fast variant of the "best" LSTM models + legacy models (github.com) I didn't delete the original file but renamed it. 
2. Test the orientation command directly with tesseract in the terminal like so  tesseract "C:\Users\osain\OneDrive\Desktop\2000\Document_20240110_0001.jpg" stdout --psm 0 --oem 0 

If this command works in the terminal then it will work in the node wrapper version. Here is how I called it.
tesseract.recognize(path, {
      oem: 0,
      psm: 0,
      lang: "eng"
    })
    .then((data) => {
      return data
    })
    .catch((error) => {
      console.log(error.message)
  })

Zdenko Podobny

unread,
Jan 13, 2024, 8:18:33 AM1/13/24
to tesser...@googlegroups.com
You do not need to rename  traineddata. You can move them to tessdata subdirectory e.g. tessdata/fast, tessdata/best and then use it at "-l best/eng" or "-l fast/eng"

Zdenko


so 13. 1. 2024 o 3:38 Oliver Saintilien <osaint...@gmail.com> napísal(a):
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages