convert a .tiff file to text file

George Varghese

unread,

Jan 30, 2019, 2:34:42 PM1/30/19

to tesseract-ocr

I am using tesseract v4 to convert .tiff file to text, only the first page. The script - run from command line on Windows 2012 takes almost 8 seconds to convert only the first page. using the configuration. The cpu usage also shoots up to 80 % during that time

-c tessedit_page_number=1

In reality, I only want to convert the first 30 lines to a text file output.

Are there any config option to only look at the first 30 lines of the .tiff file and any other parameters which will decrease the cpu usage. It is ok , even it takes 15 seconds to run OCR conversion but not get this CPU spike.

Zdenko Podobny

unread,

Jan 31, 2019, 2:58:51 PM1/31/19

to tesser...@googlegroups.com

It is not clear for me what do you want to achieve - for me it looks it is case for custom solution with using tesseract API (C, C++, Python, maybe others).

If you are can use only tesseract executable and your "30 lines" have the same location (or you know their location in advance), you can have a look at usage of unz files.

Regarding speed/resources - you did not provide description what are you doing e.g.If you are running tesseract in loop you always "waits" time for tesseract initialization...

Also it is not clear what version of tesseract you use, which language data etc.

Zdenko

st 30. 1. 2019 o 20:34 George Varghese <geo...@gmail.com> napísal(a):

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/33849fa5-bcd2-4cd7-b5f4-be43bd9c0220%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

George Varghese

unread,

Jan 31, 2019, 5:34:49 PM1/31/19

to tesseract-ocr

I am using tesseract v4.0.0.20181030 , leptonica -1.76.0

in short - using command line to convert a .tiff format to .txt file - no loop or any custom solution used. Yes the first 30 lines have the same location and I am specifying to OCR only my first page

you mentioned about usage of unz file - I am not aware of such a config -c parameter.

Appreciate if you can give me link to any documentation

Zdenko Podobny

unread,

Jan 31, 2019, 5:48:31 PM1/31/19

to tesser...@googlegroups.com

https://groups.google.com/forum/#!topic/tesseract-ocr/e3lqpY0pMpw

https://groups.google.com/forum/#!topic/tesseract-ocr/UidqCx6OE0Q

https://github.com/OpenGreekAndLatin/greek-dev/wiki/uzn-format

https://github.com/jsoma/tesseract-uzn

...

PS: I hope it works with tesseract 4 too ;-) I did not tested it yet, but

Zdenko

št 31. 1. 2019 o 23:34 George Varghese <geo...@gmail.com> napísal(a):

--

You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/34ed1c81-c301-4c65-8baa-12682200b71b%40googlegroups.com.

George Varghese

unread,

Jan 31, 2019, 8:35:25 PM1/31/19

to tesseract-ocr

Does not work in Tesseract 4.

On Wednesday, January 30, 2019 at 11:34:42 AM UTC-8, George Varghese wrote:

Zdenko Podobny

unread,

Feb 1, 2019, 3:21:23 AM2/1/19

to tesser...@googlegroups.com

What does not work? uzn? It works with tesseract 4 - I just test it.

If you are really interesting in help/reply please be specific and detailed what you did, what you use and provide examples for reproducing problems.

Zdenko

pi 1. 2. 2019 o 2:35 George Varghese <geo...@gmail.com> napísal(a):

--

You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/22421e9e-a36f-44c0-be70-9f3d8ab3666f%40googlegroups.com.

George Varghese

unread,

Feb 1, 2019, 12:52:44 PM2/1/19

to tesseract-ocr

The UZN did not work. Attached the screen shot .tif file - some confidential info removed.

My command was tessecract doc.tif doc.uzn output -l eng --oem 1 --psm 4 -c tessedit_page_number=1

The doc.uzn was in the folder as the .tif file

20 40 400 200 text

On Wednesday, January 30, 2019 at 11:34:42 AM UTC-8, George Varghese wrote:

George Varghese

unread,

Feb 1, 2019, 12:53:56 PM2/1/19

to tesseract-ocr

On Wednesday, January 30, 2019 at 11:34:42 AM UTC-8, George Varghese wrote:

SCREENCAPTURE.JPG

Zdenko Podobny

unread,

Feb 1, 2019, 12:57:16 PM2/1/19

to tesser...@googlegroups.com

It works, but your command is wrong... Did you read link I posted?

It should be:

tessecract doc.tif doc --psm 4

Zdenko

pi 1. 2. 2019 o 18:52 George Varghese <geo...@gmail.com> napísal(a):

--

You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/8633ebe1-6e85-4c58-a97a-0f9576a8bf33%40googlegroups.com.

Zdenko Podobny

unread,

Feb 1, 2019, 1:03:48 PM2/1/19

to tesser...@googlegroups.com

In this case command should be:

tesseract.exe SCREENCAPTURE.JPG output --psm 4

and attached SCREENCAPTURE.uzn file must be at the same location as SCREENCAPTURE.JPG

Zdenko

pi 1. 2. 2019 o 18:53 George Varghese <geo...@gmail.com> napísal(a):

--

You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/b1a757f1-3fc8-493e-93bc-ab051e99fad0%40googlegroups.com.

SCREENCAPTURE.uzn

George Varghese

unread,

Feb 1, 2019, 2:28:28 PM2/1/19

to tesseract-ocr

My command is tesseract doc.tif doc --psm 4

On Wednesday, January 30, 2019 at 11:34:42 AM UTC-8, George Varghese wrote:

George Varghese

unread,

Feb 1, 2019, 2:35:55 PM2/1/19

to tesseract-ocr

Thanks for help and it is working as expected . Totally appreciate your help.

uzn range is honored by tesseract. I need to fine tune the range little more but working completely as desired.

The server cpu resources does not take the spike to 90 % as was the case before - but now in the low 30-40 %. Also -c tessedit_page_number=1 is working as expected.I need to only look at the first page and sometime we have more than 1 page but with the same header information.

These are fax document s and fax software converts the document to .tif file

George V

On Wednesday, January 30, 2019 at 11:34:42 AM UTC-8, George Varghese wrote:

George Varghese

unread,

Jan 3, 2020, 8:15:50 PM1/3/20

to tesseract-ocr

Is there anyway we can limit the tesseract to limit cpu to two virtual cores on a virtual machine which has 8 cores - OS Environment Windows 2012 R2

On Wednesday, January 30, 2019 at 11:34:42 AM UTC-8, George Varghese wrote:

George Varghese

unread,

Jan 6, 2020, 5:48:17 PM1/6/20

to tesseract-ocr

reason I want to do this :

I found that sometime other processes which runs on the same server, gets an exit code of 255 and does not complete. So If I can limit the usage of tesseract to 2 cores and rest available for other processes

On Wednesday, January 30, 2019 at 11:34:42 AM UTC-8, George Varghese wrote:

Shree Devi Kumar

unread,

Jan 7, 2020, 2:08:00 AM1/7/20

to tesseract-ocr

Have you tried OMP_THREAD_LIMIT=1

--

You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/b087ab06-3c94-49d1-840d-15d2dd5ef129%40googlegroups.com.

--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

George Varghese

unread,

Jan 7, 2020, 12:35:21 PM1/7/20

to tesser...@googlegroups.com

Thanks, I will try that

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVzwvq1RY2GUKQAYKMZhLsvjegc%2BdXOCm5i5f4EEMDhSA%40mail.gmail.com.

Reply all

Reply to author

Forward