convert a .tiff file to text file

113 views
Skip to first unread message

George Varghese

unread,
Jan 30, 2019, 2:34:42 PM1/30/19
to tesseract-ocr
I am using tesseract v4 to convert .tiff file to text, only  the first page. The script - run from command line on Windows 2012 takes almost 8 seconds to convert only the first page. using the configuration. The cpu usage also shoots up to 80 % during that time

 -c tessedit_page_number=1

In reality, I only want to convert the first 30 lines to a text file output. 

Are there any config option to only look at the first 30 lines of the .tiff file and any other parameters which will decrease the cpu usage. It is ok , even it takes 15 seconds to run OCR conversion but  not get this CPU spike.

Zdenko Podobny

unread,
Jan 31, 2019, 2:58:51 PM1/31/19
to tesser...@googlegroups.com
It is not clear for me what do you want to achieve - for me it looks it is case for custom solution with using tesseract API (C, C++, Python, maybe others).

If you are can use only tesseract executable and your "30 lines" have the same location (or you know their location in advance), you can have a look at usage of unz files.

Regarding speed/resources - you did not provide description what are you doing e.g.If you are running tesseract in loop you always "waits" time for tesseract initialization...
Also it is not clear what version of tesseract you use, which language data etc.


Zdenko


st 30. 1. 2019 o 20:34 George Varghese <geo...@gmail.com> napísal(a):
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/33849fa5-bcd2-4cd7-b5f4-be43bd9c0220%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

George Varghese

unread,
Jan 31, 2019, 5:34:49 PM1/31/19
to tesseract-ocr
I am using tesseract v4.0.0.20181030  , leptonica -1.76.0

in short - using command line to convert a .tiff format to .txt file - no loop  or any custom solution used.  Yes the first 30 lines have the same location  and I am specifying to OCR only my first page

you mentioned about usage of unz file  - I am not aware of such a config -c  parameter.

Appreciate if you can give me link to any documentation

Zdenko Podobny

unread,
Jan 31, 2019, 5:48:31 PM1/31/19
to tesser...@googlegroups.com

št 31. 1. 2019 o 23:34 George Varghese <geo...@gmail.com> napísal(a):
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

George Varghese

unread,
Jan 31, 2019, 8:35:25 PM1/31/19
to tesseract-ocr
Does not work in Tesseract 4. 



On Wednesday, January 30, 2019 at 11:34:42 AM UTC-8, George Varghese wrote:

Zdenko Podobny

unread,
Feb 1, 2019, 3:21:23 AM2/1/19
to tesser...@googlegroups.com
What does not work? uzn? It works with tesseract 4 - I just test it.

If you are really interesting in help/reply please be specific and detailed what you did, what you use and provide examples for reproducing problems. 

Zdenko


pi 1. 2. 2019 o 2:35 George Varghese <geo...@gmail.com> napísal(a):
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

George Varghese

unread,
Feb 1, 2019, 12:52:44 PM2/1/19
to tesseract-ocr
The  UZN did not work. Attached the screen shot .tif file - some confidential info removed.

My command was tessecract doc.tif doc.uzn output -l eng --oem 1 --psm 4 -c tessedit_page_number=1

The doc.uzn was in the folder as the .tif file

20 40 400 200 text


On Wednesday, January 30, 2019 at 11:34:42 AM UTC-8, George Varghese wrote:

George Varghese

unread,
Feb 1, 2019, 12:53:56 PM2/1/19
to tesseract-ocr


On Wednesday, January 30, 2019 at 11:34:42 AM UTC-8, George Varghese wrote:
SCREENCAPTURE.JPG

Zdenko Podobny

unread,
Feb 1, 2019, 12:57:16 PM2/1/19
to tesser...@googlegroups.com
It works, but your command is wrong... Did you read link I posted?
It should be:
tessecract doc.tif  doc --psm 4  

Zdenko


pi 1. 2. 2019 o 18:52 George Varghese <geo...@gmail.com> napísal(a):
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

Zdenko Podobny

unread,
Feb 1, 2019, 1:03:48 PM2/1/19
to tesser...@googlegroups.com
In this case command should be:
tesseract.exe SCREENCAPTURE.JPG output --psm 4

and attached SCREENCAPTURE.uzn file must be at the same location as  SCREENCAPTURE.JPG  

Zdenko


pi 1. 2. 2019 o 18:53 George Varghese <geo...@gmail.com> napísal(a):
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
SCREENCAPTURE.uzn

George Varghese

unread,
Feb 1, 2019, 2:28:28 PM2/1/19
to tesseract-ocr
My command is tesseract doc.tif doc --psm 4





On Wednesday, January 30, 2019 at 11:34:42 AM UTC-8, George Varghese wrote:

George Varghese

unread,
Feb 1, 2019, 2:35:55 PM2/1/19
to tesseract-ocr
Thanks for help and it is working as expected .  Totally appreciate your help.  

uzn range is honored by tesseract. I need to fine tune the range little more but working completely as desired.
  
The server  cpu resources does not take the spike to 90 % as was the case before - but  now in the low 30-40 %. Also -c tessedit_page_number=1 is working as expected.I need to only look at the first page and sometime we have more than 1 page but with the same header information.

 These are fax document s   and fax software converts the document to .tif file

George V



On Wednesday, January 30, 2019 at 11:34:42 AM UTC-8, George Varghese wrote:

George Varghese

unread,
Jan 3, 2020, 8:15:50 PM1/3/20
to tesseract-ocr
Is there anyway we can limit the tesseract to limit cpu to two virtual cores on a virtual machine which has  8 cores - OS Environment Windows 2012 R2


On Wednesday, January 30, 2019 at 11:34:42 AM UTC-8, George Varghese wrote:

George Varghese

unread,
Jan 6, 2020, 5:48:17 PM1/6/20
to tesseract-ocr

reason I want to do this :

I found that sometime other processes which runs on the same server,  gets an exit code of 255 and does not complete. So If I can limit the usage of tesseract  to 2 cores and rest available for other processes





On Wednesday, January 30, 2019 at 11:34:42 AM UTC-8, George Varghese wrote:

Shree Devi Kumar

unread,
Jan 7, 2020, 2:08:00 AM1/7/20
to tesseract-ocr
Have you tried OMP_THREAD_LIMIT=1

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.


--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

George Varghese

unread,
Jan 7, 2020, 12:35:21 PM1/7/20
to tesser...@googlegroups.com
Reply all
Reply to author
Forward
0 new messages