Serious recognition speed issues

336 views
Skip to first unread message

Bastian

unread,
Nov 7, 2011, 7:40:21 AM11/7/11
to ABBYY OCR for Linux
Hi,

we have a quite serious problem regarding the recognition speed of
Abbyy OCR 9. Unfortunately, the application doesn't autonomously scale
on multi-score systems. My first inquiry on this lead to the advice of
manually starting multiple abbyyocr9 processes in order to fully
utilize multicore systems and to speed up the OCR process.

Hence, we've implemented a routine splitting up PDF documents having
images embedded into single page PDF documents, starting n amount of
abbyyOCR9 processes for each page whereas "n" is the amount of pages
and to assemble the resulting, character recognized single pages
together to one PDF document.

Our problem ist, that - surprisingly - performing OCR on an n-page
document using abbyyocr9 on a quad-core machine using our split-up
approach takes more or about the same amount of time then when passing
over the whole multi-page document to one abbyyocr9 process.

Some numbers:

Ten page document, single process: 61 seconds
Ten one-page documents, ten parallel processes (one for each page): 55
seconds

Two page document, single process: 17 seconds
Two one-page documents, two parallel processes (one for each page): 28
seconds (!!)

So, there is hardly any increase on speed noticeable when starting
multiple abbyyocr9 processes. But, and that's even more strange, each
process is utilizing one core for 100%. So, four processes cause more
or less 100% load on the machine, but result in absolutely no speed
increase compared to a load of 25% (when only one core is utilized).

Can you anyhow explain this behavior and give us a hint on how we can
efficitenly speed up the extraction process?

Regards

Bastian

Tomato

unread,
Nov 7, 2011, 8:23:31 AM11/7/11
to ABBYY OCR for Linux
Hi Bastian,

OCR has significant overhead on initialization, thus splitting to only
1-2 pages does not make much sense. That is why you see such strange
results. Can you please make similar experiment on larger amount of
pages, say 10 pages per process or more? It should give you better
effect from parallelization.

Regards,
Andrey

Bastian

unread,
Nov 11, 2011, 5:35:47 AM11/11/11
to ABBYY OCR for Linux
Hi Andrey,

we've already tried it:

"Ten page document, single process: 61 seconds
Ten one-page documents, ten parallel processes (one for each page): 55
seconds "

I've also tried to split into bunches of three pages, just out of
interest. No change, rather it becomes even worse: Same document (10
pages), convert in bunches of 4 on a 4-core machine (3+3+3+1 pages)
requires more than 60 seconds to convert.

In my understanding, the OCR process should significantly speed up
when splitting a document and process it in parallel using more than
one OCR process. Still, alle cores are fully utilized, but the runtime
for processing the whole document doesn't decrease. And requiring
nearly one minute for processing a ten page document is a real
bottleneck for our application.

Thanks and best regards

Bastian

Bastian

unread,
Nov 14, 2011, 5:44:23 AM11/14/11
to ABBYY OCR for Linux
Hi,

I've tried now split up a document of 23 pages in three nearly equal
bunches (8+8+7), more or less the same execution time as when
splitting it into 23 single documents (>120 seconds) or when
processing it as one large document. Still, the three processes fully
utilize three CPU cores.

Any ideas how we can perform "real" parallel extraction in order to
speed up the OCR process?

Thanks

Bastian

Svetlana

unread,
Nov 15, 2011, 3:22:01 AM11/15/11
to ABBYY OCR for Linux
Hello Bastian,

I have performed several tests with 427-page PDF on my dual-core
Ubuntu machine.
The command pattern I used is:

abbyyocr9 -rl German English -if ./input.pdf -f Text -tet UTF8 -of ./
output.txt

Please see below my results:

• Sequential processing of 427-page PDF

Processing time: 7 min 23 sec

• Sequential processing of 214-page PDF (first half of test PDF)

Processing time: 3 min 40 sec

• Sequential processing of 213-page PDF (second half of test PDF)

Processing time: 3 min 46 sec

• Parallel processing of 214-page and 213-page PDFs (both halves of
test PDF)

To start parallel processing, I have used the following command:

abbyyocr9 -rl German English -if ./Part1.pdf -f Text -tet UTF8 -of ./
Part1.txt & abbyyocr9 -rl German English -if ./Part2.pdf -f Text -tet
UTF8 -of ./Part2.txt

Start time: 14:32:37
Part1 finish time: 14:36:22 -> processing time: 3 min 45 sec
Part2 finish time: 14:36:31 -> processing time: 3 min 54 sec

Total processing time: 3 min 54 sec

As you can see, my results show that total processing time is almost
two times smaller in comparison with sequential processing of the
entire 427-page PDF if you process both halves of this document in
parallel on two cores.

Could you please try to use my command pattern for parallel processing
and inform me about your results?

How do you start parallel processing in your environment?

Looking forward to hearing from you.

Best regards,
Svetlana

Bastian

unread,
Nov 22, 2011, 10:49:53 AM11/22/11
to ABBYY OCR for Linux
Dear Svetlana,

please excuse the long delay for giving a feedback. I also have to
excuse our presumption regarding speed problems in Abbyy OCR 9 when
starting more than one process in parallel. In fact, we located the
problem in our codebase causing each OCR process being performed
twice. Certainly that explains our strange experiences when
parallelizing the OCR process calls.

As a small compensation, I provide some of our benchmarks to other
users, maybe it helps to optimize other applications as well (quad
core machine):

23 page document, single document (no split 1*23): 185 seconds
23 page document, splitted into 5 parts (4*5 + 1*3): 64 seconds
23 page document, split in 23 separate single page documents (23*1):
59 seconds
23 page document, splitted into 8 parts (7*3 + 1*2): 59 seconds
23 page document, splitted into 3 parts (2*10 + 1*3): 105 seconds

Still, it's a pity that the OCR app doesn't utilize multiple cores on
its own, as the memory footprint isn't neglectable. More than 200MB
per process can easily be occupied, causing a memory usage >2GB for 10
parallel processes - something people must have in mind when
parallelizing Abbyy OCR.

Thank you very much and best regards

Bastian

Chris Edeling

unread,
Feb 9, 2013, 6:06:44 AM2/9/13
to abbyy-ocr...@googlegroups.com
Bastian reported his fastest result using on a quad-core processing as
23 page document, split in 23 separate single page documents (23*1) - 59 seconds

I have followed the advice of Svetlana re the command pattern for parallel processing (cmd1 & cmd2) etc 

and have achived good results.  I use a quad core Mac Mini Server with a CentOS operating system.  It has 4G RAM and 2 GHz Intel Core i7

22 page document single PDFs as input, 35.2 seconds
22 page document single TIFs as input, 28.5 seconds

It appears that the best speed is achieved using single page TIFF images, and the Svetlana command pattern for as many pages as you wish

STEPS FOLLOWED TO GET THESE RESULTS

Although my initial documents are usually PDF I split them to single TIFFS using ghostscript 

Eg, in the .php script that calls the abbyyocr9, I first build the ghostcipt command like this


$inputfile="/var/www/html/dev/ABBYY/test/[FILENAME].pdf";
$outputfile=[path for outputfiles]."/split.%03d.tif";


$gs_tiff_cmd="gs -SDEVICE=tiffg4 -r300 -o  "
.$outputfile
." "
. $inputfile;

Remember that ABBYY prefers 300 dpi, so we use the  -r300 key

I then run that command like this, catching the output in the $answers array

$answers=array();
$numpages=0;
$rxp_page="/Page/";
exec($gs_tiff_cmd,$answers,$last);
foreach($answers as $key=>$line){
if(preg_match($rxp_page,$line)==1){
$numpages++;
}
}

AND $numpages reports how many pages there are
The single page tiffs are in the [path for outputfiles]
with filenames split.001.tif, split.001.tif etc

NOW DO THE OCR

$abby_cmd="abbyyocr9 [KEYS] -if '/path for outputfiles/split.001.tif' [output type, flags and destination]  & ";
$abby_cmd.="abbyyocr9 [KEYS] -if '/path for outputfiles/split.002.tif' [output type, flags and destination]  & ";

For every tiff page a new command line starting with ".=" to concatenate them.
Every output destination filename must also be incremented eg /path/001.pdf, /path/002.pdf etc

and finally 

exec($abby_cmd);

Hope this helps others

Chris

Bastian

unread,
Mar 12, 2013, 6:22:40 AM3/12/13
to abbyy-ocr...@googlegroups.com
Hi,

interesting, thanks! How long does the ghostscript process take in order to split up a PDF into single TIFFs? Actually, if this improves speed significantly without taking much longer time for preprocessing, maybe Abbyy should have it in mind for their next version ;)

We're currently trying around with an SSD in the hope to improve on runtime (now results yet).

Best regards

Bastian

Chris Edeling

unread,
Mar 12, 2013, 7:01:57 AM3/12/13
to abbyy-ocr...@googlegroups.com
The Ghostcript splitting is very fast (less than a second to split a 30 page document) and reduces the time that ABBYY takes if it is given a PDF to recognise.   My application will in any event disallow uploads larger than 2MB, so the tiff splitting time is no problem at all.  A 2MB pdf seldom has more than 30 pages but even if it has 100 pages the splitting time is fast enough.


I suspect that ABBYY in any event splits a multi page pdf document into single images (I dont know what format)  and that process is skipped by ABBYY if it is given single page tiffs.

I am now struggling to develop code to manage the parallel processes when the jobs are given to the different CPU's.  ANy help will be appreciated.  I have looked at GEARMAN that seems well designed to do that, but have been unable to instal gearman on CentOS 6.3 64 bit.  Some curl issues.

I am convinced that background processing must be used because the OCR takes too long to wait and return results to the user immediately.

Regards


Chris  





Bastian

--
 
---
You received this message because you are subscribed to a topic in the Google Groups "ABBYY OCR for Linux" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/abbyy-ocr-for-linux/QWwsRHFmDG4/unsubscribe?hl=en.
To unsubscribe from this group and all its topics, send an email to abbyy-ocr-for-l...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Chris Edeling

unread,
Oct 18, 2014, 3:42:37 AM10/18/14
to abbyy-ocr...@googlegroups.com
Faster parallel processing using linux parallel;

First part of the command is "parallel -k --load 40%" which regulates the load on the multi cores.  I set it at 40% to avoid server overload.

Second part is " abbyyocr11 -adt -if {}.png -f PDF -ptem ImageOnText -pfs MinSize   -pdwl -of {}.pdf -f TextVersion10Defaults  -tel -tpl  -of {}.txt ::: "

This is using CLI 11.  The important arts for parallel are the braces {} that are placeholders for the filenames in the list that will follow.  Also the ::: means that the list of filenames will follow.  All the other parts are standard abbyyocr11 commands and options, use whatever you like.

The full command with the list of files to process is:

parallel -k --load 40% abbyyocr11 -adt -if {}.png -f PDF -ptem ImageOnText -pfs MinSize   -pdwl -of {}.pdf -f TextVersion10Defaults  -tel -tpl  -of {}.txt ::: 1004.001 1004.002 1004.003 1004.004 1004.005 1004.006 1004.007 1004.008 1004.009 1004.010 1004.011 1004.012 1004.013 1004.014 1004.015 1004.016 1004.017 1004.018 1004.019 1004.020 1004.021 1004.022 1004.023 1004.024 1004.025 1004.026 1004.027 1004.028 1004.029 1004.030 1004.031 1004.032 1004.033 1004.034 1004.035 1004.036 1004.037 1004.038 1004.039 1004.040 1004.041 1004.042 1004.043 1004.044 1004.045 1004.046 1004.047

NOTE that the list has no file extension, only the part without extension because I want .pdf and .txt output files.  My input files were all .png.  That is why the command uses  -if {}.png (to get the .pbg filenames for input) and -of {}.pdf to specify the names of the .pdf output files, and -of {}.txt for the .txt files.

The job took 57 seconds and produced 47 pdfs and 47 txt files with good quality OCR.  Server: Mac mini running CentOS Linux 6.5.  RAM 4GB.  Processor: Intel(R) Core(TM) i7-2635QM CPU @ 2.00GHz, 8 cores


Svetlana

unread,
Oct 20, 2014, 2:28:41 AM10/20/14
to abbyy-ocr...@googlegroups.com
Hello Chris,

Please note that the abbyyocr11 tool supports built-in multi-processing with the option -mpm / --multiProcessingMode used with the Parallel parameter:

abbyyocr11 -mpm Parallel <...>

With this option pages of a multi-page input file (PDF or TIFF) are distributed between CPU cores allowed by your abbyyocr11 license.
By default abbyyocr11 licenses do not have CPU cores limitation (if you did not order a special license), so all available CPU cores are used.

Please try using this option and inform me about the results.

Best regards,
Svetlana

Chris Edeling

unread,
Oct 22, 2014, 2:39:50 AM10/22/14
to abbyy-ocr...@googlegroups.com
Hi Svetlana

I use single pages for all output and do not use the option -mpm.

I have achieved fantastic parellel processing speeds as explained in my last post under the topic

EXPORT TO SINGLE PAGES

and gratefully acknowledge your good advice in that regard

The GNU parallel package for linux is mature and flexible and works very well.  Using that in combination with the abbyy -ipn option as advised by you has provided the solution that I require.


Chris

Chris Edeling

unread,
Oct 22, 2014, 2:43:01 AM10/22/14
to abbyy-ocr...@googlegroups.com
Bastian

You may be interested to read my last post under the topic EXPORT TO SINGLE PAGES

With version CLI 11 I no longer need the ghostscript splitting and achive the fast results in another way as explained in EXPORT TO SINGLE PAGES

Chris

Stephan Jau

unread,
Dec 25, 2014, 7:50:04 AM12/25/14
to abbyy-ocr...@googlegroups.com
Hi there

I'm currently evaluation AbbyOCR11 and I also noticed that it's slow. However I think that's due to the trial licence that I have. I use this command as of know:

abbyyocr11 -mpm Parallel -unopc -rl German French English -if test.pdf -f PDF -of mpm_unopc.pdf

htop also reports to me, that only one core is being used.

As said, I suspect it's because I have the a trial licence for testing.

However when I get a volume based licence (e.g. the 12k pages per year), will I then be able to use all cores and hyperthreading?

Svetlana

unread,
Dec 26, 2014, 1:15:01 AM12/26/14
to abbyy-ocr...@googlegroups.com
Hi Stephan,

If your document test.pdf is single-page, then it will be processed on a single core. Only multi-page documents can be distributed to several cores.
In addition, the -unopc option does not work on Linux - by default all cores including hyper-threaded are allowed for processing.

There is also no difference between a trial license and a normal license in terms of CPU cores usage.

Best regards,
Svetlana

Stephan

unread,
Dec 26, 2014, 2:59:53 AM12/26/14
to abbyy-ocr...@googlegroups.com

HI Svetlana

Thanks for your answer. It's a 20 page document. I tried with the -unopc and without the -unopc option. But still, I only see one core working - out of 4 cores.

Although very weird..... I just tried it again and now it used all cores...


root@test:~# cat test.sh                                                                                                                                      
#!/bin/bash

pdftk test.pdf dump_data | grep NumberOfPages

abbyyocr11 -mpm Parallel -rl German French English -if test.pdf -f PDF -of mpm_unopc.pdf
root@test:~# time ./test.sh 
NumberOfPages: 20

real    0m20.509s
user    0m39.878s
sys     0m1.704s

Got a new test licence and expanded the 20 pages to 80 pages:

root@test:~# time ./test.sh 
NumberOfPages: 80

real    1m55.865s
user    2m55.981s
sys     0m5.514s


I guess I hit some bug yesterday....

Although, I noticed, that in the beginning all cores were used and then after a while, it just resorted to single core... Not sure what's going on there.

Anyway, using all cores it's a good speed and recognition is great.

Svetlana

unread,
Dec 26, 2014, 4:02:40 AM12/26/14
to abbyy-ocr...@googlegroups.com
Dear Stephan,

Processing with abbyyocr11 includes the following stages:

1. Image opening
2. Preprocessing
3. Analysis
4. Recognition
5. Document synthesis
6. Export

Only stages 2, 3 and 4 can be distributed to multiple CPU cores. Other stages are performed on a single core.
If you have further questions, please don't hesitate to ask.

Best regards,
Svetlana

Stephan

unread,
Dec 26, 2014, 4:36:01 AM12/26/14
to abbyy-ocr...@googlegroups.com
Hi Svetlana

Thanks for the explanation. When it's using multi-core it's nice :)

Stephan

Chris Edeling

unread,
Dec 26, 2014, 6:08:56 AM12/26/14
to abbyy-ocr...@googlegroups.com
Dear Stephan

I am pleased that your problem seems to have resolved itself.

I have had good and very fast results with version 11 on linux, and Svetlana was very good at advising and trouble shooting.  You may want to see my earlier posts about using GNU Parallel because I found that most useful.  For my particular needs I must have single pages, but I dont suppose that should make much difference to speed.  When I got it right the average speed was about one second per page - of course using multicore - (Mac mini server running linux - I think it has 8 cores).

I also use chron jobs to limit server overload because OCR is such a resource intensive process we dont want our server to burn out.


Best

Chris Edeling

--

---
You received this message because you are subscribed to a topic in the Google Groups "ABBYY OCR for Linux" group.

To unsubscribe from this group and all its topics, send an email to abbyy-ocr-for-l...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Stephan

unread,
Dec 26, 2014, 9:46:43 AM12/26/14
to abbyy-ocr...@googlegroups.com
Well, just for fun, I'll make another experiment splitting that 80 page pdf now up into 4x 20 pages...

As Svetlana said, some parts of the process can't be outsourced to multicore. However if I assign a core to each part and then combine it in the end with pdftk again, it should be faster... question is only, how much faster... it'll use a lot more ram since there will be four processes running so I wanna check if it's worth to do that :)

Chris Edeling

unread,
Dec 26, 2014, 9:54:11 AM12/26/14
to abbyy-ocr...@googlegroups.com
I found in some limited cases that pdftk did not work, but as I recall that was when I had used ghostscript to split the pages.  Seems that some software that makes pdf does not make it standards complaint

I am surprised to hear that only certain stages go multi core.  

Perhaps you may also want to consider what I did, ie split entire doc into singles, then join them all together with pdftk

Best

Chris

On 26 December 2014 at 16:46, Stephan <steph...@gmail.com> wrote:
Well, just for fun, I'll make another experiment splitting that 80 page pdf now up into 4x 20 pages...

As Svetlana said, some parts of the process can't be outsourced to multicore. However if I assign a core to each part and then combine it in the end with pdftk again, it should be faster... question is only, how much faster... it'll use a lot more ram since there will be four processes running so I wanna check if it's worth to do that :)

--

Stephan

unread,
Dec 26, 2014, 10:34:10 AM12/26/14
to abbyy-ocr...@googlegroups.com
Hi Chris

It seems the splitting isn't working anymore on v11.

AAABBBBYBYB YCYLYI OLCIR C101LC IRf o r0 1CLRi1 1nfu1xo rf
[some more weird chars]

Error: FREngine internal error: Access to /tmp/ABBYY FineREader Engine 11 was denied.
Error: FREngine internal error: Access to /tmp/ABBYY FineREader Engine 11 was denied.
Error: FREngine internal error: Access to /tmp/ABBYY FineREader Engine 11 was denied.

It seems it tries to access the same tmp space and fails to do so.....


The script I tried:

#!/bin/bash

pdftk /root/test.pdf dump_data | grep NumberOfPages

# split into 4 parts, each 20 pages
pdftk /root/test.pdf cat 1-20 output /root/p1.pdf &
pdftk /root/test.pdf cat 21-40 output /root/p2.pdf &
pdftk /root/test.pdf cat 41-60 output /root/p3.pdf &
pdftk /root/test.pdf cat 61-end output /root/p4.pdf &

# Wait for background jobs to finish
wait

# Do the OCR
abbyyocr11 -rl German French English -if /root/p1.pdf -f PDF -of /root/f1.pdf &
abbyyocr11 -rl German French English -if /root/p2.pdf -f PDF -of /root/f2.pdf &
abbyyocr11 -rl German French English -if /root/p3.pdf -f PDF -of /root/f3.pdf &
abbyyocr11 -rl German French English -if /root/p4.pdf -f PDF -of /root/f4.pdf &

# wait for background jobs to finish
wait

# combine the four parts to final pdf
pdftk /root/f1.pdf /root/f2pdf /root/f3.pdf /root/f4.pdf cat output final.pdf

Chris Edeling

unread,
Dec 26, 2014, 10:58:37 AM12/26/14
to abbyy-ocr...@googlegroups.com
Maybe try to adapt and use some of this PHP code that works fine for me
May of the vars will be useless to you, cos I use all singe pages
But you should be able to concat them later with pdftk


$NumberOfPages = getnumpages($target_pdf,$ThisFile['infile_id']); // see func below
$cum_has_pages = $cum_has_pages + $NumberOfPages;
$CoreFilename = $ThisFile['fump'] ; // I have rules to make filenames
$target_dir= $src_path;
$target_filename = $src_file;
$target_filenameSansExt=$CoreFilename;
$parallelCommands=array(); // I make many of these, one per page

for($pagenumber=0 ; $pagenumber< $NumberOfPages ; $pagenumber++){
$ThisPgIPN = $pagenumber;      // abbyy gets page numbers with ZERO base
$ThisPgNum = $pagenumber + 1;  // I want them numbered from 1, not from zero
$ThisPaddedPgNum = str_pad($ThisPgNum,3,"0",STR_PAD_LEFT); // I use this to name my single pages in sequence
$ThisPageAbbyCmnd= "\"abbyyocr11 "
 ." -ipn ".$ThisPgIPN." " // this makes abbyy process only the target page in the multipage pdf - it is fast
 ." -if ".$target_dir."/".$target_filenameSansExt.".pdf "
 ." -f TextUnicodeDefaults "
 ." -tel "
 ." -trl "
 ." -of ".$target_dir."/".$target_filenameSansExt.".".$ThisPaddedPgNum.".txt "
 ." -f PDF "
 ." -ptem ImageOnText "
 ." -pfs MinSize "
 ." -pdwl "
 ." -of ".$target_dir."/".$target_filenameSansExt.".".$ThisPaddedPgNum.".pdf"
 ."\""
 ;
$parallelCommands[]=$ThisPageAbbyCmnd ; // add this one to my array of commands to run parallel later

}

$TheParallelComnd="parallel -k --load 40%  ::: "; // start parallel command
$TheParallelComnd.= implode(' ',$parallelCommands) ." > /dev/null "; // add for every page
exec($TheParallelComnd); // run the parallel command


function getnumpages($target_pdf,$infile_id){
$ThePageCountCmnd="gs -q -dNODISPLAY -c \"(".$target_pdf.") (r) file runpdfbegin pdfpagecount = quit\"";
exec("$ThePageCountCmnd",$answers);
$NumberOfPages = $answers[0];
return $NumberOfPages;
}

Stephan

unread,
Dec 26, 2014, 11:38:01 AM12/26/14
to abbyy-ocr...@googlegroups.com
Hmmmm, you don't set a different tmp dir either yet it works...

I tried to see what happens if I use TMPDIR="/tmp/tmp/" and then run abbyyocr11 from the cli.... it still created it's tmp directories in /tmp/ instead of /tmp/tmp/

Stephan

unread,
Dec 26, 2014, 11:41:58 AM12/26/14
to abbyy-ocr...@googlegroups.com
Ok, I just found another cli switch :)   -tmp   ;) will retry.

Stephan

unread,
Dec 26, 2014, 11:57:17 AM12/26/14
to abbyy-ocr...@googlegroups.com
Didn't help either...

There's still a  "/tmp/ABBY FineReader Engine 11" which contains a PID-folder and a PID.lock file. However the actual tmp folders were created:


-rw-------  1 root root    0 Dez 26 17:53 AbbyOCR11.6D5UiBRbpL
-rw-------  1 root root    0 Dez 26 17:53 AbbyOCR11.71TkAph2SP
-rw-------  1 root root    0 Dez 26 17:53 AbbyOCR11.EHudpbBcjO
-rw-------  1 root root    0 Dez 26 17:53 AbbyOCR11.m8rJCcagB1
drwxrwxrwx  3 root root 4096 Dez 26 17:53 ABBYY FineReader Engine 11
drwxrwxrwx  2 root root 4096 Dez 26 17:53 AbbyyMtx
drwxrwxrwt  2 root root 4096 Dez 26 17:52 .ICE-unix
drwxrwxrwt  2 root root 4096 Dez 26 17:52 .X11-unix





Using now this script:



#!/bin/bash

pdftk /root/test.pdf dump_data | grep NumberOfPages

doSplitting()
{

        # split into 4 parts, each 20 pages
        pdftk /root/test.pdf cat 1-20 output /root/p1.pdf &
        pdftk /root/test.pdf cat 21-40 output /root/p2.pdf &
        pdftk /root/test.pdf cat 41-60 output /root/p3.pdf &
        pdftk /root/test.pdf cat 61-end output /root/p4.pdf &

        # Wait for background jobs to finish
        wait

        tmpDir=$(mktemp /tmp/AbbyOCR11.XXXXXXXXXX)
        abbyyocr11 -tmp "${tmpDir}" -rl German French English -if /root/p1.pdf -f PDF -of /root/f1.pdf &
        tmpDir=$(mktemp /tmp/AbbyOCR11.XXXXXXXXXX)
        abbyyocr11 -tmp "${tmpDir}" -rl German French English -if /root/p2.pdf -f PDF -of /root/f2.pdf &
        tmpDir=$(mktemp /tmp/AbbyOCR11.XXXXXXXXXX)
        abbyyocr11 -tmp "${tmpDir}" -rl German French English -if /root/p3.pdf -f PDF -of /root/f3.pdf &
        tmpDir=$(mktemp /tmp/AbbyOCR11.XXXXXXXXXX)
        abbyyocr11 -tmp "${tmpDir}" -rl German French English -if /root/p4.pdf -f PDF -of /root/f4.pdf &

        # wait for background jobs to finish
        wait

        # combine the four parts to final pdf
        pdftk /root/f1.pdf /root/f2pdf /root/f3.pdf /root/f4.pdf cat output final.pdf

}

doNormal()
{
        abbyyocr11 -rl German French English -if /root/test.pdf -f PDF -of /root/output.pdf
}


doSplitting


Svetlana

unread,
Dec 29, 2014, 8:07:34 AM12/29/14
to abbyy-ocr...@googlegroups.com
Hello Stephan,

Unfortunately, I couldn't reproduce your issue concerning "Access to /tmp/ABBYY FineREader Engine 11 was denied."
I hope (but still I cannot guarantee) that by February the next release of the abbyyocr11 tool will be launched. When it is available, please try testing your scenario with it.

Merry Christmas and Happy New Year!

Best regards,
Svetlana

Stephan

unread,
Dec 31, 2014, 4:57:37 PM12/31/14
to abbyy-ocr...@googlegroups.com
I always encounter the strangest of problems. Well, I have it working with -mpm and was only wondering if actually splitting up and combining would be faster.

What do you mean by next release of abbyyocr11 tool? Will there be a version 12 coming?

Michael Fuchs

unread,
Jan 2, 2015, 11:11:47 AM1/2/15
to abbyy-ocr...@googlegroups.com
Hallo Stephan,

The CLI OCR application is based on FineReader Engine 11 Linux (SDK for developers).
ABBYY is publishing updates on a regular base. The maintenance releases come with new features, API adjustments and fixes.

From time to time the CLI tool will also get an update based on the latest maintenance releases. ABBYY does not map all new features to the CLI tool, but there will be some improvements based on that has changed and on feedback from our customers.

Just to be clear, the major product version (V11) will not be changed.

Best regards
Michael, ABBYY EU


2014-12-31 22:57 GMT+01:00 Stephan <steph...@gmail.com>:
I always encounter the strangest of problems. Well, I have it working with -mpm and was only wondering if actually splitting up and combining would be faster.

What do you mean by next release of abbyyocr11 tool? Will there be a version 12 coming?

--

---
You received this message because you are subscribed to the Google Groups "ABBYY OCR for Linux" group.
To unsubscribe from this group and stop receiving emails from it, send an email to abbyy-ocr-for-l...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages