Trouble uploading with OCR

16 views
Skip to first unread message

declan...@ryerson.ca

unread,
Mar 26, 2019, 3:00:51 PM3/26/19
to overview-users
I'm a fourth-year journalist at Ryerson University. I'm trying to organize an access to information request I just got back that has over 7,000 pages of pdf files. I keep getting "Failed with error 'do-convert-single-file exited with status code 1'" when I try to upload with the OCR option.

I've tried a few different things, like cutting the file into 1000 pages, and then 500 pages, before uploading, but that didn't work. I was able to upload the 500-page file but without OCR. I was able to get OCR on a 105-page file I cut out of the main file, with either "One file is one document" or "one page is one document" settings. 

Do you have any suggestions? My first intention is to separate the documents into relevant groupings. Second, I want to see if there are duplicates. Third, I want to actually read them.

Jonathan Stray

unread,
Mar 26, 2019, 3:05:07 PM3/26/19
to overview-users
Hy Declan. Not sure exactly what is going on, but if the one file is 7000 pages then it's likely that Overview is running out of memory trying to process it. Do you have some way to slice the file into smaller chunks before uploading? You could try the pdfseparate tool that comes with poppler for this: http://www.ubuntubuzz.com/2016/01/how-to-split-pdf-with-pdfseparate.html

  - Jonathan

--
You received this message because you are subscribed to the Google Groups "overview-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to overview-user...@googlegroups.com.
To post to this group, send email to overvie...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

declan...@ryerson.ca

unread,
Mar 26, 2019, 3:14:42 PM3/26/19
to overview-users
I did try separating into smaller groups; first ~1000, which didn't work; and then, ~500, which didn't work for OCR but worked for non-OCR. I was hoping overview would be able to automate separating things into relevant groups but it looks like I'll have to do it manually. I may also try poppler but I've never used that before. 

Thanks!

Jonathan Stray

unread,
Mar 26, 2019, 3:19:43 PM3/26/19
to overvie...@googlegroups.com
It can separate pages as you’ve seen, but sometimes has trouble with large files — especially if they are scans. Hope this works out for you!

- Jonathan

Adam Hooper

unread,
Mar 27, 2019, 7:50:19 AM3/27/19
to overvie...@googlegroups.com
Hi Declan,

"Exit code 1" usually means, "out of memory." Overview has a problem OCR-ing hi-res images -- these often crop up in PDFs, and they often don't even have big file sizes. A single hi-res image will ruin the whole OCR process.

I recommend you try using OCRMyPDF to preprocess your files before uploading them to Overview. OCRMyPDF may succeed with OCR where Overview fails; Overview will still help you read the documents.

Enjoy life,
Adam

Declan Keogh

unread,
Mar 27, 2019, 8:15:22 AM3/27/19
to overvie...@googlegroups.com
Hi Adam,

Thanks for this! I was actually wondering about this. If I run OCR in Adobe before I upload the file, would that transfer into Overview? I did run an OCR already and it took about 4-5 hours.

Adam Hooper

unread,
Mar 27, 2019, 11:23:03 AM3/27/19
to overvie...@googlegroups.com
Yes, Overview works great with OCRed files.

Enjoy life,
Adam
Reply all
Reply to author
Forward
0 new messages