OCR error on .doc files

547 views
Skip to first unread message

Charles McEvoy

unread,
Sep 19, 2012, 1:50:58 PM9/19/12
to mayan...@googlegroups.com
I've a fresh install of MayanEDMS on Ubuntu Server using
the fabric installerfile. A small amount of configuration
later and it is working *almost* perfectly...

Just getting 'error' with .doc files in the OCR queue -
pdf files are handled fine.

I'm not sure how to generate / find log files to see what is going wrong.

LibreOffice and unoconv are installed and found in the settings list.

Thanks for such a great product ; I've been looking
for something like this for ages.

Roberto Rosario

unread,
Sep 20, 2012, 3:14:32 AM9/20/12
to mayan...@googlegroups.com, cha...@mcevoy.com
Hi Charles,

check to see if unpaper and tesseract version 3.x are installed.  There error log is limited to what the called binaries (unoconv, tesseract, LibreOffice) tell back Mayan, so the log may sometimes show very spare error messages.  Thanks!   I'm glad it is what you were looking for :)

--Roberto

Charles McEvoy

unread,
Sep 20, 2012, 3:58:43 AM9/20/12
to mayan...@googlegroups.com
Thanks.
Tesseract 3.02 and unpaper 0.3 are installed.
Sorry to be ignorant, but I don't know how to generate or find the logfiles -
could you point me the right way? Google hasn't helped this time!
Charles

Roberto Rosario

unread,
Sep 26, 2012, 10:42:46 AM9/26/12
to mayan...@googlegroups.com, cha...@mcevoy.com
Hi,

Go to the tools menu then the OCR button to see the queue of documents waiting for OCR processing, processed documents are deleted from the queue and the ones with errors remain in the queue with the message of the error they experienced.  If the error was internal an exception is raise and Mayan stores the text message of the exception, if it is an external executable Mayan tries to capture the text from STDERR or STDOUT if the executable provides anything.  If only the word 'error' appear it is most likely then that the error is during the external binary execution which is not returning any error message (typical with tesseract 2.x), and has to be diagnosed by hand.

Try to convert one of the document giving you error by hand doing:

libreoffice --headless --convert-to pdf <file> --outdir /tmp

if it converts correctly convert the resulting PDF file with unpaper by doing:

unpaper --overwrite --no-multi-pages </tmp/pdf file> </tmp>

do the OCR on the corresponing output files from unpaper:

tesseract <unpaper /tmp file input>

hopefully this should give an error message in one of these steps that will point in the right direction to fix it.

Also try creating a simple test .docx document (ie: lorem ipsum) and upload it to Mayan and see if it converts to OCR.

--Roberto

Steve Kersley

unread,
Nov 28, 2012, 12:15:01 PM11/28/12
to mayan...@googlegroups.com
Apologies for hijacking this topic, but having what looks a very similar issue.  Did you ever get your problem solved?
Office files (both Excel and Word) fail to OCR and as with the original post, just show 'error' in the OCR queue.

Following the advice in this topic, have run the commands manually.  Libreoffice runs fine and converts the file to a PDF which I've opened and views correctly:
libreoffice --headless --convert-to pdf e2582b36-bbfd-4f00-b162-506bafa58f7e --outdir /tmp
convert /tmp/e2582b36-bbfd-4f00-b162-506bafa58f7e -> /tmp/e2582b36-bbfd-4f00-b162-506bafa58f7e.pdf using writer_pdf_Export

Unpaper however fails, and searching the Internet has given no specific answers.
unpaper --overwrite --no-multi-pages e2582b36-bbfd-4f00-b162-506bafa58f7e.pdf  /tmp
Processing sheet: e2582b36-bbfd-4f00-b162-506bafa58f7e.pdf -> /tmp
*** error: input file format using magic '%P' is unknown.
*** error: Cannot load image e2582b36-bbfd-4f00-b162-506bafa58f7e.pdf.
*** error: sheet size unknown, use at least one input file per sheet, or force using --sheet-size.

I've tried specifying A4 as sheet size but that made no difference.  I spotted a setting 'COMMON_DEFAULT_PAPER_SIZE' which defaults to Letter so changed that to A4 (not that that would make a difference when running the commands on the command line manually!)  This was with a Word .docx.  Exactly the same error with an Excel .xslx file.  In both cases nothing clever about the files - just basic text and standard styles in the Word doc, a short list of names and dates in the Excel file.  I have also uploaded a PDF and a text file, which did both parse correctly and generate thumbnails and extract the text, so is just office files?

Anyone have any clues?
Server is Ubuntu 12.04.  Libreoffice 3.5, Tesseract 3.x, Unpaper 0.3 installed from repository.  Mayan 0.12.2 installed in virtualenv by hand rather than using the fabric file, but using pip and installing the specific versions of packages listed in requirements/production.txt.

Cheers,
Steve.

Steve Kersley

unread,
Nov 29, 2012, 6:59:38 AM11/29/12
to mayan...@googlegroups.com
Following up my own post, I think the unpaper error message is a red herring.

As far as I can see from looking through the source of unpaper, it *only* inputs/outputs ppm/pbm/pgm format files, and can't read or write a PDF.  So either I'm using the wrong version of unpaper or Roberto's suggested manual tests were wrong?

What exactly is the file format pipeline for importing an Office file so that I can check that I have all of the right tools, and they operate properly?

Is there any way to get more information than just 'error'?  I've tried enabling DEBUG=True, but I don't appear to be getting any more output in the apache error logs, and it doesn't seem to be generating any other logfile that I've found.  

Cheers,
Steve.

Lau Llobet

unread,
Dec 5, 2012, 8:07:00 PM12/5/12
to Mayan EDMS
Hi Charles, Roberto and Steve,

I'm loving this software, i'm actualy planning to start a business of
files digitalization for small busines and this software is the one
i'm liking more.

I'm having the same problem as you two a simple error given by the
binaries in the ocr cue.

Followint Roberto's adcvice I'm stuck at doing unpaper to a pdf with
the same error about "the magic %P", unpaper don't handle pdf !!! So
Roberto may give us another way to check what is going inside mayan so
we can simulate it by hand.

As far as i see there's no pdf output file from the "document as an
image" in the temporary folder, just a file called IHAKtmp which is
empty so i guess the problem is at the first step which shoud be
libreoffice jpg to pdf conversion. That may make sense since we are
all using the same version of unpaper and tesseract and we may no be
using the same LibreOffice.

I'm in a hurry trying to figure which is the best software for my
company and I would happly make a donation when i'll have it working
localy.


Also, while trying to solve this issue i've came to this observations:


1
--------------------------
Tesseract has to have it's language training files in the usr/local/
in order to work

like this:

lau@lau-H61M-D2-B3:/usr/local/share/tessdata$ ls
cat.traineddata eng.cube.fold eng.cube.params
eng.tesseract_cube.nn
configs eng.cube.lm eng.cube.size eng.traineddata
eng.cube.bigrams eng.cube.nn eng.cube.word-freq tessconfigs


2
--------------------------
making tesseract to work with a .jpg from the scan has EXTREMELY
better results than giving it a ppm "cleaned" by unpaper , in the
first case only 5 words in a page where mistaken and a cleaned ppm
tesseract gave only 3 comprensible words in the whole page. No PDF
(jpg converted via libre office) is accepted by tesseract giving a :

lau@lau-H61M-D2-B3:/tmp$ tesseract tarja.pdf tessed
Tesseract Open Source OCR Engine v3.02.02 with Leptonica
Error in pixReadStream: Unknown format: no pix returned
Error in pixRead: pix not read
Unsupported image type.

3
----------------------------
Having a metadata tag indicating a language in mayan and using this to
set the language flag of tesseract can improve results a lot ! (50
words per page) If my project is finally using mayan i would try to
program this feature.
Message has been deleted

Roberto Rosario

unread,
Dec 12, 2012, 9:46:41 PM12/12/12
to mayan...@googlegroups.com
The conversion logic is complex and had to look at the code and you are right, there are two steps missing from my suggestion.  The logic is:
Office doc -> PDF
PDF -> JPG
JPG -> PPM
PPM -> TIFF
TIFF -> Tesseract

Why the TIFF step?  Becuase the old Tesserct (<2.0) only supported TIFF files, I think the new version (3.02) supports more formats, so that is something to look at when the times comes to refactor the converter.

Mayan stores anything via STDERR after executing the binaries for OCR, so the simple 'error' message is what Mayan is getting from the command line.  I'm already using PBS (http://pypi.python.org/pypi/pbs) in some places to call binaries and am planning to use it in the converter to simplify things and hopefully capture more error information when things go wrong on the command line behind the scenes.

--Roberto

Roberto Rosario

unread,
Dec 12, 2012, 10:04:09 PM12/12/12
to mayan...@googlegroups.com

Hi Lau,

I'm glad!  A lot of people are doing it and I'm very happy my software keeps creating commercial opportunities!

Yeah, my recommendation is missing two steps.  I had to convert to TIFF as mentioned above and UNPAPER does de-skewing which why I added it to the workflow at the time.  I'm interested to test the new Tesseract to see if it can cope with skewed images better and remove UNPAPER or make it optional with a config option.  Checking the Wikipedia page for Tesseract (http://en.wikipedia.org/wiki/Tesseract_%28software%29) it only added support for other files format starting from 3.00 onwards.  The good thing is that it now has hOCR support which is interesting because with it, text and image highlighting is possible as it provides a correlation of image coordinates to recognized text.

I added per language support at one time, but sometimes document may have more than one language in them, so implemented per page language, but sometimes pages have more than one language in the them (this happens in Puerto Rico a lot, English and Spanish are the main languages of the island and are intermixed).  In my experience this yielded poor results for the language other than the one selected, so I removed the feature.  But I'm open to give it another go with the new version of Tesseract.

--Roberto

Roberto Rosario

unread,
Dec 17, 2012, 2:09:59 PM12/17/12
to mayan...@googlegroups.com
During a recent installation of Mayan, wordprocessing documents (.docx) were being detected as zip/compressed files and OCR was failing on them.  .docx are in fact compressed files containing several XML files.  Upgrading the libmagic1 file allowed the 'file' command to detect the document as a "Microsoft Word 2007+" file and upon reuploading, Mayan was able to OCR the documents correctly.  This could be one of the causes for the OCR failure being experienced in the thread.  Check to see if the 'file' command correctly detects the document type.  

This is the current list of file MIME types Mayan will pass to LibreOffice for conversion to PDF if detected: https://github.com/rosarior/mayan/blob/master/apps/converter/office_converter.py#L17


On Wednesday, December 5, 2012 9:07:00 PM UTC-4, Lau Llobet wrote:

Alek Geldenberg

unread,
Aug 4, 2013, 9:58:34 AM8/4/13
to mayan...@googlegroups.com
Roberto,

Could you, kindly, post what exactly you did to "upgrade libmagic1 file".  I have read some posts about changing /etc/magic file with the content of msooxml.  I tried to upgrade it by two ways:

#   Correct the mimetype with the registered ones:
#     http://technet.microsoft.com/en-us/library/cc179224.aspx
>>>>&26         string          word/           Microsoft Word 2007+
!:mime application/vnd.openxmlformats-officedocument.wordprocessingml.document
>>>>&26         string          ppt/            Microsoft PowerPoint 2007+
!:mime application/vnd.openxmlformats-officedocument.presentationml.presentation
>>>>&26         string          xl/             Microsoft Excel 2007+
!:mime application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
>>>>&26         default         x               Microsoft OOXML
!:strength +10


and this way:

#   Correct the mimetype with the registered ones:
#     http://technet.microsoft.com/en-us/library/cc179224.aspx
>>>>&26         string          word/           Microsoft Word 2007+ !:mime application/vnd.openxmlformats-officedocument.wordprocessingml.document
>>>>&26         string          ppt/            Microsoft PowerPoint 2007+ !:mime application/vnd.openxmlformats-officedocument.presentationml.presentation
>>>>&26         string          xl/             Microsoft Excel 2007+ !:mime application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
>>>>&26         default         x               Microsoft OOXML !:strength +10



Here is the output of the file command:
$ file testfile.docx

/etc/magic, 31: Warning: description `Microsoft Word 2007+ !:mime application/vnd.openxmlformats-offi' truncated
/etc/magic, 32: Warning: description `Microsoft PowerPoint 2007+ !:mime application/vnd.openxmlformat' truncated
/etc/magic, 33: Warning: description `Microsoft Excel 2007+ !:mime application/vnd.openxmlformats-off' truncated

However, none of that worked.  The docx files are still uploaded into Mayan as zip files.

I am running Ubuntu 12.04 LTS.


I hope you can help me with this issue.

Alek Geldenberg

unread,
Aug 4, 2013, 10:03:55 AM8/4/13
to mayan...@googlegroups.com
Correction:

Changing /etc/magic file as I described fix uploading files made by MS Office.  However, docx files created by Libre Office are still recognized as zip files.  I would love to know how /etc/magic has to be modified so that docx, xlsx, pptx files created by Libre Office would also be properly recognized.

Youri Lacan-Bartley

unread,
Aug 5, 2013, 5:49:07 AM8/5/13
to mayan...@googlegroups.com
Hi Alek,

I imagine Roberto meant upgrading libmagic1 proper, not actually modifying the /etc/magic config file which to my understanding is specific to the file(1) command.
Which version of libmagic1 are you using?

I seem to have MS Office documents being correctly detected with version 5.11-2.

Alek Geldenberg

unread,
Aug 6, 2013, 2:20:17 PM8/6/13
to mayan...@googlegroups.com
After I modified content of /etc/magic file as shown in my previous posts, Mayan does not have a problem with processing docx file produced by Microsoft Word.  It, however, has problem processing docx files produced by Libre Office, which is not a big deal, since it can process .odt files.
Reply all
Reply to author
Forward
0 new messages