pdftoxml Error

375 views
Skip to first unread message

Andrey Tomashevskiy

unread,
Feb 9, 2014, 3:40:55 PM2/9/14
to scrap...@googlegroups.com
Hi all,

Sorry if this is a simple problem but I have been unable to find a solution anywhere and I am new to Python in general. I am having a lot of trouble getting the scraperwiki pdftoxml command to work on my computer. I am using Windows 8 with Python 2.7. Whenever I attempt to run the "pdftoxml" module from scraperwiki, I keep getting a "The system cannot find the path specified" error. I am guessing this has something to do with pdftohtml not working properly but I am not sure how to fix this issue. I'd really appreciate any help in solving this issue.

Francis Irving

unread,
Feb 10, 2014, 2:35:21 AM2/10/14
to scrap...@googlegroups.com

Are you using ScraperWiki Classic or new ScraperWiki?

Can you give a link to the scraper?

On 9 Feb 2014 20:40, "Andrey Tomashevskiy" <toma...@gmail.com> wrote:
Hi all,

Sorry if this is a simple problem but I have been unable to find a solution anywhere and I am new to Python in general. I am having a lot of trouble getting the scraperwiki pdftoxml command to work on my computer. I am using Windows 8 with Python 2.7. Whenever I attempt to run the "pdftoxml" module from scraperwiki, I keep getting a "The system cannot find the path specified" error. I am guessing this has something to do with pdftohtml not working properly but I am not sure how to fix this issue. I'd really appreciate any help in solving this issue.

--
You received this message because you are subscribed to the Google Groups "ScraperWiki" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scraperwiki...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Andrey Tomashevskiy

unread,
Feb 11, 2014, 4:34:08 PM2/11/14
to scrap...@googlegroups.com
I am not using the online scraperwiki interface, if that is what you mean. I installed the scraperwiki module using pip install on my computer. I also downloaded pdftohtml from http://sourceforge.net/projects/pdftohtml/ and placed the executable in my path folder. I tried following the example code at https://gist.github.com/psychemedia/5800840 but cannot get past line 11 since I get the "The system cannot find the path specified" error.

Alexander Dunbar

unread,
May 19, 2014, 11:55:32 AM5/19/14
to scrap...@googlegroups.com
I am having a similar issue but I'm getting no error but when I try and run . 

pdfdata = urllib2.urlopen(url)
xmldata = scraperwiki.pdftoxml(pdfdata.read())

xmldata is empty all the time.

Steven Maude

unread,
May 21, 2014, 8:05:11 AM5/21/14
to scrap...@googlegroups.com
Need more information to be able to give you much help!

1. Which URL are you trying to access?

2. Do you have pdftohtml working on your machine?

If no to 2, what OS are you using?

If yes to 2, what happens if you download and save the PDF as input.pdf e.g. via browser or wget and then run:
pdftohtml -xml -nodrm -zoom 1.5 -enc UTF-8 -noframes input.pdf output.xml
(This is the command that scraperwiki.pdftoxml() executes.)

Do this generate an output.xml
?

Nick Evershed

unread,
Oct 29, 2014, 1:49:21 AM10/29/14
to scrap...@googlegroups.com
I'm having the same issue as Alexander, and have just confirmed pdftohtml is installed correctly and working.

pdftohtml version 0.40
scraperwiki freshly installed via pip

this code works fine on web scraperwiki, but xmldata is empty when I try to run it locally:

import scraperwiki
import requests


xmldata = scraperwiki.pdftoxml(r.content)

print xmldata


Visit theguardian.com. On your mobile and tablet, download the Guardian iPhone and Android apps theguardian.com/guardianapp and our tablet editions theguardian.com/editions.  Save up to 57% by subscribing to the Guardian and Observer - choose the papers you want and get full digital access.  Visit subscribe.theguardian.com

This e-mail and all attachments are confidential and may also be privileged. If you are not the named recipient, please notify the sender and delete the e-mail and all attachments immediately. Do not disclose the contents to another person. You may not use the information for any purpose, or store, or copy, it in any way.  Guardian News & Media Limited is not liable for any computer viruses or other material transmitted with or as part of this e-mail. You should employ virus checking software.
 
Guardian News & Media Limited is a member of Guardian Media Group plc. Registered Office: PO Box 68164, Kings Place, 90 York Way, London, N1P 2AP.  Registered in England Number 908396


Steven Maude

unread,
Oct 29, 2014, 5:49:40 AM10/29/14
to scrap...@googlegroups.com
Hi Nick,

The pdftoxml function isn't particularly well documented or supported
outside of Linux, I think and it's really just a wrapper around
pdftohtml that takes in downloaded data, saves it as a temporary file,
and runs pdftohtml on that file, saving the output as another temporary
file.

If you're running on Windows, you could try a pdftoxml function that I
hacked to work on Windows a few months ago:

https://gist.github.com/StevenMaude/88def892b0cbfa8ae818#file-pdf_to_html-py-L42-L54

Paste in the pdftoxml() (lines 42-54, should be highlighted) and replace
xmldata = scraperwiki.pdftoxml(r.content) with xmldata = pdftoxml(r.content)

This won't do the nice thing of using temporary files and cleaning up
after itself (you'll have to do that!), but it should at least work.

If not, the other thing you can do to diagnose is try pasting in the
pdftoxml() from here into your script:
https://github.com/scraperwiki/scraperwiki-python/blob/master/scraperwiki/utils.py#L41-L59
but then comment out or remove line 52 (the line: cmd = cmd + "
>/dev/null 2>&1" ) as this suppresses any errors. You'll then hopefully
be able to get a better idea of what's going on.

Another alternative is try the PDF at https://pdftables.com - just
checked and it does a mostly decent job of handling it.

Hope that helps,

Steve

Nick Leaton

unread,
Oct 29, 2014, 5:53:44 AM10/29/14
to scrap...@googlegroups.com
Steve, 

Wrong Nick. 

Autocorrect for you. 

Nick

To unsubscribe from this group and stop receiving emails from it, send an email to scraperwiki+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Nick

Gasper Zejn

unread,
Oct 29, 2014, 6:45:00 AM10/29/14
to scrap...@googlegroups.com
Another alternative you could try is installing pypdf2xml [1], and then running

from StringIO import StringIO
from pypdf2xml import pdf2xml
xmldata = pdf2xml(StringIO(r.content))


pypdf2xml is a replacement I wrote for scraperwiki.pdftoxml, because
that used to have
some problems with certain unicode characters. It uses pdfminer, which is
pure python, so that may help you a bit in a windows environment.

kr,
Gasper Zejn

[1] https://github.com/zejn/pypdf2xml
> For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages