Yes, it's possible! Take a look at:
http://sage.math.washington.edu/home/drake/foo.pdf
(and related files in my sage.math home directory). You can use pdftk
(or Acrobat) to extract a worksheet from that pdf.
This complements Rob Beezer's recent ideas about making LaTeX and Sage
work together.
I think it would be great if we could get the notebook server to do this
to uploaded PDFs, but I thought I would ask what people think first.
Also, I don't know enough about the notebook server to add this
functionality and don't know how hard it might be.
Comments? Ideas?
Dan
1. http://groups.google.com/group/sage-support/msg/3ea7ed2eeab0824a
--
--- Dan Drake <dr...@kaist.edu>
----- KAIST Department of Mathematical Sciences
------- http://mathsci.kaist.ac.kr/~drake
I can also use PDFMiner to extract the worksheet. The nice thing is
that PDFMiner is a pure python script. See
http://www.unixuser.org/~euske/python/pdfminer/
This extracts the sage worksheet from the above pdf:
python -m tools.dumppdf -i20 -b foo.pdf > embedded-worksheet.sws
The -b means binary mode, the -i20 specifies that we should extract the
content of object 20 in the pdf file. We know it is object 20 by
looking at object 23. Here is the output from object 23, interspersed
with my comments (lines starting "#")
$ python -m tools.dumppdf -i23 foo.pdf
# This is a xml representation of a dictionary, where each key is
# followed by its value.
<dict size="9">
<key>C</key>
<value><list size="3">
<number>1</number>
<number>0.9255</number>
<number>0.7765</number>
</list></value>
# Here is where we find the file information. FS = "File Specification"
<key>FS</key>
<value><dict size="3">
# EF = Embedded File
<key>EF</key>
<value><dict size="1">
# F = File (in this case, it's an internal reference)
<key>F</key>
<value><ref id="20"/></value>
</dict></value>
<key>Type</key>
<value><literal>Filespec</literal></value>
# F = File; it's the filename
<key>F</key>
<value><string size="22">embedded-worksheet.sws</string></value>
</dict></value>
<key>Name</key>
<value><literal>PushPin</literal></value>
<key>AP</key>
<value><dict size="3">
<key>R</key>
<value><ref id="21"/></value>
<key>D</key>
<value><ref id="21"/></value>
<key>N</key>
<value><ref id="21"/></value>
</dict></value>
<key>F</key>
<value><number>4</number></value>
<key>M</key>
<value><string size="16">D:20081218143900</string></value>
# We should scan for this key/value pair. This tells us that this
# object contains info for a file attachment.
<key>Subtype</key>
<value><literal>FileAttachment</literal></value>
<key>Type</key>
<value><literal>Annot</literal></value>
<key>Rect</key>
<value><list size="4">
<number>253.98</number>
<number>216.522</number>
<number>277.98</number>
<number>230.522</number>
</list></value>
</dict>
See p. 683 of the PDF 1.7 spec
(http://www.adobe.com/devnet/acrobat/pdfs/pdf_reference_1-7.pdf from
http://www.adobe.com/devnet/pdf/pdf_reference_archive.html)
So basically, it looks like we need to scan the pdf file for objects of
subtype FileAttachment, look at the FS key to find the filename, make
sure the filename ends in .sws, and then extract the internal object we
get from the EF key.
Sounds pretty easy, if we have something like pdfminer in Sage.
Thanks,
Jason
Here is a short python script which extracts the embedded worksheet in
the above pdf file and outputs it to stdout. To run this, put it in the
tools directory of the pdfminer distribution above, cd to the pdfminer
directory, and do:
python -m tools.sage foo.pdf > embedded.sws
Here's the file:
from pdflib.pdfparser import PDFDocument, PDFParser
import sys
stdout = sys.stdout
doc = PDFDocument()
fp = file('foo.pdf', 'rb')
parser = PDFParser(doc, fp)
doc.initialize()
for xref in doc.xrefs:
for objid in xref.objids():
try:
obj = doc.getobj(objid)
except:
continue
if isinstance(obj,dict) and 'Type' in obj and obj['Type'].name
== "Annot":
if 'Subtype' in obj and obj['Subtype'].name ==
"FileAttachment":
# We have an attached file!
filespec = obj['FS']
# Look for embedded file; we could try to extract the
# filename too. but that is platform dependent. See page
# 182 (Section 3.10.2) of
#
http://www.adobe.com/devnet/acrobat/pdfs/pdf_reference_1-7.pdf.
if 'EF' in filespec:
fileobj = filespec['EF']['F']
embeddedspec = filespec['EF']
stdout.write(fileobj.resolve().get_data())
# Just output the first file found.
exit()
Thanks,
Jason
I don't think it will be that hard. Those are famous last words, though.
>
> Comments? Ideas?
>
This is now #4825
Jason
> This is a response to an idea that William mentioned recently [1]. He
> asked if it's possible to embed a Sage worksheet into a PDF so that
> one
> could upload the PDF to a Sage notebook server, which would then
> extract
> the worksheet and let you edit it.
>
> Yes, it's possible! Take a look at:
>
> http://sage.math.washington.edu/home/drake/foo.pdf
>
> (and related files in my sage.math home directory). You can use pdftk
> (or Acrobat) to extract a worksheet from that pdf.
>
> This complements Rob Beezer's recent ideas about making LaTeX and Sage
> work together.
This is very cool!
>
> I think it would be great if we could get the notebook server to do
> this
> to uploaded PDFs, but I thought I would ask what people think first.
> Also, I don't know enough about the notebook server to add this
> functionality and don't know how hard it might be.
>
> Comments? Ideas?
It would be neat if the worksheet could be generated from the .tex
source, with perhaps extra examples, so the author doesn't have to do
something totally separate/manually keep them in sync. But perhaps
you're already thinking along these lines.
- Robert
This is very cool. My only concern is that I don't see what I'll get if the
server unpacks the PDF automatically. But I suppose this is the same problem
with sws files anyway.
Cheers,
Martin
--
name: Martin Albrecht
_pgp: http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x8EF0DC99
_www: http://www.informatik.uni-bremen.de/~malb
_jab: martinr...@jabber.ccc.de
I see many ideas in converting latex>worksheet, pdf>worksheet and
worksheet>pdf. It would be cool if all those converged to a single
solution. Maybe the sphinx solutions could help.
Ronan
> The pdftk route worked fine for me. I'll add that KPDF (KDE's pdf
> viewer) falls into the "scant support" category. Not much of a
> surprise there.
Okular, the KDE-4 pdf viewer, has (some) support for attached files,
but it doesn't seem to see the attached worksheet in this case.
Franco
--