My name is Anil Kumar. I am planning to develop a tool, which converts
a microsoft word/pdf formatted file to a xml file.
I know parsing an xml file using tDom package. I just want to know how
to convert a word/pdf formatted document to an xml formatted document
using TCL. Is there any package available?
Thanks in advance!
Thanks for thr entire comp.lang.tcl group, because you helped me when
I started learning tDom.
Regards,
Anil A Kumar
word is already xml.
Otherwise see roundtrip inside the docbook-xsl on sourceforge,
there is a word template which allows you to import and export
word documents from and to docbook.
-roger
Yes - It's called "Micosoft Word"
...but it wont help.
The problem is that the Word document model is very complicated and
your document is unlikely to be in a form that is easy to handle.
Most of the elements will just be getting in the way.
Your best bet might be to save as HTML and parse that since it's
simpler but may well still have some structure that you can use in the
form of <div>, <h1> etc
> Anil A Kumar schrieb:
>
>>Hi all,
>>
>>My name is Anil Kumar. I am planning to develop a tool, which converts
>>a microsoft word/pdf formatted file to a xml file.
>>
>>I know parsing an xml file using tDom package. I just want to know how
>>to convert a word/pdf formatted document to an xml formatted document
>>using TCL. Is there any package available?
>>
Roger
> word is already xml.
This will only be true if the word document is in docx (ooxml) format
> Otherwise see roundtrip inside the docbook-xsl on sourceforge,
> there is a word template which allows you to import and export
> word documents from and to docbook.
Another approach might be to use the tcom package in tcl and navigate the
document using Word's COM object (activeX) model and use this to create a
new file creating the relevant xml tags in accordance with with what you
find in Word structure and move the associated data from word to your new
file.
The above is just a thought - I've never used the tcom package or created
xml documents this way. However, in the past I have used Word VBA to
navigate a word document and create an HTML document from the data there.
The principle of doing this navigation using the object model via tcl and
the tcom package should, I'd imagine, yield similar possibilities...
Rob.
I will look into the tCom and update you all.
Thanks and regards,
Anil A Kumar
COM, yes works in principle (did use the opposite direction from C++ in
the past).
Another approach:
http://msdn.microsoft.com/en-us/library/aa537167(office.11).aspx
Generating .fo from a word document,
then pass it through a fo-processor such as FOP (fop.apache.org)
to generate SVG and import the SVG into Tk.
Albeit the word document can be rendered in Tk, it will preserve
its structure.
-roger
The best thing to do is use Word 2003 or Word 2007. Both of these
versions of Word save their documents as XML which can then be
transformed using XSLT (either TclXML or tDOM may provide the
transformation infrastructure). With Office 2007, the Word document is
actually a Zip file which can be mounted as a virtual filesystem using
Tcl's VFS.
HTHs,
Steve Ball
> The best thing to do is use Word 2003 or Word 2007. Both of these
> versions of Word save their documents as XML which can then be
> transformed using XSLT (either TclXML or tDOM may provide the
> transformation infrastructure). With Office 2007, the Word document is
> actually a Zip file which can be mounted as a virtual filesystem using
> Tcl's VFS.
Wow, that is pretty interesting. Do you have any example scripts
available to do this stuff?
>BTW, I (mostly) wrote and currently maintain the DocBook roundtripping
>system. It does specifically target DocBook, but not for general-
>purpose conversion of documents; just for roundtripping. However, it
>can be adapted for general-purpose conversion.
>
>
>
Thanks!
Although, I don't use roundtrip --I am writing directly
in DocBook. Using roundtrip for wordprocessor
text conversion appears to be the best approach to get structured
documents. The styles are the only reliable structure element
in a word document.
What's needed is sort of a
style mapper in word itself --VBA-- replacing the
current styles with those originating from DocBook.
The Fo-way is also interesting, it makes it possible
to design text-chunks in a wordprocessor and reuse it wysiwyg
in a Tk interface, for a sophisticated online help system for example.
-roger
>The best thing to do is use Word 2003 or Word 2007. Both of these
>versions of Word save their documents as XML which can then be
>transformed using XSLT (either TclXML or tDOM may provide the
>transformation infrastructure). With Office 2007, the Word document is
>actually a Zip file which can be mounted as a virtual filesystem using
>Tcl's VFS.
>
>HTHs,
>Steve Ball
>
>
>
>
Sounds like a lot of work.