Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Is there any package to convert a word formatted documet to xml?

21 views
Skip to first unread message

Anil A Kumar

unread,
Nov 23, 2009, 12:32:27 AM11/23/09
to 401...@gmail.com
Hi all,

My name is Anil Kumar. I am planning to develop a tool, which converts
a microsoft word/pdf formatted file to a xml file.

I know parsing an xml file using tDom package. I just want to know how
to convert a word/pdf formatted document to an xml formatted document
using TCL. Is there any package available?

Thanks in advance!

Thanks for thr entire comp.lang.tcl group, because you helped me when
I started learning tDom.

Regards,
Anil A Kumar

Arndt Roger Schneider

unread,
Nov 23, 2009, 6:07:19 AM11/23/09
to
Anil A Kumar schrieb:

word is already xml.
Otherwise see roundtrip inside the docbook-xsl on sourceforge,
there is a word template which allows you to import and export
word documents from and to docbook.

-roger

Nick Hounsome

unread,
Nov 23, 2009, 6:12:02 AM11/23/09
to

Yes - It's called "Micosoft Word"

...but it wont help.

The problem is that the Word document model is very complicated and
your document is unlikely to be in a form that is easy to handle.
Most of the elements will just be getting in the way.
Your best bet might be to save as HTML and parse that since it's
simpler but may well still have some structure that you can use in the
form of <div>, <h1> etc

Rob

unread,
Nov 23, 2009, 6:39:14 AM11/23/09
to
Arndt Roger Schneider wrote:

> Anil A Kumar schrieb:
>
>>Hi all,
>>
>>My name is Anil Kumar. I am planning to develop a tool, which converts
>>a microsoft word/pdf formatted file to a xml file.
>>
>>I know parsing an xml file using tDom package. I just want to know how
>>to convert a word/pdf formatted document to an xml formatted document
>>using TCL. Is there any package available?
>>

Roger

> word is already xml.

This will only be true if the word document is in docx (ooxml) format

> Otherwise see roundtrip inside the docbook-xsl on sourceforge,
> there is a word template which allows you to import and export
> word documents from and to docbook.

Another approach might be to use the tcom package in tcl and navigate the
document using Word's COM object (activeX) model and use this to create a
new file creating the relevant xml tags in accordance with with what you
find in Word structure and move the associated data from word to your new
file.

The above is just a thought - I've never used the tcom package or created
xml documents this way. However, in the past I have used Word VBA to
navigate a word document and create an HTML document from the data there.
The principle of doing this navigation using the object model via tcl and
the tcom package should, I'd imagine, yield similar possibilities...

Rob.

Anil A Kumar

unread,
Nov 23, 2009, 7:31:08 AM11/23/09
to
Sounds good Rob!

I will look into the tCom and update you all.

Thanks and regards,
Anil A Kumar

Arndt Roger Schneider

unread,
Nov 23, 2009, 9:07:52 AM11/23/09
to
Rob schrieb:

COM, yes works in principle (did use the opposite direction from C++ in
the past).

Another approach:

http://msdn.microsoft.com/en-us/library/aa537167(office.11).aspx

Generating .fo from a word document,
then pass it through a fo-processor such as FOP (fop.apache.org)
to generate SVG and import the SVG into Tk.

Albeit the word document can be rendered in Tk, it will preserve
its structure.


-roger

Steve Ball

unread,
Nov 23, 2009, 5:58:39 PM11/23/09
to
BTW, I (mostly) wrote and currently maintain the DocBook roundtripping
system. It does specifically target DocBook, but not for general-
purpose conversion of documents; just for roundtripping. However, it
can be adapted for general-purpose conversion.

The best thing to do is use Word 2003 or Word 2007. Both of these
versions of Word save their documents as XML which can then be
transformed using XSLT (either TclXML or tDOM may provide the
transformation infrastructure). With Office 2007, the Word document is
actually a Zip file which can be mounted as a virtual filesystem using
Tcl's VFS.

HTHs,
Steve Ball

tom.rmadilo

unread,
Nov 23, 2009, 8:43:42 PM11/23/09
to
On Nov 23, 2:58 pm, Steve Ball <Steve.B...@explain.com.au> wrote:

> The best thing to do is use Word 2003 or Word 2007. Both of these
> versions of Word save their documents as XML which can then be
> transformed using XSLT (either TclXML or tDOM may provide the
> transformation infrastructure). With Office 2007, the Word document is
> actually a Zip file which can be mounted as a virtual filesystem using
> Tcl's VFS.

Wow, that is pretty interesting. Do you have any example scripts
available to do this stuff?

Arndt Roger Schneider

unread,
Nov 24, 2009, 4:50:43 AM11/24/09
to
Steve Ball schrieb:

>BTW, I (mostly) wrote and currently maintain the DocBook roundtripping
>system. It does specifically target DocBook, but not for general-
>purpose conversion of documents; just for roundtripping. However, it
>can be adapted for general-purpose conversion.
>
>
>

Thanks!

Although, I don't use roundtrip --I am writing directly
in DocBook. Using roundtrip for wordprocessor
text conversion appears to be the best approach to get structured
documents. The styles are the only reliable structure element
in a word document.

What's needed is sort of a
style mapper in word itself --VBA-- replacing the
current styles with those originating from DocBook.

The Fo-way is also interesting, it makes it possible
to design text-chunks in a wordprocessor and reuse it wysiwyg
in a Tk interface, for a sophisticated online help system for example.


-roger

>The best thing to do is use Word 2003 or Word 2007. Both of these
>versions of Word save their documents as XML which can then be
>transformed using XSLT (either TclXML or tDOM may provide the
>transformation infrastructure). With Office 2007, the Word document is
>actually a Zip file which can be mounted as a virtual filesystem using
>Tcl's VFS.
>
>HTHs,
>Steve Ball
>
>
>
>

Sounds like a lot of work.

0 new messages