Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Structure of document

3 views
Skip to first unread message

Teresa Rippeon

unread,
Feb 5, 2002, 3:59:20 PM2/5/02
to
I would like to look at the collections that are available in the
document and determine the "structure" of the document. In the example
below, I'd like to get the first paragraph in the paragraphs collection
and then know that a table comes next, and then know that a paragraph
follows the table, etc.

Is there a way to do this? I thought that the Parent property or ID
property would somehow do this for me.

Example of document:

This is the first paragraph

Table - Row 1, Cell 1 Row 1, Cell 2 Row 1, Cell 3 Row 1, Cell 4 Row
1, Cell 5
Row 2, Cell 1


This is the second paragraph

This is the third paragraph - now I want a list:

1. This is the first item in the list
2. This is the second item in the list
3. This is the third item in the list

This is the 4th paragraph.

Thanks in advance for any help that you can provide.

Steve Hudson

unread,
Feb 5, 2002, 4:07:21 PM2/5/02
to
G'day Teresa,

you need to investigate the <range>.Information(LOTS_OF_OPTIONS)
object.

Have a ripper day.


Teresa Rippeon <teresa....@mantech-stc.com> was spinning a yarn
that went like this:

Steve Hudson, Word Heretic
HDK List MVP
Word tools: her...@tdfa.com
Please post replies/further questions to the newsgroup so that all may benefit.
If I don't provide enough information, please feel free to ask for more :-)

Dave Rado

unread,
Feb 5, 2002, 4:45:27 PM2/5/02
to
Hi Teresa

Considering that tables *contain* paragraphs, this could get mighty
complicated! And then what about collections like Lists - a paragraph may or
my not be a member of a List; and the paragraphs that make up a single List
can be non-contigious - so your "structure" could quickly look like
spagghetti junction.

What are you going to use this information for?

Regards

Dave

"Teresa Rippeon" <teresa....@mantech-stc.com> wrote in message
news:3C6047A8...@mantech-stc.com...

Teresa Rippeon

unread,
Feb 5, 2002, 5:09:08 PM2/5/02
to
I am attempting to convert the contents of a Word Documents to XML tagged
documents. I just need tagging of Paragraphs, Tables, Lists, Figures. This all
fits within the context of a larger DTD.

I previously had RTF documents that I converted to XML tagged documents, but it
seemed much simpler to go the route of converting the Word Document directly to
XML tagged documents, versus going through the process of converting the Word
document to RTF then to XML. It seemed like a good idea at the beginning...

Thanks.

Teresa

Dave Rado wrote:

--
Teresa Rippeon
Mantech
9189 Red Branch Road
Columbia, MD 21045
410-772-3452


Klaus Linke

unread,
Feb 6, 2002, 2:06:50 AM2/6/02
to
Teresa Rippeon <teresa....@mantech-stc.com> wrote:
> I am attempting to convert the contents of a Word Documents
> to XML tagged documents. I just need tagging of Paragraphs, Tables,
> Lists, Figures. This all fits within the context of a larger DTD.
>
> I previously had RTF documents that I converted to XML tagged
> documents, but it seemed much simpler to go the route of converting
> the Word Document directly to XML tagged documents, versus going
> through the process of converting the Word document to RTF then to
> XML. It seemed like a good idea at the beginning...


Hi Teresa,

If you tried this before, you know already that converting Word
documents to XML doesn't make much sense if the docs aren't formatted
with styles. OTOH, if you use list styles for example, you don't have
to worry about lists, because they will be tagged automatically when
you tag the paragraphs.
You'll also know already that most of the time it doesn't make much
sense to tag *everything* (else, it would be easier to save as HTML
and take it from there).

There are quite a few commercial/shareware/freeware utilities to do
the job.
If you search the Word newsgroups for "XML" with Google, you'll find
evaluation software and free downloads.
I have checked out only a few; often they seemed veeery slow, or very
limited in the features they support.

A converter by Microsoft looks promising, but I wasn't able to
evaluate (because I still work under Win98):
Search the MSDN library
http://msdn.microsoft.com/library/default.asp
for "Export a Word Document to XML".

If you want to do it yourself, I post some code below that tags simple
tables, paragraph styles, character styles, bold, and italic.

The code is a shortened version; there is much room for improvements
(tag foot-/endnotes, comments, sections, chapters, pictures..., change
tags so they are valid XML tags, build a DTD, tag "upper" Unicode
characters as &#xXXXX; ...).

Good luck with your project!
Klaus


Sub WordToXML()
' Tags character styles, paragraph styles,
' and bold/italic manual formatting that isn't
' applied on top of character styles;
' puts in very simple HTML table tags.

Dim myStyle As Style
Dim myStyleName As String

Call TagTables
Call FixVbCrAndVbTab
ActiveWindow.View.Type = wdNormalView
Selection.HomeKey Unit:=wdStory

' Tag character styles first,
' so they are nested in paragraph style tags:
For Each myStyle In ActiveDocument.Styles
If myStyle.InUse = True Then
If myStyle.Type = wdStyleTypeCharacter Then
If myStyle <> _
ActiveDocument.Styles(wdStyleDefaultParagraphFont) Then
myStyleName = myStyle.NameLocal & ">"
Selection.Find.ClearFormatting
Selection.Find.Style = myStyle
Selection.Find.Replacement.ClearFormatting
Selection.Find.Replacement.Style = _
ActiveDocument.Styles(wdStyleDefaultParagraphFont)
With Selection.Find
.text = ""
.Replacement.text = "<" & myStyleName _
& "^&" & "</" & myStyleName
.Forward = True
.Wrap = wdFindContinue
.Format = True
.MatchCase = True
.MatchWholeWord = False
.MatchWildcards = False
.MatchSoundsLike = False
.MatchAllWordForms = False
End With
Selection.Find.Execute _
Replace:=wdReplaceAll
End If
End If
End If
Next myStyle

' Paragraph styles:
For Each myStyle In ActiveDocument.Styles
If myStyle.InUse = True Then
If myStyle.Type = wdStyleTypeParagraph Then
If myStyle <> _
ActiveDocument.Styles(wdStyleNormal) Then
myStyleName = myStyle.NameLocal & ">"
Selection.Find.ClearFormatting
Selection.Find.Style = myStyle
Selection.Find.Replacement.ClearFormatting
Selection.Find.Replacement.Style = _
ActiveDocument.Styles(wdStyleNormal)
With Selection.Find
.text = "([!^13]@)^13"
.Replacement.text = "<" & myStyleName _
& "\1" & "</" & myStyleName & "^p"
.Forward = True
.Wrap = wdFindContinue
.Format = True
.MatchCase = True
.MatchWholeWord = False
.MatchWildcards = True
.MatchSoundsLike = False
.MatchAllWordForms = False
End With
Selection.Find.Execute _
Replace:=wdReplaceAll
End If
End If
End If
Next myStyle
Call TagBoldAndItalic
End Sub

Private Sub FixVbCrAndVbTab()
' Set para marks and tabs to DPF, so that
' character styles and manual font formatting
' are neatly nested in para styles
Selection.HomeKey Unit:=wdStory

Selection.Find.ClearFormatting
Selection.Find.Replacement.ClearFormatting
Selection.Find.Replacement.Style = _
ActiveDocument.Styles(wdStyleDefaultParagraphFont)
With Selection.Find
.text = "[^13^9]"
.Replacement.text = "^&"
.Forward = True
.Wrap = wdFindContinue
.Format = True
.MatchCase = True
.MatchWholeWord = False
.MatchWildcards = True
.MatchSoundsLike = False
.MatchAllWordForms = False
End With
Selection.Find.Execute Replace:=wdReplaceAll
End Sub

Sub TagBoldAndItalic()

Selection.HomeKey Unit:=wdStory
With Selection.Find
' Tags should be in DPF (for proper nesting of tags)
.Wrap = wdFindContinue
.Format = True
.MatchCase = True
.MatchWholeWord = False
.Forward = True

.ClearFormatting
.Replacement.ClearFormatting
.Replacement.Style = _
ActiveDocument.Styles(wdStyleDefaultParagraphFont)
.text = "\<[!\<\>]@\>"
.Replacement.text = "^&"
.MatchWildcards = True
.Execute Replace:=wdReplaceAll

.ClearFormatting
.Font.Bold = True
.Replacement.ClearFormatting
.Replacement.Font.Bold = False
.text = ""
.Replacement.text = "<em>^&</em>"
.MatchWildcards = False
.Execute Replace:=wdReplaceAll

.ClearFormatting
.Font.Italic = True
.Replacement.ClearFormatting
.Replacement.Font.Italic = False
.text = ""
.Replacement.text = "<i>^&</i>"
.MatchWildcards = False
.Execute Replace:=wdReplaceAll
End With
End Sub

Sub TagTables()
' Very simple HTML table tags without colspan.
' Doesn't allow for different paragraph styles in a single cell.
Dim myTable As Table
Dim myCell As Cell
Dim rngCell As Range
Dim rngRow As Range
Dim myString As String
Dim myPara As Paragraph
Dim SIwdStartOfRangeRowNumber
Dim SIwdEndOfRangeRowNumber
Dim rowspan

For Each myTable In ActiveDocument.Tables
' Replace ś with tags in cells:
With myTable.Range.Find
.ClearFormatting
.Replacement.ClearFormatting
.Forward = True
.Wrap = wdFindStop
.Format = True
.MatchCase = True
.MatchWholeWord = False
.MatchWildcards = False
.text = "^p"
.Replacement.text = "<CR/>"
.Execute Replace:=wdReplaceAll
End With
' Tag cells:
For Each myCell In myTable.Range.Cells
myCell.Select
Set rngCell = myCell.Range
SIwdStartOfRangeRowNumber = _
Selection.Information(wdStartOfRangeRowNumber)
SIwdEndOfRangeRowNumber = _
Selection.Information(wdEndOfRangeRowNumber)
rowspan = 0
If SIwdStartOfRangeRowNumber <> _
SIwdEndOfRangeRowNumber Then
rowspan = SIwdEndOfRangeRowNumber - _
SIwdStartOfRangeRowNumber
End If
rowspan = rowspan + 1
myString = "<td"
If rowspan > 1 Then
myString = myString & " rowspan="
myString = myString & rowspan
End If
myString = myString & ">"
rngCell.InsertBefore myString & Chr(182)
rngCell.InsertAfter "<CR/>" & "</td>"
Next myCell
With ActiveDocument.Bookmarks
.Add Range:=myTable.Range, Name:="table"
End With

myTable.ConvertToText Separator:=Chr(182)
' Tag rows:
Selection.GoTo What:=wdGoToBookmark, _
Name:="table"
For Each myPara In Selection.Paragraphs
Set rngRow = myPara.Range.Duplicate
rngRow.MoveEnd wdCharacter, -1
rngRow.InsertBefore "<tr>" & Chr(182)
rngRow.InsertAfter Chr(182) & "</tr>"
Next myPara
' Tag table:
Selection.InsertBefore "<table>" & Chr(182)
Selection.MoveEnd wdCharacter, -1
Selection.InsertAfter "</table>" & vbCr
Next myTable

Selection.WholeStory

With Selection.Find
.ClearFormatting
.Replacement.ClearFormatting
.Forward = True
.Wrap = wdFindContinue
.Format = True
.MatchCase = True
.MatchWholeWord = False
.MatchWildcards = False
.text = "<CR/>"
.Replacement.text = "^p"
.Execute Replace:=wdReplaceAll
.Replacement.ClearFormatting
.Replacement.Style = _
ActiveDocument.Styles(wdStyleNormal)
.text = Chr(182)
.Replacement.text = "^p"
.Execute Replace:=wdReplaceAll
.text = "</tr>"
.Replacement.text = "^&"
.Execute Replace:=wdReplaceAll
End With
' Format tags in DPF:
With Selection.Find

.ClearFormatting
.Replacement.ClearFormatting
.Replacement.Style = _
ActiveDocument.Styles(wdStyleDefaultParagraphFont)
.Forward = True
.Wrap = wdFindContinue
.Format = True
.MatchWildcards = True

.text = "\<[!\<\>]@\>"
.Replacement.text = "^&"
.Execute Replace:=wdReplaceAll

End With

End Sub


Teresa Rippeon

unread,
Feb 7, 2002, 9:07:40 AM2/7/02
to
Klaus -

Thanks for all of your suggestions. I totally agreed with your comments
about converting Word documents to XML. I had previously worked on a
project where we were trying to convert Word documents to XML using
BladeRunner products. We had Word templates to define styles and then
"map" various styled objects to our DTD elements.

However, in this project, the conversion is actually very simple. We're
just taking the entire document and putting everything into paragraphs,
lists, or tables (and figures). So, I was hoping that this simpler
approach would lend itself to the conversion to XML.

I'll think your sample code will be very helpful, and I will check out the
converter from Microsoft. Thanks very much!

Teresa

Klaus Linke wrote:

--

0 new messages