reading text out of word docs

Keith G Hicks

unread,

Dec 23, 2009, 10:31:00 AM12/23/09

to

I started working on a program to read text out of some well organized word
docs. I've done this sort of thing in vba but not quite this extensively and
I'm not great with word automation. I know enough to be dangerous. LOL. I
need to open the doc (got that part done), locate certain phrases that are
in all of them and then read some text after those phrases into variables so
I can post them to a sql db. The part I'm struggling with is how to read the
doc. I'm not changing the docs in any way. They are deposited into a folder
on the network and I open and read them as they arrive. Setting up the
watcher for this in general is not a problem. I just need help reading the
docs in vb.net.

Here's some of what I have so far:

oWord = CreateObject("Word.Application")
oWord.Visible = True
oDoc = oWord.Documents.Open("C:\SomeWordDoc.doc", , True)

Dim rng As Word.Range

With oWord.Selection
.HomeKey(wdStory)
rng = .Range
End With

>>> Here's a point where I'm stuck. I can find the phrase "Issue date:"
>>> but then I need to read the text AFTER that (but not including the
>>> phrase itself)
>>> For example, the line in the doc might read "Issue date: March 25, 2009"
>>> I need to extract the "March 25, 2009" part.

rng.Find.Text = "Issue date::"
If rng.Find.Execute() Then
'MsgBox("found")
rng = oWord.Selection.Range
rng.End = rng.Next(wdLine, 1).End ' rng.MoveEnd(wdLine)
MsgBox(rng)
Else
MsgBox("Not found")
End If

>>> Then the next line below that doesn't have anything to cue me into that
>>> line. I just need the entire line below the date noted above. How do
>>> I move to the next line and read the entire line?

'move to linebelow "Issue Date:" to get county
>>> The line below "Issue Date:" would be like this: "Orange County"

Help with the above will really get me started well on this. I'd really
apprecate it.

Thanks,

Keith

Keith G Hicks

unread,

Dec 25, 2009, 8:59:33 AM12/25/09

to

Is this not possible?

"Keith G Hicks" <k...@comcast.net> wrote in message
news:Owi61T%23gKH...@TK2MSFTNGP04.phx.gbl...

mayayana

unread,

Dec 25, 2009, 9:55:06 AM12/25/09

to

> Is this not possible?
>

You probably should ask in an MS Word group:
microsoft.public.word.*

You might be using VB.Net but the code you're
working on is MS Word object model. It will only
make sense to people who use MS Word and who
have experience with MS Word/Office automation.

Family Tree Mike

unread,

Dec 25, 2009, 10:54:56 AM12/25/09

to

I'm sure it is possible within word, but I would grab all the text, and
use regular expressions to search for the pattern you want. You only
seem to be using word, as that is the form of the original doc.

--
Mike

Keith G Hicks

unread,

Dec 25, 2009, 12:23:47 PM12/25/09

to

I did actually start trying that out yesterday. I'm taking the entire word
doc into a string variable. I hadn't started the RegEx part but I think
you're right. That's probalby the best way to go. I was hopign though that
someone out there had a better, less brute force way to do this.

"Family Tree Mike" <FamilyT...@ThisOldHouse.com> wrote in message
news:O3UMaqXh...@TK2MSFTNGP04.phx.gbl...

Keith G Hicks

unread,

Dec 25, 2009, 12:22:42 PM12/25/09

to

The problem is that there is no word vb.net group. Only vba. And as we all
know, they are very different. I did post a note there asking people to look
in this post if they have any ideas and so far nothing there either.

"mayayana" <mayaX...@rcXXn.com> wrote in message
news:%23xEFHIX...@TK2MSFTNGP04.phx.gbl...

mayayana

unread,

Dec 25, 2009, 3:04:38 PM12/25/09

to

> The problem is that there is no word vb.net group.
> Only vba. And as we all
> know, they are very different.

Yes, that's what I meant. MS Office automation
is COM. You've got a COM object model, which is
adaptable to any COM-centric language. VB.Net is not
COM, so there's no direct translation. If it were me
I'd ask only in the Word group, get the VB/VBA code,
then figure out how to translate that to .Net. Even if
you were using a COM-centric language like VB or
VBScript, the Word group would still be the place
to ask, because your question is not about a language.
It's about the object model of the Word.Application
automation object.

Also, this may not help, but if you're dealing
only with .doc files (not .docx) and you're considering
just dealing with the text string as Family Tree Mike
suggested -- the .doc spec. has been published.
I think this is it:

http://download.microsoft.com/download/0/B/E/0BE8BDD7-E5E8-422A-ABFD-4342ED7
AD886/Word97-2007BinaryFileFormat(doc)Specification.pdf

I downloaded it when it was first released and wrote
a VBScript to extract text from .doc files. It seems
to work quite dependably. The details of plain text
storage in .doc files (as opposed to formatting, images,
etc.) are not very complex.

Keith G Hicks

unread,

Dec 26, 2009, 12:39:21 PM12/26/09

to

Moving this post to word.vba.general.

Keith

"Keith G Hicks" <k...@comcast.net> wrote in message
news:Owi61T%23gKH...@TK2MSFTNGP04.phx.gbl...

William LaMartin

unread,

Dec 26, 2009, 10:41:22 PM12/26/09

to

I just gave this a try. It appears not difficult for a doc file--much
different for a docx file (which I didn't attempt).

From looking at the byte data of several files, I observed that

1. The body text starts at byte number 2562

2. The body text ends when you encounter the first 0 decimal value byte.

3. So simply read in the data between those two points.

I tried this on about six files. It worked for them. I can't guarantee
that it will work for all since I couldn't decipher in the Word file
documentation, for which someone posted the link, exactly where the text
began and its length. I simply looked at a few files.

"Keith G Hicks" <k...@comcast.net> wrote in message
news:Owi61T%23gKH...@TK2MSFTNGP04.phx.gbl...

mayayana

unread,

Dec 26, 2009, 11:08:44 PM12/26/09

to

> I just gave this a try. It appears not difficult for a doc file--much
> different for a docx file (which I didn't attempt).
>
> From looking at the byte data of several files, I observed that
>
> 1. The body text starts at byte number 2562
>
> 2. The body text ends when you encounter the first 0 decimal value byte.
>
> 3. So simply read in the data between those two points.
>

It's somewhat more involved than that, but not too
bad. See here for a VBScript version:

http://www.jsware.net/jsware/scripts.php5#desk

You can pretty much see the text if you just open
a Word .doc in Notepad, but it needs to be cleaned up.

Keith G Hicks

unread,

Dec 27, 2009, 8:59:01 AM12/27/09

to

Are you guys just trying to show me how to read the doc as text? If that's
what you're trying to show me, that part's easy:

Dim oWord As Word.Application
Dim oDoc As Word.Document

oWord = CreateObject("Word.Application")
oWord.Visible = True

oDoc = oWord.Documents.Open("c:\SomeWordFile.doc", , True)
oWord.Selection.WholeStory()
Dim wholeText As String = oWord.Selection.Text

I was going to do that and use RegEx to find everything I need but I got
answers to how to read the file as a word doc (not as just text) in the
word.vba.general newsgroup. Reading this as text and using RegEx is a
problem due to the fact that I can't use RegEx to find everything. I need to
find specific line #'s as well. I need all the info on line 4 and the info
on theat line will vary to the point that RegEx would be impractical. Greg
Maxey in the other newsgroup gave me some sample code. I put it into .net
and it got me going in the right direction.

Thanks.

"mayayana" <mayaX...@rcXXn.com> wrote in message

news:e%23AHQoqh...@TK2MSFTNGP05.phx.gbl...

William LaMartin

unread,

Dec 27, 2009, 10:01:39 AM12/27/09

to

In my three steps, I omitted a crucial 4th step. So my method for doc files
should read

1. The body text starts at byte number 2562

2. The body text ends when you encounter the first 0 decimal value byte.

3. So simply read in the data between those two points.

4. In that data only retain those bytes that are less than 123 and greater
than 31 along with line feeds and carriage returns.

That will give you the text and show where the line breaks are. No RegEX
needed as far as I can see to identify the lines.

In your alternate VBA approach, you are using late binding. You might want
to modify this to use early binding as below, where you have set a reference
in your project to the .net Microsoft.Office.Interop.Word, ver. 12. Using
that, you can also read docx files.

The code below displays any word file in a rich text box.

Me.OpenFileDialog1.Title = "Select Word Document"
Me.OpenFileDialog1.FileName = ""
Me.OpenFileDialog1.Filter = "Word Doc (*.doc)|*.doc|Word docx
(*.docx)|*.docx"
If Me.OpenFileDialog1.ShowDialog = Windows.Forms.DialogResult.OK
Then
Path = Me.OpenFileDialog1.FileName
End If

Dim oWord As New Microsoft.Office.Interop.Word.Application
Dim oDoc As New Microsoft.Office.Interop.Word.Document
oDoc = oWord.Documents.Open(Path)
oWord.Selection.WholeStory()
Me.rtbText.Text = oWord.Selection.Text

"Keith G Hicks" <k...@comcast.net> wrote in message

news:uBySBzvh...@TK2MSFTNGP02.phx.gbl...