Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

How to find data in MS Word Document and extract it to a txt/csv f

301 views
Skip to first unread message

Arun Kumar

unread,
Oct 8, 2009, 3:14:02 AM10/8/09
to
Hi,

I have a set of around 1000 word documents and i want to extract email
addresses from all these documents and keep all of them in a txt/csv file. I
am able to do find and replace in powershell (by using
selection.find.execute() )however not sure how to just find something
matching "@" and keep it in txt file. Please help !!!

With Regards,
Arun

Peter Schneider

unread,
Oct 8, 2009, 7:21:44 AM10/8/09
to
Hi!

Which Word version do you use?

greetings, Peter Schneider

"Arun Kumar" <Arun...@discussions.microsoft.com> schrieb im Newsbeitrag
news:ABB64A14-8461-45CD...@microsoft.com...

Uwe Ziegenhagen

unread,
Oct 8, 2009, 1:19:20 PM10/8/09
to
Arun Kumar schrieb:

I would try to find a way to convert all to TXT, combine them in one
file and search through this single file.

You can use regular expressions in Powershell to get all the email by
using a string from the web (Google will help)

Uwe

Sean Kearney

unread,
Oct 9, 2009, 12:08:02 AM10/9/09
to
Is that because the "@" is a special character? I seem to remember dealing
with that with a funky password on a SAN and DOS days. I think you had to
put a Backtick --- THIS THING ````` ------ Before the special character for
the command line to ignore it and treat it as text.

Anybody know if that works in Powershell or what the equivalent to the
Backtick is in Powershell?

Robert Robelo

unread,
Oct 9, 2009, 6:25:00 PM10/9/09
to
Try this filter...

filter Get-EmailAddressFromWord {
begin {
$wd = New-Object -ComObject word.application
[regex]$RegexEmail = '[\w-]+(?:\.[\w-]+)*@[\w-]+(?:\.[\w-]+)*'
}
process {
$doc = $wd.Documents.Open($_.Fullname)
$doc.Select()
$text = $wd.Selection.Text
$RegexEmail.Matches($text) | ForEach-Object {$_.Value}
$doc.Close([ref]$true)
}
end {$wd.Quit()}
}

# pipe your docs to the filter and write the output to a file
Get-ChildItem ~\Documents *.doc | Get-EmailAddressFromWord > Emails.txt

--
Robert

Robert Robelo

unread,
Oct 9, 2009, 6:41:23 PM10/9/09
to
One correction.

# change this line
$doc.Close([ref]$true)

# to...
$doc.Close([ref]$false)

--
Robert
0 new messages