Beginning PS user needs help....

Charlie Hoffpauir

unread,

Mar 4, 2016, 10:43:41 PM3/4/16

to

I have some Genealogy text that's formatted like this:

100001 GOODEN, Teresa (Louis & Lorence GREEN) b. 8 April 1829 in
Houston, Tx; bt. 6 July 1933 (Laf. Ch.: St. Paul Ch.: v.4, p.98)
100002 JOSEPH (Charlotte - carterone libre de Mr. Phazin NORMAND) b.
7 April 1786; bt. 25 April 1786 Spons: Joseph PRADIE - de Plaquemine
& Felicite DAUTERIVE, carterone libre. Fr. GEFFROTIN (SM Ch.: v.3,
#108)
100003 SLAUGHTURN, Francis bt. 18 Aug. 1846 at age 71 years [1775]
Fr. C. TRUYEUR, S.J. (GC Ch.: v.1, p.249)
100004 ROMAN, Jacques - de la paroisse des allemands - St. Charles
(Jacques & Marie Josephe DAIGLE) m. 3 May 1777 Marie Louise PATIN
(LSAR: Opel: 1777-34)

That is, each record starts with a six digit record number, followed
by names and text. I want to extract each line number and the
CAPITALIZED surnames that are in that record, to a file, which I'll
then be able to use to set up a relational database using the surnames
to make locating faster. (The full text is 390,000 records long)

I can extract the record numbers using:
Get-Content B:\Gen.txt | %{-split $_} | ?{[char]::IsNumber($_[6])} >
B:\Results.txt

And I can extract the surnames (at least most of them) using a
similar...

Get-Content B:\Gen.txt | %{-split $_} | ?{[char]::IsUpper($_[3])} >
B:\Results.txt

But I can't figure out how to get them both accomplished on the same
pass, so that I end up with something that looks like this:

100001
GOODEN
GREEN
100002
JOSEPH
NORMAND
PRADIE
DAUTERIVE
GEFFROTIN
100003
SLAUGHTURN
TRUYEUR
etc.

Also, the eventual goal is to end up with something that can be
imported into MS Access. The text file to import would look like this:

100001 GOODEN
100001 GREEN
100002 JOSEPH
100002 NORMAND
100002 NORMAND
100002 PRADIE
100002 DAUTERIVE
100002 GEFFROTIN
100003 SLAUGHTURN
100003 TRUYEUR

Not knowing very much PS yet, I can get from the first list to the
second using Excel..... but if it's not to hard to do it in PS
directly, that would simply be great.

Marcel Müller

unread,

Mar 5, 2016, 3:51:32 AM3/5/16

to

Hi,

I've put this together, which should suit your needs:

get-content genealogy.txt | `
foreach { `
# split each line into pieces
$line = $_.split(" ").split(",").split(".")
# loop through the elements of each line
for ($i = 1 ; $i -lt $line.length ; $i++)
{
#if there's a match on uppercase string
# minimum length: 3
# maximum length: 20 (adjust to your needs)
if ($line[$i] -cmatch "^[A-Z]{3,20}$")
{
# output the first element (number) and the
matched string
# change this to write to a file
write-host $line[0] $line[$i]
}
}
}

--
Marcel

Charlie Hoffpauir

unread,

Mar 5, 2016, 9:47:02 AM3/5/16

to

On Sat, 5 Mar 2016 09:51:30 +0100, Marcel Müller <sysb...@gmail.com>
wrote:

Thanks Marcel, that's getting really, really close, and I can probably
work with it as-is. Operating on the original text, the code was
missing any capitalized surname that had a Parenthesis attached, but I
can easily edit the text and remove all parentheses before running the
code. But the code is picking up two surnames for each line number...
the first surname that appears, and the next one... (see the listing
for line number 100002 below)

Example:

100001 GOODEN GREEN
100002 JOSEPH NORMAND

100002 JOSEPH PRADIE
100002 JOSEPH DAUTERIVE
100002 JOSEPH GEFFROTIN
100003 SLAUGHTURN TRUYEUR
100004 ROMAN DAIGLE
100004 ROMAN PATIN
100005 CALLAHAN WEST
100005 CALLAHAN CALLAGHAN
100005 CALLAHAN BRAPHY
100005 CALLAHAN MONSON
100005 CALLAHAN BUHOT

any easy modification to fix that?

Marcel Müller

unread,

Mar 5, 2016, 2:26:09 PM3/5/16

to

Am 05.03.2016 um 15:46 schrieb Charlie Hoffpauir:

> Thanks Marcel, that's getting really, really close, and I can probably
> work with it as-is. Operating on the original text, the code was
> missing any capitalized surname that had a Parenthesis attached, but I
> can easily edit the text and remove all parentheses before running the
> code. But the code is picking up two surnames for each line number...
> the first surname that appears, and the next one... (see the listing
> for line number 100002 below)
>
> Example:
>
> 100001 GOODEN GREEN
> 100002 JOSEPH NORMAND

[...]

>
> any easy modification to fix that?
>

You don't neccessarily have to edit your original file, you can as well
add more .split()'s after the ones I put into the foreach-loop. Just be
careful with escaping the " correctly: "\""

For the duplicates I suspect they're not seperated by whitespaces, but
by a tab or something else. You might want to check with an editor that
can display those special characters; You can then either replace them
or add an appropriate .split(). Using an appropriate editor you can also
make sure that you have all lines starting with your numbers and that
there are no carriage-returns / line-feeds in between.
(I use Notepad++, which can easily display all those special characters;
but feel free to use anything else.)

To check the line-splitting you can add a "write-host $line[$i]" into
the for-loop. I did that when I wrote up the script to see whether the
split worked well.

HTH

--
Marcel

Charlie Hoffpauir

unread,

Mar 5, 2016, 3:39:48 PM3/5/16

to

On Sat, 5 Mar 2016 20:26:07 +0100, Marcel Müller <sysb...@gmail.com>
wrote:

Thanks, with those corrections it's working great! The problem as you
suggested is that I put a tab after the record number.... changing it
to a space fixed everything.

Marcel Müller

unread,

Mar 6, 2016, 3:00:29 AM3/6/16

to

Am 05.03.2016 um 21:39 schrieb Charlie Hoffpauir:
[...]

> Thanks, with those corrections it's working great! The problem as you
> suggested is that I put a tab after the record number.... changing it
> to a space fixed everything.

Glad to have helped.

Beware that the regexp-expression I used only matches if the string is
uppercase only; depending on your input source, you might have to use
more .split()-commands, in case you have leading or trailing brackets,
parantheses or other characters, i.e. "JONES)", "SMITH'".

--
Marcel