Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Parse irregular data, dump into delimited text file

9 views
Skip to first unread message

d...@rren.cymraeg.org

unread,
Nov 27, 2005, 1:10:59 PM11/27/05
to
I've been given an MS Word document containing information to input into
a database. I've knocked it into shape using various unix tools eg.
sed, cut etc. so now I have data in a plain text file like this :

name|address|postcode|telephone

The address field contains data in an irregular form, eg.
12, the high street, town, place, biggerplace
The vicarage, town place

I need to be able to format address field above ready for importing
into another database. In this new database, I have 3 fields for
address (address1, address2, address3).

So my problem is how to cut this address data and then put it back in a
text file with delimiter. address3 should contain only one word,
however, address1 and address2 may contain more than one word. When
filling in the fields, data should be added from left to right, or in
the order address1 then address2 then address3. If there is an address
field left with a blank, that is not a problem as it will be handled by
the mailmerge software.

Help on this much appreciated.

Darren

--
Aberdare Blog http://blog.aberdare.org/ Only Blog In The Valley

Ed Morton

unread,
Nov 27, 2005, 1:44:29 PM11/27/05
to

Let's start with this:

awk 'BEGIN{FS=OFS="|"}{
c = split($2,addr,",")
addr3 = addr[c--]
addr2 = addr[c--]
addr1 = ""
for (i=1;i<=c;i++) {
addr1 = addr1 addr[i]
}
$2 = addr1 "," addr2 "," addr3
}1' file

Follow up with some sample input, expected output and rationale for any
problems with the above.

Ed.

Bill Marcum

unread,
Nov 27, 2005, 1:49:36 PM11/27/05
to
On 27 Nov 2005 18:10:59 GMT, d...@rren.cymraeg.org
How do you define address1 and address2?

--
Higher education helps your earning capacity. Ask any college professor.

d...@rren.cymraeg.org

unread,
Nov 27, 2005, 2:04:31 PM11/27/05
to
Bill Marcum <bma...@iglou.com> wrote:

> How do you define address1 and address2?

Well, address1, address2, and address3 may contain anything one would
normally associate with an address.

d...@rren.cymraeg.org

unread,
Nov 27, 2005, 2:06:02 PM11/27/05
to
Ed Morton <mor...@lsupcaemnt.com> wrote:

> Let's start with this:
>
> awk 'BEGIN{FS=OFS="|"}{
> c = split($2,addr,",")
> addr3 = addr[c--]
> addr2 = addr[c--]
> addr1 = ""
> for (i=1;i<=c;i++) {
> addr1 = addr1 addr[i]
> }
> $2 = addr1 "," addr2 "," addr3
> }1' file
>
> Follow up with some sample input, expected output and rationale for any
> problems with the above.

Sample input as per my original post.

The output I get using the above is :

1, A Streetname, Blahblah, Foofoo|,,

Don't assume I know anything awk.

Ed Morton

unread,
Nov 27, 2005, 2:55:29 PM11/27/05
to
d...@rren.cymraeg.org wrote:
> Ed Morton <mor...@lsupcaemnt.com> wrote:
>
>
>>Let's start with this:
>>
>>awk 'BEGIN{FS=OFS="|"}{
>> c = split($2,addr,",")
>> addr3 = addr[c--]
>> addr2 = addr[c--]
>> addr1 = ""
>> for (i=1;i<=c;i++) {
>> addr1 = addr1 addr[i]
>> }
>> $2 = addr1 "," addr2 "," addr3
>>}1' file
>>
>>Follow up with some sample input, expected output and rationale for any
>>problems with the above.
>
>
> Sample input as per my original post.

You didn't have any sample input in your original post. All you had was
an example of the address field:

12, the high street, town, place, biggerplace
The vicarage, town place

> The output I get using the above is :


>
> 1, A Streetname, Blahblah, Foofoo|,,

That seems unlikely at best if the above really were the address fields
you used in your sample input. Assuming you actually had some other
input that produced that output - is the above output what you wanted or
not?

> Don't assume I know anything awk.

awk is simple. If you have any experience with an ALGOL-based language
(e.g. Pascal, Ada, C), you'll be able to understand and modify any awk
solution posted. Don't assume we know anything about your actual input
or expected output. So far you haven't posted any complete sample input
or output records, described the output field separators (input
separators for your new database), or provided any other details.

Ed.

d...@rren.cymraeg.org

unread,
Nov 27, 2005, 3:19:27 PM11/27/05
to
Ed Morton <mor...@lsupcaemnt.com> wrote:

>> Sample input as per my original post.
>
> You didn't have any sample input in your original post. All you had was
> an example of the address field:

Okay...

> 12, the high street, town, place, biggerplace
> The vicarage, town place

I have records like the following ...

name1|12, the high street, town, place, biggerplace|postcode1|telephone1
name2|The vicarage, town, place|postcode2|telephone2

My problem is breaking up field two above into useful chunks and
slotting them into three other fields.

>> The output I get using the above is :
>>
>> 1, A Streetname, Blahblah, Foofoo|,,
>
> That seems unlikely

Well, unlikely or not, that is the output. If you want the exact data I
have, don't bother replying as I can't print that.

Janis Papanagnou

unread,
Nov 27, 2005, 4:13:37 PM11/27/05
to
d...@rren.cymraeg.org wrote:
> Ed Morton <mor...@lsupcaemnt.com> wrote:
>
>>>Sample input as per my original post.
>>
>>You didn't have any sample input in your original post. All you had was
>>an example of the address field:
>
> Okay...
>
>>12, the high street, town, place, biggerplace
>>The vicarage, town place
>
> I have records like the following ...
>
> name1|12, the high street, town, place, biggerplace|postcode1|telephone1
> name2|The vicarage, town, place|postcode2|telephone2
>
> My problem is breaking up field two above into useful chunks and
> slotting them into three other fields.
>
>>>The output I get using the above is :
>>>
>>>1, A Streetname, Blahblah, Foofoo|,,

Hmm, that should not be the case. Are you executing the awk program from
within a Unix shell?

Ed's program produces with your data...

name1|12 the high street town, place, biggerplace|postcode1|telephone1


name2|The vicarage, town, place|postcode2|telephone2

If you want '|' instead of the ',' change the respective program line to

$2 = addr1 "|" addr2 "|" addr3

>>That seems unlikely
>
> Well, unlikely or not, that is the output. If you want the exact data I
> have, don't bother replying as I can't print that.

If you get the wrong output, try putting the awk program

BEGIN{FS=OFS="|"}{
c = split($2,addr,",")
addr3 = addr[c--]
addr2 = addr[c--]
addr1 = ""
for (i=1;i<=c;i++) {
addr1 = addr1 addr[i]
}

$2 = addr1 "|" addr2 "|" addr3
}1

in an own file "yourfile" and call it from the shell (works also on WinDOS
shells:

awk -f yourfile yourdata


Janis

William James

unread,
Nov 27, 2005, 4:52:39 PM11/27/05
to

This input

name1|12, high street, town, place, biggerplace|postcode1|telephone1


name2|The vicarage, town, place|postcode2|telephone2

name3|The vicarage, place|postcode2|telephone2

produces this output

name1|12 high street|town, place|biggerplace|postcode1|telephone1
name2|The vicarage|town|place|postcode2|telephone2
name3|The vicarage|place||postcode2|telephone2

The language is Ruby:

# Read each line of the file given on the command line.
ARGF.each { |line|
# Remove the newline at the end of the string.
line.chomp!

# Split the string into an array.
array = line.split( "|" )

# Split the address field on commas, removing surrounding
# whitespace.
address = array[1].split( /\s*,\s*/ )

# I'm assuming that if the first part of the address is
# a number, it should be combined with the next part.
if address[0] =~ /^\d+$/
address[0..1] = address[0..1].join( " " )
end

if address.size > 3
# Combine all but the first and last part into one part.
address[1..-2] = address[1..-2].join( ", " )
else
# If we have fewer than 3 parts, tack on an empty string.
address.push( "" ) while address.size < 3
end

array[1] = address

# Assuming that the ouput field-separator is "|".
puts array.flatten.join( "|" )
}

Ed Morton

unread,
Nov 27, 2005, 5:05:01 PM11/27/05
to
d...@rren.cymraeg.org wrote:
> Ed Morton <mor...@lsupcaemnt.com> wrote:
>
>
>>>Sample input as per my original post.
>>
>>You didn't have any sample input in your original post. All you had was
>>an example of the address field:
>
>
> Okay...
>
>
>>12, the high street, town, place, biggerplace
>>The vicarage, town place
>
>
> I have records like the following ...
>
> name1|12, the high street, town, place, biggerplace|postcode1|telephone1
> name2|The vicarage, town, place|postcode2|telephone2
>
> My problem is breaking up field two above into useful chunks and
> slotting them into three other fields.

I gave you a script to do that.

>
>>>The output I get using the above is :
>>>
>>>1, A Streetname, Blahblah, Foofoo|,,
>>
>>That seems unlikely
>
>
> Well, unlikely or not, that is the output.

That is not the output from the input you posted.

If you want the exact data I
> have, don't bother replying as I can't print that.

What I was looking for was a sample input and expected output FOR THAT
INPUT so I could help you. You've now finally provided some sample input
but you still haven't provided the expected output or even told us if
the script I posted earlier produces the output you wanted or, if not,
what it is you'd like to see different.

You seem to be developing an attitude so I appologise for bothering you
and I'll now take your advice and stop replying. Good luck...

Ed.

Mark Hobley

unread,
Nov 27, 2005, 5:08:03 PM11/27/05
to
d...@rren.cymraeg.org wrote:

> Well, unlikely or not, that is the output. If you want the exact data I
> have, don't bother replying as I can't print that.

You are telling us that you have abstract input and giving us examples of
abstract output.

If we can't see the input or output, it makes it very difficult to help you.

You could make up a couple of false non-existing test addresses that do not
parse properly, then post the input and the resulting output, and the output
that you want to get.

That way it would be clear to us all, and we can advise you on the best
solution.

Regards,

Mark.

--
Mark Hobley
393 Quinton Road West
QUINTON
Birmingham
B32 1QE

Telephone: (0121) 247 1596
International: 0044 121 247 1596

Email: markhobley at hotpop dot donottypethisbit com

http://markhobley.yi.org/

d...@rren.cymraeg.org

unread,
Nov 28, 2005, 6:20:32 AM11/28/05
to
Many thanks to William James for the ruby code which does the job.
0 new messages