Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

awk or sed: basic? question

560 views
Skip to first unread message

Edward Westin

unread,
Sep 28, 1999, 3:00:00 AM9/28/99
to
I am trying to get started with awk and sed. As an exercise with an
unstructured file (ie. fields not clearly defined, target sometimes in the
third field sometimes in the second, etc) I was trying to extract all email
addresses from a file (this is definitely not with the intention of
spamming). It is fairly easy to get entire lines with an @ mark, but how
do I get either sed or awk to output the result of the matching pattern
only and not the entire line. At first I thought this would be a simple
task, but I have been going at this for hours now and am beginning to wonder
if you have to know the exact field ($n) in order to extract this type of
info with sed or awk. I am not very strong in programming so this has
become quite a task. Any comments would be most appreciated.

Best Regards,
Ed Westin

PEZ

unread,
Sep 29, 1999, 3:00:00 AM9/29/99
to
In article <slrn7v1eke...@DosLinux.localnet>,

man awk
Look for substr(), match(), RSTART and RLENGTH.

If you don't run on Unix maybe you can start here:
http://www.cl.cam.ac.uk/texinfodoc/gawk_11.html#SEC110

Regards,
/Peter
--
-= Spam safe(?) e-mail address: pez68 at netscape.net =-


Sent via Deja.com http://www.deja.com/
Before you buy.

Charles Demas

unread,
Sep 29, 1999, 3:00:00 AM9/29/99
to
In article <slrn7v1eke...@DosLinux.localnet>,

Edward Westin <ewe...@yahoo.com> wrote:
>I am trying to get started with awk and sed. As an exercise with an
>unstructured file (ie. fields not clearly defined, target sometimes in the
>third field sometimes in the second, etc) I was trying to extract all email
>addresses from a file (this is definitely not with the intention of
>spamming). It is fairly easy to get entire lines with an @ mark, but how
>do I get either sed or awk to output the result of the matching pattern
>only and not the entire line. At first I thought this would be a simple
>task, but I have been going at this for hours now and am beginning to wonder
>if you have to know the exact field ($n) in order to extract this type of
>info with sed or awk. I am not very strong in programming so this has
>become quite a task. Any comments would be most appreciated.


This is a much more complicated task than it may at first seem.

The first thing you must determine is what form an email address
may take. This is defined in one of the RFC's, but IIRC, an
address _MAY_ include spaces inside double quotes.

After you've found the forms that a valid address may take, then
you can eliminate all characters that may not be in a valid address.

If one were to assume that a proper email address could contain
only alphanumeric charaters, underscore, hyphen, at sign, and period,
but not space, and must contain an at sign, then it is more
straightforward, something like:

grep '@' infile |
sed 's/[^-a-z0-9A-Z.@]/ /g' |
tr ' ' '\012' |
sed -e 's/\.*$//' |
sed -n 's/@/@/p'

the functions of the lines are:

1 get only lines containing the at sign
2 change all unusable characters to spaces
3 change all spaces to newlines
4 remove periods at the end of lines
5 print only lines containing at signs

As I said, this is not a complete solution, but it shows a general
starting approach.

It could be simplified, BTW, I just didn't bother. :-)


Chuck Demas
Needham, Mass.

--
Eat Healthy | _ _ | Nothing would be done at all,
Stay Fit | @ @ | If a man waited to do it so well,
Die Anyway | v | That no one could find fault with it.
de...@tiac.net | \___/ | http://www.tiac.net/users/demas

Charles Demas

unread,
Sep 29, 1999, 3:00:00 AM9/29/99
to
In article <7ss2e1$4...@news-central.tiac.net>,

I left the underscore out of line 2 above. It should be:


grep '@' infile |
sed 's/[^-a-z0-9A-Z.@_]/ /g' |


tr ' ' '\012' |
sed -e 's/\.*$//' |
sed -n 's/@/@/p'

Harlan Grove

unread,
Sep 29, 1999, 3:00:00 AM9/29/99
to
In article <7ss2e1$4...@news-central.tiac.net>, de...@sunspot.tiac.net (Charles
Demas) writes:
...

>If one were to assume that a proper email address could contain
>only alphanumeric charaters, underscore, hyphen, at sign, and period,
>but not space, and must contain an at sign, then it is more
>straightforward, something like:
>
>grep '@' infile |
>sed 's/[^-a-z0-9A-Z.@]/ /g' |
>tr ' ' '\012' |
>sed -e 's/\.*$//' |
>sed -n 's/@/@/p'
>
>the functions of the lines are:
>
>1 get only lines containing the at sign
>2 change all unusable characters to spaces
>3 change all spaces to newlines
>4 remove periods at the end of lines
>5 print only lines containing at signs
>
>As I said, this is not a complete solution, but it shows a general
>starting approach.
>
>It could be simplified, BTW, I just didn't bother. :-)

What about tabs?

Why sed -n 's/@/@/p' ? Wouldn't sed -n '/@/p' work? Or grep '@' ?


I think (ie, untested) this could be simplified to

awk '/@/ { for (f = 1; f <= NF; ++f)
if ($f ~ /^[-0-9A-Za-z_.]+@[-0-9A-Za-z_.]+$/) { sub(/[.]$/, "", $f); print $f
}' infile

which has the advantage of being just a single process. And if the original
poster needs to handle double-quoted names with embedded spaces, sed won't
suffice.

Charles Demas

unread,
Sep 29, 1999, 3:00:00 AM9/29/99
to
In article <19990929020254...@ngol02.aol.com>,

But, you didn't eliminate all the characters that aren't allowed in the
email address in your script.

It won't print anything for this line:

My addresss is <de...@tiac.net>.

Because you looked for a field starting with only certain characters.
Adding a gsub will solve that. :-)

gawk '/@/ {gsub(/[^-0-9A-Za-z_.@]/," ");


for (f = 1; f <= NF; ++f){
if ($f ~ /^[-0-9A-Za-z_.]+@[-0-9A-Za-z_.]+$/) {
sub(/[.]$/, "", $f); print $f }}}' infile


though this might be simpler:

gawk '/@/ {gsub(/[^-0-9A-Za-z_.@]/," ");
for (f = 1; f <= NF; ++f){
if($f ~ "@"){


sub(/[.]$/, "", $f);
print $f }}}' infile

Chuck Demas

Edward Westin

unread,
Sep 29, 1999, 3:00:00 AM9/29/99
to
Charles, Harlan, and Pete:

Thank you very much for your help. You have given me excellent ideas for
dealing with this and similar tasks in sed and awk. Instinct, and the
extremely powerful feel of these tools, sort of tells me that it is time
well spent to get at least a basic mastery of them before continuing with
Perl, which I had been using for about two months.

Much Obliged,
Ed Westin

Harlan Grove

unread,
Sep 30, 1999, 3:00:00 AM9/30/99
to
In article <7ssc1b$s...@news-central.tiac.net>, de...@sunspot.tiac.net (Charles
Demas) writes:

<snip>

>But, you didn't eliminate all the characters that aren't allowed in the
>email address in your script.
>
>It won't print anything for this line:
>
> My addresss is <de...@tiac.net>.
>
>Because you looked for a field starting with only certain characters.
>Adding a gsub will solve that. :-)
>
>gawk '/@/ {gsub(/[^-0-9A-Za-z_.@]/," ");
> for (f = 1; f <= NF; ++f){
> if ($f ~ /^[-0-9A-Za-z_.]+@[-0-9A-Za-z_.]+$/) {
> sub(/[.]$/, "", $f); print $f }}}' infile
>
>
>though this might be simpler:
>
>gawk '/@/ {gsub(/[^-0-9A-Za-z_.@]/," ");
> for (f = 1; f <= NF; ++f){
> if($f ~ "@"){
> sub(/[.]$/, "", $f);
> print $f }}}' infile
>

Actually it would be easier still to make all characters that can't occur in an
e-mail into field separators, ie, FS = "[^-0-9A-Za-z_.@]+" . Getting much
cleverer if I can assume gawk, if [.@] can't occur in either first or last
position (or can they?), then make the RECORD separator RS =
"[.@]*[^-0-9A-Za-z_.@]+[.@]*". Then

gawk 'BEGIN { RS = "[.@]*[^-0-9A-Za-z_.@]+[.@]*" } /@/' infile

Now for an academic question: can perl's $\ be a regexp? The camel book says it
can be multicharacter, but it doesn't say it can be a regexp.

Colin R. DeVilbiss

unread,
Sep 30, 1999, 3:00:00 AM9/30/99
to
Harlan Grove <hrl...@aol.comzzzzzz> wrote:
> In article <7ssc1b$s...@news-central.tiac.net>, de...@sunspot.tiac.net (Charles
> Demas) writes:

> <snip>

> Now for an academic question: can perl's $\ be a regexp? The camel book says it


> can be multicharacter, but it doesn't say it can be a regexp.

...I think you have your slash backwards...

perl's $/ (which is aliased to $RS if use()ing English) can be multi-
character, but is not a regexp.

perl's $\ ($OFS) is multicharacter, but, being just output, is also not
a regex.

from perldoc perlvar:

Remember: the value of $/ is a string, not a regexp. AWK has to be
better for something :-)
--Larry Wall

Colin DeVilbiss
crde...@mtu.edu

Charles Demas

unread,
Oct 1, 1999, 3:00:00 AM10/1/99
to
In article <19990930030844...@ngol06.aol.com>,

Harlan Grove <hrl...@aol.comzzzzzz> wrote:
>In article <7ssc1b$s...@news-central.tiac.net>, de...@sunspot.tiac.net (Charles
>Demas) writes:
>
><snip>
>
>>But, you didn't eliminate all the characters that aren't allowed in the
>>email address in your script.
>>
>>It won't print anything for this line:
>>
>> My addresss is <de...@tiac.net>.
>>
>>Because you looked for a field starting with only certain characters.
>>Adding a gsub will solve that. :-)
>>
>>gawk '/@/ {gsub(/[^-0-9A-Za-z_.@]/," ");
>> for (f = 1; f <= NF; ++f){
>> if ($f ~ /^[-0-9A-Za-z_.]+@[-0-9A-Za-z_.]+$/) {
>> sub(/[.]$/, "", $f); print $f }}}' infile
>>
>>
>>though this might be simpler:
>>
>>gawk '/@/ {gsub(/[^-0-9A-Za-z_.@]/," ");
>> for (f = 1; f <= NF; ++f){
>> if($f ~ "@"){
>> sub(/[.]$/, "", $f);
>> print $f }}}' infile
>>
>
>Actually it would be easier still to make all characters that can't occur in an
>e-mail into field separators, ie, FS = "[^-0-9A-Za-z_.@]+" . Getting much
>cleverer if I can assume gawk, if [.@] can't occur in either first or last
>position (or can they?), then make the RECORD separator RS =
>"[.@]*[^-0-9A-Za-z_.@]+[.@]*". Then
>
>gawk 'BEGIN { RS = "[.@]*[^-0-9A-Za-z_.@]+[.@]*" } /@/' infile

For the record, I tried the above one-liner on a sample file, and it
didn't work. I didn't bother to try to figure out why.

Harlan Grove

unread,
Oct 1, 1999, 3:00:00 AM10/1/99
to
In article <7t119a$4...@news-central.tiac.net>, de...@sunspot.tiac.net (Charles
Demas) writes:

>>gawk 'BEGIN { RS = "[.@]*[^-0-9A-Za-z_.@]+[.@]*" } /@/' infile
>
>For the record, I tried the above one-liner on a sample file, and it
>didn't work. I didn't bother to try to figure out why.

And further for the record, I tried it on the e-mail cc Chuck Demas sent me,
and it pull all the 'e-mail addresses', which given the definition above pulled
part of the message IDs.

It almost seems Chuck doesn't want this to work.

This _might_ be a functionality difference between Win32 and unix versions of
gawk. I ran this under Windows95 on DOS text files (CR-LF line termination
sequence in the file itself). If this is a portability issue, it might be nice
to confirm.

Patrick TJ McPhee

unread,
Oct 1, 1999, 3:00:00 AM10/1/99
to
In article <19990930223202...@ngol08.aol.com>,
Harlan Grove <hrl...@aol.comzzzzzz> wrote:
% In article <7t119a$4...@news-central.tiac.net>, de...@sunspot.tiac.net (Charles
% Demas) writes:
%
% >>gawk 'BEGIN { RS = "[.@]*[^-0-9A-Za-z_.@]+[.@]*" } /@/' infile
% >
% >For the record, I tried the above one-liner on a sample file, and it
% >didn't work. I didn't bother to try to figure out why.
%
% And further for the record, I tried it on the e-mail cc Chuck Demas sent me,
% and it pull all the 'e-mail addresses', which given the definition above pulled
% part of the message IDs.

This worked OK on my inbox using gawk and mawk, but not using nawk. nawk
doesn't seem to like [] in RS, although it does accept some REs.

--

Patrick TJ McPhee
East York Canada
pt...@interlog.com

Charles Demas

unread,
Oct 1, 1999, 3:00:00 AM10/1/99
to
In article <oPVI3.1191$oY.3...@cac1.rdr.news.psi.ca>,

My usual shell shell account (at TIAC.NET) has an older version
of gawk installed:

Gnu Awk (gawk) 2.15, patchlevel 6

The script in question does not work as desired on that version.

On another shell account, at a different ISP, I have access to a
later version of gawk, version 3.0.3, and the script works just fine
on that version.

Unfortunately, I have no real control over what is available or
installed at TIAC.NET, so I cannot just "upgrade it." to
the latest version. :~(

0 new messages