Best Regards,
Ed Westin
man awk
Look for substr(), match(), RSTART and RLENGTH.
If you don't run on Unix maybe you can start here:
http://www.cl.cam.ac.uk/texinfodoc/gawk_11.html#SEC110
Regards,
/Peter
--
-= Spam safe(?) e-mail address: pez68 at netscape.net =-
Sent via Deja.com http://www.deja.com/
Before you buy.
This is a much more complicated task than it may at first seem.
The first thing you must determine is what form an email address
may take. This is defined in one of the RFC's, but IIRC, an
address _MAY_ include spaces inside double quotes.
After you've found the forms that a valid address may take, then
you can eliminate all characters that may not be in a valid address.
If one were to assume that a proper email address could contain
only alphanumeric charaters, underscore, hyphen, at sign, and period,
but not space, and must contain an at sign, then it is more
straightforward, something like:
grep '@' infile |
sed 's/[^-a-z0-9A-Z.@]/ /g' |
tr ' ' '\012' |
sed -e 's/\.*$//' |
sed -n 's/@/@/p'
the functions of the lines are:
1 get only lines containing the at sign
2 change all unusable characters to spaces
3 change all spaces to newlines
4 remove periods at the end of lines
5 print only lines containing at signs
As I said, this is not a complete solution, but it shows a general
starting approach.
It could be simplified, BTW, I just didn't bother. :-)
Chuck Demas
Needham, Mass.
--
Eat Healthy | _ _ | Nothing would be done at all,
Stay Fit | @ @ | If a man waited to do it so well,
Die Anyway | v | That no one could find fault with it.
de...@tiac.net | \___/ | http://www.tiac.net/users/demas
I left the underscore out of line 2 above. It should be:
grep '@' infile |
sed 's/[^-a-z0-9A-Z.@_]/ /g' |
tr ' ' '\012' |
sed -e 's/\.*$//' |
sed -n 's/@/@/p'
What about tabs?
Why sed -n 's/@/@/p' ? Wouldn't sed -n '/@/p' work? Or grep '@' ?
I think (ie, untested) this could be simplified to
awk '/@/ { for (f = 1; f <= NF; ++f)
if ($f ~ /^[-0-9A-Za-z_.]+@[-0-9A-Za-z_.]+$/) { sub(/[.]$/, "", $f); print $f
}' infile
which has the advantage of being just a single process. And if the original
poster needs to handle double-quoted names with embedded spaces, sed won't
suffice.
But, you didn't eliminate all the characters that aren't allowed in the
email address in your script.
It won't print anything for this line:
My addresss is <de...@tiac.net>.
Because you looked for a field starting with only certain characters.
Adding a gsub will solve that. :-)
gawk '/@/ {gsub(/[^-0-9A-Za-z_.@]/," ");
for (f = 1; f <= NF; ++f){
if ($f ~ /^[-0-9A-Za-z_.]+@[-0-9A-Za-z_.]+$/) {
sub(/[.]$/, "", $f); print $f }}}' infile
though this might be simpler:
gawk '/@/ {gsub(/[^-0-9A-Za-z_.@]/," ");
for (f = 1; f <= NF; ++f){
if($f ~ "@"){
sub(/[.]$/, "", $f);
print $f }}}' infile
Chuck Demas
Thank you very much for your help. You have given me excellent ideas for
dealing with this and similar tasks in sed and awk. Instinct, and the
extremely powerful feel of these tools, sort of tells me that it is time
well spent to get at least a basic mastery of them before continuing with
Perl, which I had been using for about two months.
Much Obliged,
Ed Westin
<snip>
>But, you didn't eliminate all the characters that aren't allowed in the
>email address in your script.
>
>It won't print anything for this line:
>
> My addresss is <de...@tiac.net>.
>
>Because you looked for a field starting with only certain characters.
>Adding a gsub will solve that. :-)
>
>gawk '/@/ {gsub(/[^-0-9A-Za-z_.@]/," ");
> for (f = 1; f <= NF; ++f){
> if ($f ~ /^[-0-9A-Za-z_.]+@[-0-9A-Za-z_.]+$/) {
> sub(/[.]$/, "", $f); print $f }}}' infile
>
>
>though this might be simpler:
>
>gawk '/@/ {gsub(/[^-0-9A-Za-z_.@]/," ");
> for (f = 1; f <= NF; ++f){
> if($f ~ "@"){
> sub(/[.]$/, "", $f);
> print $f }}}' infile
>
Actually it would be easier still to make all characters that can't occur in an
e-mail into field separators, ie, FS = "[^-0-9A-Za-z_.@]+" . Getting much
cleverer if I can assume gawk, if [.@] can't occur in either first or last
position (or can they?), then make the RECORD separator RS =
"[.@]*[^-0-9A-Za-z_.@]+[.@]*". Then
gawk 'BEGIN { RS = "[.@]*[^-0-9A-Za-z_.@]+[.@]*" } /@/' infile
Now for an academic question: can perl's $\ be a regexp? The camel book says it
can be multicharacter, but it doesn't say it can be a regexp.
> <snip>
> Now for an academic question: can perl's $\ be a regexp? The camel book says it
> can be multicharacter, but it doesn't say it can be a regexp.
...I think you have your slash backwards...
perl's $/ (which is aliased to $RS if use()ing English) can be multi-
character, but is not a regexp.
perl's $\ ($OFS) is multicharacter, but, being just output, is also not
a regex.
from perldoc perlvar:
Remember: the value of $/ is a string, not a regexp. AWK has to be
better for something :-)
--Larry Wall
Colin DeVilbiss
crde...@mtu.edu
For the record, I tried the above one-liner on a sample file, and it
didn't work. I didn't bother to try to figure out why.
>>gawk 'BEGIN { RS = "[.@]*[^-0-9A-Za-z_.@]+[.@]*" } /@/' infile
>
>For the record, I tried the above one-liner on a sample file, and it
>didn't work. I didn't bother to try to figure out why.
And further for the record, I tried it on the e-mail cc Chuck Demas sent me,
and it pull all the 'e-mail addresses', which given the definition above pulled
part of the message IDs.
It almost seems Chuck doesn't want this to work.
This _might_ be a functionality difference between Win32 and unix versions of
gawk. I ran this under Windows95 on DOS text files (CR-LF line termination
sequence in the file itself). If this is a portability issue, it might be nice
to confirm.
This worked OK on my inbox using gawk and mawk, but not using nawk. nawk
doesn't seem to like [] in RS, although it does accept some REs.
--
Patrick TJ McPhee
East York Canada
pt...@interlog.com
My usual shell shell account (at TIAC.NET) has an older version
of gawk installed:
Gnu Awk (gawk) 2.15, patchlevel 6
The script in question does not work as desired on that version.
On another shell account, at a different ISP, I have access to a
later version of gawk, version 3.0.3, and the script works just fine
on that version.
Unfortunately, I have no real control over what is available or
installed at TIAC.NET, so I cannot just "upgrade it." to
the latest version. :~(