Parsing for email addresses

galileo228

unread,

Feb 15, 2010, 6:34:35 PM2/15/10

to

Hey all,

I'm trying to write python code that will open a textfile and find the
email addresses inside it. I then want the code to take just the
characters to the left of the "@" symbol, and place them in a list.
(So if galil...@gmail.com was in the file, 'galileo228' would be
added to the list.)

Any suggestions would be much appeciated!

Matt

Jonathan Gardner

unread,

Feb 15, 2010, 6:49:31 PM2/15/10

to

On Feb 15, 3:34 pm, galileo228 <mattbar...@gmail.com> wrote:
>
> I'm trying to write python code that will open a textfile and find the
> email addresses inside it. I then want the code to take just the
> characters to the left of the "@" symbol, and place them in a list.

> (So if galileo...@gmail.com was in the file, 'galileo228' would be

> added to the list.)
>
> Any suggestions would be much appeciated!
>

You may want to use regexes for this. For every match, split on '@'
and take the first bit.

Note that the actual specification for email addresses is far more
than a single regex can handle. However, for almost every single case
out there nowadays, a regex will get what you need.

Tim Chase

unread,

Feb 15, 2010, 7:35:21 PM2/15/10

to Jonathan Gardner, pytho...@python.org

You can even capture the part as you find the regexps. As
Jonathan mentions, finding RFC-compliant email addresses can be a
hairy/intractable problem. But you can get a pretty close
approximation:

import re

r = re.compile(r'([-\w._+]+)@(?:[-\w]+\.)+(?:\w{2,5})', re.I)
# ^
# if you want to allow local domains like
# user@localhost
# then change the "+" marked with the "^"
# to a "*" and the "{2,5}" to "+" to unlimit
# the TLD. This will change the outcome
# of the last test "jim@com" to True

for test, expected in (
('j...@example.com', True),
('j...@sub.example.com', True),
('@example.com', False),
('@sub.example.com', False),
('@com', False),
('jim@com', False),
):
m = r.match(test)
if bool(m) ^ expected:
print "Failed: %r should be %s" % (test, expected)

emails = set()
for line in file('test.txt'):
for match in r.finditer(line):
emails.add(match.group(1))
print "All the emails:",
print ', '.join(emails)

-tkc

Ben Finney

unread,

Feb 15, 2010, 8:01:03 PM2/15/10

to

galileo228 <mattb...@gmail.com> writes:

> I'm trying to write python code that will open a textfile and find the
> email addresses inside it. I then want the code to take just the
> characters to the left of the "@" symbol, and place them in a list.

Email addresses can have more than one ‘@’ character. In fact, the
quoting rules allow the local-part to contain *any ASCII character* and
remain valid.

> Any suggestions would be much appeciated!

For a brief but thorough treatment of parsing email addresses, see RFC
3696, “Application Techniques for Checking and Transformation of Names”
<URL:http://www.ietf.org/rfc/rfc3696.txt>, specifically section 3.

--
\ “What I have to do is see, at any rate, that I do not lend |
`\ myself to the wrong which I condemn.” —Henry Thoreau, _Civil |
_o__) Disobedience_ |
Ben Finney

galileo228

unread,

Feb 16, 2010, 1:58:27 PM2/16/10

to

Hey all, thanks as always for the quick responses.

I actually found a very simple way to do what I needed to do. In
short, I needed to take an email which had a large number of addresses
in the 'to' field, and place just the identifiers (everything to the
left of @domain.com), in a python list.

I simply highlighted all the addresses and placed them in a text file
called emails.txt. Then I had the following code which placed each
line in the file into the list 'names':

[code]
fileHandle = open('/Users/Matt/Documents/python/results.txt','r')
names = fileHandle.readlines()
[/code]

Now, the 'names' list has values looking like this: ['aa...@domain.com
\n', 'bb...@domain.com\n', etc]. So I ran the following code:

[code]
for x in names:
st_list.append(x.replace('@domain.com\n',''))
[/code]

And that did the trick! 'Names' now has ['aaa12', 'bbb34', etc].

Obviously this only worked because all of the domain names were the
same. If they were not then based on your comments and my own
research, I would've had to use regex and the split(), which looked
massively complicated to learn.

Thanks all.

Matt

On Feb 15, 8:01 pm, Ben Finney <ben+pyt...@benfinney.id.au> wrote:

Tim Chase

unread,

Feb 16, 2010, 3:15:30 PM2/16/10

to galileo228, pytho...@python.org

galileo228 wrote:
> [code]
> fileHandle = open('/Users/Matt/Documents/python/results.txt','r')
> names = fileHandle.readlines()
> [/code]
>
> Now, the 'names' list has values looking like this: ['aa...@domain.com
> \n', 'bb...@domain.com\n', etc]. So I ran the following code:
>
> [code]
> for x in names:
> st_list.append(x.replace('@domain.com\n',''))
> [/code]
>
> And that did the trick! 'Names' now has ['aaa12', 'bbb34', etc].
>
> Obviously this only worked because all of the domain names were the
> same. If they were not then based on your comments and my own
> research, I would've had to use regex and the split(), which looked
> massively complicated to learn.

The complexities stemmed from several factors that, with more
details, could have made the solutions less daunting:

(a) you mentioned "finding" the email addresses -- this makes
it sound like there's other junk in the file that has to be
sifted through to find "things that look like an email address".
If the sole content of the file is lines containing only email
addresses, then "find the email address" is a bit like [1]

(b) you omitted the detail that the domains are all the same.
Even if they're not the same, (a) reduces the problem to a much
easier task:

s = set()
for line in file('results.txt'):
s.add(line.rsplit('@', 1)[0].lower())
print s

If it was previously a CSV or tab-delimited file, Python offers
batteries-included processing to make it easy:

import csv
f = file('results.txt', 'rb')
r = csv.DictReader(f) # CSV
# r = csv.DictReader(f, delimiter='\t') # tab delim
s = set()
for row in r:
s.add(row['Email'].lower())
f.close()

or even

f = file(...)
r = csv.DictReader(...)
s = set(row['Email'].lower() for row in r)
f.close()

Hope this gives you more ideas to work with.

-tkc

[1]
http://jacksmix.files.wordpress.com/2007/05/findx.jpg

galileo228

unread,

Feb 16, 2010, 7:07:57 PM2/16/10

to

Tim -

Thanks for this. I actually did intend to have to sift through other
junk in the file, but then figured I could just cut and paste emails
directly from the 'to' field, thus making life easier.

Also, in this particular instance, the domain names were the same, and
thus I was able to figure out my solution, but I do need to know how
to handle the same situation when the domain names are different, so
your response was most helpful.

Apologies for leaving out some details.

Matt

On Feb 16, 3:15 pm, Tim Chase <python.l...@tim.thechases.com> wrote:
> galileo228 wrote:
> > [code]
> > fileHandle = open('/Users/Matt/Documents/python/results.txt','r')
> > names = fileHandle.readlines()
> > [/code]
>
> > Now, the 'names' list has values looking like this: ['aa...@domain.com
> > \n', 'bb...@domain.com\n', etc]. So I ran the following code:
>
> > [code]
> > for x in names:

> > st_list.append(x.replace('...@domain.com\n',''))