python re - a not needed

1 view
Skip to first unread message

kepes.krisztian

unread,
Dec 16, 2004, 4:06:42 AM12/16/04
to pytho...@python.org
Hi !

I want to get infos from a html, but I need all chars except <.
All chars is: over chr(31), and over (128) - hungarian accents.
The .* is very hungry, it is eat < chars too.

If I can use not, I simply define an regexp.
[not<]*</a>

It is get all in the href.

I wrote this programme, but it is too complex - I think:

import re

l=[]
for i in range(33,65):
if i<>ord('<') and i<>ord('>'):
l.append('\\'+chr(i))
s='|'.join(l)
all='\w|\s|\%s-\%s|%s'%(chr(128),chr(255),s)
sre='<Subj>([%s]{1,1024})</d>'%all
#sre='<Subj>([?!\\<]{1,1024})</d>'
s='<Subj>xmvccv ÁÁÁ sdfkdsfj eirfie</d><A></d>'


print sre
print s
cp=re.compile(sre)
m=cp.search(s)
print m.groups()

Have the python an regexp exception, or not function ? How to I use it ?

Thanx for help:
kk

Peter Otten

unread,
Dec 16, 2004, 4:21:22 AM12/16/04
to
kepes.krisztian wrote:

You could try these regexps or variants thereof:

"<Subj>([^<]*)"

'^' changes the character set to exclude any characters listed after '^'
from matching.

"<Subj>(.*?)<"

The '?' makes the preceding '*' non-greedy, i. e. the following '<' will
match the first '<' character encountered in the string to be searched.

Peter

Max M

unread,
Dec 16, 2004, 4:52:26 AM12/16/04
to
kepes.krisztian wrote:

> I want to get infos from a html, but I need all chars except <.
> All chars is: over chr(31), and over (128) - hungarian accents.
> The .* is very hungry, it is eat < chars too.

Instead of writing ad-hoc html parsers, use BeautifulSoup instead.

http://www.crummy.com/software/BeautifulSoup/

I will most likely do what you want in 2 or 3 lines of code.

--

hilsen/regards Max M, Denmark

http://www.mxm.dk/
IT's Mad Science

Paul Rubin

unread,
Dec 16, 2004, 5:00:17 AM12/16/04
to
Max M <ma...@mxm.dk> writes:
> Instead of writing ad-hoc html parsers, use BeautifulSoup instead.
>
> http://www.crummy.com/software/BeautifulSoup/

Hey, I like that. Thanks.

Reply all
Reply to author
Forward
0 new messages