I want to get infos from a html, but I need all chars except <.
All chars is: over chr(31), and over (128) - hungarian accents.
The .* is very hungry, it is eat < chars too.
If I can use not, I simply define an regexp.
[not<]*</a>
It is get all in the href.
I wrote this programme, but it is too complex - I think:
import re
l=[]
for i in range(33,65):
if i<>ord('<') and i<>ord('>'):
l.append('\\'+chr(i))
s='|'.join(l)
all='\w|\s|\%s-\%s|%s'%(chr(128),chr(255),s)
sre='<Subj>([%s]{1,1024})</d>'%all
#sre='<Subj>([?!\\<]{1,1024})</d>'
s='<Subj>xmvccv ÁÁÁ sdfkdsfj eirfie</d><A></d>'
print sre
print s
cp=re.compile(sre)
m=cp.search(s)
print m.groups()
Have the python an regexp exception, or not function ? How to I use it ?
Thanx for help:
kk
You could try these regexps or variants thereof:
"<Subj>([^<]*)"
'^' changes the character set to exclude any characters listed after '^'
from matching.
"<Subj>(.*?)<"
The '?' makes the preceding '*' non-greedy, i. e. the following '<' will
match the first '<' character encountered in the string to be searched.
Peter
> I want to get infos from a html, but I need all chars except <.
> All chars is: over chr(31), and over (128) - hungarian accents.
> The .* is very hungry, it is eat < chars too.
Instead of writing ad-hoc html parsers, use BeautifulSoup instead.
http://www.crummy.com/software/BeautifulSoup/
I will most likely do what you want in 2 or 3 lines of code.
--
hilsen/regards Max M, Denmark
http://www.mxm.dk/
IT's Mad Science
Hey, I like that. Thanks.