I want to get infos from a html, but I need all chars except <. All chars is: over chr(31), and over (128) - hungarian accents. The .* is very hungry, it is eat < chars too.
If I can use not, I simply define an regexp. [not<]*</a>
It is get all in the href.
I wrote this programme, but it is too complex - I think:
import re
l=[] for i in range(33,65): if i<>ord('<') and i<>ord('>'): l.append('\\'+chr(i)) s='|'.join(l) all='\w|\s|\%s-\%s|%s'%(chr(128),chr(255),s) sre='<Subj>([%s]{1,1024})</d>'%all #sre='<Subj>([?!\\<]{1,1024})</d>' s='<Subj>xmvccv มมม sdfkdsfj eirfie</d><A></d>'
print sre print s cp=re.compile(sre) m=cp.search(s) print m.groups()
Have the python an regexp exception, or not function ? How to I use it ?
> I want to get infos from a html, but I need all chars except <. > All chars is: over chr(31), and over (128) - hungarian accents. > The .* is very hungry, it is eat < chars too.
> If I can use not, I simply define an regexp. > [not<]*</a>
> It is get all in the href.
> I wrote this programme, but it is too complex - I think:
> import re
> l=[] > for i in range(33,65): > if i<>ord('<') and i<>ord('>'): > l.append('\\'+chr(i)) > s='|'.join(l) > all='\w|\s|\%s-\%s|%s'%(chr(128),chr(255),s) > sre='<Subj>([%s]{1,1024})</d>'%all > #sre='<Subj>([?!\\<]{1,1024})</d>' > s='<Subj>xmvccv มมม sdfkdsfj eirfie</d><A></d>'
kepes.krisztian wrote: > I want to get infos from a html, but I need all chars except <. > All chars is: over chr(31), and over (128) - hungarian accents. > The .* is very hungry, it is eat < chars too.
Instead of writing ad-hoc html parsers, use BeautifulSoup instead.