Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

splitting file/content into lines based on regex termination

28 views
Skip to first unread message

bruce

unread,
Nov 7, 2013, 12:15:48 PM11/7/13
to pytho...@python.org
hi.

got a test file with the sample content listed below:

the content is one long string, and needs to be split into separate lines

I'm thinking the pattern to split on should be a kind of regex like::
<br>#45 / 58#0#
or
<br>#9 / 58#0
but i have no idea how to make this happen!!

if i read the content into a buf -> s

import re
dat = re.compile("what goes here??").split(s)

--i'm not sure what goes in the compile() to get the process to work..

thoughts/comments would be helpful.

thanks


test dat::
10116#000#C S#S#100##001##DAY#Fund of Computing#Barrett,
William#3#MWF<br>#08:00am<br>#08:50am<br>#3718 HBLL <br>#45 /
58#0#10116#000#C S#S#100##002##DAY#Fund of Computing#Barrett,
William#3#MWF<br>#09:00am<br>#09:50am<br>#3718 HBLL <br>#9 /
58#0#10178#000#C S#S#124##001##DAY#Computer Systems#Roper,
Paul#3#MWF<br>#11:00am<br>#11:50am<br>#1170 TMCB <br>#41 /
145#0#10178#000#C S#S#124##002##DAY#Computer Systems#Roper,
Paul#3#MWF<br>#2:00pm<br>#2:50pm<br>#1170 TMCB <br>#40 /
120#0#01489#002#C S#S#142##001##DAY#Intro to Computer
Programming#Burton, Robert <div class='instructors'>Seppi, Kevin<br
/></div><span

bruce

unread,
Nov 7, 2013, 12:45:48 PM11/7/13
to pytho...@python.org
update...

dat=re.compile("<br>#(\d+) / (\d+)#(\d+)#").split(s)

almost works..

except i get
m = 10116#000#C S#S#100##001##DAY#Fund of Computing#Barrett,
William#3#MWF<br>#08:00am<br>#08:50am<br>#3718 HBLL
m = 45
m = 58
m = 0
m = 10116#000#C S#S#100##002##DAY#Fund of Computing#Barrett,
William#3#MWF<br>#09:00am<br>#09:50am<br>#3718 HBLL
m = 9
m = 58
m = 0

and what i want is:
m = 10116#000#C S#S#100##001##DAY#Fund of Computing#Barrett,
William#3#MWF<br>#08:00am<br>#08:50am<br>#3718 HBLL 45 / 58,0
m = 10116#000#C S#S#100##002##DAY#Fund of Computing#Barrett,
William#3#MWF<br>#09:00am<br>#09:50am<br>#3718 HBLL 9 / 58,0


so i'd have the results of the "compile/regex process" to be added to
the split lines

thoughts/comments??

thanks

MRAB

unread,
Nov 7, 2013, 1:13:14 PM11/7/13
to pytho...@python.org
On 07/11/2013 17:45, bruce wrote:
> update...
>
> dat=re.compile("<br>#(\d+) / (\d+)#(\d+)#").split(s)
>
> almost works..
>
> except i get
> m = 10116#000#C S#S#100##001##DAY#Fund of Computing#Barrett,
> William#3#MWF<br>#08:00am<br>#08:50am<br>#3718 HBLL
> m = 45
> m = 58
> m = 0
> m = 10116#000#C S#S#100##002##DAY#Fund of Computing#Barrett,
> William#3#MWF<br>#09:00am<br>#09:50am<br>#3718 HBLL
> m = 9
> m = 58
> m = 0
>
> and what i want is:
> m = 10116#000#C S#S#100##001##DAY#Fund of Computing#Barrett,
> William#3#MWF<br>#08:00am<br>#08:50am<br>#3718 HBLL 45 / 58,0
> m = 10116#000#C S#S#100##002##DAY#Fund of Computing#Barrett,
> William#3#MWF<br>#09:00am<br>#09:50am<br>#3718 HBLL 9 / 58,0
>
>
> so i'd have the results of the "compile/regex process" to be added to
> the split lines
>
> thoughts/comments??
>
> thanks
>
The split method also returns what's matched in any capture groups,
i.e. "(\d+)". Try omitting the parentheses:

dat = re.compile(r"<br>#\d+ / \d+#\d+#").split(s)

You should also be using raw string literals as above (r"..."). It
doesn't matter in this instance, but it might in others.

bruce

unread,
Nov 7, 2013, 1:45:36 PM11/7/13
to pytho...@python.org
hi.

thanks for the reply.

tried what you suggested. what I see now, is that I print out the
lines, but not the regex data at all. my initial try, gave me the
line, and then the next items , followed by the next line, etc...

what I then tried, was to do a capture/findall of the regex, and
combine the outputs in separate loops, which will be ugly but will
work....

ff= "byu2.dat"
#fff= "sdsu2.dat"
with open(ff,"r") as myfile:
s=myfile.read()


s=s.replace("&nbsp", "")

#with open(fff,"w") as myfile2:
# myfile2.write(s)
#<br>#45 / 58#0#
#<br>#45 / 58#0#
#dat1=re.compile("<br>#(\d+) / (\d+)#(\d+)#").search(s).findall()
dat1=re.findall("<br>#(\d+) / (\d+)#(\d+)#",s)
dat=re.compile("<br>#(\d+) / (\d+)#(\d+)#").split(s)
dat2 = re.compile(r"<br>#\d+ / \d+#\d+#").split(s)
#dat=re.split('("<br>#(\d+) / (\d+)#(\d+)#")',s)
#dat=re.compile("<br>#(\d+)").split(s)


for m in dat:
if m:
print "m = "+m

#sys.exit()

print "dat1"
print dat1
print len(dat1)
print "dat2a"
#sys.exit()

# for m in dat1:
# if m:
# print "m = "+m
#
# #sys.exit()

for m in dat2:
if m:
print "m = "+m

#sys.exit()

sys.exit()

return


the test data is pasted to -->>> http://bpaste.net/show/kYzBUIfhc5023phOVmcu/

thanks
!!
> --
> https://mail.python.org/mailman/listinfo/python-list

Piet van Oostrum

unread,
Nov 9, 2013, 8:05:03 PM11/9/13
to
bruce <bado...@gmail.com> writes:

> hi.
>
> thanks for the reply.
>
> tried what you suggested. what I see now, is that I print out the
> lines, but not the regex data at all. my initial try, gave me the
> line, and then the next items , followed by the next line, etc...

exp = re.compile(r"(<br>#\d+\s*/\s*\d+#\d+#)")

exp.split(s)
=>
['10116#000#C S#S#100##001##DAY#Fund of Computing#Barrett,\nWilliam#3#MWF<br>#08:00am<br>#08:50am<br>#3718 HBLL ',
'<br>#45 /\n58#0#',
'10116#000#C S#S#100##002##DAY#Fund of Computing#Barrett,\nWilliam#3#MWF<br>#09:00am<br>#09:50am<br>#3718 HBLL ',
'<br>#9 /\n58#0#',
'10178#000#C S#S#124##001##DAY#Computer Systems#Roper,\nPaul#3#MWF<br>#11:00am<br>#11:50am<br>#1170 TMCB ',
'<br>#41 /\n145#0#',
'10178#000#C S#S#124##002##DAY#Computer Systems#Roper,\nPaul#3#MWF<br>#2:00pm<br>#2:50pm<br>#1170 TMCB ',
'<br>#40 /\n120#0#',
"01489#002#C S#S#142##001##DAY#Intro to Computer\nProgramming#Burton, Robert <div class='instructors'>Seppi, Kevin<br\n/></div>"]
--
Piet van Oostrum <pi...@vanoostrum.org>
WWW: http://pietvanoostrum.com/
PGP key: [8DAE142BE17999C4]
0 new messages