splitting file/content into lines based on regex termination

bruce

unread,

Nov 7, 2013, 12:15:48 PM11/7/13

to pytho...@python.org

hi.

got a test file with the sample content listed below:

the content is one long string, and needs to be split into separate lines

I'm thinking the pattern to split on should be a kind of regex like::
 #45 / 58#0#
or
 #9 / 58#0
but i have no idea how to make this happen!!

if i read the content into a buf -> s

import re
dat = re.compile("what goes here??").split(s)

--i'm not sure what goes in the compile() to get the process to work..

thoughts/comments would be helpful.

thanks

test dat::
10116#000#C S#S#100##001##DAY#Fund of Computing#Barrett,
William#3#MWF #08:00am #08:50am #3718 HBLL #45 /
58#0#10116#000#C S#S#100##002##DAY#Fund of Computing#Barrett,
William#3#MWF #09:00am #09:50am #3718 HBLL #9 /
58#0#10178#000#C S#S#124##001##DAY#Computer Systems#Roper,
Paul#3#MWF #11:00am #11:50am #1170 TMCB #41 /
145#0#10178#000#C S#S#124##002##DAY#Computer Systems#Roper,
Paul#3#MWF #2:00pm #2:50pm #1170 TMCB #40 /
120#0#01489#002#C S#S#142##001##DAY#Intro to Computer
Programming#Burton, Robert <div class='instructors'>Seppi, Kevin </div><span

bruce

unread,

Nov 7, 2013, 12:45:48 PM11/7/13

to pytho...@python.org

update...

dat=re.compile(" #(\d+) / (\d+)#(\d+)#").split(s)

almost works..

except i get
m = 10116#000#C S#S#100##001##DAY#Fund of Computing#Barrett,

William#3#MWF #08:00am #08:50am #3718 HBLL

m = 45
m = 58
m = 0
m = 10116#000#C S#S#100##002##DAY#Fund of Computing#Barrett,

William#3#MWF #09:00am #09:50am #3718 HBLL

m = 9
m = 58
m = 0

and what i want is:
m = 10116#000#C S#S#100##001##DAY#Fund of Computing#Barrett,
William#3#MWF #08:00am #08:50am #3718 HBLL 45 / 58,0
m = 10116#000#C S#S#100##002##DAY#Fund of Computing#Barrett,
William#3#MWF #09:00am #09:50am #3718 HBLL 9 / 58,0

so i'd have the results of the "compile/regex process" to be added to
the split lines

thoughts/comments??

thanks

MRAB

unread,

Nov 7, 2013, 1:13:14 PM11/7/13

to pytho...@python.org

On 07/11/2013 17:45, bruce wrote:
> update...
>
> dat=re.compile(" #(\d+) / (\d+)#(\d+)#").split(s)
>
> almost works..
>
> except i get
> m = 10116#000#C S#S#100##001##DAY#Fund of Computing#Barrett,
> William#3#MWF #08:00am #08:50am #3718 HBLL
> m = 45
> m = 58
> m = 0
> m = 10116#000#C S#S#100##002##DAY#Fund of Computing#Barrett,
> William#3#MWF #09:00am #09:50am #3718 HBLL
> m = 9
> m = 58
> m = 0
>
> and what i want is:
> m = 10116#000#C S#S#100##001##DAY#Fund of Computing#Barrett,
> William#3#MWF #08:00am #08:50am #3718 HBLL 45 / 58,0
> m = 10116#000#C S#S#100##002##DAY#Fund of Computing#Barrett,
> William#3#MWF #09:00am #09:50am #3718 HBLL 9 / 58,0
>
>
> so i'd have the results of the "compile/regex process" to be added to
> the split lines
>
> thoughts/comments??
>
> thanks
>

The split method also returns what's matched in any capture groups,
i.e. "(\d+)". Try omitting the parentheses:

dat = re.compile(r" #\d+ / \d+#\d+#").split(s)

You should also be using raw string literals as above (r"..."). It
doesn't matter in this instance, but it might in others.

bruce

unread,

Nov 7, 2013, 1:45:36 PM11/7/13

to pytho...@python.org

hi.

thanks for the reply.

tried what you suggested. what I see now, is that I print out the
lines, but not the regex data at all. my initial try, gave me the
line, and then the next items , followed by the next line, etc...

what I then tried, was to do a capture/findall of the regex, and
combine the outputs in separate loops, which will be ugly but will
work....

ff= "byu2.dat"
#fff= "sdsu2.dat"
with open(ff,"r") as myfile:
s=myfile.read()

s=s.replace("&nbsp", "")

#with open(fff,"w") as myfile2:
# myfile2.write(s)
# #45 / 58#0#
# #45 / 58#0#
#dat1=re.compile(" #(\d+) / (\d+)#(\d+)#").search(s).findall()
dat1=re.findall(" #(\d+) / (\d+)#(\d+)#",s)

dat=re.compile(" #(\d+) / (\d+)#(\d+)#").split(s)

dat2 = re.compile(r" #\d+ / \d+#\d+#").split(s)
#dat=re.split('(" #(\d+) / (\d+)#(\d+)#")',s)
#dat=re.compile(" #(\d+)").split(s)

for m in dat:
if m:
print "m = "+m

#sys.exit()

print "dat1"
print dat1
print len(dat1)
print "dat2a"
#sys.exit()

# for m in dat1:
# if m:
# print "m = "+m
#
# #sys.exit()

for m in dat2:
if m:
print "m = "+m

#sys.exit()

sys.exit()

return

the test data is pasted to -->>> http://bpaste.net/show/kYzBUIfhc5023phOVmcu/

thanks
!!

> --
> https://mail.python.org/mailman/listinfo/python-list

Piet van Oostrum

unread,

Nov 9, 2013, 8:05:03 PM11/9/13

to

bruce <bado...@gmail.com> writes:

> hi.
>
> thanks for the reply.
>
> tried what you suggested. what I see now, is that I print out the
> lines, but not the regex data at all. my initial try, gave me the
> line, and then the next items , followed by the next line, etc...

exp = re.compile(r"( #\d+\s*/\s*\d+#\d+#)")

exp.split(s)
=>
['10116#000#C S#S#100##001##DAY#Fund of Computing#Barrett,\nWilliam#3#MWF #08:00am #08:50am #3718 HBLL ',
' #45 /\n58#0#',
'10116#000#C S#S#100##002##DAY#Fund of Computing#Barrett,\nWilliam#3#MWF #09:00am #09:50am #3718 HBLL ',
' #9 /\n58#0#',
'10178#000#C S#S#124##001##DAY#Computer Systems#Roper,\nPaul#3#MWF #11:00am #11:50am #1170 TMCB ',
' #41 /\n145#0#',
'10178#000#C S#S#124##002##DAY#Computer Systems#Roper,\nPaul#3#MWF #2:00pm #2:50pm #1170 TMCB ',
' #40 /\n120#0#',
"01489#002#C S#S#142##001##DAY#Intro to Computer\nProgramming#Burton, Robert <div class='instructors'>Seppi, Kevin<br\n/></div>"]
--
Piet van Oostrum <pi...@vanoostrum.org>
WWW: http://pietvanoostrum.com/
PGP key: [8DAE142BE17999C4]