Need help with a program

evilweasel

unread,

Jan 28, 2010, 10:07:04 AM1/28/10

to

Hi folks,

I am a newbie to python, and I would be grateful if someone could
point out the mistake in my program. Basically, I have a huge text
file similar to the format below:

AAAAAGACTCGAGTGCGCGGA 0
AAAAAGATAAGCTAATTAAGCTACTGG 0
AAAAAGATAAGCTAATTAAGCTACTGGGTT 1
AAAAAGGGGGCTCACAGGGGAGGGGTAT 1
AAAAAGGTCGCCTGACGGCTGC 0

The text is nothing but DNA sequences, and there is a number next to
it. What I will have to do is, ignore those lines that have 0 in it,
and print all other lines (excluding the number) in a new text file
(in a particular format called as FASTA format). This is the program I
wrote for that:

seq1 = []
list1 = []
lister = []
listers = []
listers1 = []
a = []
d = []
i = 0
j = 0
num = 0

file1 = open(sys.argv[1], 'r')
for line in file1:
if not line.startswith('\n'):
seq1 = line.split()
if len(seq1) == 0:
continue

a = seq1[0]
list1.append(a)

d = seq1[1]
lister.append(d)

b = len(lister)
for j in range(0, b):
if lister[j] == 0:
listers.append(j)
else:
listers1.append(j)

print listers1
resultsfile = open("sequences1.txt", 'w')
for i in listers1:
resultsfile.write('\n>seq' + str(i) + '\n' + list1[i] + '\n')

But this isn't working. I am not able to find the bug in this. I would
be thankful if someone could point it out. Thanks in advance!

Cheers!

Alf P. Steinbach

unread,

Jan 28, 2010, 10:17:41 AM1/28/10

to

* evilweasel:

What do you mean by "isn't working"?

> I am not able to find the bug in this. I would
> be thankful if someone could point it out. Thanks in advance!

What do you expect as output, and what do you actually get as output?

Cheers,

- Alf

Mark Dickinson

unread,

Jan 28, 2010, 10:22:00 AM1/28/10

to

On Jan 28, 3:07 pm, evilweasel <karthikramaswam...@gmail.com> wrote:
> Hi folks,
>
> I am a newbie to python, and I would be grateful if someone could
> point out the mistake in my program.

<snip>

> for j in range(0, b):
> if lister[j] == 0:

At a guess, this line should be:

if lister[j] == '0':
...

--
Mark

Krister Svanlund

unread,

Jan 28, 2010, 10:28:11 AM1/28/10

to evilweasel, pytho...@python.org

I'm not totaly sure what you want to do but try this (python2.6+):

newlines = []

with open(sys.argv[1], 'r') as f:
text = f.read();
for line in text.splitlines():
if not line.strip() and line.strip().endswith('1'):
newlines.append('seq'+line)

with open(sys.argv[2], 'w') as f:
f.write('\n'.join(newlines))

Krister Svanlund

unread,

Jan 28, 2010, 10:31:49 AM1/28/10

to evilweasel, pytho...@python.org

On Thu, Jan 28, 2010 at 4:28 PM, Krister Svanlund
<krister....@gmail.com> wrote:
> On Thu, Jan 28, 2010 at 4:07 PM, evilweasel
> <karthikra...@gmail.com> wrote:

> I'm not totaly sure what you want to do but try this (python2.6+):
>
> newlines = []
>
> with open(sys.argv[1], 'r') as f:
> text = f.read();
> for line in text.splitlines():
> if not line.strip() and line.strip().endswith('1'):

newlines.append('seq'+line.strip()[:-1].strip())

>
> with open(sys.argv[2], 'w') as f:
> f.write('\n'.join(newlines))
>

Gah, made some errors

D'Arcy J.M. Cain

unread,

Jan 28, 2010, 10:44:37 AM1/28/10

to evilweasel, pytho...@python.org

On Thu, 28 Jan 2010 07:07:04 -0800 (PST)
evilweasel <karthikra...@gmail.com> wrote:
> I am a newbie to python, and I would be grateful if someone could

Welcome.

> point out the mistake in my program. Basically, I have a huge text
> file similar to the format below:

You don't say how it isn't working. As a first step you should read
http://catb.org/~esr/faqs/smart-questions.html.

> The text is nothing but DNA sequences, and there is a number next to
> it. What I will have to do is, ignore those lines that have 0 in it,

Your code doesn't completely ignore them. See below.

> and print all other lines (excluding the number) in a new text file
> (in a particular format called as FASTA format). This is the program I
> wrote for that:
>
> seq1 = []
> list1 = []
> lister = []
> listers = []
> listers1 = []
> a = []
> d = []
> i = 0
> j = 0
> num = 0

This seems like an awful lot of variables for such a simple task.

>
> file1 = open(sys.argv[1], 'r')
> for line in file1:

This is good. You aren't trying to load the whole file into memory at
once. If the file is huge as you say then that would have been bad. I
would have made one small optimization that saves one assignment and
one extra variable.

for line in open(sys.argv[1], 'r'):

> if not line.startswith('\n'):
> seq1 = line.split()
> if len(seq1) == 0:
> continue

This is redundant and perhaps not even correct at the end of the file.
It assumes that the last line ends with a newline. Look at what
'\n'.split() gives you and see if you can't improve the above code.

Another small optimization - "if seq1" is better than "if len(seq1)".

>
> a = seq1[0]
> list1.append(a)

Aha! I may have found your bug. Are you mixing tabs and spaces?
Don't do that. Either always use spaces or always use tabs. My
suggestion is to use spaces and choose a short indent such as three or
even two but that's a religious issue.

>
> d = seq1[1]
> lister.append(d)

You can also do "a, d = seq1". Of course you must be sure that you
have two fields. Perhaps that's guaranteed for your input but a quick
sanity test wouldn't hurt here.

However, I don't understand all of the above. It may also be a source
of problems. You say the files are huge. Are you filling up memory
here? You did the smart thing reading the file but you lose it here.
In any case, see below.

> b = len(lister)
> for j in range(0, b):

Go lookup zip()

> if lister[j] == 0:

I think that you will find that lister[j] is "0", not 0.

> listers.append(j)
> else:
> listers1.append(j)

Why are you collecting the input? Just toss the '0' ones and write the
others lines directly to the output.

Hope this helps with this script and in further understanding the power
and simplicity of Python. Good luck.

--
D'Arcy J.M. Cain <da...@druid.net> | Democracy is three wolves
http://www.druid.net/darcy/ | and a sheep voting on
+1 416 425 1212 (DoD#0082) (eNTP) | what's for dinner.

Krister Svanlund

unread,

Jan 28, 2010, 10:39:16 AM1/28/10

to pytho...@python.org

On Thu, Jan 28, 2010 at 4:31 PM, Krister Svanlund

<krister....@gmail.com> wrote:
> On Thu, Jan 28, 2010 at 4:28 PM, Krister Svanlund
> <krister....@gmail.com> wrote:
>> On Thu, Jan 28, 2010 at 4:07 PM, evilweasel
>> <karthikra...@gmail.com> wrote:

I'm trying this again:

newlines = []

with open(sys.argv[1], 'r') as f:
text = f.read();

for line in (l.strip() for l in text.splitlines()):
if line:
line_elem = line.split()
if len(line_elem) == 2 and line_elem[1] == '1':
newlines.append('seq'+line_elem[0])

evilweasel

unread,

Jan 28, 2010, 10:50:31 AM1/28/10

to

I will make my question a little more clearer. I have close to 60,000
lines of the data similar to the one I posted. There are various
numbers next to the sequence (this is basically the number of times
the sequence has been found in a particular sample). So, I would need
to ignore the ones containing '0' and write all other sequences
(excluding the number, since it is trivial) in a new text file, in the
following format:

>seq59902
TTTTTTTATAAAATATATAGT

>seq59903
TTTTTTTATTTCTTGGCGTTGT

>seq59904
TTTTTTTGGTTGCCCTGCGTGG

>seq59905
TTTTTTTGTTTATTTTTGGG

The number next to 'seq' is the line number of the sequence. When I
run the above program, what I expect is an output file that is similar
to the above output but with the ones containing '0' ignored. But, I
am getting all the sequences printed in the file.

Kindly excuse the 'newbieness' of the program. :) I am hoping to
improve in the next few months. Thanks to all those who replied. I
really appreciate it. :)

nn

unread,

Jan 28, 2010, 11:13:09 AM1/28/10

to

People have already given you some pointers to your problem. In the
end you will have to "tweak the details" because only you have access
to the data not us.

Just as example here is another way to do what you are doing:

with open('dnain.dat') as infile, open('dnaout.dat','w') as outfile:
partgen=(line.split() for line in infile)
dnagen=(str(i+1)+'\n'+part[0]+'\n'
for i,part in enumerate(partgen)
if len(part)>1 and part[1]!='0')
outfile.writelines(dnagen)

Arnaud Delobelle

unread,

Jan 28, 2010, 12:00:54 PM1/28/10

to

nn <prue...@latinmail.com> writes:

I think that generator expressions are overrated :) What's wrong with:

with open('dnain.dat') as infile, open('dnaout.dat','w') as outfile:

for i, line in enumerate(infile):
parts = line.split()
if len(parts) > 1 and parts[1] != '0':
outfile.write(">seq%s\n%s\n" % (i+1, parts[0]))

(untested)

--
Arnaud

John Posner

unread,

Jan 28, 2010, 12:23:36 PM1/28/10

to evilweasel

Your program is a good first try. It contains a newbie error (looking
for the number 0 instead of the string "0"). But more importantly,
you're doing too much work yourself, rather than letting Python do the
heavy lifting for you. These practices and tools make life a lot easier:

* As others have noted, don't accumulate output in a list. Just write
data to the output file line-by-line.

* You don't need to initialize every variable at the beginning of the
program. But there's no harm in it.

* Use the enumerate() function to provide a line counter:

for counter, line in enumerate(file1):

This eliminates the need to accumulate output data in a list, then use
the index variable "j" as the line counter.

* Use string formatting. Each chunk of output is a two-line string, with
the line-counter and the DNA sequence as variables:

outformat = """seq%05d
%s
"""

... later, inside your loop ...

resultsfile.write(outformat % (counter, sequence))

HTH,
John

Jean-Michel Pichavant

unread,

Jan 28, 2010, 12:49:02 PM1/28/10

to evilweasel, pytho...@python.org

Using regexp may increase readability (if you are familiar with it).
What about

import re

output = open("sequences1.txt", 'w')

for index, line in enumerate(open(sys.argv[1], 'r')):
match = re.match('(?P<sequence>[GATC]+)\s+1')
if match:
output.write('seq%s\n%s\n' % (index, match.group('sequence')))

Jean-Michel

D'Arcy J.M. Cain

unread,

Jan 28, 2010, 1:03:44 PM1/28/10

to Jean-Michel Pichavant, pytho...@python.org, evilweasel

On Thu, 28 Jan 2010 18:49:02 +0100
Jean-Michel Pichavant <jeanm...@sequans.com> wrote:
> Using regexp may increase readability (if you are familiar with it).

If you have a problem and you think that regular expressions are the
solution then now you have two problems. Regex is really overkill for
the OP's problem and it certainly doesn't improve readability.

Jean-Michel Pichavant

unread,

Jan 28, 2010, 1:21:42 PM1/28/10

to D'Arcy J.M. Cain, pytho...@python.org, evilweasel

D'Arcy J.M. Cain wrote:
> On Thu, 28 Jan 2010 18:49:02 +0100
> Jean-Michel Pichavant <jeanm...@sequans.com> wrote:
>
>> Using regexp may increase readability (if you are familiar with it).
>>
>
> If you have a problem and you think that regular expressions are the
> solution then now you have two problems. Regex is really overkill for
> the OP's problem and it certainly doesn't improve readability.
>
>

It depends on the reader ability to understand a *simple* regexp.
It is also strange to get such answer after taking so much precautions,
so let me quote myself:

"Using regexp *may* increase readability (*if* you are *familiar* with it)."

I honestly find it quite readable in the sample code I provided and
spares all the if-len-startwith-strip logic, but If the OP does not
agree, fine with me. But there's no need to get certain that I'm
completly wrong.

JM

Steven Howe

unread,

Jan 28, 2010, 1:28:54 PM1/28/10

to pytho...@python.org

On 01/28/2010 09:49 AM, Jean-Michel Pichavant wrote:

> Using regexp may increase readability (if you are familiar with it).

> What about
>
> import re
>
> output = open("sequences1.txt", 'w')
>
> for index, line in enumerate(open(sys.argv[1], 'r')):
> match = re.match('(?P<sequence>[GATC]+)\s+1')
> if match:
> output.write('seq%s\n%s\n' % (index, match.group('sequence')))
>
>
> Jean-Michel

Finally!

After ready 8 or 9 messages about find a line ending with '1', someone
suggests Regex.
It was my first thought.

Steven

Mensanator

unread,

Jan 28, 2010, 1:38:17 PM1/28/10

to

And as a first thought, it is, of course, wrong.

You don't want lines ending in '1', you want ANY non-'0' amount.

Likewise, you don't want to exclude lines ending in '0' because
you'll end up excluding counts of 10, 20, 30, etc.

You need a regex that extracts ALL the numeric characters at the end
of the
line and exclude those that evaluate to 0.

>
> Steven

MRAB

unread,

Jan 28, 2010, 2:59:44 PM1/28/10

to pytho...@python.org

Steven Howe wrote:
> On 01/28/2010 09:49 AM, Jean-Michel Pichavant wrote:
>> evilweasel wrote:

>> Using regexp may increase readability (if you are familiar with it).
>> What about
>>
>> import re
>>
>> output = open("sequences1.txt", 'w')
>>
>> for index, line in enumerate(open(sys.argv[1], 'r')):
>> match = re.match('(?P<sequence>[GATC]+)\s+1')
>> if match:
>> output.write('seq%s\n%s\n' % (index, match.group('sequence')))
>>
>>
>> Jean-Michel
>
> Finally!
>
> After ready 8 or 9 messages about find a line ending with '1', someone
> suggests Regex.
> It was my first thought.
>

I'm a great fan of regexes, but I never though of using them for this
because it doesn't look like a regex type of problem to me.

nn

unread,

Jan 28, 2010, 4:22:10 PM1/28/10

to

Nothing really,
After posting I was thinking I should have posted a more
straightforward version like the one you wrote. Now there is! It
probably is more efficient too. I just have a tendency to think in
terms of pipes: "pipe this junk in here, then in here, get output".
Probably damage from too much Unix scripting.Since I can't resist the
urge to post crazy code here goes the bonus round (don't do this at
work):

open('dnaout.dat','w').writelines(
'seq%s\n%s\n'%(i+1,part[0])
for i,part in enumerate(line.split() for line in open('dnain.dat'))

Arnaud Delobelle

unread,

Jan 28, 2010, 4:42:17 PM1/28/10

to

nn <prue...@latinmail.com> writes:
> After posting I was thinking I should have posted a more
> straightforward version like the one you wrote. Now there is! It
> probably is more efficient too. I just have a tendency to think in
> terms of pipes: "pipe this junk in here, then in here, get output".
> Probably damage from too much Unix scripting.

This is funny, I did think *exactly* this when I saw your code :)

--
Arnaud

Johann Spies

unread,

Jan 29, 2010, 4:23:54 AM1/29/10

to pytho...@python.org

On Thu, Jan 28, 2010 at 07:07:04AM -0800, evilweasel wrote:
> Hi folks,
>
> I am a newbie to python, and I would be grateful if someone could
> point out the mistake in my program. Basically, I have a huge text
> file similar to the format below:
>
> AAAAAGACTCGAGTGCGCGGA 0
> AAAAAGATAAGCTAATTAAGCTACTGG 0
> AAAAAGATAAGCTAATTAAGCTACTGGGTT 1
> AAAAAGGGGGCTCACAGGGGAGGGGTAT 1
> AAAAAGGTCGCCTGACGGCTGC 0

I know this is a python list but if you really want to get the job
done quickly this is one method without writing python code:

$ cat /tmp/y

AAAAAGACTCGAGTGCGCGGA 0
AAAAAGATAAGCTAATTAAGCTACTGG 0
AAAAAGATAAGCTAATTAAGCTACTGGGTT 1
AAAAAGGGGGCTCACAGGGGAGGGGTAT 1
AAAAAGGTCGCCTGACGGCTGC 0

$ grep -v 0 /tmp/y > tmp/z
$ cat /tmp/z
AAAAAGATAAGCTAATTAAGCTACTGGGTT 1
AAAAAGGGGGCTCACAGGGGAGGGGTAT 1

Regards
Johann
--
Johann Spies Telefoon: 021-808 4599
Informasietegnologie, Universiteit van Stellenbosch

"My son, if sinners entice thee, consent thou not."
Proverbs 1:10

Steven D'Aprano

unread,

Jan 29, 2010, 5:04:33 AM1/29/10

to

On Fri, 29 Jan 2010 11:23:54 +0200, Johann Spies wrote:

> On Thu, Jan 28, 2010 at 07:07:04AM -0800, evilweasel wrote:
>> Hi folks,
>>
>> I am a newbie to python, and I would be grateful if someone could point
>> out the mistake in my program. Basically, I have a huge text file
>> similar to the format below:
>>
>> AAAAAGACTCGAGTGCGCGGA 0
>> AAAAAGATAAGCTAATTAAGCTACTGG 0
>> AAAAAGATAAGCTAATTAAGCTACTGGGTT 1
>> AAAAAGGGGGCTCACAGGGGAGGGGTAT 1
>> AAAAAGGTCGCCTGACGGCTGC 0
>
> I know this is a python list but if you really want to get the job done
> quickly this is one method without writing python code:
>
> $ cat /tmp/y
> AAAAAGACTCGAGTGCGCGGA 0
> AAAAAGATAAGCTAATTAAGCTACTGG 0
> AAAAAGATAAGCTAATTAAGCTACTGGGTT 1
> AAAAAGGGGGCTCACAGGGGAGGGGTAT 1
> AAAAAGGTCGCCTGACGGCTGC 0
> $ grep -v 0 /tmp/y > tmp/z
> $ cat /tmp/z
> AAAAAGATAAGCTAATTAAGCTACTGGGTT 1
> AAAAAGGGGGCTCACAGGGGAGGGGTAT 1

That will do the wrong thing for lines like:

AAAAAGATAAGCTAATTAAGCTACTGGGTT 10

--
Steven

Johann Spies

unread,

Jan 29, 2010, 8:21:40 AM1/29/10

to pytho...@python.org

On Fri, Jan 29, 2010 at 10:04:33AM +0000, Steven D'Aprano wrote:
> > I know this is a python list but if you really want to get the job done
> > quickly this is one method without writing python code:
> >
> > $ cat /tmp/y
> > AAAAAGACTCGAGTGCGCGGA 0
> > AAAAAGATAAGCTAATTAAGCTACTGG 0
> > AAAAAGATAAGCTAATTAAGCTACTGGGTT 1
> > AAAAAGGGGGCTCACAGGGGAGGGGTAT 1
> > AAAAAGGTCGCCTGACGGCTGC 0
> > $ grep -v 0 /tmp/y > tmp/z
> > $ cat /tmp/z
> > AAAAAGATAAGCTAATTAAGCTACTGGGTT 1
> > AAAAAGGGGGCTCACAGGGGAGGGGTAT 1
>
> That will do the wrong thing for lines like:
>
> AAAAAGATAAGCTAATTAAGCTACTGGGTT 10

In that case change the grep to ' 0$' then only the lines with a
singel digit '0' at the end of the line will be excluded.

One can do the same using regulare expressions in Python and it will
probably a lot slower on large files.

D'Arcy J.M. Cain

unread,

Jan 29, 2010, 9:32:52 AM1/29/10

to Johann Spies, pytho...@python.org

On Fri, 29 Jan 2010 11:23:54 +0200

Johann Spies <jsp...@sun.ac.za> wrote:
> I know this is a python list but if you really want to get the job
> done quickly this is one method without writing python code:

> [...]

> $ grep -v 0 /tmp/y > tmp/z

There's plenty of ways to do it without writing Python. C, C++, Perl,
Forth, Awk, BASIC, Intercal, etc. So what? Besides, your solution
doesn't work. You want "grep -vw 0 /tmp/y > tmp/z" and even then it
doesn't meet the requirements. It extracts the lines the OP wants but
doesn't reformat them. It also assumes a Unix system or at least
something with grep installed so it isn't portable.

If you want to see how the same task can be done in many different
languages see http://www.roesler-ac.de/wolfram/hello.htm.

nn

unread,

Jan 29, 2010, 9:41:27 AM1/29/10

to

I would rather use awk for this:

awk 'NF==2 && $2!~/^0$/ {printf("seq%s\n%s\n",NR,$1)}' dnain.dat

but I think that is getting a bit off topic...

Aahz

unread,

Feb 2, 2010, 6:07:05 PM2/2/10

to

In article <mailman.1551.1264701...@python.org>,

D'Arcy J.M. Cain <da...@druid.net> wrote:
>
>If you have a problem and you think that regular expressions are the
>solution then now you have two problems. Regex is really overkill for
>the OP's problem and it certainly doesn't improve readability.

If you're going to use a quote, it works better if you use the exact
quote and attribute it:

'Some people, when confronted with a problem, think "I know, I'll use
regular expressions." Now they have two problems.' --Jamie Zawinski
--
Aahz (aa...@pythoncraft.com) <*> http://www.pythoncraft.com/

import antigravity

Nobody

unread,

Feb 3, 2010, 10:03:22 AM2/3/10

to

On Tue, 02 Feb 2010 15:07:05 -0800, Aahz wrote:

>>If you have a problem and you think that regular expressions are the
>>solution then now you have two problems. Regex is really overkill for
>>the OP's problem and it certainly doesn't improve readability.
>
> If you're going to use a quote, it works better if you use the exact
> quote and attribute it:
>
> 'Some people, when confronted with a problem, think "I know, I'll use
> regular expressions." Now they have two problems.' --Jamie Zawinski

He may have mixed that one up with a different (and more generic) saying:

"If you think that X is the solution to your problem, then you don't
understand X and you don't understand your problem."

For most values of X.