I am a newbie to python, and I would be grateful if someone could
point out the mistake in my program. Basically, I have a huge text
file similar to the format below:
AAAAAGACTCGAGTGCGCGGA 0
AAAAAGATAAGCTAATTAAGCTACTGG 0
AAAAAGATAAGCTAATTAAGCTACTGGGTT 1
AAAAAGGGGGCTCACAGGGGAGGGGTAT 1
AAAAAGGTCGCCTGACGGCTGC 0
The text is nothing but DNA sequences, and there is a number next to
it. What I will have to do is, ignore those lines that have 0 in it,
and print all other lines (excluding the number) in a new text file
(in a particular format called as FASTA format). This is the program I
wrote for that:
seq1 = []
list1 = []
lister = []
listers = []
listers1 = []
a = []
d = []
i = 0
j = 0
num = 0
file1 = open(sys.argv[1], 'r')
for line in file1:
if not line.startswith('\n'):
seq1 = line.split()
if len(seq1) == 0:
continue
a = seq1[0]
list1.append(a)
d = seq1[1]
lister.append(d)
b = len(lister)
for j in range(0, b):
if lister[j] == 0:
listers.append(j)
else:
listers1.append(j)
print listers1
resultsfile = open("sequences1.txt", 'w')
for i in listers1:
resultsfile.write('\n>seq' + str(i) + '\n' + list1[i] + '\n')
But this isn't working. I am not able to find the bug in this. I would
be thankful if someone could point it out. Thanks in advance!
Cheers!
What do you mean by "isn't working"?
> I am not able to find the bug in this. I would
> be thankful if someone could point it out. Thanks in advance!
What do you expect as output, and what do you actually get as output?
Cheers,
- Alf
<snip>
> for j in range(0, b):
> if lister[j] == 0:
At a guess, this line should be:
if lister[j] == '0':
...
--
Mark
I'm not totaly sure what you want to do but try this (python2.6+):
newlines = []
with open(sys.argv[1], 'r') as f:
text = f.read();
for line in text.splitlines():
if not line.strip() and line.strip().endswith('1'):
newlines.append('seq'+line)
with open(sys.argv[2], 'w') as f:
f.write('\n'.join(newlines))
Gah, made some errors
Welcome.
> point out the mistake in my program. Basically, I have a huge text
> file similar to the format below:
You don't say how it isn't working. As a first step you should read
http://catb.org/~esr/faqs/smart-questions.html.
> The text is nothing but DNA sequences, and there is a number next to
> it. What I will have to do is, ignore those lines that have 0 in it,
Your code doesn't completely ignore them. See below.
> and print all other lines (excluding the number) in a new text file
> (in a particular format called as FASTA format). This is the program I
> wrote for that:
>
> seq1 = []
> list1 = []
> lister = []
> listers = []
> listers1 = []
> a = []
> d = []
> i = 0
> j = 0
> num = 0
This seems like an awful lot of variables for such a simple task.
>
> file1 = open(sys.argv[1], 'r')
> for line in file1:
This is good. You aren't trying to load the whole file into memory at
once. If the file is huge as you say then that would have been bad. I
would have made one small optimization that saves one assignment and
one extra variable.
for line in open(sys.argv[1], 'r'):
> if not line.startswith('\n'):
> seq1 = line.split()
> if len(seq1) == 0:
> continue
This is redundant and perhaps not even correct at the end of the file.
It assumes that the last line ends with a newline. Look at what
'\n'.split() gives you and see if you can't improve the above code.
Another small optimization - "if seq1" is better than "if len(seq1)".
>
> a = seq1[0]
> list1.append(a)
Aha! I may have found your bug. Are you mixing tabs and spaces?
Don't do that. Either always use spaces or always use tabs. My
suggestion is to use spaces and choose a short indent such as three or
even two but that's a religious issue.
>
> d = seq1[1]
> lister.append(d)
You can also do "a, d = seq1". Of course you must be sure that you
have two fields. Perhaps that's guaranteed for your input but a quick
sanity test wouldn't hurt here.
However, I don't understand all of the above. It may also be a source
of problems. You say the files are huge. Are you filling up memory
here? You did the smart thing reading the file but you lose it here.
In any case, see below.
> b = len(lister)
> for j in range(0, b):
Go lookup zip()
> if lister[j] == 0:
I think that you will find that lister[j] is "0", not 0.
> listers.append(j)
> else:
> listers1.append(j)
Why are you collecting the input? Just toss the '0' ones and write the
others lines directly to the output.
Hope this helps with this script and in further understanding the power
and simplicity of Python. Good luck.
--
D'Arcy J.M. Cain <da...@druid.net> | Democracy is three wolves
http://www.druid.net/darcy/ | and a sheep voting on
+1 416 425 1212 (DoD#0082) (eNTP) | what's for dinner.
I'm trying this again:
newlines = []
with open(sys.argv[1], 'r') as f:
text = f.read();
for line in (l.strip() for l in text.splitlines()):
if line:
line_elem = line.split()
if len(line_elem) == 2 and line_elem[1] == '1':
newlines.append('seq'+line_elem[0])
>seq59902
TTTTTTTATAAAATATATAGT
>seq59903
TTTTTTTATTTCTTGGCGTTGT
>seq59904
TTTTTTTGGTTGCCCTGCGTGG
>seq59905
TTTTTTTGTTTATTTTTGGG
The number next to 'seq' is the line number of the sequence. When I
run the above program, what I expect is an output file that is similar
to the above output but with the ones containing '0' ignored. But, I
am getting all the sequences printed in the file.
Kindly excuse the 'newbieness' of the program. :) I am hoping to
improve in the next few months. Thanks to all those who replied. I
really appreciate it. :)
People have already given you some pointers to your problem. In the
end you will have to "tweak the details" because only you have access
to the data not us.
Just as example here is another way to do what you are doing:
with open('dnain.dat') as infile, open('dnaout.dat','w') as outfile:
partgen=(line.split() for line in infile)
dnagen=(str(i+1)+'\n'+part[0]+'\n'
for i,part in enumerate(partgen)
if len(part)>1 and part[1]!='0')
outfile.writelines(dnagen)
I think that generator expressions are overrated :) What's wrong with:
with open('dnain.dat') as infile, open('dnaout.dat','w') as outfile:
for i, line in enumerate(infile):
parts = line.split()
if len(parts) > 1 and parts[1] != '0':
outfile.write(">seq%s\n%s\n" % (i+1, parts[0]))
(untested)
--
Arnaud
Your program is a good first try. It contains a newbie error (looking
for the number 0 instead of the string "0"). But more importantly,
you're doing too much work yourself, rather than letting Python do the
heavy lifting for you. These practices and tools make life a lot easier:
* As others have noted, don't accumulate output in a list. Just write
data to the output file line-by-line.
* You don't need to initialize every variable at the beginning of the
program. But there's no harm in it.
* Use the enumerate() function to provide a line counter:
for counter, line in enumerate(file1):
This eliminates the need to accumulate output data in a list, then use
the index variable "j" as the line counter.
* Use string formatting. Each chunk of output is a two-line string, with
the line-counter and the DNA sequence as variables:
outformat = """seq%05d
%s
"""
... later, inside your loop ...
resultsfile.write(outformat % (counter, sequence))
HTH,
John
import re
output = open("sequences1.txt", 'w')
for index, line in enumerate(open(sys.argv[1], 'r')):
match = re.match('(?P<sequence>[GATC]+)\s+1')
if match:
output.write('seq%s\n%s\n' % (index, match.group('sequence')))
Jean-Michel
If you have a problem and you think that regular expressions are the
solution then now you have two problems. Regex is really overkill for
the OP's problem and it certainly doesn't improve readability.
"Using regexp *may* increase readability (*if* you are *familiar* with it)."
I honestly find it quite readable in the sample code I provided and
spares all the if-len-startwith-strip logic, but If the OP does not
agree, fine with me. But there's no need to get certain that I'm
completly wrong.
JM
Finally!
After ready 8 or 9 messages about find a line ending with '1', someone
suggests Regex.
It was my first thought.
Steven
And as a first thought, it is, of course, wrong.
You don't want lines ending in '1', you want ANY non-'0' amount.
Likewise, you don't want to exclude lines ending in '0' because
you'll end up excluding counts of 10, 20, 30, etc.
You need a regex that extracts ALL the numeric characters at the end
of the
line and exclude those that evaluate to 0.
>
> Steven
Nothing really,
After posting I was thinking I should have posted a more
straightforward version like the one you wrote. Now there is! It
probably is more efficient too. I just have a tendency to think in
terms of pipes: "pipe this junk in here, then in here, get output".
Probably damage from too much Unix scripting.Since I can't resist the
urge to post crazy code here goes the bonus round (don't do this at
work):
open('dnaout.dat','w').writelines(
'seq%s\n%s\n'%(i+1,part[0])
for i,part in enumerate(line.split() for line in open('dnain.dat'))
This is funny, I did think *exactly* this when I saw your code :)
--
Arnaud
I know this is a python list but if you really want to get the job
done quickly this is one method without writing python code:
$ cat /tmp/y
AAAAAGACTCGAGTGCGCGGA 0
AAAAAGATAAGCTAATTAAGCTACTGG 0
AAAAAGATAAGCTAATTAAGCTACTGGGTT 1
AAAAAGGGGGCTCACAGGGGAGGGGTAT 1
AAAAAGGTCGCCTGACGGCTGC 0
$ grep -v 0 /tmp/y > tmp/z
$ cat /tmp/z
AAAAAGATAAGCTAATTAAGCTACTGGGTT 1
AAAAAGGGGGCTCACAGGGGAGGGGTAT 1
Regards
Johann
--
Johann Spies Telefoon: 021-808 4599
Informasietegnologie, Universiteit van Stellenbosch
"My son, if sinners entice thee, consent thou not."
Proverbs 1:10
> On Thu, Jan 28, 2010 at 07:07:04AM -0800, evilweasel wrote:
>> Hi folks,
>>
>> I am a newbie to python, and I would be grateful if someone could point
>> out the mistake in my program. Basically, I have a huge text file
>> similar to the format below:
>>
>> AAAAAGACTCGAGTGCGCGGA 0
>> AAAAAGATAAGCTAATTAAGCTACTGG 0
>> AAAAAGATAAGCTAATTAAGCTACTGGGTT 1
>> AAAAAGGGGGCTCACAGGGGAGGGGTAT 1
>> AAAAAGGTCGCCTGACGGCTGC 0
>
> I know this is a python list but if you really want to get the job done
> quickly this is one method without writing python code:
>
> $ cat /tmp/y
> AAAAAGACTCGAGTGCGCGGA 0
> AAAAAGATAAGCTAATTAAGCTACTGG 0
> AAAAAGATAAGCTAATTAAGCTACTGGGTT 1
> AAAAAGGGGGCTCACAGGGGAGGGGTAT 1
> AAAAAGGTCGCCTGACGGCTGC 0
> $ grep -v 0 /tmp/y > tmp/z
> $ cat /tmp/z
> AAAAAGATAAGCTAATTAAGCTACTGGGTT 1
> AAAAAGGGGGCTCACAGGGGAGGGGTAT 1
That will do the wrong thing for lines like:
AAAAAGATAAGCTAATTAAGCTACTGGGTT 10
--
Steven
In that case change the grep to ' 0$' then only the lines with a
singel digit '0' at the end of the line will be excluded.
One can do the same using regulare expressions in Python and it will
probably a lot slower on large files.
There's plenty of ways to do it without writing Python. C, C++, Perl,
Forth, Awk, BASIC, Intercal, etc. So what? Besides, your solution
doesn't work. You want "grep -vw 0 /tmp/y > tmp/z" and even then it
doesn't meet the requirements. It extracts the lines the OP wants but
doesn't reformat them. It also assumes a Unix system or at least
something with grep installed so it isn't portable.
If you want to see how the same task can be done in many different
languages see http://www.roesler-ac.de/wolfram/hello.htm.
I would rather use awk for this:
awk 'NF==2 && $2!~/^0$/ {printf("seq%s\n%s\n",NR,$1)}' dnain.dat
but I think that is getting a bit off topic...
If you're going to use a quote, it works better if you use the exact
quote and attribute it:
'Some people, when confronted with a problem, think "I know, I'll use
regular expressions." Now they have two problems.' --Jamie Zawinski
--
Aahz (aa...@pythoncraft.com) <*> http://www.pythoncraft.com/
import antigravity
>>If you have a problem and you think that regular expressions are the
>>solution then now you have two problems. Regex is really overkill for
>>the OP's problem and it certainly doesn't improve readability.
>
> If you're going to use a quote, it works better if you use the exact
> quote and attribute it:
>
> 'Some people, when confronted with a problem, think "I know, I'll use
> regular expressions." Now they have two problems.' --Jamie Zawinski
He may have mixed that one up with a different (and more generic) saying:
"If you think that X is the solution to your problem, then you don't
understand X and you don't understand your problem."
For most values of X.