Help: Content extraction

Amy Lee

unread,

May 9, 2008, 10:30:39 PM5/9/08

to

Hello,

I have a problem while I'm processing my sequence file. The file content
is like this.

>seq1
ACGGTC
ACTG
>seq2
CGATCC
ACCTC
>seq3
......

And I hope make every sequence into a single file. For example, a file
"seq1" content is
>seq1
ACGGTC
ACTG
And a file "seq2" content is
>seq2
CGATCC
ACCTC
and so on.

However, I'm only a newbie in perl, I don't know what to do. So could
anyone post some sample codes to do that? And I don't wanna use BioPerl
because others machines do not install this package although it's quite
useful.

Thank you very much~

Regards,

Amy Lee

Jürgen Exner

unread,

May 9, 2008, 11:43:39 PM5/9/08

to

Amy Lee <openlin...@gmail.com> wrote:
>I have a problem while I'm processing my sequence file.

I know text files, binary files, random access files, sequential files,
but I've never heard of a sequence file.

>The file content
>is like this.
>
>>seq1
>ACGGTC
>ACTG
>>seq2
>CGATCC
>ACCTC
>>seq3
>......
>
>And I hope make every sequence into a single file. For example, a file

What is a sequence?

>"seq1" content is
>>seq1
>ACGGTC
>ACTG
>And a file "seq2" content is
>>seq2
>CGATCC
>ACCTC
>and so on.

How is this desired content different from the original content? They
seem to be identical to me.

>However, I'm only a newbie in perl, I don't know what to do. So could
>anyone post some sample codes to do that?

Probably not without some much improved specification.

jue

Amy Lee

unread,

May 10, 2008, 3:16:49 AM5/10/08

to

Jue,

My most work is to process DNA so I save DNA sequences as a format called
FastA as you've seen before. And you could call my file dna.fasta, the
content is

>seq1
ACGGTC
ACTG

>seq2
CGATCC
ACCTC
>seq3
......

The "seq1" "seq2" "seq3" and "seqx" is the names of these sequences. I can
say, it's a mark. And under "seqx" it's DNA sequences. My point is quite
simple, I wanna extract every sequences as a file saved. I mean I can
extract sequences for dna.fasta and make a single file for every sequences.

There's an example.

From dna.fasta, I can make 3 sequences files and the names are from
mark names. They are seq1, seq2, seq3. In seq1, its content is
>seq1
ACGGTC
ACTG
In file seq2, its content is
>seq2
CGATCC
ACCTC
And so on. So from this I can deal with my sequences easily.

Jürgen Exner

unread,

May 10, 2008, 8:11:30 AM5/10/08

to

Amy Lee <openlin...@gmail.com> wrote:
>My most work is to process DNA so I save DNA sequences as a format called
>FastA as you've seen before. And you could call my file dna.fasta, the
>content is
>
>>seq1
>ACGGTC
>ACTG
>>seq2
>CGATCC
>ACCTC
>>seq3
>......

From your previous description I thought those were 3 separte files.
Obviously I was wrong.

>The "seq1" "seq2" "seq3" and "seqx" is the names of these sequences. I can
>say, it's a mark. And under "seqx" it's DNA sequences. My point is quite
>simple, I wanna extract every sequences as a file saved. I mean I can
>extract sequences for dna.fasta and make a single file for every sequences.

So you want to split the file at each ">seq*" marker.

Well, then why not just loop (while (<>)) through the input file and
whenever you encounter such a marker (m//) close() the current output
file and open() a new one?

jue

John W. Krahn

unread,

May 10, 2008, 8:21:05 AM5/10/08

to

Amy Lee wrote:
>
> I have a problem while I'm processing my sequence file. The file content
> is like this.
>
>> seq1
> ACGGTC
> ACTG
>> seq2
> CGATCC
> ACCTC
>> seq3
> ......
>
> And I hope make every sequence into a single file. For example, a file
> "seq1" content is
>> seq1
> ACGGTC
> ACTG
> And a file "seq2" content is
>> seq2
> CGATCC
> ACCTC
> and so on.

while ( <> ) {
if ( /^>(.+)/ ) {
open my $OUT, '>>', $1 or die "Cannot open '$1' $!";
select $OUT;
}
print;
}

John
--
Perl isn't a toolbox, but a small machine shop where you
can special-order certain sorts of tools at low cost and
in short order. -- Larry Wall

Amy Lee

unread,

May 10, 2008, 11:01:21 AM5/10/08

to

Thank you very much~
I've solved this problem.

Regards,

Amy Lee

unread,

May 10, 2008, 11:03:47 AM5/10/08

to

Anyway, could you tell me how to find out the usage of "select" function?

Thank you.

Amy Lee

unread,

May 10, 2008, 11:04:58 AM5/10/08

to

Yes, you are right, and the codes is right for my work.

Thank you again~

Amy

Jürgen Exner

unread,

May 10, 2008, 12:03:09 PM5/10/08

to

Amy Lee <openlin...@gmail.com> wrote:
>Anyway, could you tell me how to find out the usage of "select" function?

The usage of each perl function is described in the first line(s) of the
manual page for this function. It doesn't explicitely say "Usage" as in
Unix man pages, but it has the same format:
select FILEHANDLE

Sometimes, if a function is overloaded, there may be additional usages
farther down the page, too, e.g.
select RBITS,WBITS,EBITS,TIMEOUT

jue