Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Convert FASTA file's line length more simply?

353 views
Skip to first unread message

Robert Mesibov

unread,
Dec 31, 2015, 2:57:40 AM12/31/15
to
Hi, all and best wishes for the New Year!

I'm posting this question here on comp.lang.awk rather than Stack Overflow because on SO it will attract non-AWK solutions. I already have an AWK solution, but I'm wondering if there's a simpler or more elegant AWK one.

This FASTA file

$ cat fasta
>Sequence_1
MTEITAAMVKELRESTGAGM
MDCKNALLVSVKVSDDFTIA
AMRPSYLSYEDLDMTIPQFA
SRKQLSDAILKEAEEKIKEE
LKAQMGQFYVMDDKKTVEQV
IAEKEKEFGGK
>Sequence_2
SATVSEINSETDFVAKNDQF
IALTKDTTAATIGENLVVRR
FATLKAGANGVVNGYIH

contains 2 sequences of letters, each introduced by a line beginning with '>'. The sequences are broken into lines of 20 characters each, with a bit left over at the end (sequence 1 has 111 letters, 2 has 57 letters).

I wondered how easy it would be to change the number of characters per line in the sequences, say from 20 to 25. The only way I could think of to do this would be to first concatenate the letters in each sequence into the full, unbroken sequence, then chop that full sequence into 25-character chunks.

I did it this way:

Step 1 - make unbroken sequences by storing the broken bits in a variable (foo):

$ awk '/>/ {if (foo) print foo; foo=""; print} !/>/ {foo=foo$0} END {print foo}' fasta
>Sequence_1
MTEITAAMVKELRESTGAGMMDCKNALLVSVKVSDDFTIAAMRPSYLSYEDLDMTIPQFASRKQLSDAILKEAEEKIKEELKAQMGQFYVMDDKKTVEQVIAEKEKEFGGK
>Sequence_2
SATVSEINSETDFVAKNDQFIALTKDTTAATIGENLVVRRFATLKAGANGVVNGYIH

--

Step 2 - make 25-character chunks from the unbroken sequences (I used gsub) in a separate AWK command:

$ awk '/>/ {if (foo) print foo; foo=""; print} !/>/ {foo=foo$0} END {print foo}' fasta | awk '!/>/ {gsub(/.{25}/,"&\n")} 1'
>Sequence_1
MTEITAAMVKELRESTGAGMMDCKN
ALLVSVKVSDDFTIAAMRPSYLSYE
DLDMTIPQFASRKQLSDAILKEAEE
KIKEELKAQMGQFYVMDDKKTVEQV
IAEKEKEFGGK
>Sequence_2
SATVSEINSETDFVAKNDQFIALTK
DTTAATIGENLVVRRFATLKAGANG
VVNGYIH

--

Step 3 - combine the two commands:

$ awk '/>/ && foo {gsub(/.{25}/,"&\n",foo); print foo; foo=""} />/ && !foo {print} !/>/ {foo=foo$0} END {gsub(/.{25}/,"&\n",foo); print foo}' fasta
>Sequence_1
MTEITAAMVKELRESTGAGMMDCKN
ALLVSVKVSDDFTIAAMRPSYLSYE
DLDMTIPQFASRKQLSDAILKEAEE
KIKEELKAQMGQFYVMDDKKTVEQV
IAEKEKEFGGK
>Sequence_2
SATVSEINSETDFVAKNDQFIALTK
DTTAATIGENLVVRRFATLKAGANG
VVNGYIH

My question - is there a simpler or more elegant way to do this?

Kenny McCormack

unread,
Dec 31, 2015, 3:49:24 AM12/31/15
to
In article <fd8957e8-9bde-4c54...@googlegroups.com>,
Robert Mesibov <robert....@gmail.com> wrote:
...
Convert:
>$ cat fasta
>>Sequence_1
>MTEITAAMVKELRESTGAGM
>MDCKNALLVSVKVSDDFTIA
>AMRPSYLSYEDLDMTIPQFA
>SRKQLSDAILKEAEEKIKEE
>LKAQMGQFYVMDDKKTVEQV
>IAEKEKEFGGK
>>Sequence_2
>SATVSEINSETDFVAKNDQF
>IALTKDTTAATIGENLVVRR
>FATLKAGANGVVNGYIH
...
into:
>>Sequence_1
>MTEITAAMVKELRESTGAGMMDCKN
>ALLVSVKVSDDFTIAAMRPSYLSYE
>DLDMTIPQFASRKQLSDAILKEAEE
>KIKEELKAQMGQFYVMDDKKTVEQV
>IAEKEKEFGGK
>>Sequence_2
>SATVSEINSETDFVAKNDQFIALTK
>DTTAATIGENLVVRRFATLKAGANG
>VVNGYIH

$ gawk4 'BEGIN { RS=">";FS="\n" } NR > 1 { print ">"$1;print gensub(/.{25}/,"&\n","g",gensub(/\n/,"","g",substr($0,length($1)+1))) }' fasta

--
Watching ConservaLoons playing with statistics and facts is like watching a
newborn play with a computer. Endlessly amusing, but totally unproductive.

Kenny McCormack

unread,
Dec 31, 2015, 3:51:57 AM12/31/15
to
In article <n62q6j$rno$1...@news.xmission.com>,
Kenny McCormack <gaz...@shell.xmission.com> wrote:
...
>$ gawk4 'BEGIN { RS=">";FS="\n" } NR > 1 { print ">"$1;print
>gensub(/.{25}/,"&\n","g",gensub(/\n/,"","g",substr($0,length($1)+1))) }' fasta

This seems to have line-wrapped. It should be all one line, as:

$ gawk4 'BEGIN { RS=">";FS="\n" } NR > 1 { print ">"$1;print gensub(/.{25}/,"&\n","g",gensub(/\n/,"","g",substr($0,length($1)+1))) }' fasta

--
If you think you have any objections to anything I've said above, please
navigate to this URL:

http://www.xmission.com/~gazelle/Truth

This should clear up any misconceptions you may have.

Robert Mesibov

unread,
Dec 31, 2015, 4:10:42 AM12/31/15
to
> $ gawk4 'BEGIN { RS=">";FS="\n" } NR > 1 { print ">"$1;print gensub(/.{25}/,"&\n","g",gensub(/\n/,"","g",substr($0,length($1)+1))) }' fasta

Many thanks for the super-quick reply and the nested gensubs are neat, but it's taken me longer to figure out how this works than it did to hack up my 'concatenate the lines in a variable' strategy. I guess what I'm hoping for would be something even simpler and more transparent in its working. Maybe that's not possible...

Kenny McCormack

unread,
Dec 31, 2015, 4:34:12 AM12/31/15
to
In article <cd687f56-f780-46f0...@googlegroups.com>,
C'est la vie.

Incidentally, a purist might point out a possible problem with my solution,
which is that if the length of the "sequence" part is an exact multiple of 25
(e.g., 25, 50, 75, etc), then the output will be "double spaced" (if you
see what I mean...). This is easily fixed should it prove to be a problem
for the end user.

--
"Remember when teachers, public employees, Planned Parenthood, NPR and PBS
crashed the stock market, wiped out half of our 401Ks, took trillions in
TARP money, spilled oil in the Gulf of Mexico, gave themselves billions in
bonuses, and paid no taxes? Yeah, me neither."

Janis Papanagnou

unread,
Dec 31, 2015, 7:52:31 AM12/31/15
to
> [...]
>
> My question - is there a simpler or more elegant way to do this?

Elegance probably lies in the eyes of the beholder. So I just point to another
approach: Concatenate the data on the fly and if it exceeds a limit print it.
E.g.:

/^>/ { if(r) print r; r=""; print; next }
{ d=r $0; l=length(d) }
l<25 { r=d; next }
{ print substr(d,1,25); r=substr(d,26) }
END { if(r) print r }


Janis

Anton Treuenfels

unread,
Dec 31, 2015, 9:28:47 AM12/31/15
to

"Robert Mesibov" <robert....@gmail.com> wrote in message
news:fd8957e8-9bde-4c54...@googlegroups.com...
==============================

Untested:

BEGIN {
RS = ">"
FS = "\n"
linelen = 25
}

{
print ">"
while ( length($0) > linelen ) {
print substr( $0, 1, linelen )
$0 = substr( $0, linelen+1 )
}
if ( $0 )
print $0
}

- Anton Treuenfels

Bruce Horrocks

unread,
Dec 31, 2015, 1:20:26 PM12/31/15
to
On 31/12/2015 09:10, Robert Mesibov wrote:
>> $ gawk4 'BEGIN { RS=">";FS="\n" } NR > 1 { print ">"$1;print gensub(/.{25}/,"&\n","g",gensub(/\n/,"","g",substr($0,length($1)+1))) }' fasta
>
> Many thanks for the super-quick reply and the nested gensubs are neat, but it's taken me longer to figure out how this works than it did to hack up my 'concatenate the lines in a variable' strategy. I guess what I'm hoping for would be something even simpler and more transparent in its working. Maybe that's not possible...
>

Here is a variation on Kennys's which avoids that nesting...

BEGIN { RS=">";FS="\n";OFS="" }
{
print "> " $1
$1 = ""
gsub(/.{25}/, "&\n")
print
}


--
Bruce Horrocks
Surrey
England
(bruce at scorecrow dot com)

Kenny McCormack

unread,
Dec 31, 2015, 1:40:07 PM12/31/15
to
In article <568571E7...@scorecrow.com>,
Bruce Horrocks <07....@scorecrow.com> wrote:
...
>Here is a variation on Kenny's which avoids that nesting...
>
>BEGIN { RS=">";FS="\n";OFS="" }
> {
> print "> " $1
> $1 = ""
> gsub(/.{25}/, "&\n")
> print
> }

Your output isn't quite correct, although it is unlikely to matter much in
practice. One of the differences is probably just a typo in your source
code (the space after the ">"); the other is likely caused by the
uncertainly that occurs in AWK when the input line ($0) is reconstructed.
I try to avoid solutions that depend on input-line-reconstruction for this
reason. It seems to be an area rife with dark corners.

Anyway, here's a diff between the output of my version and yours:

1c1,3
< >Sequence_1
---
> >
>
> > Sequence_1
7c9
< >Sequence_2
---
> > Sequence_2

P.S. Actually, now that I think about it some more, I think that the
reason for the extra lines in your output is simply because you forgot the
"NR > 1" condition/qualifier.

--

"If God wanted us to believe in him, he'd exist."

(Linda Smith on "10 Funniest Londoners", TimeOut, 23rd June, 2005.)

Robert Mesibov

unread,
Dec 31, 2015, 4:14:26 PM12/31/15
to
$ awk 'BEGIN {RS=">";FS="\n";OFS=""} NR>1 {print "> "$1; $1=""; gsub(/.{25}/,"&\n"); print}' fasta
> Sequence_1
MTEITAAMVKELRESTGAGMMDCKN
ALLVSVKVSDDFTIAAMRPSYLSYE
DLDMTIPQFASRKQLSDAILKEAEE
KIKEELKAQMGQFYVMDDKKTVEQV
IAEKEKEFGGK
> Sequence_2
SATVSEINSETDFVAKNDQFIALTK
DTTAATIGENLVVRRFATLKAGANG
VVNGYIH

Here's Bruce Horrocks' suggestion, modified, and I like it a lot better than my original solution! Many thanks!

Robert Mesibov

unread,
Dec 31, 2015, 4:19:42 PM12/31/15
to
> Elegance probably lies in the eyes of the beholder. So I just point to another
> approach: Concatenate the data on the fly and if it exceeds a limit print it.
> E.g.:
>
> /^>/ { if(r) print r; r=""; print; next }
> { d=r $0; l=length(d) }
> l<25 { r=d; next }
> { print substr(d,1,25); r=substr(d,26) }
> END { if(r) print r }
>
>
> Janis

Many thanks, Janis, for the alternative of on-the-fly concatenation.

You're right that 'elegance' is a matter of taste, but wouldn't you say that karakfa's suggestion is more elegantly simple than shelter's in this Stack Overflow thread?:

http://stackoverflow.com/questions/34482123/join-every-240-lines-of-a-large-file-consisting-of-different-numbers-in-cshell-s

Robert Mesibov

unread,
Jan 1, 2016, 1:33:34 AM1/1/16
to
Both my original solution and the simpler McCormack-Horrocks solution will produce a trailing blank line if the total number of characters in the sequence is an exact multiple of the desired number of characters per line - as Kenny McCormack pointed out.

A FASTA file with blank lines can be 'restored to continuity' by piping the result to my favourite Blank Line Remover, awk 'NF', but that adds another command.

Janis Papanagnou's solution never adds a trailing blank line after a sequence, but it uses 3 variables and its logic is a little harder to follow.

I haven't yet succeeded in getting Anton Treuenfels' solution to work, but I'll keep at it.

This problem isn't as simple as I thought it might be...

Janis Papanagnou

unread,
Jan 1, 2016, 3:49:46 AM1/1/16
to
On 31.12.2015 22:19, Robert Mesibov wrote:
>> Elegance probably lies in the eyes of the beholder. So I just point to another
>> approach: Concatenate the data on the fly and if it exceeds a limit print it.
>> E.g.:
>>
>> /^>/ { if(r) print r; r=""; print; next }
>> { d=r $0; l=length(d) }
>> l<25 { r=d; next }
>> { print substr(d,1,25); r=substr(d,26) }
>> END { if(r) print r }
>>
>>
>> Janis
>
> Many thanks, Janis, for the alternative of on-the-fly concatenation.
>
> You're right that 'elegance' is a matter of taste, but wouldn't you say
> that karakfa's suggestion is more elegantly simple than shelter's in this
> Stack Overflow thread?:

Definitely.

You should just be ware that the SO-cases don't compare with your task.
The SO-task is primitive, and soleyly depends on the separators (FS, RS),
and the separators are applied in a natural way, separating fields/records
(as as opposed to other solutions in this thread that are relying on RS).
Your task has the "complexity" of inhomogeneous data (header/data lines).

Janis

>
> http://stackoverflow.com/questions/34482123/join-every-240-lines-of-a-large-file-consisting-of-different-numbers-in-cshell-s
>


Kenny McCormack

unread,
Jan 1, 2016, 6:09:53 AM1/1/16
to
In article <9ec76e25-9340-4eeb...@googlegroups.com>,
Robert Mesibov <robert....@gmail.com> wrote:
>Both my original solution and the simpler McCormack-Horrocks solution
>will produce a trailing blank line if the total number of characters in
>the sequence is an exact multiple of the desired number of characters
>per line - as Kenny McCormack pointed out.

# Input file modified to include a 25 char sequence...
# Note also that the line with all the gensubs is all one line - it looks
# like it gets broken up by the posting software.
$ gawk4 'BEGIN { RS=">";FS="\n" }
NR > 1 { print ">"$1
print gensub(/\n$/,"",1,gensub(/.{25}/,"&\n","g",gensub(/\n/,"","g",substr($0,length($1)+1)))) }
' fasta
>Sequence_1
MTEITAAMVKELRESTGAGMMDCKN
ALLVSVKVSDDFTIAAMRPSYLSYE
DLDMTIPQFASRKQLSDAILKEAEE
KIKEELKAQMGQFYVMDDKKTVEQV
IAEKEKEFGGK
>Sequence_2
SATVSEINSETDFVAKNDQFIALTK
DTTAATIGENLVVRRFATLKAGANG
>Sequence_3
SATVSEINSETDFVAKNDQFIALTK
DTTAATIGENLVVRRFATLKAGANG
VVNGYIH
$

(You're welcome!)

--
Religion is what keeps the poor from murdering the rich.

- Napoleon Bonaparte -

Bruce Horrocks

unread,
Jan 1, 2016, 8:52:12 PM1/1/16
to
On 01/01/2016 06:33, Robert Mesibov wrote:
> Both my original solution and the simpler McCormack-Horrocks solution
> will produce a trailing blank line if the total number of characters
> in the sequence is an exact multiple of the desired number of
> characters per line - as Kenny McCormack pointed out.

The trailing blank line is easily fixed with an extra gsub:
$ awk 'BEGIN {RS=">";FS="\n";OFS=""} NR>1 {print "> "$1; $1="";
gsub(/.{25}/,"&\n"); gsub(/\n$/, ""); print}' fasta

Robert Mesibov

unread,
Jan 1, 2016, 10:44:56 PM1/1/16
to
> The trailing blank line is easily fixed with an extra gsub:
> $ awk 'BEGIN {RS=">";FS="\n";OFS=""} NR>1 {print "> "$1; $1="";
> gsub(/.{25}/,"&\n"); gsub(/\n$/, ""); print}' fasta

Many thanks, Bruce. Still nice and compact and reasonably easy to understand, and a lot better than my original effort. And way better than this UNIX pipeline approach, which only uses AWK to split 'fasta' into component sequences - wait! I'm not being serious! It's an after-celebration New Year brain bubble!

$ awk '/>/ {++n} {print > "Sequence_"n}' fasta; for i in Seq*; do var=$(sed 's/\(>.*\)/\1@/' "$i" | paste -s -d '' | sed 's/@/\n/'); echo "$var" | head -n 1; echo "$var" | tail -n +2 | fold -w 25; done; rm Seq*
>Sequence_1
MTEITAAMVKELRESTGAGMMDCKN
ALLVSVKVSDDFTIAAMRPSYLSYE
DLDMTIPQFASRKQLSDAILKEAEE
KIKEELKAQMGQFYVMDDKKTVEQV
>Sequence_2
SATVSEINSETDFVAKNDQFIALTK
DTTAATIGENLVVRRFATLKAGANG
VVNGYIH

Robert Mesibov

unread,
Jan 1, 2016, 10:46:20 PM1/1/16
to
> VVNGYIH
> >Sequence_2
> SATVSEINSETDFVAKNDQFIALTK
> DTTAATIGENLVVRRFATLKAGANG
> VVNGYIH

Edited.

Robert Mesibov

unread,
Jan 1, 2016, 10:48:07 PM1/1/16
to
> > IAEKEKEFGGK
> > >Sequence_2
> > SATVSEINSETDFVAKNDQFIALTK
> > DTTAATIGENLVVRRFATLKAGANG
> > VVNGYIH
>
> Edited.

and again, to show the correct output.

Luuk

unread,
Jan 2, 2016, 4:06:07 AM1/2/16
to
On 31-12-15 08:57, Robert Mesibov wrote:
> Hi, all and best wishes for the New Year!
>
> I'm posting this question here on comp.lang.awk rather than Stack Overflow because on SO it will attract non-AWK solutions. I already have an AWK solution, but I'm wondering if there's a simpler or more elegant AWK one.
>
> This FASTA file
>
> $ cat fasta
>> Sequence_1
> MTEITAAMVKELRESTGAGM
> MDCKNALLVSVKVSDDFTIA
> AMRPSYLSYEDLDMTIPQFA
> SRKQLSDAILKEAEEKIKEE
> LKAQMGQFYVMDDKKTVEQV
> IAEKEKEFGGK
>> Sequence_2
> SATVSEINSETDFVAKNDQF
> IALTKDTTAATIGENLVVRR
> FATLKAGANGVVNGYIH
>
> contains 2 sequences of letters, each introduced by a line beginning with '>'. The sequences are broken into lines of 20 characters each, with a bit left over at the end (sequence 1 has 111 letters, 2 has 57 letters).
>
> I wondered how easy it would be to change the number of characters per line in the sequences, say from 20 to 25. The only way I could think of to do this would be to first concatenate the letters in each sequence into the full, unbroken sequence, then chop that full sequence into 25-character chunks.
>
> I did it this way:
>
..
> My question - is there a simpler or more elegant way to do this?
>


This is another way:

gawk '/>/{a=$0; b[a]=""}
!/>/{b[a]=b[a] $0 }
END{ for (i in b) { print i;
print gensub(/([A-Z]{25})/,"\\1\n","g",b[i])}}' fasta


Janis Papanagnou

unread,
Jan 2, 2016, 4:22:19 AM1/2/16
to
On 02.01.2016 10:06, Luuk wrote:
> [...]
>> My question - is there a simpler or more elegant way to do this?
>>
>
> This is another way:
>
> gawk '/>/{a=$0; b[a]=""}
> !/>/{b[a]=b[a] $0 }
> END{ for (i in b) { print i;
> print gensub(/([A-Z]{25})/,"\\1\n","g",b[i])}}' fasta

Note that for-in loops produce unordered results.

Janis

Luuk

unread,
Jan 2, 2016, 4:23:50 AM1/2/16
to
aaaargh, 50% chance that i have these 2 lines in the wrong order...

;)

Luuk

unread,
Jan 2, 2016, 11:11:43 AM1/2/16
to
Ok, a version with the ordering correct... ;)

file: fasta.awk
function fasta(a,f) {
print a;
print gensub(/([A-Z]{25})/,"\\1\n","g",f)
}
/>/{
if (a) { fasta(a,f) }
a=$0; f=""
}
!/>/{
f=f $0
}
END {
fasta(a,f)
}


$ awk -f fasta.awk
>Sequence_1
MTEITAAMVKELRESTGAGMMDCKN
ALLVSVKVSDDFTIAAMRPSYLSYE
DLDMTIPQFASRKQLSDAILKEAEE
KIKEELKAQMGQFYVMDDKKTVEQV
IAEKEKEFGGK
>Sequence_2
SATVSEINSETDFVAKNDQFIALTK
DTTAATIGENLVVRRFATLKAGANG
VVNGYIH



BTW: It would be 'nice' if the '25' could also be a parameter for this
function....

Robert Mesibov

unread,
Jan 2, 2016, 4:12:01 PM1/2/16
to
Suffers from Blank Line Syndrome: if the total number of characters in a sequence is an integral multiple of the desired number of characters per line, a blank line is added after the sequence.
0 new messages