Hi,
I'm trying to use a program called GAA that merges two different
assemblies. In order to use it, I have to create a file with my
assembled contigs in the format: Contig1.1 in which the number before
the point refers to scaffold number or ID and the number after the
point refers to the contig number or ID that is part of the scaffold.
For example if my scaffold1 is composed of contig1, contig2 and
contig3 I will need to write it:
>Contig1.1
agagagaga
>Contig1.2
ctgatcgaht
>Contig1.3
agctagctgat
The sequence here would represent the sequences of the contig1 contig2
and contig3 (not the scaffold) that are part of the scaffold.
In the case of Abyss I can obtain the information about which contigs
and part of which scaffolds in the file ${name}-5.path (because is a
paired-end assembly). For example:
3104040 2600412+ 2560124+ 984306+ 2431009+ 137177+ 483429+ 1822299+
1175151+ 207527- 1736600+ 102457+ 2451822-
3104041 2603296+ 142480+ 1813486- 1345543- 1992572+ 2608396-
3104042 2563086- 594643+ 1092622- 2615901- 1845317+ 1976512+ 2285303+
3104043 2619681+ 1767980+ 765902+ 2234704+ 1620807+ 220843+ 1034286-
951594- 1937453- 1887306- 382806- 654898-
Here I have 4 scaffolds
(3104040,
3104041,
3104042,
3104043) and the
list of contigs that compose each scaffold. But in the file ${name}-
contig.fa I can find only the sequence of the scaffolds (because I can
find the IDs
3104040,
3104041,
3104042,
3104043 in the headers ), But
if a take a contig ID, lets say 2600412 (which is part of the scaffold
3104040) I don't find it in the ${name}-contig.fa file as the main
name of some contig/scaffold. I assume this is because in the file $
{name}-contig.fa we have the sequences of the scaffolds and not the
sequences of the contigs that form those scaffolds.
So my question is, how can I obtain or generate the file with the
contigs (contig ID + sequence ) separated?
Thanks in advance and I hope somebody can help me with this soon.
Best regards
Alexandra