How to obtain the file of contigs?

114 views
Skip to first unread message

Alexandra J. Roth Schulze

unread,
Mar 19, 2012, 9:22:15 PM3/19/12
to ABySS
Hi,

I'm trying to use a program called GAA that merges two different
assemblies. In order to use it, I have to create a file with my
assembled contigs in the format: Contig1.1 in which the number before
the point refers to scaffold number or ID and the number after the
point refers to the contig number or ID that is part of the scaffold.
For example if my scaffold1 is composed of contig1, contig2 and
contig3 I will need to write it:
>Contig1.1
agagagaga
>Contig1.2
ctgatcgaht
>Contig1.3
agctagctgat

The sequence here would represent the sequences of the contig1 contig2
and contig3 (not the scaffold) that are part of the scaffold.
In the case of Abyss I can obtain the information about which contigs
and part of which scaffolds in the file ${name}-5.path (because is a
paired-end assembly). For example:
3104040 2600412+ 2560124+ 984306+ 2431009+ 137177+ 483429+ 1822299+
1175151+ 207527- 1736600+ 102457+ 2451822-
3104041 2603296+ 142480+ 1813486- 1345543- 1992572+ 2608396-
3104042 2563086- 594643+ 1092622- 2615901- 1845317+ 1976512+ 2285303+
3104043 2619681+ 1767980+ 765902+ 2234704+ 1620807+ 220843+ 1034286-
951594- 1937453- 1887306- 382806- 654898-

Here I have 4 scaffolds (3104040, 3104041, 3104042, 3104043) and the
list of contigs that compose each scaffold. But in the file ${name}-
contig.fa I can find only the sequence of the scaffolds (because I can
find the IDs 3104040, 3104041, 3104042, 3104043 in the headers ), But
if a take a contig ID, lets say 2600412 (which is part of the scaffold
3104040) I don't find it in the ${name}-contig.fa file as the main
name of some contig/scaffold. I assume this is because in the file $
{name}-contig.fa we have the sequences of the scaffolds and not the
sequences of the contigs that form those scaffolds.

So my question is, how can I obtain or generate the file with the
contigs (contig ID + sequence ) separated?

Thanks in advance and I hope somebody can help me with this soon.

Best regards

Alexandra

Shaun Jackman

unread,
Mar 20, 2012, 6:59:38 PM3/20/12
to Alexandra J.Roth Schulze, ABySS
Hi Alexandra,

Attached are two scripts, faunscaffold and fatoagp. The former splits scaffolds into scaftigs, and the latter outputs an AGP file, which indicates where the scaftigs are placed in the scaffolds.

The sequences you’re looking for are in the intermediate files, ${name}-[345].fa

Cheers,
Shaun

faunscaffold
fatoagp

Alexandra J. Roth Schulze

unread,
Jul 27, 2012, 3:33:49 AM7/27/12
to abyss...@googlegroups.com
Hi Again Shaun,

How are you? I just want to confirm there is no way to obtain the original reads that were included in the final assembly of Abyss-pe other than to map your original reads to the final assembly with some mapping program. Please le me know if there is other way, if not, I'll use the aligning method. Do you think bowtie is ok?

Thanks in advance

Best regards,

Alexandra

Tony Raymond

unread,
Jul 31, 2012, 1:40:17 PM7/31/12
to Alexandra J. Roth Schulze, abyss...@googlegroups.com
Hi Alexandra,

Shaun has left the GSC and will be starting graduate studies in the fall so he may not get back to you here. I'm happy to answer any questions you have.

There is no other way I can think of to find the reads used in an assembly other than to map the reads back to the assembly. ABySS is distributed with an aligner that would work well for this type of analysis called abyss-map. It is an FM aligner that shaun developed, and is similar to the BWA fastmap command in that they both find the longest exact sequence match between query and target sequences. Setting the minimum alignment length to your k value would find all the reads that were used in the assembly.

Tony
Reply all
Reply to author
Forward
0 new messages