String Graph Format

270 views
Skip to first unread message

James Lindsay

unread,
Jun 22, 2012, 5:39:30 PM6/22/12
to sga-...@googlegroups.com
Hi All,
I was interested in looking at the string graph constructed from the "overlap" command. I was wondering how the string graph is serialised to disk, i.e. which files, and how are they formatted. Thanks for the help!
James

Lee Mendelowitz

unread,
Jun 22, 2012, 9:14:07 PM6/22/12
to sga-...@googlegroups.com
Hi James,

I've been working a little bit with this file so I can fill you in on some of the details, and Jared can probably add more.

The string graph file produced by "sga overlap" and "sga assemble" have the extension *.asqg.gz. You can decompress it into a plain text file with:

gunzip -c file.asqg.gz > file.asqg

You can view it with less or a text editor (although this file can be large).

The first line is a header line (beginning with HT)
Lines that begin with VT are vertex records and include the contig name and the sequence
Lines that begin with ED are edge records and include information about the overlap in 10 fields:
1. contig 1 name
2. contig 2 name
3. contig 1 overlap start (0 based)
4. contig 1 overlap end (inclusive)
5. contig 1 length
6. contig 2 overlap start (0 based)
7. contig 2 overlap end (inclusive)
8. contig 2 length
9. contig 2 is reverse (1 for reverse, 0 for forward)
10. number of differences in overlap  (0 for perfect overlaps, which is the default).

Contig 1 is always in the forward direction. If contig 2 is reverse and if the overlap is perfect, then (in Python notation):
contig1[s1:e1+1] = rev_comp(contig2[s2:e2+1]),
where s1 ,e1 and s2, e2 are the starting and ending indices listed in the edge record.

Hopefully that helps! Also, you can draw overlaps from an asqg file using the "sga oview" command.

Best,

Lee

Jared Simpson

unread,
Jun 23, 2012, 4:16:50 AM6/23/12
to sga-...@googlegroups.com
Great explanation Lee, thanks.

Here are a couple of tips that making viewing graphs easier. You can use the sga subgraph command to extract a subgraph surrounding a given vertex. 

For example:

sga subgraph -d 5 read-12345 graph.asqg.gz

This will write out a new graph file containing just read-12345 and all nodes that are within a distance of 5 to it.

You can also use the sga-asqg2dot.pl to write the graph in DOT format for use with the graphviz tools:

zcat subgraph.asqg.gz | sga-asqg2dot.pl > subgraph.dot

Jared
-- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

James Lindsay

unread,
Jun 26, 2012, 2:26:50 PM6/26/12
to sga-...@googlegroups.com
Thanks for the replies! I think I understand it a bit better now.

James
Reply all
Reply to author
Forward
0 new messages