SOAPdenovo and paired end or unpaired FASTA files

375 views
Skip to first unread message

wsg

unread,
May 2, 2013, 5:21:23 PM5/2/13
to bgi-...@googlegroups.com
I have a subset of an indexed MiSeq run that was obtained by extracting files from Tablet that were aligned against a reference sequence using BWA/samtools. The resulting files is in FASTA format and as near as I can tell unpaired. I'm attempting to preform a denovo assembly of this small (3300 kb) FASTA dataset with SOAPdenovo but SOAP cannot read the FASTA file and crashes in a loop trying to import the data to pregraph. This a small scale run to work out the bugs prior to scaling up to a larger data set.

1) My data appears as:
>M00542_7_000000000-A3C86_1_2103_28825_15020_pos=33_len=89
GTGGATTCACAATCCACTGCCTTGATCCACTTGGCTACATCCGCCCCTTATCCAGCTAAAGGATTTTTTTCTTTTTTCC
ATTGATCATT
>M00542_7_000000000-A3C86_1_1102_26169_15631_pos=125_len=90
CTATTTATTCTGACCTCCGTACTTCGATCGAGATATTGGACATAGAATGCCACTCTTTAAAAAGGAAAAAAGGAGTAAT
CAGCTGTGACA
...
up to ~14,000 reads

2) My config file is:

#maximal read length
max_rd_len=260
[LIB]
#average insert size
avg_ins=420
#if sequence needs to be reversed 
reverse_seq=0
#in which part(s) the reads are used
asm_flags=3
#in which order the reads are used while scaffolding
rank=1
#fasta file (unpaired ends)
f=/home/fasta/1700-sorted_bam.fasta

3) and my output error looks like:

Version 2.04: released on July 13th, 2012
Compile Apr 25 2013 16:59:53

********************
Pregraph
********************

Parameters: pregraph -s soap-fasta.config -K 63 -R -o testfasta 

In soap-fasta.config, 1 lib(s), maximum read length 260, maximum name length 256.

8 thread(s) initialized.
Import reads from file:
 /home/fasta/1700-sorted_bam.fasta
--- 100000000th reads.
--- 200000000th reads.
--- 300000000th reads.
...
--- 108400000000th reads.

...and on until I "kill -9" the run.

lizhenyu

unread,
May 2, 2013, 9:56:09 PM5/2/13
to bgi-...@googlegroups.com
Hi,
 
Please check whether the read sequence is in one line.
 
 
2013-05-03

BGI  Zhenyu Li

发件人: wsg
发送时间: 2013-05-03  09:42:21
抄送:
主题: [BGI-SOAP:767] SOAPdenovo and paired end or unpaired FASTA files
--
You received this message because you are subscribed to the Google Groups "BGI-SOAP" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bgi-soap+u...@googlegroups.com.
To post to this group, send email to bgi-...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msg/bgi-soap/-/NbUN_4onqXwJ.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

wsg

unread,
May 3, 2013, 12:32:19 PM5/3/13
to bgi-...@googlegroups.com
The fasta sequence was on multiple lines. End-of-line characters were removed from the DNA sequence portion and rerun. The same 'loop' occurred:

Version 2.04: released on July 13th, 2012
Compile Apr 25 2013     16:59:53

********************
Pregraph
********************

Parameters: pregraph -s soap-fasta.config -K 63 -R -o fastatest

In soap-fasta.config, 1 lib(s), maximum read length 260, maximum name length 25$

8 thread(s) initialized.
Import reads from file:
 /home/fasta/bamwindow-star.fa
--- 100000000th reads.
--- 200000000th reads.
--- 300000000th reads.
--- 400000000th reads.
etc.

wsg

unread,
May 3, 2013, 1:14:39 PM5/3/13
to bgi-...@googlegroups.com
SUCCESS!

It appears Tablet places some asterisks as place-holders in some of the sequences I was checking. Removing these, placing all DNA sequence on one line, and ensuring that all DOS end of line characters were properly formatted for UNIX (dos2unix) allowed SOAPdenovo to run properly. I'll determine whether these place-holders were biologically significant later. For now SOAPdenovo is running. Thank you for pointing me in the right direction.

谢谢

wsg


On Thursday, 2 May 2013 15:21:23 UTC-6, wsg wrote:

Topulaneus Hattum

unread,
Mar 11, 2014, 1:04:38 PM3/11/14
to bgi-...@googlegroups.com
It appears Tablet places some asterisks as place-holders in some of the sequences I was checking. Removing these, ... allowed SOAPdenovo to run properly.

Hi, wsg,  Are you saying that the asterisks were in the nucleotide sequences?

I am having a similar problem, running soap denovo r223 on a pair of gzipped fastq files.  There are about 100 million 150 bp read pairs in the two files.  Soap denova was reporting "--- 18600000000th reads" before I killed it after 60 hours.

I extracted a small subset of the pairs (a couple thousand), and soap denovo ran fine for those.

An online search revealed your post of a similar problem. It sounds like I need to "fix" my reads to remove asterisks but I want to understand better what you actually changed.

Thanks.

Reply all
Reply to author
Forward
0 new messages