SOAPdenovo and paired end or unpaired FASTA files

wsg

unread,

May 2, 2013, 5:21:23 PM5/2/13

to bgi-...@googlegroups.com

I have a subset of an indexed MiSeq run that was obtained by extracting files from Tablet that were aligned against a reference sequence using BWA/samtools. The resulting files is in FASTA format and as near as I can tell unpaired. I'm attempting to preform a denovo assembly of this small (3300 kb) FASTA dataset with SOAPdenovo but SOAP cannot read the FASTA file and crashes in a loop trying to import the data to pregraph. This a small scale run to work out the bugs prior to scaling up to a larger data set.

1) My data appears as:

>M00542_7_000000000-A3C86_1_2103_28825_15020_pos=33_len=89

GTGGATTCACAATCCACTGCCTTGATCCACTTGGCTACATCCGCCCCTTATCCAGCTAAAGGATTTTTTTCTTTTTTCC

ATTGATCATT

>M00542_7_000000000-A3C86_1_1102_26169_15631_pos=125_len=90

CTATTTATTCTGACCTCCGTACTTCGATCGAGATATTGGACATAGAATGCCACTCTTTAAAAAGGAAAAAAGGAGTAAT

CAGCTGTGACA

...

up to ~14,000 reads

2) My config file is:

#maximal read length

max_rd_len=260

[LIB]

#average insert size

avg_ins=420

#if sequence needs to be reversed

reverse_seq=0

#in which part(s) the reads are used

asm_flags=3

#in which order the reads are used while scaffolding

rank=1

#fasta file (unpaired ends)

f=/home/fasta/1700-sorted_bam.fasta

3) and my output error looks like:

Version 2.04: released on July 13th, 2012

Compile Apr 25 2013 16:59:53

********************

Pregraph

********************

Parameters: pregraph -s soap-fasta.config -K 63 -R -o testfasta

In soap-fasta.config, 1 lib(s), maximum read length 260, maximum name length 256.

8 thread(s) initialized.

Import reads from file:

/home/fasta/1700-sorted_bam.fasta

--- 100000000th reads.

--- 200000000th reads.

--- 300000000th reads.

...

--- 108400000000th reads.

...and on until I "kill -9" the run.

lizhenyu

unread,

May 2, 2013, 9:56:09 PM5/2/13

to bgi-...@googlegroups.com

Hi,

Please check whether the read sequence is in one line.

2013-05-03

BGI Zhenyu Li

发件人： wsg

发送时间： 2013-05-03 09:42:21

收件人： bgi-...@googlegroups.com

抄送：

主题： [BGI-SOAP:767] SOAPdenovo and paired end or unpaired FASTA files

--
You received this message because you are subscribed to the Google Groups "BGI-SOAP" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bgi-soap+u...@googlegroups.com.
To post to this group, send email to bgi-...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msg/bgi-soap/-/NbUN_4onqXwJ.
For more options, visit https://groups.google.com/groups/opt_out.

wsg

unread,

May 3, 2013, 12:32:19 PM5/3/13

to bgi-...@googlegroups.com

The fasta sequence was on multiple lines. End-of-line characters were removed from the DNA sequence portion and rerun. The same 'loop' occurred:

Version 2.04: released on July 13th, 2012

Compile Apr 25 2013 16:59:53

********************

Pregraph

********************

Parameters: pregraph -s soap-fasta.config -K 63 -R -o fastatest

In soap-fasta.config, 1 lib(s), maximum read length 260, maximum name length 25$

8 thread(s) initialized.

Import reads from file:

/home/fasta/bamwindow-star.fa

--- 100000000th reads.

--- 200000000th reads.

--- 300000000th reads.

--- 400000000th reads.

etc.

wsg

unread,

May 3, 2013, 1:14:39 PM5/3/13

to bgi-...@googlegroups.com

SUCCESS!

It appears Tablet places some asterisks as place-holders in some of the sequences I was checking. Removing these, placing all DNA sequence on one line, and ensuring that all DOS end of line characters were properly formatted for UNIX (dos2unix) allowed SOAPdenovo to run properly. I'll determine whether these place-holders were biologically significant later. For now SOAPdenovo is running. Thank you for pointing me in the right direction.

谢谢

wsg

On Thursday, 2 May 2013 15:21:23 UTC-6, wsg wrote:

Topulaneus Hattum

unread,

Mar 11, 2014, 1:04:38 PM3/11/14

to bgi-...@googlegroups.com

It appears Tablet places some asterisks as place-holders in some of the sequences I was checking. Removing these, ... allowed SOAPdenovo to run properly.

Hi, wsg, Are you saying that the asterisks were in the nucleotide sequences?

I am having a similar problem, running soap denovo r223 on a pair of gzipped fastq files. There are about 100 million 150 bp read pairs in the two files. Soap denova was reporting "--- 18600000000th reads" before I killed it after 60 hours.

I extracted a small subset of the pairs (a couple thousand), and soap denovo ran fine for those.

An online search revealed your post of a similar problem. It sounds like I need to "fix" my reads to remove asterisks but I want to understand better what you actually changed.

Thanks.

Reply all

Reply to author

Forward