Star never finish when align reads to a large genome reference?

538 views
Skip to first unread message

Zoe

unread,
Nov 2, 2016, 10:59:08 AM11/2/16
to rna-star
Hi Alex,

I am trying to align wheat reads against Triticum aestivum genome (version TGACv1) from Ensemble plant (http://plants.ensembl.org/Triticum_aestivum/Info/Index).
I tried STAR version 2.4.2a and 2.5.2b, both of them hang at certain stage for a few days and never complete (please see the Log.out: https://drive.google.com/open?id=0B9WQGMiXLe7GeWh6MlJvSXVaT1k ). There is no error message. BTW, there is no problem at all when generating index.

It was working very well when I used version 2.4.2a to align the reads to a smaller wheat genome reference, it took about 2-3 hours to complete one sample.

In the new study, I used "masked genomic DNA" as reference, in which the total length of the genome is 13427354022, it contains 80% of N's. And there are a lot of scaffold. I am wondering if it is because of the genome size.

The parameters used for alignment are the following:

STAR-2.5.2b/bin/Linux_x86_64_static/STAR STAR
--runMode alignReads
--runThreadN 16
--limitBAMsortRAM 470000000000
--limitIObufferSize 500000000
--limitSjdbInsertNsj 5000000
--outReadsUnmapped Fastx
--outSAMtype BAM SortedByCoordinate
--outSAMmode Full
--outSAMstrandField intronMotif
--outFilterIntronMotifs RemoveNoncanonical
--chimSegmentMin 20
--quantMode TranscriptomeSAM GeneCounts
--outBAMsortingThreadN 0
--outSAMattributes All

 --genomeDir ${ref_index}
--readFilesIn reads1.fa reads2.fa
--outFileNamePrefix myResults/

Could you please point out what is the problem for me and give me some suggestions in terms parameters?

Thanks in advance!

Zoe

BTW, the Log.out file is too big and not able to attach

Zoe

unread,
Nov 3, 2016, 3:32:19 PM11/3/16
to rna-star
BTW, I am running the job on a 32 cores and 512GB server.

Alexander Dobin

unread,
Nov 3, 2016, 5:19:28 PM11/3/16
to rna-star
Hi Zoe,

is there anything written in the Log.progress.out file?
If not, please try to map a very small subset of reads with --readMapNumber 1000 .

I can think of a couple of reasons that may cause very slow mapping:
1. Masked genome. If the reads are coming from loci that are masked, STAR will have a hard time trying to fit them into the rest of the genome.
You could try to map to the unmasked genome, though it will significantly increase the suffix array size - by a factor of 5 if 80% are un-masked. It should still fit under 150GB of RAM.
2. Large number of contigs. The speed is reduced when the number of contigs is > 50-100k, while you have ~700k. To mitigate this, you can keep the 50k longet contigs as is, and concatenate short contigs into one big super-contig. If this helps, you would need to write a scripts to transform the super-contig alignments into contig cordinates.

Cheers
Alex

Zoe

unread,
Nov 4, 2016, 8:22:49 AM11/4/16
to rna-star
Hi Alex,

I will try to use unmasked genome and let you know. BTW, only headers are in the Log.progress.out file, please see the attachment.
Thanks,
Zoe
Log.progress.out

Ziying Liu

unread,
Nov 4, 2016, 4:04:17 PM11/4/16
to rna-star
Hi Alex,

When I used a small subset of reads with --readMapNumber 1000, it works well. What is your suggestion then? Split the input data to perform the alignment would be a way to go?

Thanks,
Zoe

ps. I can try to use un-masked genome, however it will cause problem for subsequent data analysis I think.

--
You received this message because you are subscribed to a topic in the Google Groups "rna-star" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rna-star/atgXFA2wnYw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rna-star+unsubscribe@googlegroups.com.
Visit this group at https://groups.google.com/group/rna-star.

Alexander Dobin

unread,
Nov 4, 2016, 6:00:04 PM11/4/16
to rna-star
Hi Zoe,

the 1000 reads test tells us that STAR can in principle map, not catastrophic error with genome, fastqs, etc. However, it's either
(i) very slow, that's why you cannot see the progress
(ii) something bad happens in the middle of the fastq file.

Please try to increase the number of mapped read to 10,000 and then 100,000 until the mapping becomes really slow, and then post the Log.final.out file for the largest number of reads you could map.
We will see if the mapping speed is indeed very low. If the mapping rate is very low as well, it will tell us that the masking is the likely problem.

Cheers
Alex

On Friday, November 4, 2016 at 4:04:17 PM UTC-4, Zoe wrote:
Hi Alex,

When I used a small subset of reads with --readMapNumber 1000, it works well. What is your suggestion then? Split the input data to perform the alignment would be a way to go?

Thanks,
Zoe

ps. I can try to use un-masked genome, however it will cause problem for subsequent data analysis I think.
Reply all
Reply to author
Forward
0 new messages