excessive run time for particular fastq orangutan sequences

490 views
Skip to first unread message
Assigned to ado...@gmail.com by me

Sol Katzman

unread,
Jun 13, 2017, 6:55:09 PM6/13/17
to rna-star
Dear Alex,

STAR 2.5.3a

or

STAR 2.5.1b

Target is orangutan (assembly ponAbe2) with ensembl gene 100bp overlap sjdb.

Numerous samples with a range of 5M to 10M sequenced paired-End 130x100bp reads.

Running with 8 threads, about 32GB

Most jobs completed in an hour or so. (I have used this target many times in the past
with no problems.)

One job took 24 hours. (was actually slower after I upgraded from 2.5.1b to 2.5.3a)

I laboriously have narrowed this down to a small file of 111 reads.

A different set of 111 reads from the same sample completes in about a minute

The "slow" set has been running for hours. Maybe will finish by the time you see this.

I will send Alex a URL for access to the fast and slow data, the STAR parameters,
and the ponAbe2 target.

I hope you can look into this.

Thanks,
/Sol Katzman
UCSC


Alexander Dobin

unread,
Jun 19, 2017, 4:40:50 PM6/19/17
to rna-star
Hi Sol,

I think this is caused by one of the known problems with STAR (but also an opportunity for a significant speed-up in the future) - it spends too much time on a few reads that are hard to align. I will check these particular reads to see if this is the case, and will see if I can tweak the parameters to fix it.

Cheers
Alex

Sol Katzman

unread,
Jun 19, 2017, 7:26:28 PM6/19/17
to rna-star
Dear Alex,

Thanks for looking into it.

The set of 111 sequences that contain the problem finally finished after over 48 hours:

Jun 13 15:36:47 ..... started STAR run
Jun 13 15:36:47 ..... loading genome
Jun 13 15:37:43 ..... started mapping
Jun 15 19:31:40 ..... started sorting BAM
Jun 15 19:31:40 ..... finished successfully

It is hard to imagine what "hard to align" means. Can you explain that a little?

I have mapped probably 20 or more samples to the same target
without ever encountering this kind of response.

I imagine that simply checking for forward progress every few minutes
and discarding such a problem sequence would be a good fix.

/Sol.

Alexander Dobin

unread,
Jun 26, 2017, 4:14:30 PM6/26/17
to rna-star
Hi Sol,

I have reproduced the slow mapping of these reads, though it was not as drastic as yours, only :) took ~1hour to map these 111 reads.
The reason behind such a slow mapping (I checked read #4 from your file that looks like the 1st slow read) is the large number (50) of short seeds for this read in one alignment window. This is very rare for short reads, typically, the number of seeds per window (genomic region of ~500kb) is <5. STAR tries to stitch all possible combinations of seeds in a window, and this takes long time which grows exponentially with the number of seeds. This is not the best approach for stitching the seeds - for instance, for STARlong I am using a DP algorithm with time quadratic in number of seeds. I am planning to implement the DP algorithm for short reads in the future as well.
In the meantime, there is a parameter --seedPerWindowNmax which controls the max number of seeds per window, Unfortunately, it's default 50 is unrealistically large.
If you set this at a smaller value, say 30, you will avoid the mapping slowdown, and will not lose sensitivity (lose <100 alignments).

Cheers
Alex

Sol Katzman

unread,
Jul 18, 2017, 2:05:12 PM7/18/17
to rna-star
Dear Alex,

Thanks for the workaround.

I tried a range of values (25,30,35,40,45) for --seedPerWindowNmax,

I ran on the same small data set, in rapid succession on the same computing resources.

Only 1-2 seconds for the mapping step using 25,30,35,40,45

Then the "blowup" using 50.

Now intellectually, I know:

"There is no "knee" in an exponential curve"

http://www.abarry.org/knee.htm

But it feels like a knee. Or maybe it's not precisely exponential growth.

No matter. Good workaround. Thanks again.

/Sol.
--------------------------------------------------------
 ====== seedPerWindow25 ====
Finished loading the genome: Tue Jul 18 10:19:07 2017
Jul 18 10:19:09 ..... started sorting BAM
 ====== seedPerWindow30 ====
Finished loading the genome: Tue Jul 18 10:32:34 2017
Jul 18 10:32:35 ..... started sorting BAM
 ====== seedPerWindow35 ====
Finished loading the genome: Tue Jul 18 10:33:02 2017
Jul 18 10:33:03 ..... started sorting BAM
 ====== seedPerWindow40 ====
Finished loading the genome: Tue Jul 18 10:33:30 2017
Jul 18 10:33:31 ..... started sorting BAM
 ====== seedPerWindow45 ====
Finished loading the genome: Tue Jul 18 10:33:58 2017
Jul 18 10:34:00 ..... started sorting BAM

Alexander Dobin

unread,
Aug 28, 2017, 3:54:28 PM8/28/17
to rna-star
Hi Sol,

the compute time scales (roughly) exponentially with the number of seeds per window. However, the --seedPerWindowNmax parameter has a threshold effect on the running time. One particular read (4th from the start in your file) has exactly 50 seeds in one of the windows, which takes an "exponentially" long time to compute. If you set --seedPerWindowNmax 49 (or below), this 50-seed window is not considered at all, so the compute time drops drastically.

One other question is why I did not set this parameter to a more reasonable value to begin with. The problem is that the exponential growth of the run time with the number of seeds in a window is the worst case scenario. In most cases, the seeds in each window split into smaller "collinear" subsets, with the subsets incompatible to each other (i.e. the order of the seeds on the genome is reverted with respect to the order on the read sequence). In principle, I should limit the number of seeds in these subsets to avoid the exponential mapping time. I hope to implement this in the future.

Cheers
Alex
Reply all
Reply to author
Forward
0 new messages