Poor sample quality, or STAR issue?

68 views

Skip to first unread message

daheeba...@gmail.com

unread,

Apr 3, 2018, 9:04:46 AM4/3/18

to rna-star

Hi there!

I'm trying to do a little troubleshooting with a dataset. Basically I have 60 paired end fastq files (coming from 30 samples), and using my pipeline (fastqc > trimmomatic > fastqc > STAR > RSEM) 28 samples have a good level of reads mapping uniquely and there are no problems (>90%). Two samples however do not map well: approx 19% mapping uniquely and the remaining ~80% fall into the 'too short' category..

When looking at the fastqc reports for these two troublesome samples I saw that the quality wasn't as good as the others and there were many fails. Most fails/warnings for the other samples disappeared after trimming, but the two troublesome samples seem to have a lot of over-represented sequences (not expected from exp design) and an extremely high level of sequence duplication (clear peak at >10k).

To try to resolve this I have tried the following:

a. Re-trim but with a smaller sliding window to remove 'bad tiles' and re-run STAR as default.
    i.e. ILLUMINACLIP:<pathToFolder>/TruSeq3-PE-2.fa:2:30:10:4 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:20 MINLEN:36
    Result: This improved fastqc statistics considerably, but still only 19% mapped. The length of the mapping was 198 (100bp paired end data) so this looks OK and was similar in all subsequent trials

b. Try running STAR with just one of the trimmed paired end reads at a time
    Result: 19% mapped. Mapped length was 99.4.

c. Try running STAR without trimming at all to see if pair-mate mismatches was to blame
    Result: 19% mapped

d. Try running STAR with adjusted mapping parameters
    i.e. --outFilterScoreMinOverLread 0.5 --outFilterMatchNminOverLread 0 \
    Result: 19% mapped

e. Take 1000 random reads from the unmapped.out.mate1 & mate2 files from a. and BLAST to see what is not mapping in the 80%+ reads in the output from STAR
    Result: no hits to anything on NCBI

Any ideas how to investigate further? Could there have been a problem at the library preparation step/ a PCR step? Libraries and sequencing was outsourced to BGI, so I did not do this myself. I have three samples per treatment group, so dropping one of the three for two treatment groups wouldn't be the end of the world, but I'm unsure if unbalanced numbers of biological replicates has an impact downstream. Thanks for any help in advance

Deb

Alexander Dobin

unread,

Apr 3, 2018, 1:19:53 PM4/3/18

to rna-star

Hi Deb,

if BLAST does not find any hits for 80% of the reads, it's a really bad sign. My guess would be that these are some kind of adapter sequences that trimmomatic is not aware of. I would ask BGI for a full list of adapters used in library construction/sequencing, and try to match unmapped reads against them.

You can also try to go least stringent mapping with --outFilterScoreMinOverLread 0 --outFilterMatchNminOverLread 0 on the unmapped reads to see what portion of these reads can be mapped. I am afraid that in the end only the reads that mapped (19%) will be useful.

Cheers