Problem with IGV --> F2R1 and F1R2

1,562 views
Skip to first unread message

Baptiste

unread,
Apr 24, 2016, 11:37:51 PM4/24/16
to rna-star
Hello everyone,

I am currently in internship to validate my master. I am working on data that I received from a biologist. She told me that the data were generated with RNA-seq, paired-end library FR-firstrand. So With this information I expect to see something like F2R1 on IGV.

Anyway, I mapped fastq file with STAR, seeing that I am nice intern student, I keep the command line:

/home/baptiste/Documents/Biotools/star \
        --runThreadN 60 \
        --runMode genomeGenerate \
        --genomeDir /Genomes/StarIndexes/ \
        --genomeFastaFiles /Genomes/FileFasta/hg38.fa  \
        --sjdbGTFfile /Genomes/FileGTF/gencode.v23.annotation.gtf \
        --sjdbOverhang 30

/home/baptiste/Documents/Biotools/star \
        --runThreadN 60 \
        --genomeDir Path/to/StarIndexes \
        --readFilesIn File_read1.fastq.gz File_read2.fastq.gz
        --readFilesCommand zcat \
        --outFileNamePrefix Path/to/Output/Sample1_STAR

Then I convert the SAM into BAM and I generated the BAI files. (Whew, it's finished ! )
Seeing that I am still a great student who cares about what he did, I visualized my results with IGV.

But the Suprised ! I see on IGV that I have reads F2R1 and reads F1R2.
-->   <--        -->     <--
F1    R2        F2      R1

But problem: I saw (somewhere ) that when you found F2R1 and F1R2, it means that your library is FR-unstrand. It could explain why when I try to see my bedgraph with UCSC, I see that strand - and strand + are quite similar (because I used the options "strand" with bedtools). Would she try to trick me ?
Can someone help me to understand ? Thank you for your help.

Baptiste

Alexander Dobin

unread,
Apr 25, 2016, 5:24:03 PM4/25/16
to rna-star
Hi Baptiste,

fr-firststrand library in Tophat/Cufflinks lingo means that the read sequence is on the opposite strand to the original RNA.
Hence,
-->     <--
F2      R1
configuration is for RNA transcribed from the +strand of DNA
and
-->   <--      
F1    R2       
is for RNA transcribed from the -strand.

If you see both configurations for all genes, it's likely that the data is unstranded.

I have not used bedTools in a while, but I think it's supposed to take proper care of the strands in the BAM file with PE reads.
You can also use STAR to get bedGraph files:
STAR --runMode inputAlignmentsFromBAM --inputBAMfile Aligned.sortedByCoord.out.bam --outWigType bedGraph --outWigStrand Stranded

Cheers
Alex

Baptiste

unread,
Apr 25, 2016, 9:56:22 PM4/25/16
to rna-star
Hi Alexander,

One again, thank you very much for your reply, it helps me.
I will try to use STAR for the bed files but the thing is,
if the problem comes from STAR, is it possible that STAR generates this can of error ? Or it is really obvious that the data are unstranded and it is impossible that STAR do this.
But really thank you for your help, I am glad I found this forum.

Baptiste

Alexander Dobin

unread,
Apr 27, 2016, 3:22:32 PM4/27/16
to rna-star
Hi Baptiste,

it's unlikely that this is a STAR problem, this part of the algortihm is quite stable.
Yet another way to check strandedness of the libraries is to calculate the number of reads per gene with --quantMode GeneCounts, and then compare columns 3 and 4 of the ReadsPerGene.out.tab file, which count reads per gene
with opposite strandedness. If values in one of the columns are much larger than in the other, the data is stranded, if the values are similar - the data is unstranded.

Cheers
Alex

Baptiste

unread,
Apr 27, 2016, 9:12:23 PM4/27/16
to rna-star
Hi Alexander,

Thanks for your reply.
I will try to use this parameter and see if my data are stranded or not.

Baptiste

Baptiste

unread,
Apr 27, 2016, 10:10:52 PM4/27/16
to rna-star
Hi Alexander,

This is the output of the --quantMode GeneCounts:
N_unmapped      68986391        68986391        68986391
N_multimapping  9513186 9513186 9513186
N_noFeature     99133281        103052918       107361768
N_ambiguous     1372844 716508  174154
ENSG00000223972.5       16      16      7
ENSG00000227232.5       56      47      9
ENSG00000278267.1       0       0       0
ENSG00000243485.3       29      27      7
ENSG00000274890.1       0       0       0
ENSG00000237613.2       32      19      13
ENSG00000268020.3       0       0       0
ENSG00000240361.1       0       0       0
ENSG00000186092.4       0       0       0
ENSG00000238009.6       6       10      3

This is my command line:
star \
        --runThreadN 100 \
        --quantMode GeneCounts \
        --sjdbGTFfile /Genomes/FileGTF/gencode.v23.annotation.gtf \
        --genomeDir /StarIndexes/ \
        --readFilesIn /File_read1.fastq.gz /File_read2.fastq.gz \
        --readFilesCommand zcat \
        --outFileNamePrefix /Test_Stranded_VS_Unstranded


So I am quite confused because it means that the data are stranded since the column 3 and the column 4 are different. I guess that the problem should come from samtools to convert the sam file, to sort the bam file, and to create the bai file.

Baptiste


Alexander Dobin

unread,
Apr 28, 2016, 12:07:47 PM4/28/16
to rna-star
Hi Baptiste,

I would suggest that you make a scatter plot of column 3 vs column 4.
For unstranded data, you should see most points close to the diagonal, while for stranded data it should close to X or Y axes.
Looking at the few genes in your example, it seems like there is some bias towards column 3, however, there are quite a lot of read in the column 4.
A typical example for stranded data will look like this:
N_noFeature     19765516        93881114        20587796
N_ambiguous     5386350 84654   2443827
ENSG00000241860.5       23      0       23
ENSG00000279457.2       66      0       66
ENSG00000228463.7       16      0       16
ENSG00000237094.10      23      0       23
ENSG00000225972.1       10      3       7
ENSG00000225630.1       1173    219     954
ENSG00000276171.1       3       0       3
ENSG00000237973.1       8201    99      8643
ENSG00000278791.1       0       528     13
ENSG00000229344.1       119     1       118
ENSG00000240409.1       3       1       2
ENSG00000248527.1       15474   546     14928
ENSG00000198744.5       113     4       109

In this case, for most genes col4 >> col3, while N_noFeature col2~col4<<col3

In your case, all columns are very similar for N_noFeature, so it looks like you have mostly unstranded data.
How many uniquely mapped reads do you have? It seems that you have a very large number of noFeature (i.e. non-exonic) reads - intronic or intergenic, is this what you expected?

Cheers
Alex

Baptiste

unread,
May 2, 2016, 10:10:06 PM5/2/16
to rna-star

Hi Alexander,

This is my results. It seems that the data are unstranded. (because of the symmetry ?) 
Anyway this is the log.final.out from star for this file:

                    Number of input reads |       190321335
                      Average input read length |       202
                                    UNIQUE READS:
                   Uniquely mapped reads number |       111821758
                        Uniquely mapped reads % |       58.75%
                          Average mapped length |       197.60
                       Number of splices: Total |       4193634
            Number of splices: Annotated (sjdb) |       3603747
                       Number of splices: GT/AG |       3829863
                       Number of splices: GC/AG |       40369
                       Number of splices: AT/AC |       5347
               Number of splices: Non-canonical |       318055
                      Mismatch rate per base, % |       0.68%
                         Deletion rate per base |       0.02%
                        Deletion average length |       1.72
                        Insertion rate per base |       0.01%
                       Insertion average length |       1.89
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |       9513186
             % of reads mapped to multiple loci |       5.00%
        Number of reads mapped to too many loci |       1494483
             % of reads mapped to too many loci |       0.79%
                                  UNMAPPED READS:
       % of reads unmapped: too many mismatches |       0.00%
                 % of reads unmapped: too short |       34.43%
                     % of reads unmapped: other |       1.03%


I am running also a TOPHAT to see if I can obtain the same results. But I re-checked the protocol for this data and it is written paire-end, fr-stranded.
Otherwise, thank you again to take your time to help me.

Best

Baptiste


Alexander Dobin

unread,
May 3, 2016, 12:53:38 PM5/3/16
to rna-star
Hi Baptiste,

it does indeed look mostly unstranded. Maybe the protocol did not work properly.
Please let me know if Tophat says otherwise.

Cheers
Alex 

Baptiste

unread,
May 25, 2016, 7:47:20 AM5/25/16
to rna-star
Hi Alexander,

Sorry for the late reply. With the sam from Tophat, I confirm that I obtained the same results as STAR.
The data are unstranded. But the sequencing protocol was saying  that it was stranded. Anyway, it was a good exercise for me and this is something that I can report in my thesis for my master.
So thank you again for your time and your help.

Cheers,

Baptiste 
Reply all
Reply to author
Forward
0 new messages