Difference in hg19 and GRCh37 reference genomes

1,497 views
Skip to first unread message

Gregory Stone

unread,
Mar 7, 2017, 12:53:43 PM3/7/17
to Subread
I would like to preface this by saying that I've grown to love featureCounts, and think its documentation and reporting is the best.
 I was wondering if a marked difference has been observed when using featureCounts with the hg19 and ensembl GRCh37 reference genomes. I aligned my reads using STAR and got similar results using each reference genome. I then go to use featureCounts and get vastly difference results. Using the hg19 genome in featureCounts I get about 40-50% successfully assigned fragments, whereas when using the GRCh37 genome I get about 70-80% successfully assigned fragments. I'm concerned that the significant difference means I'm implementing something wrong in featureCounts, and am confused as to which output I should trust. Any advice or guidance would be greatly appreciated. Thank you!

Mathias Lesche

unread,
Mar 8, 2017, 10:37:20 AM3/8/17
to Subread
There shouldn't be a difference between these two version. I think they are created from the same source. Anything else would be just wrong. However, Ensembl has different version from the same assembly like repeat regions are soft-masked or hard-masked where repeats are replaces by N. In order to see what went wrong can you print the output of the samtools flagstat of both alignments (to GRCh37 and hg19). Furthermore, the commands for featureCounts and the printout of the summary tables?

John Blischak

unread,
Mar 8, 2017, 11:37:02 AM3/8/17
to Subread
Hi Gregory,

I suspect there is a mismatch between the chromosome names in the reference genome compared to those in the annotation file used by featureCounts.

While hg19 and GRCh37 are the same genome build, UCSC appends "chr" to the beginning of the chromosome names, e.g. chr1, chr2, etc. On the other hand, Ensembl leaves the chromosomes as is: 1, 2, etc.  Another difference is the mitochondrial genome, which UCSC labels chrM and Ensembl labels MT.

What are the chromosome names in the annotation file you are using with feature counts?

John

Mathias Lesche

unread,
Mar 8, 2017, 12:10:45 PM3/8/17
to Subread
I had the same thought as well. Usually we recommend to use the assembly and annotation from the same site (like Ensembl) or to relabel the chromosome. I prefer the UCSC labels but use everything from Ensembl. If the names were wrong in reference and annotation, then I would I expect to see 0 counts for hg19 build and not 40-50 %

Gregory Stone

unread,
Mar 9, 2017, 10:51:40 AM3/9/17
to Subread
Thank you very much for the thoughtful replies. While the chromosome names are different between ensembl and hg19, I didn't mix and match. I did two runs, one with ensembl assembly and annotation, and one with hg19 assembly and annotation, so I don't believe that the difference in chromosome labeling would be the issue, unless featureCounts expects one or the other. The following is the output from the same sample file of samtools flagstat, the first when aligned with hg19 (assembly and annotation) and the second with ensembl:
29991285 + 702821 in total (QC-passed reads + QC-failed reads)
2278391 + 64385 secondary
0 + 0 supplementary
0 + 0 duplicates
29991285 + 702821 mapped (100.00% : 100.00%)
27712894 + 638436 paired in sequencing
13871636 + 315364 read1
13841258 + 323072 read2
27665798 + 627952 properly paired (99.83% : 98.36%)
27665798 + 627952 with itself and mate mapped
47096 + 10484 singletons (0.17% : 1.64%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)



30213813 + 705544 in total (QC-passed reads + QC-failed reads)
2286418 + 64605 secondary
0 + 0 supplementary
0 + 0 duplicates
30213813 + 705544 mapped (100.00% : 100.00%)
27927395 + 640939 paired in sequencing
13978944 + 316580 read1
13948451 + 324359 read2
27880148 + 630366 properly paired (99.83% : 98.35%)
27880148 + 630366 with itself and mate mapped
47247 + 10573 singletons (0.17% : 1.65%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)



I'm unfortunately having trouble pulling individual files from the featureCounts summary, so the following is the summary from the hg19 and then ensembl for the first couple files from each:
|| Load annotation file /groups/shared_databases/igenome/Homo_sapiens/UCS ... ||
||    Features : 392959                                                       ||
||    Meta-features : 23228                                                   ||
||    Chromosomes/contigs : 47                                                ||
||                                                                            ||
|| Process BAM file /n/data1/hms/genetics/seidman/danny/gls/STAR/STAR_hg1 ... ||
||    Paired-end reads are included.                                          ||
||    Assign fragments (read pairs) to features...                            ||
||                                                                            ||
||    WARNING: reads from the same pair were found not adjacent to each       ||
||             other in the input (due to read sorting by location or         ||
||             reporting of multi-mapping read pairs).                        ||
||                                                                            ||
||    Read re-ordering is performed.                                          ||
||                                                                            ||
||    Total fragments : 33961661                                              ||
||    Successfully assigned fragments : 14868327 (43.8%)                      ||
||    Running time : 3.29 minutes                                             ||
||                                                                            ||
|| Process BAM file /n/data1/hms/genetics/seidman/danny/gls/STAR/STAR_hg1 ... ||
||    Paired-end reads are included.                                          ||
||    Assign fragments (read pairs) to features...                            ||
||                                                                            ||
||    WARNING: reads from the same pair were found not adjacent to each       ||
||             other in the input (due to read sorting by location or         ||
||             reporting of multi-mapping read pairs).                        ||
||                                                                            ||
||    Read re-ordering is performed.                                          ||
||                                                                            ||
||    Total fragments : 33899360                                              ||
||    Successfully assigned fragments : 18155450 (53.6%)                      ||
||    Running time : 3.44 minutes                                             ||





|| Load annotation file /groups/shared_databases/igenome/Homo_sapiens/Ens ... ||
||    Features : 1309155                                                      ||
||    Meta-features : 62069                                                   ||
||    Chromosomes/contigs : 244                                               ||
||                                                                            ||
|| Process BAM file /n/data1/hms/genetics/seidman/danny/gls/final_workflo ... ||
||    Paired-end reads are included.                                          ||
||    Assign fragments (read pairs) to features...                            ||
||                                                                            ||
||    WARNING: reads from the same pair were found not adjacent to each       ||
||             other in the input (due to read sorting by location or         ||
||             reporting of multi-mapping read pairs).                        ||
||                                                                            ||
||    Read re-ordering is performed.                                          ||
||                                                                            ||
||    Total fragments : 34164741                                              ||
||    Successfully assigned fragments : 27688203 (81.0%)                      ||
||    Running time : 0.52 minutes                                             ||
||                                                                            ||
|| Process BAM file /n/data1/hms/genetics/seidman/danny/gls/final_workflo ... ||
||    Paired-end reads are included.                                          ||
||    Assign fragments (read pairs) to features...                            ||
||                                                                            ||
||    WARNING: reads from the same pair were found not adjacent to each       ||
||             other in the input (due to read sorting by location or         ||
||             reporting of multi-mapping read pairs).                        ||
||                                                                            ||
||    Read re-ordering is performed.                                          ||
||                                                                            ||
||    Total fragments : 34067700                                              ||
||    Successfully assigned fragments : 28410464 (83.4%)                      ||
||    Running time : 0.65 minutes                                             ||



Any help would be greatly appreciated. Thank you!

Mathias Lesche

unread,
Mar 9, 2017, 11:07:07 AM3/9/17
to Subread
There are a couple of things that I can see in the output of featuerCounts

hg19:
||    Features : 392959
||    Meta-features : 23228                                                   
||    Chromosomes/contigs : 47   

Ensembl:
||    Features : 1309155                                                      ||
||    Meta-features : 62069                                                   ||
||    Chromosomes/contigs : 244       

First, it looks like you have used used all available chromosomes from the assemblies. Maybe you should restrict the alignment to the 'normal chromsosomes' chr1, chr2, ... chrM and X and Y. I usually use the soft masked primary assembly from Ensembl. Secondly, the amount of features differes between hg19 (UCSC) and Ensembl. hg19 lists 392959 whereas Ensembl lists 1309155. The same goes for the Meta-features which are the genes. So that is most likely the explanation for the differences in assigned fragments to features. Ensembl just list more genes (confirmed, predicted etc) and UCSC is more stringent. What you could do is extract a list of genes from annotation and look at the overlap and difference.

Gregory Stone

unread,
Mar 9, 2017, 11:51:14 AM3/9/17
to Subread
Ok great, thank you so much for the help
Reply all
Reply to author
Forward
0 new messages