featureCounts output

653 views
Skip to first unread message

kistina...@gmail.com

unread,
Sep 23, 2020, 2:43:31 AM9/23/20
to Subread
Dear Yang and Wei Shi,

Sorry, a bit of an amateur question.. at which part should I edit to have the output summary list genes by its ensembl_gene_id instead of listing the hgnc symbols?

Many thanks,
Kistina

Yang LIAO

unread,
Sep 23, 2020, 2:47:06 AM9/23/20
to Subread
Hi Kistina,

Do you mean to have the ensembl gene ids as the identities of genes in the output read-count table from featureCounts?

Are you using the featureCounts program in the command line or the featureCounts() function in R?

Cheers,
Yang

kistina...@gmail.com

unread,
Sep 23, 2020, 3:05:43 AM9/23/20
to Subread
Yes,for the identities of genes in the output read-count table from featureCounts.  

I am using it in R.

Many thanks.

Yang LIAO

unread,
Sep 23, 2020, 3:09:08 AM9/23/20
to Subread
If you used the GTF file downloaded from the Ensembl website, then the default gene identities in the returned R object from featureCounts() are the Ensembl gene ids (eg "ENSG00000223972"). On the other hand, if you want to use the gene symbols as gene identities, you need to specify an extra option when running featureCounts.

Can you provide the command that you ran in R for counting reads?

Cheers,
Yang

kistina...@gmail.com

unread,
Sep 23, 2020, 3:11:05 AM9/23/20
to Subread
Code is:

counts_output<-featureCounts(bam.files,
                             annot.ext = "/run/media/johnalex/Sheffield_C/Kistina/Reference/Homo_sapiens.GRCh38.79.gtf",
                             isGTFAnnotationFile = TRUE,
                             useMetaFeatures = TRUE,
                             allowMultiOverlap = TRUE,
                              largestOverlap = TRUE,
                             isPairedEnd = TRUE,
                             requireBothEndsMapped = FALSE,
                             nthreads = 1)

colnames(counts_output$counts)= paste(sub("\\_.*", "",reads1))

#### save data
setwd("/run/media/johnalex/Sheffield_C/Kistina/Data")
write.table(counts_output,file="transcript_counts.txt",sep = "\t", row.names = TRUE)  

kistina...@gmail.com

unread,
Sep 23, 2020, 3:13:03 AM9/23/20
to Subread
From what you said, it should be the ensembl gene IDs isn't it, as the annotation file is an Ensembl reference.

Yang LIAO

unread,
Sep 23, 2020, 3:24:53 AM9/23/20
to Subread
 FearureCounts by default uses the "gene_id" field in the GTF file as the gene identities. If you haven't edited the annotation file downloaded from Ensembl, then the gene identities are the ensembl gene ids.

You may take a look at the annotation information in the returned object:

> head(counts_output$annotation)

, to see the values in the "GeneID" column.

kistina...@gmail.com

unread,
Sep 23, 2020, 5:12:36 AM9/23/20
to Subread
Sorry, Yang. The GTF file that I used wasn't the one I shared, but it is the NCBI one (GCF_000001405.39_GRCh38.p13_genomic.gtf). I had a look at this file and yes, the gene_id is the gene symbol. 

If I may add some details to what my issue actually is:

My index was built using NCBI 's GRCh38_latest_genomic.fna.
For the counts, I used the NCBI gtf file mentioned above.
The first time I ran my data, the output read-count table listed down as Ensemble_gene_ids

MR1.bam MR2.bam MR3.bam MR4.bam MR5.bam MR6.bam MR7.bam MR8.bam MR9.bam MR10.bam MR11.bam MR12.bam MR13.bam MR14.bam MR15.bam MR16.bam MR17.bam MR18.bam MR19.bam MR20.bam
ENSG00000223972 3 5 7 8 5 2 6 2 2 3 0 4 2 1 3 3 5 1 4 0
ENSG00000227232 148 249 188 162 185 188 205 144 144 238 97 184 222 157 175 177 132 176 134 136
ENSG00000278267 9 9 5 8 8 11 14 3 7 25 8 11 5 10 13 6 6 17 3 6
ENSG00000243485 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0

When I ran another set of data in a different setting but using the exact same index and gtf file, I seem to be getting gene symbols.
"A096CD.bam" "A096SGN.bam" "A096VD.bam" "A096VGN.bam" "A204NB.bam" "A204OV.bam" "A205CD.bam" "A205SGN.bam" "A205VD.bam" "A205VGN.bam" "A211NB.bam" "A211OV.bam" "A216NB.bam" "A216OV.bam" "A218CD.bam" "A218SGN.bam" "A218VD.bam" "A218VGN.bam" "A220CD.bam" "A220SGN.bam" "A220VD.bam" "A220VGN.bam" "A226NB.bam" "A226OV.bam" "A228CD.bam" "A228SGN.bam" "A228VD.bam" "A228VGN.bam" "A237CD.bam" "A237SGN.bam" "A237VD.bam" "A237VGN.bam" "A240CD.bam" "A240SGN.bam" "A240VD.bam" "A240VGN.bam" "A242CD.bam" "A242SGN.bam" "A242VD.bam" "A242VGN.bam" "A245CD.bam" "A245SGN.bam" "A245VD.bam" "A245VGN.bam" "A247CD.bam" "A247SGN.bam" "A247VD.bam" "A247VGN.bam"
"DDX11L1" 0 0 0 3 0 0 3 0 0 0 0 0 0 0 0 0 3 1 2 0 0 1 0 0 0 0 0 1 1 0 1 1 0 0 0 0 0 2 0 0 0 0 1 0 0 0 0 0
"WASH7P" 75 93 135 56 104 72 159 86 183 122 70 107 16 67 88 42 59 51 79 37 73 54 133 116 193 145 102 102 235 125 263 187 126 101 122 82 115 51 96 101 190 79 66 100 166 107 97 149
"MIR6859-1" 0 0 0 0 0 0 2 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 1 0 0 0 8 2 0 0
"MIR1302-2HG" 0 3 5 5 0 3 0 2 2 1 3 1 0 0 0 0 3 1 3 2 4 2 2 0 2 1 1 1 4 3 0 1 4 0 0 0 0 1 2 0 2 3 4 0 3 1 12 1

I am quite confident that I didn't change anything to my codes or to the gtf file, so I am a bit confused as to why the difference in the output reads and thought perhaps there was some update to the package that I would now need specify when I run the featureCounts.

Many thanks.

Yang LIAO

unread,
Sep 23, 2020, 7:06:58 AM9/23/20
to Subread
FeatureCounts doesn't convert gene symbols to ensembl gene ids, so I don't think you can have Ensembl gene ids by using that NCBI gtf file. Also, if you used the same gtf file and didn't change the options of featureCounts, it should not change the gene identities in the resulted counting tables.

Can you show like the first 50 rows in your NCBI gtf file?

kistina...@gmail.com

unread,
Sep 23, 2020, 7:39:00 AM9/23/20
to Subread
The NCBI gtf:

#gtf-version 2.2
#!genome-build GRCh38.p13
#!genome-build-accession NCBI_Assembly:GCF_000001405.39
#!annotation-date 09/05/2019
#!annotation-source NCBI Homo sapiens Updated Annotation Release 109.20190905
NC_000001.11 BestRefSeq gene 11874 14409 . + . gene_id "DDX11L1"; db_xref "GeneID:100287102"; db_xref "HGNC:HGNC:37102"; description "DEAD/H-box helicase 11 like 1"; gbkey "Gene"; gene "DDX11L1"; gene_biotype "transcribed_pseudogene"; pseudo "true";
NC_000001.11 BestRefSeq exon 11874 12227 . + . gene_id "DDX11L1"; transcript_id "NR_046018.2"; db_xref "GeneID:100287102"; gbkey "misc_RNA"; gene "DDX11L1"; product "DEAD/H-box helicase 11 like 1"; exon_number "1";
NC_000001.11 BestRefSeq exon 12613 12721 . + . gene_id "DDX11L1"; transcript_id "NR_046018.2"; db_xref "GeneID:100287102"; gbkey "misc_RNA"; gene "DDX11L1"; product "DEAD/H-box helicase 11 like 1"; exon_number "2";
NC_000001.11 BestRefSeq exon 13221 14409 . + . gene_id "DDX11L1"; transcript_id "NR_046018.2"; db_xref "GeneID:100287102"; gbkey "misc_RNA"; gene "DDX11L1"; product "DEAD/H-box helicase 11 like 1"; exon_number "3";
NC_000001.11 BestRefSeq gene 14362 29370 . - . gene_id "WASH7P"; db_xref "GeneID:653635"; db_xref "HGNC:HGNC:38034"; description "WASP family homolog 7, pseudogene"; gbkey "Gene"; gene "WASH7P"; gene_biotype "transcribed_pseudogene"; gene_synonym "FAM39F"; gene_synonym "WASH5P"; pseudo "true";
NC_000001.11 BestRefSeq exon 29321 29370 . - . gene_id "WASH7P"; transcript_id "NR_024540.1"; db_xref "GeneID:653635"; gbkey "misc_RNA"; gene "WASH7P"; product "WASP family homolog 7, pseudogene"; exon_number "1";
NC_000001.11 BestRefSeq exon 24738 24891 . - . gene_id "WASH7P"; transcript_id "NR_024540.1"; db_xref "GeneID:653635"; gbkey "misc_RNA"; gene "WASH7P"; product "WASP family homolog 7, pseudogene"; exon_number "2";
NC_000001.11 BestRefSeq exon 18268 18366 . - . gene_id "WASH7P"; transcript_id "NR_024540.1"; db_xref "GeneID:653635"; gbkey "misc_RNA"; gene "WASH7P"; product "WASP family homolog 7, pseudogene"; exon_number "3";
NC_000001.11 BestRefSeq exon 17915 18061 . - . gene_id "WASH7P"; transcript_id "NR_024540.1"; db_xref "GeneID:653635"; gbkey "misc_RNA"; gene "WASH7P"; product "WASP family homolog 7, pseudogene"; exon_number "4";
NC_000001.11 BestRefSeq exon 17606 17742 . - . gene_id "WASH7P"; transcript_id "NR_024540.1"; db_xref "GeneID:653635"; gbkey "misc_RNA"; gene "WASH7P"; product "WASP family homolog 7, pseudogene"; exon_number "5";
NC_000001.11 BestRefSeq exon 17233 17368 . - . gene_id "WASH7P"; transcript_id "NR_024540.1"; db_xref "GeneID:653635"; gbkey "misc_RNA"; gene "WASH7P"; product "WASP family homolog 7, pseudogene"; exon_number "6";
NC_000001.11 BestRefSeq exon 16858 17055 . - . gene_id "WASH7P"; transcript_id "NR_024540.1"; db_xref "GeneID:653635"; gbkey "misc_RNA"; gene "WASH7P"; product "WASP family homolog 7, pseudogene"; exon_number "7";
NC_000001.11 BestRefSeq exon 16607 16765 . - . gene_id "WASH7P"; transcript_id "NR_024540.1"; db_xref "GeneID:653635"; gbkey "misc_RNA"; gene "WASH7P"; product "WASP family homolog 7, pseudogene"; exon_number "8";
NC_000001.11 BestRefSeq exon 15796 15947 . - . gene_id "WASH7P"; transcript_id "NR_024540.1"; db_xref "GeneID:653635"; gbkey "misc_RNA"; gene "WASH7P"; product "WASP family homolog 7, pseudogene"; exon_number "9";
NC_000001.11 BestRefSeq exon 14970 15038 . - . gene_id "WASH7P"; transcript_id "NR_024540.1"; db_xref "GeneID:653635"; gbkey "misc_RNA"; gene "WASH7P"; product "WASP family homolog 7, pseudogene"; exon_number "10";
NC_000001.11 BestRefSeq exon 14362 14829 . - . gene_id "WASH7P"; transcript_id "NR_024540.1"; db_xref "GeneID:653635"; gbkey "misc_RNA"; gene "WASH7P"; product "WASP family homolog 7, pseudogene"; exon_number "11";
NC_000001.11 BestRefSeq gene 17369 17436 . - . gene_id "MIR6859-1"; db_xref "GeneID:102466751"; db_xref "HGNC:HGNC:50039"; db_xref "miRBase:MI0022705"; description "microRNA 6859-1"; gbkey "Gene"; gene "MIR6859-1"; gene_biotype "miRNA"; gene_synonym "hsa-mir-6859-1";
NC_000001.11 BestRefSeq exon 17369 17436 . - . gene_id "MIR6859-1"; transcript_id "NR_106918.1"; db_xref "GeneID:102466751"; gbkey "precursor_RNA"; gene "MIR6859-1"; product "microRNA 6859-1"; exon_number "1";
NC_000001.11 BestRefSeq exon 17369 17391 . - . gene_id "MIR6859-1"; transcript_id "MIR6859-1"; db_xref "GeneID:102466751"; db_xref "miRBase:MIMAT0027619"; gbkey "ncRNA"; gene "MIR6859-1"; product "hsa-miR-6859-3p"; exon_number "1";
NC_000001.11 BestRefSeq exon 17409 17431 . - . gene_id "MIR6859-1"; transcript_id "MIR6859-1_1"; db_xref "GeneID:102466751"; db_xref "miRBase:MIMAT0027618"; gbkey "ncRNA"; gene "MIR6859-1"; product "hsa-miR-6859-5p"; exon_number "1";
NC_000001.11 Gnomon gene 29926 31295 . + . gene_id "MIR1302-2HG"; db_xref "GeneID:107985730"; db_xref "HGNC:HGNC:52482"; gbkey "Gene"; gene "MIR1302-2HG"; gene_biotype "lncRNA";
NC_000001.11 Gnomon exon 29926 30039 . + . gene_id "MIR1302-2HG"; transcript_id "XR_001737835.1"; db_xref "GeneID:107985730"; gbkey "ncRNA"; gene "MIR1302-2HG"; model_evidence "Supporting evidence includes similarity to: 100% coverage of the annotated genomic feature by RNAseq alignments, including 8 samples with support for all annotated introns"; product "MIR1302-2 host gene"; exon_number "1";
NC_000001.11 Gnomon exon 30564 30667 . + . gene_id "MIR1302-2HG"; transcript_id "XR_001737835.1"; db_xref "GeneID:107985730"; gbkey "ncRNA"; gene "MIR1302-2HG"; model_evidence "Supporting evidence includes similarity to: 100% coverage of the annotated genomic feature by RNAseq alignments, including 8 samples with support for all annotated introns"; product "MIR1302-2 host gene"; exon_number "2";
NC_000001.11 Gnomon exon 30976 31295 . + . gene_id "MIR1302-2HG"; transcript_id "XR_001737835.1"; db_xref "GeneID:107985730"; gbkey "ncRNA"; gene "MIR1302-2HG"; model_evidence "Supporting evidence includes similarity to: 100% coverage of the annotated genomic feature by RNAseq alignments, including 8 samples with support for all annotated introns"; product "MIR1302-2 host gene"; exon_number "3";  

Wei Shi

unread,
Sep 23, 2020, 7:33:04 PM9/23/20
to Subread
You can set GTF.attrType = "db_xref" in your featureCounts command to get Entrez gene IDs, instead of gene symbols for the counts. 
Reply all
Reply to author
Forward
0 new messages