Normalization of counts

27 views
Skip to first unread message

mette Joergensen

unread,
Jan 19, 2021, 8:39:27 AM1/19/21
to NGLess
Hi
In the documentation for the count functions it says that the {normed} parameter results in the counts being divided by the size of the feature. Is the size of the feature measured in kb? I have an example where the raw count is around 3000 and the normalized value is around 0.0006 meaning that the gene should be more than 5MB long if the length is measured in bases. Or am I misunderstanding something?
Best,
Mette  

Luis Pedro Coelho

unread,
Jan 20, 2021, 4:20:50 AM1/20/21
to NGLess List
That is correct: {normed} should divide by the size of the feature (note that the feature may be longer than a gene if you are not using seqname: it's the sum of the size of all the genes with a given annotation).

I just also noticed that we never documented the fact that we also support {fpkm} (fragments per thousand per million fragments), but I added that bit to the documentation now.

HTH
Luis

Luis Pedro Coelho | Fudan University | http://luispedro.org
--
You received this message because you are subscribed to the Google Groups "NGLess" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ngless+un...@googlegroups.com.

mette Joergensen

unread,
Jan 20, 2021, 8:17:43 AM1/20/21
to NGLess
Thanks Luis. Good to know that you also support fpkm. 
My real question was if the size of the feature was measured in bases or kilo bases. I'm working with bacteria and the  features are the kegg ontology, so that is why it puzzled me that most features was longer than an average genome, but now when I think about it would make even less sense if the size was measured in kb. Based on the raw and normalized counts I get the size  of the K00001 feature to be 3.609.405 bases, can that be correct ?   

Luis Pedro Coelho

unread,
Jan 20, 2021, 10:20:38 PM1/20/21
to NGLess List


My real question was if the size of the feature was measured in bases or kilo bases. I'm working with bacteria and the  features are the kegg ontology, so that is why it puzzled me that most features was longer than an average genome, but now when I think about it would make even less sense if the size was measured in kb. Based on the raw and normalized counts I get the size  of the K00001 feature to be 3.609.405 bases, can that be correct ?   

It can actually be correct as it represents the sum of all sequences annotated as K000001 in the database.

HTH,
Luis


onsdag den 20. januar 2021 kl. 10.20.50 UTC+1 skrev lu...@luispedro.org:

That is correct: {normed} should divide by the size of the feature (note that the feature may be longer than a gene if you are not using seqname: it's the sum of the size of all the genes with a given annotation).

I just also noticed that we never documented the fact that we also support {fpkm} (fragments per thousand per million fragments), but I added that bit to the documentation now.

HTH
Luis

Luis Pedro Coelho | Fudan University | http://luispedro.org


On Tue, 19 Jan 2021, at 9:39 PM, mette Joergensen wrote:
Hi
In the documentation for the count functions it says that the {normed} parameter results in the counts being divided by the size of the feature. Is the size of the feature measured in kb? I have an example where the raw count is around 3000 and the normalized value is around 0.0006 meaning that the gene should be more than 5MB long if the length is measured in bases. Or am I misunderstanding something?
Best,
Mette  


--
You received this message because you are subscribed to the Google Groups "NGLess" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ngless+un...@googlegroups.com.


--
You received this message because you are subscribed to the Google Groups "NGLess" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ngless+un...@googlegroups.com.

mette Joergensen

unread,
Jan 21, 2021, 2:49:10 AM1/21/21
to NGLess

When you think about it, I don't it make sense first to sum all the reads mapping to genes with a specific function and then divide by the total length of all the genes. In my opinion it makes more sense first to length normalize to the length of the genes and then sum the normalized values. It will give hugely different results. A simple example will be if If we have three genes all 1kb long with the same function and 100 reads mapping to the first gene. With your normalization the result will be 100/3000= 0.03 while if you  divide first and then sum the result will be 100/100+0/1000+0/1000=0.1. Your was punish common functions  which are coded for by many genes even though most of genes are most likely not present in the sample (and hence don't have a change of being mapped to). Maybe there exist some cases where your normalization make sense, but for my case I think I will do the normalization my self. 
Reply all
Reply to author
Forward
0 new messages