Normalization of counts

mette Joergensen

unread,

Jan 19, 2021, 8:39:27 AM1/19/21

to NGLess

Hi

In the documentation for the count functions it says that the {normed} parameter results in the counts being divided by the size of the feature. Is the size of the feature measured in kb? I have an example where the raw count is around 3000 and the normalized value is around 0.0006 meaning that the gene should be more than 5MB long if the length is measured in bases. Or am I misunderstanding something?

Best,

Mette

Luis Pedro Coelho

unread,

Jan 20, 2021, 4:20:50 AM1/20/21

to NGLess List

That is correct: {normed} should divide by the size of the feature (note that the feature may be longer than a gene if you are not using seqname: it's the sum of the size of all the genes with a given annotation).

I just also noticed that we never documented the fact that we also support {fpkm} (fragments per thousand per million fragments), but I added that bit to the documentation now.

HTH

Luis

Luis Pedro Coelho | Fudan University | http://luispedro.org

https://orcid.org/0000-0002-9280-7885

--
You received this message because you are subscribed to the Google Groups "NGLess" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ngless+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ngless/c5ad0768-df82-41e5-a5c3-5d49207eb06bn%40googlegroups.com.

mette Joergensen

unread,

Jan 20, 2021, 8:17:43 AM1/20/21

to NGLess

Thanks Luis. Good to know that you also support fpkm.

My real question was if the size of the feature was measured in bases or kilo bases. I'm working with bacteria and the features are the kegg ontology, so that is why it puzzled me that most features was longer than an average genome, but now when I think about it would make even less sense if the size was measured in kb. Based on the raw and normalized counts I get the size of the K00001 feature to be 3.609.405 bases, can that be correct ?

Luis Pedro Coelho

unread,

Jan 20, 2021, 10:20:38 PM1/20/21

to NGLess List

My real question was if the size of the feature was measured in bases or kilo bases. I'm working with bacteria and the features are the kegg ontology, so that is why it puzzled me that most features was longer than an average genome, but now when I think about it would make even less sense if the size was measured in kb. Based on the raw and normalized counts I get the size of the K00001 feature to be 3.609.405 bases, can that be correct ?

It can actually be correct as it represents the sum of all sequences annotated as K000001 in the database.

HTH,

Luis

onsdag den 20. januar 2021 kl. 10.20.50 UTC+1 skrev lu...@luispedro.org:

That is correct: {normed} should divide by the size of the feature (note that the feature may be longer than a gene if you are not using seqname: it's the sum of the size of all the genes with a given annotation).

I just also noticed that we never documented the fact that we also support {fpkm} (fragments per thousand per million fragments), but I added that bit to the documentation now.

HTH
Luis

Luis Pedro Coelho | Fudan University | http://luispedro.org
https://orcid.org/0000-0002-9280-7885

On Tue, 19 Jan 2021, at 9:39 PM, mette Joergensen wrote:
Hi
In the documentation for the count functions it says that the {normed} parameter results in the counts being divided by the size of the feature. Is the size of the feature measured in kb? I have an example where the raw count is around 3000 and the normalized value is around 0.0006 meaning that the gene should be more than 5MB long if the length is measured in bases. Or am I misunderstanding something?
Best,
Mette

--
You received this message because you are subscribed to the Google Groups "NGLess" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ngless+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ngless/c5ad0768-df82-41e5-a5c3-5d49207eb06bn%40googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "NGLess" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ngless+un...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/ngless/8e683aac-aad0-4978-a249-10bc1d5cb4fcn%40googlegroups.com.

mette Joergensen

unread,

Jan 21, 2021, 2:49:10 AM1/21/21

to NGLess

When you think about it, I don't it make sense first to sum all the reads mapping to genes with a specific function and then divide by the total length of all the genes. In my opinion it makes more sense first to length normalize to the length of the genes and then sum the normalized values. It will give hugely different results. A simple example will be if If we have three genes all 1kb long with the same function and 100 reads mapping to the first gene. With your normalization the result will be 100/3000= 0.03 while if you divide first and then sum the result will be 100/100+0/1000+0/1000=0.1. Your was punish common functions which are coded for by many genes even though most of genes are most likely not present in the sample (and hence don't have a change of being mapped to). Maybe there exist some cases where your normalization make sense, but for my case I think I will do the normalization my self.

Reply all

Reply to author

Forward