Then a few days ago, while catching up with back issues of SCIENCE, in
the issue of 2003.Jul.04, on page 53, I found this wonderful quote:
"Given the intricacies of RNA editing, complex regulatory networks,
genetic redundancy, and molecular pathways, it is meaningless to
identify any one concrete matural object as the gene." Although that
sounds extreme, I believe it's the right way of thinking. Comments?
As for guessing the number of "genes" (appx. 24000 or maybe 30000) in
the human genome, I suppose it's a good thing the guesses in the
correct range were so sparse that a winner could be picked without
getting into vicious arguments about what exactly counted as one gene
vs. two genes.
Ignoring the definitions of a gene in terms of coding for phenotype,
considering only the definitions of a gene in terms of evolution by
mutation (point, crossover, duplicate, delete, etc.) and natural
selection, a year or two ago I devised a fuzzy definition of a gene,
basically any segment of genome (usually DNA, sometimes RNA such as in
viruses) which is long enough that it doesn't arise by chance but short
enough that it can last many generations before accidently being split
via meiosis crossover or chance mutation, specifically that the
exact-copy fecundity of that segment of genome is greater than 1. This
definition of course has genes within genes within genes simply because
a wide range of lengths of genome segment satisfy the definition.
Indeed an entire chromosome can count as a single gene if crossing over
is infrequent in the particular sequence or the chromosome is
sufficiently short that it misses crossing over most of the time and
simultaneously the mutation rate is low enough to miss that particular
chromosome most of the time. (Whether there in fact is any chromosome
of any living species satisfying those conditions, I don't know,
probably not, but maybe?)
How to perform matching calculations on such a varying length of
overlapping and inclusive "genes"? My idea for the past many months
(few years) has been overlapping segments of power-of-two lengths
feeding into ProxHash (a hashing function from data space into
high-dimensional real-number space, satisfying the mathematical
property of "continuous" i.e. epsilon-delta you all remember from
pre-Calculus and metric spaces of abstract algebra). In that way only
four-fold coverage for any given power of two (length of genome
segment) is needed to assure that mis-phase won't prevent matching.
Larger powers of two (gs-length) can be used to efficiently trace large
unchanged genome-segments through a few generations before they are
mutated, thereby tracking a whole set of codons etc. simultaneously as
they co-replicate, while smaller powers of two (gs-length) can be used
at greater cost to trace shorter genome-segments, even smaller than a
single phenotype-gene, through more generations. After building a set
of nearest-neighbor (in hash space) links between gs in our database,
and also links between whole and part (adjacent powers of two
gs-length), software can then look at the gs at each end of a link to
find the actual match of identical base sequences, i.e. establish
pointwise alignment whereever an exact match occurs, and then fill in
the gaps whenever a SNP occurs. At this point we have vectors of
exact-alignment links, which can be tracked as groups forward and
backward in time, and thereby easily identify where insertions (from
copies, or from retro-coding) or deletions have occurred, and set up
fuzzy links to show paths through such changes. By combining these
various kinds of links, we obtain a directed graph of alignment
stretching from the current time all the way back to the last common
ancestor pool of all life on Earth. Perhaps we'll discover that all
present-day DNA, coding and non-coding, aligns directly from a very
small pool of maybe five or ten codons that were in life 3900 million
years ago, except for one medium-size segment of DNA that suddenly came
to Earth appx. 3400 million years ago via some meteor from Mars.
---
> For many months, several years I believe, I've advocated mapping
> genomes completely (that part already done for a dozen species and in
> the works for a dozen more and then more being planned) and then
> matching every segment of DNA without regard to supposed boundaries of
> genes or codons etc., to thereby get a true picture of the evolutionary
> history of DNA without the bias of preconceived ideas of genes etc.
>
> Then a few days ago, while catching up with back issues of SCIENCE, in
> the issue of 2003.Jul.04, on page 53, I found this wonderful quote:
> "Given the intricacies of RNA editing, complex regulatory networks,
> genetic redundancy, and molecular pathways, it is meaningless to
> identify any one concrete matural object as the gene." Although that
> sounds extreme, I believe it's the right way of thinking. Comments?
It's a stupid way of thinking. I define a gene as a DNA sequence
that is transcribed. (There are a few exceptions to this definition
but it works very well.) RNA editing and the rest don't have any
effect on my ability to recognize what I define as a gene.
You can try as many different definitions as you like but I think
I'll stick with one that works, thank-you very much.
Larry Moran
---
>
>It's a stupid way of thinking. I define a gene as a DNA sequence
>that is transcribed. (There are a few exceptions to this definition
>but it works very well.) RNA editing and the rest don't have any
>effect on my ability to recognize what I define as a gene.
>
>You can try as many different definitions as you like but I think
>I'll stick with one that works, thank-you very much.
Well, ok, but that is something of a copout. So you have defined the
word gene to be a transcription unit. But what we ultimately want is a
unit of function (making a protein?) -- whatever you call it.
Anyway... what do you mean by what is "transcribed"? Pol II
transcription does not have a well defined terminus. Message end is
determined not by transcription but by polyadenylation, which may
vary. But actual extent of transcription goes beyond that.
bob
No it's not. Your original claim was that it's impossible to define
a gene. I gave you a definition that seems to work pretty well. Why
is that a copout?
> So you have defined the word gene to be a transcription unit.
Not exactly. I defined a gene as a region of DNA that is transcibed.
That may not be the same as a "transcription unit." I'd have to know
what you mean by "transcription unit" before I could agree. Do you
have a definition?
> But what we ultimately want is a unit of function (making a protein?)
> -- whatever you call it.
Okay. If that's what you want then try and come up with a word for
it. "Gene" is already taken. How about "protein-encoding region of
DNA"?
> Anyway... what do you mean by what is "transcribed"? Pol II
> transcription does not have a well defined terminus. Message end is
> determined not by transcription but by polyadenylation, which may
> vary. But actual extent of transcription goes beyond that.
This is only one of many difficulties with my preferred definition.
I think we all know how to criticize any definition of a gene since
we're all pretty knowledgeable about biology. The fact that there are
exceptions to every rule (and definition) in biology does not mean
that it's hopeless to try and come up with meaningful terms. There's
a useful molecular definition of "gene" and it seems to work well
in most cases. I don't think it's helpful to abandon the word just
because it can't be precisely defined to cover all possibilities in
every living species.
The real problem here is that you, and the authors you quote, want
to define a gene in terms of making a particular protein. This kind
of definition is a holdover from the olden days before the discovery
of functional RNA's. We need to recognize that some genes don't encode
proteins and any reasonable definition has to take this into account.
(Unless, of course, we want to stop talking about tRNA "genes",
ribosomal RNA "genes, or "genes" for snRNA's.)
Larry Moran
---