Following on from yesterday is yet another discussion on the Gene. This constitutes a brief overview of the article Genomics Confounds Gene Classification (Michael Serenghaus and Mark Gerstein), from American Scientist (Nov 2008).
The article clarifies the concept of a gene by rendering it more complex. Which sounds somewhat perverse, but the notion of a gene has always been rubbery, possibly due to efforts to render simple something that is more complicated than taxonomically-inclined people would like to deal with.
Quickly recapping, the human genome consists of 23 pairs of chromosomes located in the nucleus of nearly every human cell. Those chromosomes - long strands of DNA - contain some three billion items of information. A sequence thereof is used to build up proteins which are the biochemical fundaments of metabolism and life. Thus comes the concept of "one gene - one protein" - that the most atomic process constitutes the encoding of a protein, which ultimately determines a human characteristic, and is thus a "basic unit of heredity".
Via Wikipedia comes the claim that there are an estimated 20,000 to 25,000 "protein-coding genes" in the human genome; the Wikipedia collaborators thus nail down the Gene between the articles on Human genome and Gene. This leaves a number of troubling questions, however, including the function of large swathes of the genome - (only) some of which may be "junk DNA" and may have been inserted by viruses into germ line cells (thus fostering inheritance).
Seringhaus and Gerstein's central premise is that the gene, as "biology's basic unit", is "not nearly so uniform nor as discrete as once was thought" - so "biologists must adapt their methods of classifying genes and their products".
Part of the problem is that the encoding process is more complicated than simply reading a contiguous strand of DNA data. That has been recognised already by conceptualising introns, segments of data that are removed from the ultimate coding process (both main strands of theory posits these as junk). There are also control sequences that govern, inter alia, the beginning and end of the transcription process. However, the overall coding process has been found to involve serious convolutions of the DNA strand. Transcription is not purely a sequential process: exons (coding strands) are non-adjacent and "important control regions can occur tens of thousands of nucleotide pairs away from the targeted coding region - with uninvolved genes sometimes postioned in between"... "the physical qualities of DNA, its ability to loop and bend, bring distand regulatory components close".
The other complication is over functionality of a "gene". Seringhaus and Gerstein: "Function in the genetic sense initially was inferred from the phenotypic effects of genes... but a phenotypic effect doesn't capture function on the molecular level. To really elucidate the importance of a gene, it's vital to understand the detailed biochemistry of its products." But each protein, each enzyme, can have a variety of biochemical effects. "Deciding which qualities of a gene and its products to record, report and classify is not trivial". This leads to the system of classification called Gene Ontology (GO). Much more complicated than a simple hierarchy, it uses a Directed Acyclic Graph structure, where each node can have multiple parents - resulting in a rather messy-looking chart of interconnected notes. This, and the "flood of new genomic data" mean a "large volume of data" which can "paralyze the most dedicated team. Precisely this problem is occurring in biology today".
The solution may be found in the semantic web project mentioned yesterday, where indefinite amounts of information and, importantly, relationships, can be stored. Simplicity vanishes, but information can be retained in toto, and compiled collaboratively that can be mined for meaning. And intuitively, such complexity makes more sense.