From the notebook of.: semantic web

Showing posts with label semantic web. Show all posts

Thursday, July 16, 2009

Web: easy as 1.0, 2.0, 3.0

I have a memory of a W3 Consortium seminar in Sydney several years back. It discussed their efforts to put meaning into web content via The Semantic Web, using a concept/relationship mapping language called OWL, which uses RDF (a metadata descriptor) and XML syntax.

They intended to arrive at a structure that was universally navigable mechanistically (by computer), yet retain for each specialty area its own language/concepts. Yes, it was developed by academics, for academic applications.

This was before the concept of Web 2.0 was sufficiently popularised to gain a solid meaning. At the time, I believe they used the term Web 2.0 to describe their endeavour.

Times change, meanings change. The term Web 2.0 has been usurped for another purpose, and it looks like the W3 Consortium is now using Web 3.0 instead. At the current state of play, the simplest description I have seen of the evolution of the web (from Jean-Michel Texier via Peter Thomas) is as follows:

* Web 1.0 was for authors [ - to be read]
* Web 2.0 is for users [ - fosters interaction
* Web 3.0 is also for machines [ - fosters automation]

In effect, Web 3.0 should enable more rigorous discovery and collation of information from the far corners of the web. Something like what Google should be, if it had the full smarts. However, it would only work where web content authors added the background tags and capabilities - so it's more likely to be taken up for knowledge/information-building purposes, such as research, reference materials and databases. But this is a deceptively powerful paradigm, and the sky's the limit for assembling useful meaning. The current Google would look like a paper telephone directory... but by then, Google would have evolved to make full use of it. A fully referenced assembler of knowledge, rather than isolated lumps of unverified information.

PS If interested in Data Quality in a technical, database sense, see my latest tech post. (This one was intended for a generalised audience!)

Wednesday, February 11, 2009

Death of the Gene pt 2 (what is a gene? pt 397)

Following on from yesterday is yet another discussion on the Gene. This constitutes a brief overview of the article Genomics Confounds Gene Classification (Michael Serenghaus and Mark Gerstein), from American Scientist (Nov 2008).

The article clarifies the concept of a gene by rendering it more complex. Which sounds somewhat perverse, but the notion of a gene has always been rubbery, possibly due to efforts to render simple something that is more complicated than taxonomically-inclined people would like to deal with.

Quickly recapping, the human genome consists of 23 pairs of chromosomes located in the nucleus of nearly every human cell. Those chromosomes - long strands of DNA - contain some three billion items of information. A sequence thereof is used to build up proteins which are the biochemical fundaments of metabolism and life. Thus comes the concept of "one gene - one protein" - that the most atomic process constitutes the encoding of a protein, which ultimately determines a human characteristic, and is thus a "basic unit of heredity".

Via Wikipedia comes the claim that there are an estimated 20,000 to 25,000 "protein-coding genes" in the human genome; the Wikipedia collaborators thus nail down the Gene between the articles on Human genome and Gene. This leaves a number of troubling questions, however, including the function of large swathes of the genome - (only) some of which may be "junk DNA" and may have been inserted by viruses into germ line cells (thus fostering inheritance).

Seringhaus and Gerstein's central premise is that the gene, as "biology's basic unit", is "not nearly so uniform nor as discrete as once was thought" - so "biologists must adapt their methods of classifying genes and their products".

Part of the problem is that the encoding process is more complicated than simply reading a contiguous strand of DNA data. That has been recognised already by conceptualising introns, segments of data that are removed from the ultimate coding process (both main strands of theory posits these as junk). There are also control sequences that govern, inter alia, the beginning and end of the transcription process. However, the overall coding process has been found to involve serious convolutions of the DNA strand. Transcription is not purely a sequential process: exons (coding strands) are non-adjacent and "important control regions can occur tens of thousands of nucleotide pairs away from the targeted coding region - with uninvolved genes sometimes postioned in between"... "the physical qualities of DNA, its ability to loop and bend, bring distand regulatory components close".

The other complication is over functionality of a "gene". Seringhaus and Gerstein: "Function in the genetic sense initially was inferred from the phenotypic effects of genes... but a phenotypic effect doesn't capture function on the molecular level. To really elucidate the importance of a gene, it's vital to understand the detailed biochemistry of its products." But each protein, each enzyme, can have a variety of biochemical effects. "Deciding which qualities of a gene and its products to record, report and classify is not trivial". This leads to the system of classification called Gene Ontology (GO). Much more complicated than a simple hierarchy, it uses a Directed Acyclic Graph structure, where each node can have multiple parents - resulting in a rather messy-looking chart of interconnected notes. This, and the "flood of new genomic data" mean a "large volume of data" which can "paralyze the most dedicated team. Precisely this problem is occurring in biology today".

The solution may be found in the semantic web project mentioned yesterday, where indefinite amounts of information and, importantly, relationships, can be stored. Simplicity vanishes, but information can be retained in toto, and compiled collaboratively that can be mined for meaning. And intuitively, such complexity makes more sense.

Tuesday, February 10, 2009

The Semantic Web (Death of the Gene, part 1)

Two recent articles - in American Scientist and New Scientist - purport to sound the death knell of our understanding of genetics. Interestingly enough, the New Scientist article is the more sensationalist, whereas American Scientist has the more meaningful one.

First, however, a diversion into computer science.

I first encountered the concept of the Semantic Web about four years ago, through a seminar presented by the W3 Consortium. The Semantic Web was envisaged as a successor to the worldwide web, something to better enable collaboration.

Web pages, written in Hypertext Markup Language, represent a rather unstructured way to navigate information. True enough, linkages are made from one concept to another. But on the whole the effect is a rather unstructured journey, with no instrinic meaning underpinning one's meanderings.

In contrast, the Semantic Web is intended to be a network of information in which the navigational links are imbued with specifically defined relationships, such that they could be machine-read. Web pioneer Tim Berners-Lee has referred to this as a Global Giant Graph in contrast to the worldwide web. Descriptive relationships are facilitated by languages designed for depicting data: Resource Description Framework, Web Ontology Language (OWL), and particularly XML (Extensible Markup Language), which is already in heavy use for defining data in a very wide range of contexts.

Why do this?, was the question that occurred to me at that seminar. The applications proposed were restricted to scientific fields such as pharmaceutics and bibliographics, somewhat esoteric to me.

But this set of design and representational principles is starting to make sense in fields in which collaboration is necessary simply because it is too difficult to keep track of a field that is constantly burgeoning, updating faster than any traditional publishing method, and too large for any one person or group to maintain. Thus, an ontology: precise specifications for a knowledge-classification system.

That could easily be a description of Wikipedia. Such an endeavour is not possible without the web, simply because it calls for such a vast community of contributors.

The same could apply to a more structured discipline, where structured relationships may be just as important as the single instance or 'article'. The ensuing structures, spread out over a large number of web sites, could then be data-mined for meaning.

There is increasing need for this in genetics, as we start to see the concept of a gene break down, and the need to build a large number of relationships out of a genetic code with billions of letters.

Monday, May 29, 2006

Tech: Database the world with XML (Semantic web, part 2)

(part 1 was Semantic web, super web.)

I have a vision: I want to see the whole digital world databased.

Why? Databases are wonderfully associative tools. We can make connections, sort, and list. We can gain new insights into our information with rapid querying and analysis tools (business intelligence tools in particular).

Now, databases are rather inefficient for storing information, as a colleague pointed out to me. But once upon a time, relational databases were said to be impractical in the real world for much the same reason. Then precipitous drops in CPU and storage costs brought the theoretical into the real world, to the point where you’d be hard-pressed to find a database not predicated on the relationship model.

My vision will prevail (although I’m in for a bit of a wait). The web will become a virtual database, thanks to semantic web and XML technology. We will see a gradual takeup of the concept, through the markup of new and existing pages in XML, which will define the web page semantically, giving machine-readable meaning to the information on the page. Search engines will need to be more powerful to process that meaning, to integrate an open set of disparate pages. This is the power of the semantic web paradigm, this is how true integration will happen.

Finally, the whole of human knowledge will be integrated, and we’ll all be experts on everything… whoops, getting ahead of myself here. (We only think we’re experts.)
Seriously, there’s no reason we won’t go down this path. Of course, beyond a certain point much of this information will remain specific and privatised, sensitive to organisations or individuals. Yet what remains in the public domain – even now – is powerful. We just need the tools in place to boost the value of this chaotic, cluttered web.