From the notebook of.: XML

Showing posts with label XML. Show all posts

Tuesday, February 10, 2009

The Semantic Web (Death of the Gene, part 1)

Two recent articles - in American Scientist and New Scientist - purport to sound the death knell of our understanding of genetics. Interestingly enough, the New Scientist article is the more sensationalist, whereas American Scientist has the more meaningful one.

First, however, a diversion into computer science.

I first encountered the concept of the Semantic Web about four years ago, through a seminar presented by the W3 Consortium. The Semantic Web was envisaged as a successor to the worldwide web, something to better enable collaboration.

Web pages, written in Hypertext Markup Language, represent a rather unstructured way to navigate information. True enough, linkages are made from one concept to another. But on the whole the effect is a rather unstructured journey, with no instrinic meaning underpinning one's meanderings.

In contrast, the Semantic Web is intended to be a network of information in which the navigational links are imbued with specifically defined relationships, such that they could be machine-read. Web pioneer Tim Berners-Lee has referred to this as a Global Giant Graph in contrast to the worldwide web. Descriptive relationships are facilitated by languages designed for depicting data: Resource Description Framework, Web Ontology Language (OWL), and particularly XML (Extensible Markup Language), which is already in heavy use for defining data in a very wide range of contexts.

Why do this?, was the question that occurred to me at that seminar. The applications proposed were restricted to scientific fields such as pharmaceutics and bibliographics, somewhat esoteric to me.

But this set of design and representational principles is starting to make sense in fields in which collaboration is necessary simply because it is too difficult to keep track of a field that is constantly burgeoning, updating faster than any traditional publishing method, and too large for any one person or group to maintain. Thus, an ontology: precise specifications for a knowledge-classification system.

That could easily be a description of Wikipedia. Such an endeavour is not possible without the web, simply because it calls for such a vast community of contributors.

The same could apply to a more structured discipline, where structured relationships may be just as important as the single instance or 'article'. The ensuing structures, spread out over a large number of web sites, could then be data-mined for meaning.

There is increasing need for this in genetics, as we start to see the concept of a gene break down, and the need to build a large number of relationships out of a genetic code with billions of letters.

Thursday, August 17, 2006

Tech: Just what does IBM do?

It might surprise you to know that IBM is the largest I.T. company in the world - by revenue. Global sales in 2005: IBM: US$91 billion; Microsoft: about $40 billion.

Of course, it used to be the hardware giant: in the 1960s and 70s, IBM simply defined the mainframe computers that once formed the backbone of large enterprises. Then the mainframe market was slowly crippled over the 1980s and 90s by the rise of micro computers: they were effectively a victim of the success of... IBM-format PCs. And the Dells of this world have demonstrated that the PC sector is a dangerously low-margin market.

Over time, IBM turned to technical/business services and software. Some of that software is its own technology, like DB2 (the new release is discussed here: it includes native XML support and autonomic memory management).

But IBM has also been buying up software companies left right and centre, specifically to give it a full vertical business software offering. In effect, it is seeking to lock in large enterprises by providing a full range of software/services across business needs. This is a common trend in the larger software companies which is why, for example, Microsoft and Oracle have equally been on the takeover warpath for several years.

Like Apple, IBM deserves credit for successfully re-inventing themselves more than once. Whereas they once exploited their hegemonic dominance in the mainframe computer market to extract monopolistic profits (hence the epithet Incapacitating Business for Megabucks), they now operate much more competitively across a range of markets, leveraging off their brand name rather than their monolithic presence - something Microsoft is taking note of, as its own dominant position is being eroded by Linux, Open Office, and other open source offerings.

2009 update: Q2 2009 revenue reports at $23.6 billion - that's just the one quarter. This is made up of:

57% services - made up of technical services (39%) and business services (18%)
22% software
17% hardware

The remainder is revenue from financing businesses to buy their upscale hardware. More complete press reports on IBM's second quarter financial figures can be found here and here.

Tuesday, June 27, 2006

Tech: XML for DBAs

“XML is not for every use or occasion. But when you need flexibility and neutrality, it can be ideal” – Ari Kaplan

I’m not a Database Administrator, by temperament. But I have to be across some of the issues, working in business intelligence.

An interesting use of XML in this month’s Oracle magazine. You want to archive data from previous years, but the table’s schema may change. You want to drop the archived data, but you want to be able to restore it again.

The solution is to archive the data into XML format. This involves extracting and wrapping it in XML tags. The data can be placed in an archival database as XML, able to be mined later.

Here, Oracle uses two specific functions: XMLFOREST to convert relational into XML data, XMLELEMENT to wrap user-defined tags around the data. However, I can’t see any difficulty achieving the desired results in other ways, if a given database product doesn’t have such functions.

Simple.

Thursday, June 15, 2006

Tech: DB2: an impressive new database release

This blog’s been slow because I attended a particularly good course on the new release of DB2 – called Viper on pre-release; now simply DB2 9. The main presenter was one
Boris Bialek, who was particularly knowledgable and entertaining.

Technology competitors are constantly leapfrogging each other. It’s hard to say that product X is consistently better than product Y when the following year, product Y brings out a new release that trumps competition.

Having said that, I think DB2 now has the edge that IBM’s competitors ( Oracle and Microsoft's SQL Server) will be struggling to match – Microsoft in particular, since they’ve only just released a major upgrade, SQL Server 2005.

In fact, IBM has made a realistic effort to keep Microsoft at bay with the release of an Express version, which is effectively free – but limited to 2 CPUs and 4 gigabytes of data. This is enough for them to prevent revenue leakage, but at the same time provide a small entry point for developers and small business.

In order of merit, the chief points about the new DB2 are:

Native XML support – this is not the half-baked implementations of its competitors; it’s a true hybrid relational/XML database, storing XML documents intact – no shredding – and providing proper indexing to the XML fields. Each XML document is a field in a row.
Autonomics – memory management is now largely automatic: just set the upper threshold of DB2’s entire memory needs, and its internal management will produce optimal results – among multiple instances – better than a DBA’s manual tuning efforts. It’s so good, IBM plan to get rid of all other memory management parameters.
Free at entry level with Express versions
Backup/restore – a host of improvements to handle partial/failure situations
Range partitioning – ability to partition tables by key values
Granular security – Label Based Access Control allows administrators to define access levels within tables
granular compression – data compression can be defined up to the row level (note that this is not an exact equivalence to granular security
other improvements – including capacity improvements at page level and below

Other databases will have some of these features already, but the true Native XML support is a first for a relational (non-specialised XML) database. The support for XPATH and XQuery structures is good – very good – as is XML schema support. All way better than anything currently on the market.

That hybrid model may cause some rethinking of the general concepts of relational databases. XQuery and SQL constructs can be embedded within each other, but you can’t precisely treat fields within XML documents as database fields – the document structures are too flexible. First Normal Form is instantly broken if tables and XML documents are treated as a continuum.

Interesting to see where this will all take us. And good to see the technology is there. Although this is invisible to most people, the world is already exploding with XML.

Tuesday, May 30, 2006

Tech: Innovative IBM database and BI releases

Two recent initiatives from IBM are promising in terms of information integration and discovery.

Of course, since IBM lost to Microsoft its mantle as the most monolithic and pervasive entity in the I.T. world, it’s been working hard to re-invent itself. It’s even sold (to Lenovo, a Chinese company) its PC hardware business – the very business that fostered microcomputer standardisation, allowed Microsoft to gain pre-eminence, and ate away at its traditional, mainframe business. Their business is currently split between software, services, and mainframe hardware. Mainframes are now a niche market, and it’s their software innovation that garners attention.

On imminent release is Viper, software technology for their DB2 database platform which, amongst other things, allows for “native” XML databasing. The presentation I attended last week gave me the impression it permits admixtures of data with XML-defined data, but I’d be quite cautious about that until I could see it in action.

This is quite a dramatic initiative*, providing some enabling technology for the Semantic Web (discussed here and here). For me, the significance lies not simply in its ability to handle XML – which can be done in proof-of-concept by any number of vendors – but that it can do this natively, as an integral part of its DB2 product.

Also announced is IBM’s Content Discovery for Business Intelligence. Although this is a part of their WebSphere (application server) product range, in concept it permits pervasive business intelligence throughout an organisation’s structured and unstructured data. Provided, I presume, the unstructured data has been sufficiently tagged (manually or automatically). The announcement is careful not to include the term “data mining” so I’m a bit suspicious of its “discovery” nomenclature. Business Intelligence involves specific query, analysis, and reporting functions, whereas data mining is more a discovery of trends – the difference between asking specific questions and asking what patterns are in the data.

We’ll find out the full story when the dust settles. Still, access to unstructured data is nothing to be sneezed at. And if Viper can’t immediately database extant web pages, be sure that that’s the direction they’re going.

*1-June: In fact, it's been said to be not that dramatic, that Oracle has had native XML support for some time. I guess it's down to how genuine that "native" label is, and how they mix XML and non-XML. Comments welcome.
(Viper also adds range partitioning, which I can see being particularly useful in a data warehouse/business intelligence context.)