From the notebook of.: database

Showing posts with label database. Show all posts

Tuesday, December 19, 2006

It's like a Data Warehousing genie

I met with someone this week who was working on a large Data Warehouse/Business Intelligence project.

At the large enterprise level, it's hard to get close to the ideal in a Data Warehouse. The task is positively Hurculean, to schedule feeds from a large number of data sources, cleanse the data, integrate it, and store it. The data integration task alone would be daunting, and could conceivably take forever. Why? Because if you do the right thing and involve all interested business units, it could be a neverending talkfest. Especially if there are competing interests. And much as people can talk about eliminating stovepipes (around the various business areas), there is always going to be competing interests, people concerned with building, maintaining, expanding their domains.

So I was unsurprised to hear there was a certain amount of descoping. Not enough, to my mind, to match the realities of a large business environment.

But on the positive side, ever business or technical issue I raised (on a broad level) was matched with a plan of action, or at least a firm awareness of the issue. So I'm happy to let them run with it – I'm not an enterprise architect.

But on the even more positive side, look what they were promising the data user! An array of different tools to access the DW, from (almost raw) SQL coding to (vanilla) data mining tools, and various options in between. But wait – that's not all. They were also offering the data user access to several levels of data, from modelled representations to raw tables, even as far as allowing access to the [ETL] staging area.

What does that mean? On one level, if the tool you're using doesn't give you the productivity you're after, you can go for something more gui-based. Conversely, if it doesn't give you the level of control you're after, you can go more towards SQL. And if you're not confident about the data you're receiving, you can drill behind the model, to the tables, then to the staging data.

What better access could you ask for? It's like wish fulfilment.

Prototypes are a matter of months away. I'm keeping my fingers crossed.

Thursday, September 28, 2006

Tech: Don't waste your time on a Data warehouse!

Dratz did an interesting post on data warehousing.

In effect: why bother?

In concept, a data warehouse is laudible: one place where you can store all the organisation's data, integrated across business units...

But why? Do you have a business case for it? A good reason you can't extract the intelligence you want from the existing databases?

Well maybe you can't. But maybe you can. Dratz pointed out that the "if you build it they will come" philosophy is the ruin of many a technologist. It just doesn't necessarily happen.

At my workplace, there's a data warehouse. But does it contain feeds from all the company's repositories? Noooo. In fact, much as you'd like to think the organisation is all one big happy family, there remain concepts of ownership. Competition, even.

Break down the information silos? Look at the business silos first. Work on the business issues, the internal politics. And tell me integration is axiomatic.

All I'm saying is, think more carefully about what you're doing. If there's a green light, start small. Don't try to bring on board business units that aren't yet interested - or even jealously guard their IP. Datawarehouse by all means, but only as and when there are specific business returns.

Monday, August 07, 2006

Tech: semantic integration

Have a look at a thoughtful article on semantic integration, flagged from Wikipedia.

In a nutshell, this is a form of integration that focuses on data, and relies on ontologies (effectively metadata by domain, or data dictionaries) being published in a relatively standardised format.

The ultimate goal would surely be a product which sucks all a company’s metadata into one source – regardless of original format - then enables updating from that source. Now there’s a business opportunity.

Tuesday, June 27, 2006

Tech: XML for DBAs

“XML is not for every use or occasion. But when you need flexibility and neutrality, it can be ideal” – Ari Kaplan

I’m not a Database Administrator, by temperament. But I have to be across some of the issues, working in business intelligence.

An interesting use of XML in this month’s Oracle magazine. You want to archive data from previous years, but the table’s schema may change. You want to drop the archived data, but you want to be able to restore it again.

The solution is to archive the data into XML format. This involves extracting and wrapping it in XML tags. The data can be placed in an archival database as XML, able to be mined later.

Here, Oracle uses two specific functions: XMLFOREST to convert relational into XML data, XMLELEMENT to wrap user-defined tags around the data. However, I can’t see any difficulty achieving the desired results in other ways, if a given database product doesn’t have such functions.

Simple.

Tuesday, June 20, 2006

Tech: Why use a snowflake schema?

The gurus of Data Warehousing are, of course, Ralph Kimball and Bill Inmon, in whatever order you like. (Kimball's company is the Kimball Group.)

The Snowflake Schema gets mentioned an awful lot, but is understood less and implemented relatively rarely. Even Wikipedia’s entry is uncharacteristically terse (any takers for improving it?).

Kimball has a good article on this architecture, and where it would be useful.

Here's a quick FAQ on BI/DW, but to recap: data warehouses experience different usage patterns from normal transaction databases; they typically involve a lot more reads, fewer writes and updates, and few or no deletions. Most writes are at ETL time, batched. Because of this: a) efficiency issues are different from production databases; and because data is loaded from a database that is already normalised, so data redundancy has already been addressed. Typically, data warehouses are centred around a fine-tuned fact table with dimension tables coming off it in a star.

The snowflake involves removing from a dimension table “low cardinality” attributes, into secondary dimension tables. Literature seems to differ on whether the resultant schema is normalised or not, so you can take your pick (data warehousing is not an exact science, despite Inmon and Kimball – who often differ anyway).

Even if you have a snowflake, Kimball strongly favours presenting information users with a simpler star schema representation – unless the snowflake is demonstrably simple in concept.

Kimball’s first example of a useful snowflake is a customer table which mixes browsers with actual shoppers – shoppers would have many more attributes, so it’s best to delineate early. He advocates separating out into subdimensions those dimensions that are specific to actual buyers – who will have much more data recorded about them.

A somewhat similar example is a set of financial products. The varying products may have completely different sets of attributes, so each product-specific set is separated out into a different dimension tables – again, subdimensions.

Finally, he deals with calendar dimensions, where an organisation crosses jurisdictional boundaries, and needs to deal with a variety of settings relating to holidays, fiscal periods, etc. From the base calendar dimension comes subdimensions specifying those different jurisdictions and their attributes.

All that discussion ignores issues of normalisation or denormalisation, although literature tends to indicate snowflakes are normalised and stars (which are really a special case of a snowflake) are not.

The ultimate lesson is that a snowflake is not a trivial solution, and shouldn’t get used willy-nilly.

Thursday, June 15, 2006

Tech: DB2: an impressive new database release

This blog’s been slow because I attended a particularly good course on the new release of DB2 – called Viper on pre-release; now simply DB2 9. The main presenter was one
Boris Bialek, who was particularly knowledgable and entertaining.

Technology competitors are constantly leapfrogging each other. It’s hard to say that product X is consistently better than product Y when the following year, product Y brings out a new release that trumps competition.

Having said that, I think DB2 now has the edge that IBM’s competitors ( Oracle and Microsoft's SQL Server) will be struggling to match – Microsoft in particular, since they’ve only just released a major upgrade, SQL Server 2005.

In fact, IBM has made a realistic effort to keep Microsoft at bay with the release of an Express version, which is effectively free – but limited to 2 CPUs and 4 gigabytes of data. This is enough for them to prevent revenue leakage, but at the same time provide a small entry point for developers and small business.

In order of merit, the chief points about the new DB2 are:

Native XML support – this is not the half-baked implementations of its competitors; it’s a true hybrid relational/XML database, storing XML documents intact – no shredding – and providing proper indexing to the XML fields. Each XML document is a field in a row.
Autonomics – memory management is now largely automatic: just set the upper threshold of DB2’s entire memory needs, and its internal management will produce optimal results – among multiple instances – better than a DBA’s manual tuning efforts. It’s so good, IBM plan to get rid of all other memory management parameters.
Free at entry level with Express versions
Backup/restore – a host of improvements to handle partial/failure situations
Range partitioning – ability to partition tables by key values
Granular security – Label Based Access Control allows administrators to define access levels within tables
granular compression – data compression can be defined up to the row level (note that this is not an exact equivalence to granular security
other improvements – including capacity improvements at page level and below

Other databases will have some of these features already, but the true Native XML support is a first for a relational (non-specialised XML) database. The support for XPATH and XQuery structures is good – very good – as is XML schema support. All way better than anything currently on the market.

That hybrid model may cause some rethinking of the general concepts of relational databases. XQuery and SQL constructs can be embedded within each other, but you can’t precisely treat fields within XML documents as database fields – the document structures are too flexible. First Normal Form is instantly broken if tables and XML documents are treated as a continuum.

Interesting to see where this will all take us. And good to see the technology is there. Although this is invisible to most people, the world is already exploding with XML.

Thursday, June 01, 2006

Tech: Gartner's Database market shares

Apropros to my comments about databasing everything, I note that the latest Gartner database report says the market (ie, relational DBMSs) has grown by 8% last year.

Is that all? Well, that’s the year-on-year increase in sales of new systems (as I read it), not a measure of the amount of information stored in databases.

However, of more interest is the market share. As you’d expect, Oracle’s far ahead, followed by IBM, with Microsoft only managing two thirds of IBM’s share. The actual proportions are 49% to Oracle, 22% to IBM, with Microsoft at 13% and Teradata only 3%. Surprisingly, Teradata’s market share has been static. Its tools were somewhat rudimentary when last I played with it, but it is quite an industrial-strength product for high-end data warehousing – quite a growth area. Probably indicative that IBM and Oracle are successful in maintaining their presence in that market segment.

What about open source? Growth is significant, but from quite a small base, so products like MySQL and Ingres are not quite on the radar yet.

Year-on-year movement in market share is also surprisingly static. There has been a slight increase from Microsoft, at the expense of most others, but the figures show they’re having more of a struggle than they bargained for, in trying to match the product – and brandnames – of Oracle and IBM. Still, those figures definitely count for something, as Microsoft’s entry point is considerable lower than all their direct competitors, so unit sales must have shown some oomph.

Of more relevance to my earlier comments, I guess, would be number of installations, amount of data held in each system, and year-on-year changes in those figures. If I hear anything useful, I’ll publish here.

Tuesday, May 30, 2006

Tech: Innovative IBM database and BI releases

Two recent initiatives from IBM are promising in terms of information integration and discovery.

Of course, since IBM lost to Microsoft its mantle as the most monolithic and pervasive entity in the I.T. world, it’s been working hard to re-invent itself. It’s even sold (to Lenovo, a Chinese company) its PC hardware business – the very business that fostered microcomputer standardisation, allowed Microsoft to gain pre-eminence, and ate away at its traditional, mainframe business. Their business is currently split between software, services, and mainframe hardware. Mainframes are now a niche market, and it’s their software innovation that garners attention.

On imminent release is Viper, software technology for their DB2 database platform which, amongst other things, allows for “native” XML databasing. The presentation I attended last week gave me the impression it permits admixtures of data with XML-defined data, but I’d be quite cautious about that until I could see it in action.

This is quite a dramatic initiative*, providing some enabling technology for the Semantic Web (discussed here and here). For me, the significance lies not simply in its ability to handle XML – which can be done in proof-of-concept by any number of vendors – but that it can do this natively, as an integral part of its DB2 product.

Also announced is IBM’s Content Discovery for Business Intelligence. Although this is a part of their WebSphere (application server) product range, in concept it permits pervasive business intelligence throughout an organisation’s structured and unstructured data. Provided, I presume, the unstructured data has been sufficiently tagged (manually or automatically). The announcement is careful not to include the term “data mining” so I’m a bit suspicious of its “discovery” nomenclature. Business Intelligence involves specific query, analysis, and reporting functions, whereas data mining is more a discovery of trends – the difference between asking specific questions and asking what patterns are in the data.

We’ll find out the full story when the dust settles. Still, access to unstructured data is nothing to be sneezed at. And if Viper can’t immediately database extant web pages, be sure that that’s the direction they’re going.

*1-June: In fact, it's been said to be not that dramatic, that Oracle has had native XML support for some time. I guess it's down to how genuine that "native" label is, and how they mix XML and non-XML. Comments welcome.
(Viper also adds range partitioning, which I can see being particularly useful in a data warehouse/business intelligence context.)

Monday, May 29, 2006

Tech: Database the world with XML (Semantic web, part 2)

(part 1 was Semantic web, super web.)

I have a vision: I want to see the whole digital world databased.

Why? Databases are wonderfully associative tools. We can make connections, sort, and list. We can gain new insights into our information with rapid querying and analysis tools (business intelligence tools in particular).

Now, databases are rather inefficient for storing information, as a colleague pointed out to me. But once upon a time, relational databases were said to be impractical in the real world for much the same reason. Then precipitous drops in CPU and storage costs brought the theoretical into the real world, to the point where you’d be hard-pressed to find a database not predicated on the relationship model.

My vision will prevail (although I’m in for a bit of a wait). The web will become a virtual database, thanks to semantic web and XML technology. We will see a gradual takeup of the concept, through the markup of new and existing pages in XML, which will define the web page semantically, giving machine-readable meaning to the information on the page. Search engines will need to be more powerful to process that meaning, to integrate an open set of disparate pages. This is the power of the semantic web paradigm, this is how true integration will happen.

Finally, the whole of human knowledge will be integrated, and we’ll all be experts on everything… whoops, getting ahead of myself here. (We only think we’re experts.)
Seriously, there’s no reason we won’t go down this path. Of course, beyond a certain point much of this information will remain specific and privatised, sensitive to organisations or individuals. Yet what remains in the public domain – even now – is powerful. We just need the tools in place to boost the value of this chaotic, cluttered web.