Tuesday, June 20, 2006

Tech: Why use a snowflake schema?

The gurus of Data Warehousing are, of course, Ralph Kimball and Bill Inmon, in whatever order you like. (Kimball's company is the Kimball Group.)

The Snowflake Schema gets mentioned an awful lot, but is understood less and implemented relatively rarely. Even Wikipedia’s entry is uncharacteristically terse (any takers for improving it?).

Kimball has a good article on this architecture, and where it would be useful.

Here's a quick FAQ on BI/DW, but to recap: data warehouses experience different usage patterns from normal transaction databases; they typically involve a lot more reads, fewer writes and updates, and few or no deletions. Most writes are at ETL time, batched. Because of this: a) efficiency issues are different from production databases; and because data is loaded from a database that is already normalised, so data redundancy has already been addressed. Typically, data warehouses are centred around a fine-tuned fact table with dimension tables coming off it in a star.

The snowflake involves removing from a dimension table “low cardinality” attributes, into secondary dimension tables. Literature seems to differ on whether the resultant schema is normalised or not, so you can take your pick (data warehousing is not an exact science, despite Inmon and Kimball – who often differ anyway).

Even if you have a snowflake, Kimball strongly favours presenting information users with a simpler star schema representation – unless the snowflake is demonstrably simple in concept.

Kimball’s first example of a useful snowflake is a customer table which mixes browsers with actual shoppers – shoppers would have many more attributes, so it’s best to delineate early. He advocates separating out into subdimensions those dimensions that are specific to actual buyers – who will have much more data recorded about them.

A somewhat similar example is a set of financial products. The varying products may have completely different sets of attributes, so each product-specific set is separated out into a different dimension tables – again, subdimensions.

Finally, he deals with calendar dimensions, where an organisation crosses jurisdictional boundaries, and needs to deal with a variety of settings relating to holidays, fiscal periods, etc. From the base calendar dimension comes subdimensions specifying those different jurisdictions and their attributes.

All that discussion ignores issues of normalisation or denormalisation, although literature tends to indicate snowflakes are normalised and stars (which are really a special case of a snowflake) are not.

The ultimate lesson is that a snowflake is not a trivial solution, and shouldn’t get used willy-nilly.

No comments: