Friday, July 13, 2007

What is unstructured data?

This blog is devoted to the study of unstructured data, or semi-structured data, or complex data, or whatever we are this week calling data that hasn’t been sledge hammered into an RDBMS.  I’m working on a position piece on the nomenclature; in the meantime, here are some thoughts from Josh Berkus of Database Soup:

As a database geek, the instance of Silicon Valley linguistic quackery which is my pet peeve du jour is "unstructured data," and its sibling "semi-structured data."  This was brought particularly to my attention last week when I started a data warehouse project which involves the digestion and analysis of a few million web pages, which was classified an "unstructured data archive," a name I quickly changed.  But a quick round of Google searches will reveal quite a buzz in the Valley around these vague terms.

To be perfectly clear: "unstructured data" is an oxymoron.   Unstructured bits, characters, and words are not data, they are gibberish.  Or noise.  Or, to use a data processors' term, garbage.  While the market for garbage processing seems rather confined to the Silicon Valley Toxics Coalition, somehow vendors of "unstructured data processing" have been able to raise millions in venture capital.   So to what, exactly, are these vendors referring?


No comments: