Today’s article in O’Reilly’s Radar by Joseph Hellerstein, is a concise synopsis of the state-of-the-art large scale data analysis. It compares the Enterprise IT dominant Relational Database paradigm to the emerging (with a bullet!) MapReduce / Hadoop technologies.
Professor Hellerstein, from UC Berkeley lives this stuff as a leading researcher on databases and distributed systems. He is also an advisor to Greenplum, one of the start-ups mentioned in the article that is involved in commercializing MapReduce Tech and writes the data beta blog.
The article discusses how some companies (and they are companies with proprietary tech and nary a free download link on their home page) such as Aster Data and Greenplum that are promoting hybrid Relational Database / MapReduce Data Warehouse products. These may get some traction in the Enterprise but with any success, will eventually get squashed and/or assimilated by Oracle and thus stay in the IT Realm (IMHO).
The more interesting space is the multiverse of open source tools that are pushing the evolution of the underlying Hadoop MapReduce as well as the growing set of tools being layered on top of Hadoop such as Hive,originally developed by Facebook Engineering, and Pig, started by Yahoo Research. Both are sets of tools, including a query language interface, for doing ad-hoc analysis of massive data sets.
Hellerstein calls all of this a renaissance in computer science research and calls for folks to look towards standardizing the upper layers of the Hadoop hierarchy, particularly the query language.
There is a debate brewing among data systems cognoscenti as to the best way to do data analysis at this scale. The old guard in the Enterprise IT camp tends to favor relational databases and the SQL language, while the web upstarts have rallied around the MapReduce programming model popularized at Google, and cloned in open source as Apache Hadoop. Hadoop is in wide use at companies like Yahoo! and Facebook, and gets a lot of attention in tech blogs as the next big open source project. But if you mention Hadoop in a corporate IT shop you are often met with blank stares — SQL is ubiquitous in those environments. There is still a surprising disconnect between these developer communities, but I expect that to change over the next year or two.
We are at the beginning of what I call The Industrial Revolution of Data. We’re not quite there yet, since most of the digital information available today is still individually “handmade”: prose on web pages, data entered into forms, videos and music edited and uploaded to servers. But we are starting to see the rise of automatic data generation “factories” such as software logs, UPC scanners, RFID, GPS transceivers, video and audio feeds. These automated processes can stamp out data at volumes that will quickly dwarf the collective productivity of content authors worldwide.
Meanwhile, disk capacities are growing exponentially, so the cost of archiving this data remains modest. And there are plenty of reasons to believe that this data has value in a wide variety of settings. The last step of the revolution is the commoditization of data analysis software, to serve a broad class of users.