The taken-for-granted ability to enter a word or a phrase into Google Search and have it return in seconds the most relevant results from the largest set of information in the history of civilization is one of the unacknowledged wonders of the modern world. It’s a world-changing technology, and is the result of ingenious software developed by Google’s engineers. Although the actual search and ranking algorithms are a closely guarded secret, many of the principles that underlie the back-end technology have been published in the form of white-papers and have been taken up by projects that have created implementations that are open-source and available for anyone to use. This week we’re going to take a look at some of those projects.
Dremel/Drill
In the original white-paper, Dremel is described as “a scalable, interactive ad-hoc query system for analysis of read-only nested data.” Dremel allows queries over massive amounts of data within very short time-frames. According to Google, Dremel is capable of running SQL like queries over petabytes of data in seconds. Purportedly, Google’s BigQuery service is based on Dremel. Dremel has the advantage over other technologies like Hadoop – which we’ll look at below – in that it doesn’t have to run queries in batches, meaning there is less of a lag between query and result, so Dremel is more suitable for real-time, interactive analysis.
In recent days, the Hadoop vendor, MapR, has submitted an implementation of Dremel, known as Drill, to the Apache incubator program. While development is at a fairly early stage, and APIs have yet to be established, it’s expected that in time, Apache Drill will come to be an essential tool for the sort of real-time analytics that Big Data is making increasingly necessary.
Big Query and GFS/Hadoop
Anyone following the tech news in recent months can’t fail to be familiar with Hadoop. It’s been massively successful. It’s been predicted that Hadoop will support a software market worth $813 million over the next 4 years. Based on the Google technologies GFS (Google File System) and MapReduce, Hadoop allows processing of huge sets of data across many servers. Like Drill, Hadoop is an Apache product, and is being used by many corporations with large-scale data processing requirements, like Yahoo (who are one of the major developers) and Facebook, who claim to have the world’s largest Hadoop cluster, with 30 petabytes of data.
BigTable/Apache Accumulo
Accumulo is a column-based key-value store database originally developed by the National Security agency. Accumulo is based on Google’s BigTable, which Google used in-house for data storage. Built on top of Hadoop, and various other Apache projects, Accumulo is written in Java.
Of course there are many technologies that have been produced by Google, or based on Google’s concepts, but these are three of the most important for large-scale distributed data storage and querying. Even as Hadoop use grows within the enterprise, its underlying technologies have already been superseded within Google. No doubt, as time goes on, those companies that need to be able to process quantities of data at an, until recently, unimaginable scale, will continue to innovate and create better software. It’s an exciting time to be in the data business.