On Wednesday I attended STF to learn about Big Data. It was a packed house again, and I nearly didn’t get a seat (thanks Tiger).
First on deck was Avkash Chauhan (who also spoke at the October 2012 STF), a Senior Engineer with Windows Azure and HDInsight at Microsoft, who’s talk was titled “Data Visualization: Tools and Techniques”. He demonstrated how visualizations can be used to understand large data sets. He started out with a visualization showing tweets related to the Egyptian Revolution. He explained how the various connections and distribution of nodes could be used to visually understand what was happening on Twitter at that time. I can’t find the exact graphic he used but this video is fairly similar:
Chauhan went on to demonstrate how to use two different open source visualization tools, NodeXL (an Excel plugin) and Gephi, to create similar visualizations. Both tools allow you to import public data sets like Youtube or Twitter, or use your own data. You can then toggle hundreds of controls to create exactly the right kind of visualizations to help you understand your data.
The second speaker, Arpit Gupta, spoke about “Philosophy of Big Data” (which you can watch on YouTube). In many ways this was your typical subject overview talk about Big Data, but by using humour and really great examples Gupta made this far more valuable and interesting than most overview presentations.
In one slide he listed sixteen different industries which involved applications of Big Data. He asked the audience to pick two of them to talk about. He proceeded to give some really interesting examples from the worlds of Fraud & Security and Search Quality, but it seemed like he could easily have talked about the other fourteen topics as well. He also brought an interesting perspective to the table explaining that Big Data was really a new marketting term for something that had been around a long time. The main change (besides the invention of the term) is that only recently has it been cheap enough to do really good analysis on all that data.
Next we heard “Big Data – 10x Better” from Ying Li, the Chief Scientist and co-founder of Concurix. The “10x Better” refers to Concurix’s main goal: to create an operating system specifically engineered for data centers, one that will deliver at least a 10x price/performance improvement over current Linux and Windows servers. They are doing this by focusing on improving the usage of multiple cores. According to Li, current operating systems and software platforms are not very well suited to leverage multi-core, with net performance actually getting worse beyond about 8 cores. Concurix believes they can radically improve that situation.
To that end, Li has been doing research on various machine and OS configurations by benchmarking calculations of the Mandelbrot Set. She gave some very detailed information and visualizations showing the behavior of a multi-core system in terms of core utilization, garbage collection, etc. During Q&A I had to ask whether the Mandelbrot Set was a good place to start, given that it was such an emberrassingly parellel problem, and did that mean that Concurix did not care about figuring out how to improve multi-core use for less parellel problems (which is much more difficult a problem). She responded that the Mandelbrot Set was just a starting point and that they definately did intend to work on improvements for less parellel types of software problems.
Finally Jim Caputo, Engineering Manager for BigQuery at Google, gave his talk “Big Data for the Masses: How We Opened Up the Doors to Google’s Dremel”. He opened with some great stats regarding Big Data at Google such as the fact YouTube currently has 72 hours of video uploaded every minute. He went on to talk about how BigQuery provides an SQL-like ad-hoc query interface over these kinds of very large data sets. As an example he ran a query against one of the sample BigQuery data sets (which is apparently not publicly available yet). This query across 14+ billion rows, 1TB of data in 12 tables, returned it’s results in just 30 seconds.
Caputo went on to explain the technology behind BigQuery, called Dremel, and how it differed from BigTable and MapReduce. Dremel achieves a much lower latency by using a completely column oriented storage approach and a totally diskless data flow. This means that queries involving just a few columns need only touch storage on machines where those columns are stored. It also means that machines involved later on in processing the query won’t be doing any disk I/O.
Data can be uploaded directly and quickly into BigQuery. An online console can be used to query this data immediately. Developers can also create there own interfaces using the BigQuery API directly or via one of the many available client libraries.
Finally we heard from the a representative of this month’s sponsor, ComputeNext. He described ComputeNext as “Expedia for Cloud services”. They compare IAAS providers on performance, pricing, availability, and other metrics, and provides ways of easily accessing those providers. This could be a very useful service for companies requiring a varying range of services, especially if spread across many countries where available offerings might differ.