THE NEXT FIVE VIDEOS
Mark Miller SOLR: PEAK PERFORMANCEAndrzej Bialeck DISTRIBUTED SEARCH OF HETEROGENEOUS COLLECTIONS WITH SOLR Nick Burch THE OTHER APACHE TECHNOLOGIES YOUR BIG DATA SOLUTION NEEDS!Chris Wensel COMMON MAPREDUCE PATTERNSStefan Groschupf HADOOP: A REALITY CHECK Click here to view all five videos...
Interview: Chris Wensel
by Matthias Ringwald
After his talk on common MapReduce patterns at BerlinBuzzwords 2011, I had the opportunity to interview Chris Wensel. He is the author of Cascading, an alternative API to MapReduce. As a co-founder of Scale Unlimited, he trained and mentored companies like Apple, Sun Microsystems and Hewlett-Packard.
Q: You have been coaching several companies for using Hadoop. What scenario is typical for Hadoop and when do you want to use MapReduce?
A: There are several reasons for using Hadoop. One is that data is so big that it does not fit anywhere. Second is the efficiency of the infrastructure. We have a simple distributed file system and an unified execution framework and the value there is that the execution and the storage are collocated, so you can fully utilize the cores and fully utilize the disks simultaneously and get some big efficiency in cost. Anyone with a project big enough can work with Hadoop. You see people who have adopted Hadoop because they had no other choice, but once they had Hadoop they saw that the efficiency was good. Sometimes one problem does not justify the use of Hadoop, but there are very often two dozens other problems who were similar to the one. All of these problems together justify the use of Hadoop as it offers this efficiency and simplicity. Side effect is that if a disk or a machine fails the process still runs. So anything you can represent in the abstraction of Hadoop is suitable. For ETL load Hadoop also makes sense. Often data is loaded from Teradata, processed in Hadoop and then loaded back into Teradata, because the very expensive Teradata machines should not be over utilized. In this use case it is not about storing data but taking load from more expensive machines. So we are seeing a wide range of applications.
Q: Sometimes it seems like Hadoop is some kind of hype and everybody wants to use it. Do you also have some examples where you do not want to use Hadoop?
A: If you need quick response times or if you ask the same question over and over again, I strongly recommend getting Greenplum, Teradata or using any other database. So if you need the response so quick there you should use databases. Hadoop makes more sense for things like scoring data, upfront cleanup in ETL or finding bugs in data.
Q: You have mentioned cascading as one of your projects. Can you please describe it a bit more?
A: Cascading started in 2007. It is a java based API, implementing common MapReduce patterns which you would implement by hand like joins or aggregations and other forms of optimizations like partial aggregations. So if you realize you need partial aggregation you can very easily add this to your job. Cascading also allows you to write unit tests for testing the functionality of your code. If you want to test your functions you do not care about integration, you just want to make sure your calculation is right, but you do not care were your data comes from or goes to. Integration can then be tested later in the development process.
Q: How do you predict the future of Hadoop? Do you think it will gain much more impact on data warehousing and data analysis?
A: There are many implementations, relying on the same premise: a simple distributed file system and a simple distributed execution. Then you will have tools, which can work with both. So it will become more canonical and there will be multiple variants. I guess we will see minor improvements on the file system layer, and we will see minor improvements on the MapReduce layer like MapReduceReduce.
Interview: Stefan Groschupf, Datameer
by Christoph Nagel
Christop: Thank you for the refreshing talk at Berlin Buzzwords. Please tell us a little bit about yourself and you involvement with Hadoop.
Stefan Groschupf: Fascinated about the political idea of free information and search, I was one of the first co-developers of Nutch in 2003. Before that devoloped a search engine for the University of Halle and launched a startup in order to avoid the German military service. For Nutch I implemented a plugin-system because I believe that expandable systems have a greater chance in becoming succesful and the result can be seen in the wide adoption of Nutch. After the "birth" of Hadoop I helped in coding and also contributed the logos for Hadoop as well as Nutch. In the meanwhile I contributed and started a few startups based on Hadoop like Scale unlimited, 101Tec and now Datameer. The last one builds BI-applications on Hadoop with a strong focus on business users and is backed by two venture capital firms.
Christoph: The title of your talk was "Hadoop: A reality check". Please give us a short summarization.
Stefan Groschupf: As we can see, technology changes completely nearly every 15 years. But the amount of data grows constantly. Data volume outperforms Moore's law. On the other hand 75% of data is unstructured and there are only a few products available who can handle this trend. Traditional systems like SQL databases get in trouble because optimization data structures like B-Tree can't always be distributed and so the systems doesn't scale, but luckily Hadoop addresses these problems.