THE NEXT FIVE VIDEOS
Mark Miller SOLR: PEAK PERFORMANCEAndrzej Bialeck DISTRIBUTED SEARCH OF HETEROGENEOUS COLLECTIONS WITH SOLR Nick Burch THE OTHER APACHE TECHNOLOGIES YOUR BIG DATA SOLUTION NEEDS!Chris Wensel COMMON MAPREDUCE PATTERNSStefan Groschupf HADOOP: A REALITY CHECK Click here to view all five videos...
Mark Miller SOLR: PEAK PERFORMANCE from ntc GmbH on Vimeo.
Andrzej Bialeck DISTRIBUTED SEARCH OF HETEROGENEOUS COLLECTIONS WITH SOLR from ntc GmbH on Vimeo.
Nick Burch THE OTHER APACHE TECHNOLOGIES YOUR BIG DATA SOLUTION NEEDS! from ntc GmbH on Vimeo.
Chris Wensel COMMON MAPREDUCE PATTERNS from ntc GmbH on Vimeo.
Interview: Chris Wensel 
by Matthias Ringwald
After his talk on common MapReduce patterns at BerlinBuzzwords 2011, I 
had the opportunity to interview Chris Wensel. He is the author of 
Cascading, an alternative API to MapReduce. As a co-founder of Scale 
Unlimited, he trained and mentored companies like Apple, Sun 
Microsystems and Hewlett-Packard.
Q: You have been coaching several companies for using Hadoop. What 
scenario is typical for
Hadoop and when do you want to use MapReduce?
A: There are several reasons for using Hadoop. One is that data is so 
big that it does not fit anywhere. Second is the efficiency of the 
infrastructure. We have a simple distributed file system and an unified 
execution framework and the value there is that the execution and the 
storage are collocated, so you can fully utilize the cores and fully 
utilize the disks simultaneously and get some big efficiency in cost.
Anyone with a project big enough can work with Hadoop. You see people 
who have adopted Hadoop because they had no other choice, but once they 
had Hadoop they saw that the efficiency was good. Sometimes one problem 
does not justify the use of Hadoop, but there are very often two dozens 
other problems who were similar to the one. All of these problems 
together justify the use of Hadoop as it offers this efficiency and 
simplicity. Side effect is that if a disk or a machine fails the process 
still runs. So anything you can represent in the abstraction of Hadoop 
is suitable. For ETL load Hadoop also makes sense. Often data is loaded 
from Teradata, processed in Hadoop and then loaded back into Teradata, 
because the very expensive Teradata machines should not be over 
utilized. In this use case it is not about storing data but taking load 
from more expensive
machines. So we are seeing a wide range of applications.
Q: Sometimes it seems like Hadoop is some kind of hype and everybody 
wants to use it. Do you
also have some examples where you do not want to use Hadoop?
A: If you need quick response times or if you ask the same question over 
and over again, I strongly recommend getting Greenplum, Teradata or 
using any other database. So if you need the response so quick there you 
should use databases. Hadoop makes more sense for things like scoring 
data, upfront cleanup in ETL or finding bugs in data.
Q: You have mentioned cascading as one of your projects. Can you please 
describe it a bit more?
A: Cascading started in 2007. It is a java based API, implementing 
common MapReduce patterns which you would implement by hand like joins 
or aggregations and other forms of optimizations like partial 
aggregations. So if you realize you need partial aggregation you can 
very easily add this to your job. Cascading also allows you to write 
unit tests for testing the functionality of your code. If you want to 
test your functions you do not care about integration, you just want to 
make sure your calculation is right, but you do not care were your data 
comes from or goes to. Integration can then be tested later in the 
development process.
Q: How do you predict the future of Hadoop? Do you think it will gain 
much more impact on data warehousing and data analysis?
A: There are many implementations, relying on the same premise: a simple 
distributed file system and a simple distributed execution. Then you 
will have tools, which can work with both. So it will become more 
canonical and there will be multiple variants. I guess we will see minor 
improvements on the file system layer, and we will see minor 
improvements on the MapReduce layer like MapReduceReduce.
Stefan Groschupf HADOOP: A REALITY CHECK from ntc GmbH on Vimeo.
Interview: Stefan Groschupf, Datameer 
by Christoph Nagel
Christop: Thank you for the refreshing talk at Berlin Buzzwords. Please tell us a little bit about yourself and you involvement with Hadoop.
Stefan Groschupf: Fascinated about the political idea of free information and search, I was one of the first co-developers of Nutch in 2003. Before that devoloped a search engine for the University of Halle and launched a startup in order to avoid the German military service. For Nutch I implemented a plugin-system because I believe that expandable systems have a greater chance in becoming succesful and the result can be seen in the wide adoption of Nutch. After the "birth" of Hadoop I helped in coding and also contributed the logos for Hadoop as well as Nutch. In the meanwhile I contributed and started a few startups based on Hadoop like Scale unlimited, 101Tec and now Datameer. The last one builds BI-applications on Hadoop with a strong focus on business users and is backed by two venture capital firms.
Christoph: The title of your talk was "Hadoop: A reality check". Please give us a short summarization.
Stefan Groschupf: As we can see, technology changes completely nearly every 15 years. But the amount of data grows constantly. Data volume outperforms Moore's law. On the other hand 75% of data is unstructured and there are only a few products available who can handle this trend. Traditional systems like SQL databases get in trouble because optimization data structures like B-Tree can't always be distributed and so the systems doesn't scale, but luckily Hadoop addresses these problems.
- jge's blog
- Login to post comments

 
      

























