THE NEXT FIVE VIDEOS
Mark Miller SOLR: PEAK PERFORMANCEAndrzej Bialeck DISTRIBUTED SEARCH OF HETEROGENEOUS COLLECTIONS WITH SOLR Nick Burch THE OTHER APACHE TECHNOLOGIES YOUR BIG DATA SOLUTION NEEDS!Chris Wensel COMMON MAPREDUCE PATTERNSStefan Groschupf HADOOP: A REALITY CHECK Click here to view all five videos...
Mark Miller SOLR: PEAK PERFORMANCE from ntc GmbH on Vimeo.
Andrzej Bialeck DISTRIBUTED SEARCH OF HETEROGENEOUS COLLECTIONS WITH SOLR from ntc GmbH on Vimeo.
Nick Burch THE OTHER APACHE TECHNOLOGIES YOUR BIG DATA SOLUTION NEEDS! from ntc GmbH on Vimeo.
Chris Wensel COMMON MAPREDUCE PATTERNS from ntc GmbH on Vimeo.
Interview: Chris Wensel
by Matthias Ringwald
After his talk on common MapReduce patterns at BerlinBuzzwords 2011, I
had the opportunity to interview Chris Wensel. He is the author of
Cascading, an alternative API to MapReduce. As a co-founder of Scale
Unlimited, he trained and mentored companies like Apple, Sun
Microsystems and Hewlett-Packard.
Q: You have been coaching several companies for using Hadoop. What
scenario is typical for
Hadoop and when do you want to use MapReduce?
A: There are several reasons for using Hadoop. One is that data is so
big that it does not fit anywhere. Second is the efficiency of the
infrastructure. We have a simple distributed file system and an unified
execution framework and the value there is that the execution and the
storage are collocated, so you can fully utilize the cores and fully
utilize the disks simultaneously and get some big efficiency in cost.
Anyone with a project big enough can work with Hadoop. You see people
who have adopted Hadoop because they had no other choice, but once they
had Hadoop they saw that the efficiency was good. Sometimes one problem
does not justify the use of Hadoop, but there are very often two dozens
other problems who were similar to the one. All of these problems
together justify the use of Hadoop as it offers this efficiency and
simplicity. Side effect is that if a disk or a machine fails the process
still runs. So anything you can represent in the abstraction of Hadoop
is suitable. For ETL load Hadoop also makes sense. Often data is loaded
from Teradata, processed in Hadoop and then loaded back into Teradata,
because the very expensive Teradata machines should not be over
utilized. In this use case it is not about storing data but taking load
from more expensive
machines. So we are seeing a wide range of applications.
Q: Sometimes it seems like Hadoop is some kind of hype and everybody
wants to use it. Do you
also have some examples where you do not want to use Hadoop?
A: If you need quick response times or if you ask the same question over
and over again, I strongly recommend getting Greenplum, Teradata or
using any other database. So if you need the response so quick there you
should use databases. Hadoop makes more sense for things like scoring
data, upfront cleanup in ETL or finding bugs in data.
Q: You have mentioned cascading as one of your projects. Can you please
describe it a bit more?
A: Cascading started in 2007. It is a java based API, implementing
common MapReduce patterns which you would implement by hand like joins
or aggregations and other forms of optimizations like partial
aggregations. So if you realize you need partial aggregation you can
very easily add this to your job. Cascading also allows you to write
unit tests for testing the functionality of your code. If you want to
test your functions you do not care about integration, you just want to
make sure your calculation is right, but you do not care were your data
comes from or goes to. Integration can then be tested later in the
development process.
Q: How do you predict the future of Hadoop? Do you think it will gain
much more impact on data warehousing and data analysis?
A: There are many implementations, relying on the same premise: a simple
distributed file system and a simple distributed execution. Then you
will have tools, which can work with both. So it will become more
canonical and there will be multiple variants. I guess we will see minor
improvements on the file system layer, and we will see minor
improvements on the MapReduce layer like MapReduceReduce.
Stefan Groschupf HADOOP: A REALITY CHECK from ntc GmbH on Vimeo.
Interview: Stefan Groschupf, Datameer
by Christoph Nagel
Christop: Thank you for the refreshing talk at Berlin Buzzwords. Please tell us a little bit about yourself and you involvement with Hadoop.
Stefan Groschupf: Fascinated about the political idea of free information and search, I was one of the first co-developers of Nutch in 2003. Before that devoloped a search engine for the University of Halle and launched a startup in order to avoid the German military service. For Nutch I implemented a plugin-system because I believe that expandable systems have a greater chance in becoming succesful and the result can be seen in the wide adoption of Nutch. After the "birth" of Hadoop I helped in coding and also contributed the logos for Hadoop as well as Nutch. In the meanwhile I contributed and started a few startups based on Hadoop like Scale unlimited, 101Tec and now Datameer. The last one builds BI-applications on Hadoop with a strong focus on business users and is backed by two venture capital firms.
Christoph: The title of your talk was "Hadoop: A reality check". Please give us a short summarization.
Stefan Groschupf: As we can see, technology changes completely nearly every 15 years. But the amount of data grows constantly. Data volume outperforms Moore's law. On the other hand 75% of data is unstructured and there are only a few products available who can handle this trend. Traditional systems like SQL databases get in trouble because optimization data structures like B-Tree can't always be distributed and so the systems doesn't scale, but luckily Hadoop addresses these problems.
- jge's blog
- Login to post comments