Monday Upload - the next videos of the conference

Shevek TIME SERIES OR CAUSAL ANALYSIS WITHOUT LIMITS!Sylvain Lebresne REAL-TIME ANALYTICS WITH CASSANDRA Thomas Koch ZOOKEEPER - THE UNSUNG HERO Kai Voigt Hadoop Jean Daniel Crayens Jakob Homan Chris Male Friso van Vollenhoven Devaraj Das MAKING HADOOP SECURE Uwe Schindler HEAVY COMMITTING: FLEXIBLE INDEXING IN LUCENE 4.0

Shevek TIME SERIES OR CAUSAL ANALYSIS WITHOUT LIMITS! from ntc GmbH on Vimeo.

Interview: Shevek by Tillmann Fiehn
Time Series or Causal Analysis Without Limits!
Shevek is working for Karmasphere a US company that distributes and develops a big data analysis tool which is build on top of Hadoop's distributed file system and MapReduce engine. A very interesting talk on the bbuz this year has been "Time Series or Causal Analysis Without Limits!" which was held by Shevek. It has been about the class of time series analysis algorithms and how they can be brought to a MapReduce world. These come with a data intensive matching of series for some range of offsets. Using block-wise operations this can be handled and the costs can be estimated in form of I/O costs as a linear function of fs block size and matching window size. And here is an excerpt of the interview I had with Shevek on that talk:
Q: Hello Shevek. Thank you for that interesting talk. Though it is about time series analysis on MapReduce you describe it in a pattern of Map-Reduce-Scatter-Gather. You pointed out that you like to think in this pattern. Are you using this pattern internally?
A: Yes. I certainly do in describing. It is not exported into products yet. Google itself uses it as well. Deviding algorithms into Partition and Combine and all of these is not actually useful. MRSG is a structural description of a computation in both structure and type. So we can describe all these transformations in form of a type system.
Q: Can you explain the benefits of a formal type system?
A: Well in Hadoop the standard operation of scatter has been broken up into two halfs of Map and Partition for reasons of acceptability and understandability. The scatter operation takes a value of some type and creates a key value pair where the key is used for partitioning Remind the history of MapReduce coming out of web crawler. That implementation of scatter-gather implemented all their requirements. Its doesen't implement all of mine though.
Q: Do you find it easy to map the pattern of MRSG into MapReduce?
A: It is easy to map it to MapReduce. It is not easy to map it to Hadoop's implementation of MapReduce. I recommend the Google FlumeJava paper where you find an algorithm to split these generic descriptions into MapReduce jobs.
Q: How come you were faced with problems of time series class?
A: I was not faced with such problems. I was trying to find Big Data questions which were as different as possible from the thing I had in front of me. Hadoop allows you were easily to write TF/IDF style algorithms. Time series analysis was a good testcase for "is MRSC the correct formalism?" and I actually changed formalism as a result of that studies, as a result of my interpretation of the Kernighan mini language philosophy.

Sylvain Lebresne REAL-TIME ANALYTICS WITH CASSANDRA from ntc GmbH on Vimeo.

Thomas Koch ZOOKEEPER - THE UNSUNG HERO from ntc GmbH on Vimeo.

Kai Voigt Hadoop from ntc GmbH on Vimeo.

Jean Daniel Crayens from ntc GmbH on Vimeo.

Jakob Homan from ntc GmbH on Vimeo.

Chris Male from ntc GmbH on Vimeo.

Friso van Vollenhoven from ntc GmbH on Vimeo.

Devaraj Das MAKING HADOOP SECURE from ntc GmbH on Vimeo.

Uwe Schindler HEAVY COMMITTING: FLEXIBLE INDEXING IN LUCENE 4.0 from ntc GmbH on Vimeo.

Bookmark/Search this post with

jge's blog
Login to post comments

Berlin Buzzwords 2011 is a conference for developers and users of open source software projects, focussing on the issues of scalable search, data-analysis in the cloud and NoSQL-databases. Berlin Buzzwords presents more than 30 talks and presentations of international speakers specific to the three tags "search", "store" and "scale".