Monday Upload - the next videos of the conference
Shevek TIME SERIES OR CAUSAL ANALYSIS WITHOUT LIMITS!Sylvain Lebresne REAL-TIME ANALYTICS WITH CASSANDRAThomas Koch ZOOKEEPER - THE UNSUNG HERO Kai Voigt HadoopJean Daniel Crayens Jakob HomanChris MaleFriso van VollenhovenDevaraj Das MAKING HADOOP SECUREUwe Schindler HEAVY COMMITTING: FLEXIBLE INDEXING IN LUCENE 4.0
Interview: Shevek by Tillmann Fiehn
Time Series or Causal Analysis Without Limits!
Shevek is working for Karmasphere a US company that distributes and develops a big data analysis tool which is build on top of Hadoop's distributed file system and MapReduce engine. A very interesting talk on the bbuz this year has been "Time Series or Causal Analysis Without Limits!" which was held by Shevek. It has been about the class of time series analysis algorithms and how they can be brought to a MapReduce world. These come with a data intensive matching of series for some range of offsets. Using block-wise operations this can be handled and the costs can be estimated in form of I/O costs as a linear function of fs block size and matching window size. And here is an excerpt of the interview I had with Shevek on that talk:
Q: Hello Shevek. Thank you for that interesting talk. Though it is about time series analysis on MapReduce you describe it in a pattern of Map-Reduce-Scatter-Gather. You pointed out that you like to think in this pattern. Are you using this pattern internally?
A: Yes. I certainly do in describing. It is not exported into products yet. Google itself uses it as well. Deviding algorithms into Partition and Combine and all of these is not actually useful. MRSG is a structural description of a computation in both structure and type. So we can describe all these transformations in form of a type system.
Q: Can you explain the benefits of a formal type system?
A: Well in Hadoop the standard operation of scatter has been broken up into two halfs of Map and Partition for reasons of acceptability and understandability. The scatter operation takes a value of some type and creates a key value pair where the key is used for partitioning Remind the history of MapReduce coming out of web crawler. That implementation of scatter-gather implemented all their requirements. Its doesen't implement all of mine though.
Q: Do you find it easy to map the pattern of MRSG into MapReduce?
A: It is easy to map it to MapReduce. It is not easy to map it to Hadoop's implementation of MapReduce. I recommend the Google FlumeJava paper where you find an algorithm to split these generic descriptions into MapReduce jobs.
Q: How come you were faced with problems of time series class?
A: I was not faced with such problems. I was trying to find Big Data questions which were as different as possible from the thing I had in front of me. Hadoop allows you were easily to write TF/IDF style algorithms. Time series analysis was a good testcase for "is MRSC the correct formalism?" and I actually changed formalism as a result of that studies, as a result of my interpretation of the Kernighan mini language philosophy.