Sean Owen, TweetDirector of Data Science at Cloudera

Biography: Sean Owen

Sean is Director of Data Science at Cloudera in London. Before Cloudera, he founded Myrrix Ltd (now known as the Oryx project) to commercialize large-scale real-time recommenders on Apache Hadoop. He is an Apache Spark contributor, and was a committer and VP for Apache Mahout. He is co-author of Advanced Analytics on Spark, and Mahout in Action. Previously, Sean was a senior engineer at Google.

Twitter: @sean_r_owen
Quora

Presentation: TweetA Taste of Random Decision Forests on Apache Spark

Track: Hadoop / Time: Friday 10:20 - 11:10 / Location: Veilingzaal

Apache Spark continues to gain momentum as the new processing paradigm for Apache Hadoop, and for the data scientist, it has a lot to like: natively distributed, REPL, Python APIs in addition to native Scala, and a library of machine learning algorithms, MLlib.

Spark 1.2 includes an implementation of random decision forests, an important and popular ensemble classifier/regressor algorithm. This talk will introduce Spark, Scala and random decision forests to the curious, and demonstrate the process of analyzing a real-world data set with them. The session will cover loading data and understanding the data set, and introduce ideas like training and test set evaluation, ensemble methods, feature types, and supporting concepts like impurity and entropy.

Attendees will:

Become familiar with Spark basics using its Scala API
Understand the decision tree and random decision forest algorithms
See a simple, narrated data science workflow in action on a real data set