Presentation: "Apache Mahout's new DSL for Distributed Machine Learning on SPARK"

Track: NoSQL & Smart Big Data Analytics / Time: Thursday 14:30 - 15:20 / Location: Hall 1

I will talk about software that connects two very complex worlds: machine learning and distributed data processing. Six years ago, the Apache Mahout project started out to build a library for scalable machine learning based on Hadoop's MapReduce paradigm.

I will look back to the learnings from my experience as developer on Mahout, w.r.t. to software engineering as well as running a community-driven open-source project. After that, I will talk about a major rewrite that is currently undergoing. Mahout will provide an easy-to-use, declarative Scala DSL for linear algebraic operations, and an optimizer that translates programs written in the DSL to modern data processing systems such as Apache Spark.

Sebastian Schelter is a PhD. student at the Database Systems and Information Management Group (DIMA) of TU Berlin with Prof. Volker Markl. He is engaged in Open Source as a member of the Apache Software Foundation. I am committer and PMC member at Apache Mahout, a library of scalable data mining algorithms and Apache Giraph, a distributed system for large scale graph processing. Furthermore, he is a fellow of the Free Software Foundation Europe.

His research aims at improving the technology for performing large scale data analysis on parallel processing platforms. Use case-wise, his focus is on enabling Collaborative Filtering with billions of interactions and Graph Mining on graphs with billions of vertices and edges. Furthermore, he is a part of the developing team of Stratosphere, a database-inspired parallel processing stack for large scale analytics.