GOTO Berlin is a vendor independent international software development conference with more that 60 top speaker and 400 attendees. The conference cover topics such as Java, Open Source, Agile, Architecture, Design, Web, Cloud, New Languages and Processes.

Presentation: "Apache Mahout's new DSL for Distributed Machine Learning on SPARK"

Track: NoSQL & Smart Big Data Analytics / Time: Thursday 14:30 - 15:20 / Location: Hall 1

I will talk about software that connects two very complex worlds: machine learning and distributed data processing. Six years ago, the Apache Mahout project started out to build a library for scalable machine learning based on Hadoop's MapReduce paradigm.

I will look back to the learnings from my experience as developer on Mahout, w.r.t. to software engineering as well as running a community-driven open-source project. After that, I will talk about a major rewrite that is currently undergoing. Mahout will provide an easy-to-use, declarative Scala DSL for linear algebraic operations, and an optimizer that translates programs written in the DSL to modern data processing systems such as Apache Spark.

Download slides

Sebastian Schelter, PhD. student at the Database Systems and Information Management Group

Sebastian Schelter

Biography: Sebastian Schelter

Sebastian Schelter is a PhD. student at the Database Systems and Information Management Group (DIMA) of TU Berlin with Prof. Volker Markl. He is engaged in Open Source as a member of the Apache Software Foundation. I am committer and PMC member at Apache Mahout, a library of scalable data mining algorithms and Apache Giraph, a distributed system for large scale graph processing. Furthermore, he is a fellow of the Free Software Foundation Europe.

His research aims at improving the technology for performing large scale data analysis on parallel processing platforms. Use case-wise, his focus is on enabling Collaborative Filtering with billions of interactions and Graph Mining on graphs with billions of vertices and edges. Furthermore, he is a part of the developing team of Stratosphere, a database-inspired parallel processing stack for large scale analytics.