Data Modeling has been constrained through scale; Sampling still rules the day for Adhoc Analytics. Scale brings much needed change to the modeling world. In this talk we present the predictive power of using sophisticated algorithms on big datasets. With large data sizes comes the particularly hard problem of unbalanced data with multiple asymmetrically rare classes. Missing features pose unique problems for most Classification and Regression algorithms and proper handling can lead to greater predictive power. In the race for Better Predictions, H2O makes practical techniques accessible to manyone through an easy-to-use software product.
H2O is an open source math & machine learning engine for big data that brings distribution and parallelism to powerful algorithms while keeping the widely used languages of R and JSON as an API. And integrates neatly into popular data ecosystems of hadoop, amazon s3, nosql and sql. We briefly discuss design choices in the implementation of Distributed Random Forest and Generalized Linear Modeling and bringing speed and scale to vox populi of Data Science, R. We take a peek at the elegant lego-like infrastructure that brings fine grained parallelism to math over simple distributed arrays.
A short hacking data demo presents the life cycle of Data Science: Powerful Data Manipulation via R at scale, Interactive Summarization over large datasets, Modeling using Elastic Net (GLM), Grid Search for best parameters & low-latency scoring.
Download slides
Petr Maj is co-founder and CEO of ReactorLabs, s.r.o., which specializes in research and development of new generation of tools for data scientists. ReactorLabs cooperates with top universities in the EU and US to adapt the research outcomes into industrial strength applications and frameworks of practical importance.
Petr has been working on optimizations and runtime architectures of both static and dynamic programming languages and systems & embedded development. Before ReactorLabs, he has worked on the core of the H2O analytics engine for 0xdata in the US and on compilers for the Cell and ARM architectures of the Sony Playstation 3 and Sony PS Vita gaming consoles for SnSys in the UK. He holds the highest honors MSc degree from the Czech Technical University and has also studied at the University of Bristol, UK and Purdue University in the US.
Petr has co-authored several papers (OOPSLA, PLDI, EuroSYS, ISMM, UseR!) and currently helps organizing ECOOP 2015 in Prague.
Tomas is hacker at 0xdata and maker of H2O. He currently holds the world-record for fastest high-scale Generalized Linear Model on the Planet, clocking an insane 1Billion Rows in under 1 second for a binomial classification. He is continually in the process of finishing his PhD at Binghamton University, in behavioral malware detection. He received his undergraduate degree from the Czech Technical University. Tomas has worked at IBM-research and Agent-Technology Group. He has participated on several projects related to malware detection/protection funded by US Air Force. Specifically, he developed a system for modeling software behavior using compressed graphs of the system calls made on the system.
Tomas also created a sandbox with simulated user activity for safe execution of malware with advanced behavior extraction algorithms to extract behavior of malware injected into other processes. Apart from his work, Tomas spends his time mountain biking, skiing, snowboarding.