Lecture #9.1 - Data Processing - Apache Spark ML API.pdf

Full Transcript

MODERN DATA ARCHITECTURES FOR BIG DATA II APACHE SPARK ML API Agenda Introduction Spark ML Spark ML API Additional Resources 2 TIME TO TURN OSBDET ON We'll use the course environment by the end of the lesson: 3 1 INTRODUCTION DISCLAIMER The following slides are NOT intended to be a machine learning/...

MODERN DATA ARCHITECTURES FOR BIG DATA II APACHE SPARK ML API Agenda Introduction Spark ML Spark ML API Additional Resources 2 TIME TO TURN OSBDET ON We'll use the course environment by the end of the lesson: 3 1 INTRODUCTION DISCLAIMER The following slides are NOT intended to be a machine learning/data science course. The goal for these slides is to explain how to use Apache Spark in the different steps of the machine learning workflow. 5 BUILDING UP ON THE FOUNDATIONS Spark ML API is built on top of Structured APIs: WHAT IS ML? 7 WHAT IS ML? Is the process of learning patterns and relationships in data without explicitly programmed The result of this process is a ML model. Data training Machine Learning Algorithm results ML Model 8 WHAT IS ML model? A ML model is equivalent to a function in mathematics or computer programming. It takes one or more variables as input called features and it returns an output called a prediction. Data (features) inference ML Model results Prediction 9 WHY SPARK FOR ML? There are many tools for doing ML: 10 WHY SPARK FOR ML? What if our data is too big? What if we need to train multiple models in parallel? 11 WHY SPARK FOR ML? Spark works distributing the data computations across multiple workers. and This way Spark ML makes possible: Scale Out to work with huge datasets Speed Up to train models faster 12 MACHINE LEARNING WORKFLOW Machine Learning typically requires of the following steps: Problem Definition Data Gathering Data Preparation Model Training Model Evaluation Model Selection 13 2 SPARK ML WHAT IS MLLIB? Spark has two machine learning packages: spark.mllib is the original machine learning API, based on the RDD API (in maintenance mode since Spark 2.0) spark.ml is the newer API, based on DataFrames. *However, we use “MLlib” as an umbrella term to refer to both machine learning library packages 15 WHAT IS MLLIB? MLlib provides tools for: ML Algorithms: common learning algorithms such as classification, regression, clustering, and collaborative filtering Featurization: feature extraction, transformation, dimensionality reduction, and selection Pipelines: tools for constructing, evaluating, and tuning ML Pipelines Persistence: saving and load models and pipelines Utilities: linear algebra, statistics, data handling, etc. 16 SPARK MLLIB CONCEPTS The Machine Learning Workflow in Spark: 17 SPARK MLLIB CONCEPTS MLlib standardizes APIs for machine learning algorithms to make it easier to combine multiple algorithms into a single pipeline, or workflow. This section covers the key concepts introduced by the Pipelines API, where the pipeline concept is mostly inspired by the scikit-learn project. 18 SPARK MLLIB CONCEPTS Common language for defining the different parts of the end-to-end machine learning pipeline: Transformers Estimators Evaluators Pipelines 19 SPARK MLLIB CONCEPTS 20 SPARK MLLIB CONCEPTS Transformer: is a class that transforms one DataFrame into another DataFrame via.transform() method. E.g., an ML model is a Transformer which transforms a DataFrame with features into a DataFrame with predictions. 21 SPARK MLLIB CONCEPTS Estimator: is a class which can be fit on a DataFrame to produce a Transformer via.fit() method. E.g., a learning algorithm is an Estimator which trains a DataFrame and produces a model. 22 SPARK MLLIB CONCEPTS Evaluator: is a class which can be evaluate how a given model performs via.evaluate() method returning the value of the performance metric. E.g., a ML model is applied to a test set and its performance is measured based on selected metric (accuracy, recall, roc) 23 SPARK MLLIB CONCEPTS Pipeline: is a class that chains multiple Transformers and Estimators together into a single step to specify an ML workflow. The concept of pipelines is common across many ML frameworks as a way to organize a series of operations to apply to your data. You can sequence a series of transformation and model training steps as a unified repeatable step. Oftentimes data preparation pipelines will have multiple steps, and it becomes cumbersome to remember not only which steps to apply, but also the ordering of the steps. 24 MACHINE LEARNING WORKFLOW 25 3 SPARK ML API EXPLORE THE API IN JUPYTER NOTEBOOK Jump to OSBDET and explore the Spark ML API: 27 4. ADDITIONAL RESOURCES Summary Spark ML Guide Machine Learning Mindmap 29

Use Quizgecko on...
Browser
Browser