Lecture #12.1 - Spark in Production Scenarios.pdf

MODERN DATA ARCHITECTURES FOR BIG DATA II APACHE SPARK IN PRODUCTION SCENARIOS AGENDA Structured API best practices Structured API under the hood Spark applications Azure Databricks Success Stories TIME TO TURN OSBDET ON! We'll use the course environment by the end of the lesson: 1. STRUCTURED API BEST PRACTICES TRANSFORMATIONS & ACTIONS Spark data analysis just some transformations & one action: RESOURCES ARE LIMITED Unfortunately, resources for projects are limited due to: Budgetary limitations → constrained Big Data environment Demanding scenarios → reaching the limits of technology Once you've proved viability of your analysis you must: Review your code and apply data analysis optimizations Optimize the format of the data if possible (text vs binary) Optimize the structure of the data if possible (row vs columnar) * Picture from the A brief introduction to column-oriented databases article DATA ANALYSIS OPTIMIZATIONS Optimizations will try to achieve the following: Remove as much unnecessary data as possible Avoid as much data exchange between nodes as possible This optimizations can be introduced by: 1. Filtering as much rows as you can (filter, where) 2. Removing those columns you don’t need (select) 3. Starting with narrow transformations (ex. new columns) 4. Moving on with wide transformations (ex. aggregations) 5. Caching DataFrames if you’ll use them more than once 2. STRUCTURED API UNDER THE HOOD TRANSFORMATIONS DON'T DUPLICATE DATA New DataFrames don't mean data is duplicated. Our data analysis will be optimized 'automagically': 1. Our PySpark code is converted into a Logical Plan 2. The Logical Plan is then converted into a Physical Plan 3. Multiple optimizations applied along the way by the Catalyst Optimizer Physical Plan is all about RDDs transformations. LOOKING INTO THE PLANS The explain's DataFrame method can show you the plans: simple - only a physical plan extended - both logical and physical plans codegen - physical plan and generated codes if available cost - logical plan and statistics if available formatted - physical plan outline and node details THERE IS A SPARK UI TO MONITOR JOBS The Spark UI let you look to the progress of data analysis: 3. SPARK APPLICATIONS BUILDING SPARK APPLICATIONS Jupyter Notebooks are great for interactive analytics. Batch & stream processing often times are not interactive: Need to happen at times humans are not available (eventually we sleep) Jobs triggered based on dynamic conditions that we cannot foresee Spark Applications the way to go to address those scenarios. Jupyter Notebooks can be translated into Spark Applications. WHAT'S A SPARK APPLICATION? It's just a Python script with PySpark code in it. A Spark Application with PySpark might start like this: if __name__ == '__main__': from pyspark.sql import SparkSession from pyspark.context import SparkContext sc = SparkSession.builder \.master("local") \.appName("Bikes") \.getOrCreate() spark = SparkSession(sc) INITIATING SPARK APPLICATIONS Send Spark Applications to a Spark Cluster with spark-submit. Spark Applications are initiated on OSBDET like this: osbdet@osbdet:~$ export PYSPARK_PYTHON=/usr/bin/python3 osbdet@osbdet:~$ $SPARK_HOME/bin/spark-submit --master local your_pyspark_application.py Additional arguments might be needed (ex. --packages) BIKES STATIONS ANALYSIS APPLICATION Bikes Stations' graph analysis as a Spark Application: RUNNING BIKES STATIONS ANALYSIS Run the Bikes Stations' graph analysis by typing the following: osbdet@osbdet:~$ export PYSPARK_PYTHON=/usr/bin/python3 osbdet@osbdet:~$ $SPARK_HOME/bin/spark-submit --master local \ --packages "graphframes:graphframes:0.8.2-spark3.2-s_2.12" \ bike_stations_analysis_job.py 4. AZURE DATABRICKS USING MICROSOFT AZURE Azure is a cloud computing service created by Microsoft. Lifecycle of applications/services on Managed Data Centers: Development Testing Production Deployment Microsoft Azure provides services classified as follows: Software as a Service (SaaS) → ex. Microsoft Dynamics 365 Platform as a Service (PaaS) → ex. Databricks Infrastructure as a Service (IaaS) → ex. Azure VMs USING AZURE DATABRICKS Azure Databricks is an Apache Spark-based analytics platform. It's optimized for the Microsoft Azure Cloud Services platform. BIKES STATIONS ANALYSIS ON DATABRICKS Let's run a graph analysis on a production-class cluster. We'll accomplish it by going through the following steps: Create an account and log into Azure Create an Azure Databricks service Create a Databricks workspace Enter the Databricks workspace and create a Spark cluster Install the GraphFrames Spark package on the cluster Upload the two CVS files to the storage layer (ex. DBFS) Import the Jupyter Notebook into the workspace Execute the analysis BIKES STATIONS ANALYSIS ON DATABRICKS 1. Create an Azure account to try it out: BIKES STATIONS ANALYSIS ON DATABRICKS 2. Choose the option that fits your preference the best*: * I’ve used pay-as-you-go, but the other one will work as well. BIKES STATIONS ANALYSIS ON DATABRICKS 3. You might want to use your IE account: BIKES STATIONS ANALYSIS ON DATABRICKS 4. Complete the sign-up process: BIKES STATIONS ANALYSIS ON DATABRICKS 5. Search for Azure Databricks in the Azure Portal: BIKES STATIONS ANALYSIS ON DATABRICKS 6. Once identified, create your Azure Databricks Service: BIKES STATIONS ANALYSIS ON DATABRICKS 7. Create an Azure Databricks Workspace within the service*: * You need to create a new Resource group called mda2_course (a different name will work too). BIKES STATIONS ANALYSIS ON DATABRICKS 8. Be patient, the process will take a while: BIKES STATIONS ANALYSIS ON DATABRICKS 9. After a while, your workspace will be ready to use: BIKES STATIONS ANALYSIS ON DATABRICKS 10. Access the workspace and use the Azure Databricks Service: BIKES STATIONS ANALYSIS ON DATABRICKS 11. It's all setup to create a proper Spark cluster: BIKES STATIONS ANALYSIS ON DATABRICKS 12. You have to be patient again, it'll take a while to be ready: BIKES STATIONS ANALYSIS ON DATABRICKS 13. Time to install the GraphFrames Spark package: BIKES STATIONS ANALYSIS ON DATABRICKS 14. Let's identify DBFS, storage layer, to upload the CSV files: BIKES STATIONS ANALYSIS ON DATABRICKS 15. Import the notebook with the analysis into the workspace*: * Right click the workspace to get the contextual menu with the import option BIKES STATIONS ANALYSIS ON DATABRICKS 16. Open the uploaded notebook and execute the analysis: 5. SUCCESS STORIES CLEARSENSE Clearsense Provides Streaming Analytics Solutions: SHELL Shell is taking an AI-first approach for business needs: CONGRATS, WE'RE DONE!

Lecture #12.1 - Spark in Production Scenarios.pdf

Document Details

Tags

Related

Full Transcript

Upgrade to continue