Databricks Data Science & Engineering Workspace PDF
Document Details
Uploaded by CostEffectiveClavichord
2024
Alexandre Bergere
Tags
Summary
Introduction to Databricks Data Science and Engineering Workspace, covering introductory big data concepts. Presented on 29/10/2024 by Alexandre Bergere, this document offers insight into the tools and platform for data science and engineering.
Full Transcript
Get Started with Databricks Data Science & Engineering Workspace 29/10/2024 Introduction to Big Data by Alexandre Bergere 2 ...
Get Started with Databricks Data Science & Engineering Workspace 29/10/2024 Introduction to Big Data by Alexandre Bergere 2 Data Expert (Audit, Modeling, Urbanization, BI’s project – Azure, Big Data, Spark, openLineage & Delta lover) Trainer (BI, Big Data, MongoDB, NoSQL, SQL, Cloud Alexandre BERGERE Computing) Head of Data & AI Engineer & Partners at DataGalaxy, Data Architect independent Delta & openLineage lover Speaker (Pass Summit 2018, Spark AI Tour 2020, …) Cloud & data tech articles (medium, LinkedIn, slideshare) Avanade OpenClassroom DataRedKite Github Résultat Sr Anls, Data Analyst Mentor CTO & Co-Founder Résultat Résultat Logo - de de de 2021 - x Free recherche 2016 - x 2019 - x recherche recherche d'images 2016 - 2019 2019 - x 2020 - 2023 Trainer Datalex DataGalaxy ESAIP, ESME, ECS, NEOMA, ESEO, Data Architect Freelance EFREI, Sorbonne Paris Head of Data & AI Engineer BI, Big Data, MongoDB, NoSQL, Panorama de la data, Cloud computing 29/10/2024 Introduction to Big Data by Alexandre Bergere 3 alexandrebergere.com Don’t get lost Your turn ! Practice Demo Important 29/10/2024 Introduction to Big Data by Alexandre Bergere 4 Learning Objectives 1 Understand Spark 2 Understand the core components of the Databricks Lakehouse platform 3 Use PySpark in Azure Databricks 29/10/2024 Introduction to Big Data by Alexandre Bergere 5 Understand Spark 29/10/2024 Introduction to Big Data by Alexandre Bergere 6 Apache Spark Apache Spark is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters. https://spark.apache.org/ Forbes’ article 29/10/2024 Introduction to Big Data by Alexandre Bergere 7 Apache Spark Distributed data processing framework o Code in multiple languages o Driver program uses SparkContext to coordinate processing across multiple executors on worker nodes o Executors run in parallel to process data in a distributed file system 29/10/2024 Introduction to Big Data by Alexandre Bergere 8 Apache Spark o The Driver is the JVM in which our application runs. o The secret to Spark's awesome performance is parallelism. o Scaling vertically is limited to a finite amount of RAM, Threads and CPU speeds. o Scaling horizontally means we can simply add new "nodes" to the cluster almost endlessly. o We parallelize at two levels: o The first level of parallelization is the Executor - a Java virtual machine running on a node, typically, one instance per node. o The second level of parallelization is the Slot - the number of which is determined by the number of cores and CPUs of each node. o Each Executor has a number of Slots to which parallelized Tasks can be assigned to it by the Driver. 29/10/2024 Introduction to Big Data by Alexandre Bergere 9 Spark API 29/10/2024 Introduction to Big Data by Alexandre Bergere 10 Query Execution We can express the same query using any interface. The Spark SQL engine generates the same query plan used to optimize and execute on our Spark cluster. Resilient Distributed Datasets (RDDs) are the low-level representation of datasets processed by a Spark cluster. In early versions of Spark, you had to write code manipulating RDDs directly. In modern versions of Spark you should instead use the higher-level DataFrame APIs, which Spark automatically compiles into low-level RDD operation. 29/10/2024 Introduction to Big Data by Alexandre Bergere 11 SparkSession The SparkSession class is the single entry point to all functionality in Spark using the DataFrame API. In Azure Synapse & Databricks notebooks, the SparkSession is created for you, stored in a variable called spark. spark from pyspark.sql import SparkSession # Spark session & context spark = SparkSession.builder.master("local").getOrCreate() sc = spark.sparkContext # Test PySpark spark.range(5).show() 29/10/2024 Introduction to Big Data by Alexandre Bergere 12 SparkSession Methods Below are several additional methods we can use to create DataFrames. All of these can be found in the documentation for SparkSession. Method Description sql Returns a DataFrame representing the result of the given query table Returns the specified table as a DataFrame read Returns a DataFrameReader that can be used to read data in as a DataFrame range Create a DataFrame with a column containing elements in a range from start to end (exclusive) with step value and number of partitions createDataFrame Creates a DataFrame from a list of tuples, primarily used for testing 29/10/2024 Introduction to Big Data by Alexandre Bergere 13 DataFrames Queries using methods in the DataFrame API returns results in a DataFrame. A DataFrame is a distributed collection of data grouped into named columns. budget_df = (spark.table("products").select("name", "price").where("price < 200").orderBy("price") ) We can use display() to output the results of a dataframe. display(budget_df) 29/10/2024 Introduction to Big Data by Alexandre Bergere 14 Analyse data with Spark %pyspark # Load data df=spark.read.load("/data/products.csv", format="csv", header=True) # Manipulate dataframe counts_df = df.select("ProductID", "Category").groupBy("Category").count() # Display dataframe display(counts_df) Category count Headsets 3 Wheels 14 Mountain Bikes 32...... 29/10/2024 Introduction to Big Data by Alexandre Bergere 15 Using SQL expressions in Spark %pyspark # Create a table in the metastore df.createOrReplaceTempView("products") %pyspark # Use spark.sql method for inline SQL queries that return a dataframe bikes_df = spark.sql("SELECT ProductID, ProductName, ListPrice \ FROM products \ WHERE Category IN ('Mountain Bikes', 'Road Bikes’)”) display(bikes_df) %sql -- Use SQL to query tables in the metastore SELECT Category, COUNT(ProductID) AS ProductCount FROM products GROUP BY Category ORDER BY Category 29/10/2024 Introduction to Big Data by Alexandre Bergere 16 Spark Local Mode One option, commonly used for local development, is to run Spark in Local Mode. In Local Mode, both the Driver and one Executor share the same JVM. o This is an ideal scenario for experimentation, prototyping, and learning! 29/10/2024 DP-203 by Alexandre Bergere 17 Running PySpark on Jupyter Notebook with Docker on Mac: A Step-by-Step Guide This article aims to simplify the process by demonstrating how to run PySpark seamlessly on Jupyter Notebook using Docker, specifically tailored for Mac users. Docker, with its containerization magic, ensures a consistent and reproducible environment, while Jupyter Notebook provides an interactive and user-friendly interface for data exploration and analysis. https://medium.com/p/0b2e3bad1930 29/10/2024 DP-203 by Alexandre Bergere 18 Understand the core components of the Databricks Lakehouse platform 29/10/2024 Introduction to Big Data by Alexandre Bergere 19 What is Databricks? One platform to unify all your data, analytics and AI workloads. Business SQL Lakehouse Platform Data Science Data Intelligence Analytics & ML Streaming Dat a Dat a Dat a Dat a Science Warehousing Engineering St reaming and ML Governance and Security Governance and Security Table ACLs Files and Blobs Unity Catalog Fine- grained governance for dat a and AI Delta Lake Dat a reliabilit y and performance Cloud Data Lake Data Warehouse All st ruct ured and unst ruct ured dat a Data Lake Structured Unstructured files: tables logs, t ext , images, video, 29/10/2024 Introduction to Big Data by Alexandre Bergere 20 Databricks Workspace and Services 29/10/2024 Introduction to Big Data by Alexandre Bergere 21 What is Azure Databricks? Databricks is a fully managed, cloud-based data analytics platform o Built on Apache Spark o Web-based portal Provisioned as an Azure resource o Standard tier o Premium tier o Trial 29/10/2024 Introduction to Big Data by Alexandre Bergere 22 Azure Databricks workload Data Science and Engineering Machine Learning SQL Use notebooks to run Apache Spark Train predictive models using Store and query data in relational code to manipulate and explore data SparkML and other machine learning tables using SQL frameworks Only available in Premium tier workspaces 29/10/2024 Introduction to Big Data by Alexandre Bergere 23 Key concepts 1. Apache Spark clusters provide highly scalable parallel compute for distributed data processing 2. Databricks File System (DBFS) provides distributed shared storage for data lakes 3. Notebooks provide an interactive environment for combining code, notes, and images 4. Metastore provides a relational abstraction layer, enabling you to define tables based on data in files 5. Delta Lake builds on the metastore to enable common relational database capabilities 6. SQL Warehouses provide relational compute endpoints for querying data in tables 29/10/2024 Introduction to Big Data by Alexandre Bergere Δ 24 Databricks File System Databricks File System (DBFS) is a distributed file system mounted into a Databricks workspace and available on Databricks clusters. DBFS is an abstraction on top of scalable object storage and offers the following benefits: o Allows you to mount storage objects so that you can seamlessly access data without requiring credentials. o Allows you to interact with object storage using directory and file semantics instead of storage URLs. o Files persist to object storage, so you won’t lose data after you terminate a cluster. https://docs.databricks.com/user-guide/databricks-file-system.html#dbfs 29/10/2024 Introduction to Big Data by Alexandre Bergere 25 Azure Databricks — Connect Azure storage to Databricks https://medium.com/datalex/azure-databricks-connect-azure-storage-to-databricks- 285cac910fa0 29/10/2024 Introduction to Big Data by Alexandre Bergere 26 Reprise à 16h05 Demo - Navigate the Workspace UI 29/10/2024 Introduction to Big Data by Alexandre Bergere 28 Enable student credits 29/10/2024 Introduction to Big Data by Alexandre Bergere 29 Create an Azure Databricks Workspace 29/10/2024 Introduction to Big Data by Alexandre Bergere 30 Create an Azure Databricks Workspace 29/10/2024 Introduction to Big Data by Alexandre Bergere 31 Create an Azure Databricks Workspace 29/10/2024 Introduction to Big Data by Alexandre Bergere 32 Use Databricks Community Edition 29/10/2024 Introduction to Big Data by Alexandre Bergere 33 Compute Ressources 29/10/2024 Introduction to Big Data by Alexandre Bergere 34 Clusters Collection of VM instances Distributes workloads across workers Two main types: o All-purpose clusters for interactive development o Job clusters for automating workloads 29/10/2024 Introduction to Big Data by Alexandre Bergere 35 Create a Spark cluster Create a cluster in the Azure Databricks portal, specifying: o Cluster name o Cluster mode (standard, high-concurrency, or single-node) o Databricks Runtime version o Worker and driver node VM configuration o Autoscaling and automatic shutdown 29/10/2024 Introduction to Big Data by Alexandre Bergere 36 Cluster Types All-purpose Clusters Job Clusters Run automated jobs o Analyze data collaboratively using o The Databricks job scheduler creates interactive notebooks job clusters when running jobs o Create clusters from the Workspace or o Configuration information retained for API up to 30 most recently terminated clusters o Configuration information retained for up to 70 clusters for up to 30 days 29/10/2024 Introduction to Big Data by Alexandre Bergere 37 Cluster Configuration Cluster Mode Standard (Multi Node) Default mode for workloads developed in any supported language (requires at least two VM instances) Single node Low-cost single-instance cluster catering to single-node machine learning workloads and lightweight exploratory analysis 29/10/2024 Introduction to Big Data by Alexandre Bergere 38 Cluster Configuration Databricks Runtime Version Standard Apache Spark and many other components and updates to provide an optimized big data analytics experiences Machine learning Adds popular machine learning libraries like TensorFlow, Keras, PyTorch, and XGBoost. Photon An optional add-on to optimize SQL workloads 29/10/2024 Introduction to Big Data by Alexandre Bergere 39 Cluster Configuration Access Mode 29/10/2024 Introduction to Big Data by Alexandre Bergere 40 Cluster Configuration Cluster Policies Cluster policies can help to achieve the following: o Standardize cluster configurations o Provide predefined configurations targeting specific use cases o Simplify the user experience o Prevent excessive use and control cost o Enforce correct tagging 29/10/2024 Introduction to Big Data by Alexandre Bergere 41 Cluster Configuration Cluster Access Control 29/10/2024 Introduction to Big Data by Alexandre Bergere 42 DE 1.1: Create and Manage Interactive Clusters o Use the Clusters UI to configure and deploy a cluster o Edit, terminate, restart, and delete clusters 29/10/2024 Introduction to Big Data by Alexandre Bergere 43 Develop Code with Notebooks 29/10/2024 Introduction to Big Data by Alexandre Bergere 44 Databricks Notebooks Collaborative, reproducible, and enterprise ready 29/10/2024 Introduction to Big Data by Alexandre Bergere 45 Use Spark in notebooks Interactive notebooks o Syntax highlighting and error support o Code auto-completion o Interactive data visualizations o The ability to export results 29/10/2024 DP-203 by Alexandre Bergere 46 Notebook magic commands Use to override default languages, run utilities/auxiliary commands, etc. o %python, %r, %scala, %sql Switch languages in a command cell o %sh Run shell code (only runs on driver node, not worker nodes) o %fs Shortcut for dbutils filesystem commands o %md Markdown for styling the display o %run Execute a remote notebook from a notebook o %pip Install new Python libraries 29/10/2024 Introduction to Big Data by Alexandre Bergere 47 dbutils (Databricks Utilities) Perform various tasks with Databricks using notebooks. Utility Description Example Manipulates the Databricks filesystem fs dbutils.fs.ls() (DBFS) from the console Provides utilities for leveraging secrets secrets dbutils.secrets.get() within notebooks notebook Utilities for the control flow of a notebook dbutils.notebook.run() Methods to create and get bound value of widgets dbutils.widget.text() input widgets inside notebooks jobs Utilities for leveraging jobs features dbutils.jobs.taskValues.set() Available within Python, R, or Scala notebooks. 29/10/2024 Introduction to Big Data by Alexandre Bergere 48 Git Versioning with Databricks Repos Databricks Repos 29/10/2024 Introduction to Big Data by Alexandre Bergere 49 Databricks Repos CI/CD Integration 29/10/2024 Introduction to Big Data by Alexandre Bergere 50 CI/CD workflows with Git and Repos https://www.google.com/url?q=https://docs.databricks.com/repos/ci-cd-techniques-with-repos.html&sa=D&source=editors&ust=1683908557263917&usg=AOvVaw2XjB4PcscoUZbtANSbp4pl 29/10/2024 Introduction to Big Data by Alexandre Bergere 51 DE 1.2: Databricks Notebook Operations o Attach a notebook to a cluster to execute a cell in a notebook Set the default language for a notebook o Describe and use magic commands o Create and run SQL, Python, and markdown cells o Export a single or collection of notebook 29/10/2024 Introduction to Big Data by Alexandre Bergere 52 DE 1.3L: Get Started with the Databricks Platform Your turn ! o Rename a notebook and change the default language Attach a cluster o Use the %run magic command o Run Python and SQL cells o Create a Markdown cell 29/10/2024 Introduction to Big Data by Alexandre Bergere 53 Use PySpark in Azure Databricks 29/10/2024 Introduction to Big Data by Alexandre Bergere 54 DataFrameReader Interface used to load a DataFrame from external storage systems spark.read.parquet("path/to/files") DataFrameReader is accessible through the SparkSession attribute read. This class includes methods to load DataFrames from different external storage systems. Quick function review: Configuration methods: o csv(path) o option(key, value) o jdbc(url, table,..., o options(map) connectionProperties) o schema(schema) o json(path) o format(source) o load(path) o orc(path) o parquet(path) o table(tableName) o text(path) 29/10/2024 o textFile(path) Introduction to Big Data by Alexandre Bergere 55 RDD Operations An Action in Spark is any operation that does not return an RDD. Evaluation is executed when an action is taken. Actions trigger the scheduler, which build a directed acyclic graph (DAG) as a plan of execution. The plan of execution is created by working backward to define the series of steps required to produce the final distributed dataset (each Partition). A transformation are functions that return another RDD. TRANSFORMATION ACTION RDD → RDD RDD → Value Such operations apply Such operations generate a value transformations and generate a out of an RDD. No processing new RDD out of an existing RDD. happens until an action is performed. 29/10/2024 Introduction to Big Data by Alexandre Bergere 56 Transformation Transformations always return a DataFrame (or in some cases, such as Scala & Java, Dataset[Row]). Transformations are operations on RDDs, DataFrames, or Datasets that produce a new distributed dataset from an existing one. They are generally lazy, meaning they are not executed immediately but create a logical execution plan. 29/10/2024 Introduction to Big Data by Alexandre Bergere 57 Spark is Lazy Transformations are lazy in nature, they get execute when we call an action. They are not executed immediately. o Not forced to load all data at step. o Easier to parallelize operations: N different transformations can be processed on a single data element, on a single thread, on a single machine. o Most importantly, it allows the framework to automatically apply various optimizations. 29/10/2024 Introduction to Big Data by Alexandre Bergere 58 Action Transformations always return a DataFrame (or in some cases, such as Scala & Java, Dataset[Row]). In contrast, Actions either return a result or write to disc. For example: o The number of records in the case of count() o An array of objects in the case of collect() or take(n) Method Return Description collect() Collection Returns an array that contains all of Rows in this Dataset. count() Long Returns the number of rows in the Dataset. first() Row Returns the first row. foreach(f) - Applies a function f to all rows. foreachPartition(f) - Applies a function f to each partition of this Dataset. head() Row Returns the first row. reduce(f) Row Reduces the elements of this Dataset using the specified binary function. show(..) - Displays the top 20 rows of Dataset in a tabular form. take(n) Collection Returns the first n rows in the Dataset. toLocalIterator() Iterator Return an iterator that contains all of Rows in this Dataset. 29/10/2024 Introduction to Big Data by Alexandre Bergere 59 Modify and save dataframes o Load source file into a dataframe o Use dataframe methods and functions to from pyspark.sql.functions import year, col transform the data: Filter rows # Load data Modify column values df = spark.read.load("/data/orders.csv", format="csv", header=True) Derive new columns Drop columns # Add Year column, derived from OrderDate o Write the modified data df = df.withColumn("Year", year(col("OrderDate"))) Specify required file format # Save transformed data df.write.mode("overwrite").parquet("/data/orders.parquet") 29/10/2024 Introduction to Big Data by Alexandre Bergere 60 Partition data files o Partition data by one or more columns o Distributes data to improve performance and scalability df.write.partitionBy("Year").mode("overwrite").parquet("/data") data Year=2020 Year=2021 29/10/2024 Introduction to Big Data by Alexandre Bergere 61 Transform data with SQL Use the metastore to define tables and views Use SQL to query and transform the data Save transformed data as an external table Dropping an external table does not delete the data files # Create a view in the metastore df.createOrReplaceTempView("sales_orders") # Use SQL to transform data and return a dataframe new_df = spark.sql("SELECT OrderNo, OrderDate, Year(OrderDate) As Year FROM sales_orders") # Save the dataframe as an external table new_df.write.partitionBy("Year").saveAsTable("transformed_orders", format="parquet", mode="overwrite", path="/transformed_orders") 29/10/2024 Introduction to Big Data by Alexandre Bergere 62 DE 0.00 - Module Introduction Your turn ! 29/10/2024 Introduction to Big Data by Alexandre Bergere 63