Transform Data With Spark PDF

Document Details

CostEffectiveClavichord

Uploaded by CostEffectiveClavichord

2024

Alexandre Bergere

Tags

big data spark data transformations data analysis

Summary

This presentation introduces big data transformations using Spark. It covers topics like data extraction, cleaning, and complex transformations. The presentation was given by Alexandre Bergere on October 28, 2024.

Full Transcript

Transform Data with Spark 28/10/2024 Introduction to Big Data by Alexandre Bergere 2 Data Expert...

Transform Data with Spark 28/10/2024 Introduction to Big Data by Alexandre Bergere 2 Data Expert (Audit, Modeling, Urbanization, BI’s project – Azure, Big Data, Spark, openLineage & Delta lover) Trainer (BI, Big Data, MongoDB, NoSQL, SQL, Cloud Alexandre BERGERE Computing) Head of Data & AI Engineer & Partners at DataGalaxy, Data Architect independent Delta & openLineage lover Speaker (Pass Summit 2018, Spark AI Tour 2020, …) Cloud & data tech articles (medium, LinkedIn, slideshare) Avanade OpenClassroom DataRedKite Github Résultat Sr Anls, Data Analyst Mentor CTO & Co-Founder Résultat Résultat Logo - de de de 2021 - x Free recherche 2016 - x 2019 - x recherche recherche d'images 2016 - 2019 2019 - x 2020 - 2023 Trainer Datalex DataGalaxy ESAIP, ESME, ECS, NEOMA, ESEO, Data Architect Freelance EFREI, Sorbonne Paris Head of Data & AI Engineer BI, Big Data, MongoDB, NoSQL, Panorama de la data, Cloud computing 28/10/2024 Introduction to Big Data by Alexandre Bergere 3 alexandrebergere.com Don’t get lost Your turn ! Practice Demo Important 28/10/2024 Introduction to Big Data by Alexandre Bergere 4 Learning Objectives 1 Extract data from a variety of file formats and data sources using Spark 2 Apply a number of common transformations to clean data using Spark 3 Reshape and manipulate complex data using advanced built-in functions in Spark 4 Leverage UDFs for reusable code and apply best practices for performance in Spark 28/10/2024 Introduction to Big Data by Alexandre Bergere 5 Data Objects in the Lakehouse 28/10/2024 Introduction to Big Data by Alexandre Bergere 6 Data objects in the Lakehouse 28/10/2024 Introduction to Big Data by Alexandre Bergere 7 Managed Tables 28/10/2024 Introduction to Big Data by Alexandre Bergere 8 External Tables 28/10/2024 Introduction to Big Data by Alexandre Bergere 9 Data objects in the Lakehouse 28/10/2024 Introduction to Big Data by Alexandre Bergere 10 Extracting Data 28/10/2024 Introduction to Big Data by Alexandre Bergere 11 Query files directly 28/10/2024 Introduction to Big Data by Alexandre Bergere 12 Using SQL expressions in Spark %pyspark # Create a table in the metastore df.createOrReplaceTempView("products") %pyspark # Use spark.sql method for inline SQL queries that return a dataframe bikes_df = spark.sql("SELECT ProductID, ProductName, ListPrice \ FROM products \ WHERE Category IN ('Mountain Bikes', 'Road Bikes’)”) display(bikes_df) %sql -- Use SQL to query tables in the metastore SELECT Category, COUNT(ProductID) AS ProductCount FROM products GROUP BY Category ORDER BY Category 28/10/2024 Introduction to Big Data by Alexandre Bergere 13 DE 2.1: Querying Files Directly o Use Spark SQL to directly query JSON data files o Leverage text and binaryFile methods to review raw file contents 28/10/2024 Introduction to Big Data by Alexandre Bergere 14 DE 2.2: Providing Options for External Sources o Use Spark SQL to configure options for extracting data from external sources o Create tables against external data sources for various file formats o Describe behavior when querying tables defined against external RDBMS sources 28/10/2024 Introduction to Big Data by Alexandre Bergere 15 DE 2.3L: Extract Data Lab Your turn ! 28/10/2024 Introduction to Big Data by Alexandre Bergere 16 DE 2.4: Cleaning Data o Summarize datasets and describe NULL behaviors o Retrieve and removing Duplicates o Validate datasets for expected counts, missing values, and duplicate records o Apply date_format and regexp_extract to clean and transform data 28/10/2024 Introduction to Big Data by Alexandre Bergere 17 Complex Transformations 28/10/2024 Introduction to Big Data by Alexandre Bergere 18 Interact with Nested Data Use built-in syntax to traverse nested data with Spark SQL. # Use “:” (colon) syntax in queries to access subfields in JSON strings SELECT value:device, value:geo... # Use “.” (dot) syntax in queries to access subfields in STRUCT types SELECT value.device, value.geo... 28/10/2024 Introduction to Big Data by Alexandre Bergere 19 Complex Types Nested data types storing multiple values: o Array: arbitrary number of elements of same data type o Map: set of key-value pairs o Struct: ordered (fixed) collection of column(s) and any data type # Example table with complex types CREATE TABLE employees (name STRING, salary FLOAT, subordinates ARRAY, deductions MAP, address STRUCT ) 28/10/2024 Introduction to Big Data by Alexandre Bergere 20 Explode Explode outputs the elements of an array field into a separate row for each element. SELECT user_id, event_timestamp, event_name, explode(items) AS item FROM events Each item in the items array above is exploded into its own row, resulting in the 3 rows below 28/10/2024 Introduction to Big Data by Alexandre Bergere 21 Flatten o collect_set returns an array of unique values from a field for each group of rows o flatten returns an array that flattens multiple arrays into one SELECT user_id, collect_set(event_name) AS event_history, array_distinct(flatten(collect_set(items.item_id))) AS cart_history FROM events GROUP BY user_id 28/10/2024 Introduction to Big Data by Alexandre Bergere 22 Collection example o collect_set returns an array with duplicate elements eliminated o collect_list returns an array with duplicate elements intact 28/10/2024 Introduction to Big Data by Alexandre Bergere 23 Parse JSON strings into structs Create the schema to parse the JSON strings by providing an example JSON string from a row that has no nulls 28/10/2024 Introduction to Big Data by Alexandre Bergere 24 DE 2.5: Complex Transformations o Use : and. syntax to traverse nested data in strings and structs o Use.* syntax to flatten and query struct types o Parse JSON string fields o Flatten/unpack arrays and structs 28/10/2024 Introduction to Big Data by Alexandre Bergere 25 DE 2.6L: Reshape Data Lab Your turn ! 28/10/2024 Introduction to Big Data by Alexandre Bergere 26 DE 2.7A: SQL UDFs and Control Flow (Optional) 28/10/2024 Introduction to Big Data by Alexandre Bergere 27 DE 2.7B: Python UDFs (Optional) 28/10/2024 Introduction to Big Data by Alexandre Bergere 28 Reprise à 16h05

Use Quizgecko on...
Browser
Browser