quiz image

Spark Chapter 6: Working with Different Types of Data

EnrapturedElf avatar
EnrapturedElf
·
·
Download

Start Quiz

Study Flashcards

25 Questions

Spark's structured operations only involve working with Booleans and Numbers.

False

DataFrame is a submodule of the Dataset class.

True

DataFrameStatFunctions holds methods related to working with null data.

False

Org.apache.spark.sql.functions is a package that contains functions only for working with Strings.

False

The majority of Spark's functions are unique to Spark and cannot be found in SQL and analytics systems.

False

The explode function takes a column that consists of strings and creates one row per character in the string.

False

The split function is used to combine multiple columns into a single array column.

False

The map function is used to create a new column with a constant value.

False

The explode function can be used with SQL queries.

True

The explode function can only be used with Scala and Python APIs.

False

The result of the explode function is a single row with an array of values.

False

The printSchema method is used to display the data of the DataFrame.

False

The createOrReplaceTempView method is used to create a permanent table.

False

Spark has unique support for working with JSON data and can parse JSON objects directly from strings.

True

The get_json_object function is used to extract JSON objects with multiple levels of nesting.

False

The jsonDF.selectExpr method is used to parse JSON objects from a string column.

True

The json_tuple function is used to extract JSON objects with multiple levels of nesting.

False

Spark's SQL syntax supports JSON operations using the json_tuple function.

True

Spark can only operate on JSON data that is stored in a column of type JSON.

False

The to_json function can be used to convert a StructType into a JSON string without any additional parameters.

True

The from_json function does not require a schema to be specified.

False

The to_json function can only be used in Scala.

False

The result of the from_json function is a single row with an array of values.

False

The to_json function is used to parse JSON data into a StructType.

False

The from_json function can only be used with JSON data that is stored in a column named 'newJSON'.

False

Study Notes

Working with Different Types of Data

  • This chapter covers building expressions, which are the foundation of Spark's structured operations.
  • It reviews working with various types of data, including:
    • Booleans
    • Numbers
    • Strings
    • Dates and timestamps
    • Handling null
    • Complex types
    • User-defined functions

Where to Look for APIs

  • Key places to look for transformations in Spark:
    • DataFrame (Dataset) methods
    • Column methods
    • org.apache.spark.sql.functions package
    • SQL and DataFrame functions

Reading in a DataFrame

  • Example code to read in a CSV file and create a temporary view:
    • In Scala: val df = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("/data/retail-data/by-day/2010-12-01.csv")
    • In Python: df = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("/data/retail-data/by-day/2010-12-01.csv")

Exploding a Column of Text

  • The explode function takes a column that consists of arrays and creates one row per value in the array.
  • Example code to explode a column of text:
    • In Scala: df.withColumn("splitted", split(col("Description"), " ")).withColumn("exploded", explode(col("splitted"))).select("Description", "InvoiceNo", "exploded").show(2)
    • In Python: df.withColumn("splitted", split(col("Description"), " ")).withColumn("exploded", explode(col("splitted"))).select("Description", "InvoiceNo", "exploded").show(2)
    • In SQL: SELECT Description, InvoiceNo, exploded FROM (SELECT *, split(Description, " ") as splitted FROM dfTable) LATERAL VIEW explode(splitted) as exploded

Working with Maps

  • Maps are created by using the map function and key-value pairs of columns.
  • Example code to create a map:
    • In Scala: Not shown
    • In Python: Not shown

Working with JSON

  • Spark has unique support for working with JSON data.
  • You can operate directly on strings of JSON in Spark and parse from JSON or extract JSON objects.
  • Example code to create a JSON column:
    • In Scala: val jsonDF = spark.range(1).selectExpr("'{\"myJSONKey\" : {\"myJSONValue\" : [1, 2, 3]}}' as jsonString")
    • In Python: jsonDF = spark.range(1).selectExpr("'{\"myJSONKey\" : {\"myJSONValue\" : [1, 2, 3]}}' as jsonString")
  • Example code to use the get_json_object function to inline query a JSON object:
    • In Scala: jsonDF.select(get_json_object(col("jsonString"), "$.myJSONKey.myJSONValue[1]") as "column", json_tuple(col("jsonString"), "myJSONKey")).show(2)
    • In Python: jsonDF.select(get_json_object(col("jsonString"), "$.myJSONKey.myJSONValue[1]") as "column", json_tuple(col("jsonString"), "myJSONKey")).show(2)
  • Example code to turn a StructType into a JSON string using the to_json function:
    • In Scala: df.selectExpr("(InvoiceNo, Description) as myStruct").select(to_json(col("myStruct")))
    • In Python: df.selectExpr("(InvoiceNo, Description) as myStruct").select(to_json(col("myStruct")))
  • Example code to parse a JSON string using the from_json function:
    • In Scala: df.selectExpr("(InvoiceNo, Description) as myStruct").select(to_json(col("myStruct")).alias("newJSON")).select(from_json(col("newJSON"), parseSchema), col("newJSON")).show(2)
    • In Python: df.selectExpr("(InvoiceNo, Description) as myStruct").select(to_json(col("myStruct")).alias("newJSON")).select(from_json(col("newJSON"), parseSchema), col("newJSON")).show(2)

This chapter covers building expressions and working with various data types in Spark, including booleans, numbers, strings, dates, and timestamps, as well as handling null and complex types.

Make Your Own Quizzes and Flashcards

Convert your notes into interactive study material.

Get started for free

More Quizzes Like This

Use Quizgecko on...
Browser
Browser