25 Questions
Spark's structured operations only involve working with Booleans and Numbers.
False
DataFrame is a submodule of the Dataset class.
True
DataFrameStatFunctions holds methods related to working with null data.
False
Org.apache.spark.sql.functions is a package that contains functions only for working with Strings.
False
The majority of Spark's functions are unique to Spark and cannot be found in SQL and analytics systems.
False
The explode function takes a column that consists of strings and creates one row per character in the string.
False
The split function is used to combine multiple columns into a single array column.
False
The map function is used to create a new column with a constant value.
False
The explode function can be used with SQL queries.
True
The explode function can only be used with Scala and Python APIs.
False
The result of the explode function is a single row with an array of values.
False
The printSchema
method is used to display the data of the DataFrame.
False
The createOrReplaceTempView
method is used to create a permanent table.
False
Spark has unique support for working with JSON data and can parse JSON objects directly from strings.
True
The get_json_object function is used to extract JSON objects with multiple levels of nesting.
False
The jsonDF.selectExpr method is used to parse JSON objects from a string column.
True
The json_tuple function is used to extract JSON objects with multiple levels of nesting.
False
Spark's SQL syntax supports JSON operations using the json_tuple function.
True
Spark can only operate on JSON data that is stored in a column of type JSON.
False
The to_json function can be used to convert a StructType into a JSON string without any additional parameters.
True
The from_json function does not require a schema to be specified.
False
The to_json function can only be used in Scala.
False
The result of the from_json function is a single row with an array of values.
False
The to_json function is used to parse JSON data into a StructType.
False
The from_json function can only be used with JSON data that is stored in a column named 'newJSON'.
False
Study Notes
Working with Different Types of Data
- This chapter covers building expressions, which are the foundation of Spark's structured operations.
- It reviews working with various types of data, including:
- Booleans
- Numbers
- Strings
- Dates and timestamps
- Handling null
- Complex types
- User-defined functions
Where to Look for APIs
- Key places to look for transformations in Spark:
- DataFrame (Dataset) methods
- Column methods
- org.apache.spark.sql.functions package
- SQL and DataFrame functions
Reading in a DataFrame
- Example code to read in a CSV file and create a temporary view:
- In Scala:
val df = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("/data/retail-data/by-day/2010-12-01.csv")
- In Python:
df = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("/data/retail-data/by-day/2010-12-01.csv")
- In Scala:
Exploding a Column of Text
- The
explode
function takes a column that consists of arrays and creates one row per value in the array. - Example code to explode a column of text:
- In Scala:
df.withColumn("splitted", split(col("Description"), " ")).withColumn("exploded", explode(col("splitted"))).select("Description", "InvoiceNo", "exploded").show(2)
- In Python:
df.withColumn("splitted", split(col("Description"), " ")).withColumn("exploded", explode(col("splitted"))).select("Description", "InvoiceNo", "exploded").show(2)
- In SQL:
SELECT Description, InvoiceNo, exploded FROM (SELECT *, split(Description, " ") as splitted FROM dfTable) LATERAL VIEW explode(splitted) as exploded
- In Scala:
Working with Maps
- Maps are created by using the
map
function and key-value pairs of columns. - Example code to create a map:
- In Scala: Not shown
- In Python: Not shown
Working with JSON
- Spark has unique support for working with JSON data.
- You can operate directly on strings of JSON in Spark and parse from JSON or extract JSON objects.
- Example code to create a JSON column:
- In Scala:
val jsonDF = spark.range(1).selectExpr("'{\"myJSONKey\" : {\"myJSONValue\" : [1, 2, 3]}}' as jsonString")
- In Python:
jsonDF = spark.range(1).selectExpr("'{\"myJSONKey\" : {\"myJSONValue\" : [1, 2, 3]}}' as jsonString")
- In Scala:
- Example code to use the
get_json_object
function to inline query a JSON object:- In Scala:
jsonDF.select(get_json_object(col("jsonString"), "$.myJSONKey.myJSONValue[1]") as "column", json_tuple(col("jsonString"), "myJSONKey")).show(2)
- In Python:
jsonDF.select(get_json_object(col("jsonString"), "$.myJSONKey.myJSONValue[1]") as "column", json_tuple(col("jsonString"), "myJSONKey")).show(2)
- In Scala:
- Example code to turn a StructType into a JSON string using the
to_json
function:- In Scala:
df.selectExpr("(InvoiceNo, Description) as myStruct").select(to_json(col("myStruct")))
- In Python:
df.selectExpr("(InvoiceNo, Description) as myStruct").select(to_json(col("myStruct")))
- In Scala:
- Example code to parse a JSON string using the
from_json
function:- In Scala:
df.selectExpr("(InvoiceNo, Description) as myStruct").select(to_json(col("myStruct")).alias("newJSON")).select(from_json(col("newJSON"), parseSchema), col("newJSON")).show(2)
- In Python:
df.selectExpr("(InvoiceNo, Description) as myStruct").select(to_json(col("myStruct")).alias("newJSON")).select(from_json(col("newJSON"), parseSchema), col("newJSON")).show(2)
- In Scala:
This chapter covers building expressions and working with various data types in Spark, including booleans, numbers, strings, dates, and timestamps, as well as handling null and complex types.
Make Your Own Quizzes and Flashcards
Convert your notes into interactive study material.
Get started for free