Spark Core Concepts

QuickerPipa avatar
QuickerPipa
·
·
Download

Start Quiz

Study Flashcards

12 Questions

What is the primary function of Spark's DAG Scheduler?

To break down RDD operations into a DAG of tasks

What is the purpose of the 'Execution Memory' in Spark's memory hierarchy?

To cache intermediate results

What happens in Spark's 'Disk Spill Mode'?

Spark spills data to disk when memory is limited

What is the difference between 'Cache' and 'Persist' in Spark?

Cache temporarily stores an RDD in memory, while persist stores it in memory and/or disk

What is the purpose of 'StorageLevel' in Spark?

To define the storage behavior for an RDD

What type of RDD operation triggers computation and returns a result?

Action

What is the primary function of Azure Databricks' connectors in data engineering?

To ingest data from various sources

What is the benefit of using Apache Spark on Azure Databricks?

It provides automatic scaling and termination of Spark clusters

What is the purpose of Azure Databricks' workflows and jobs in data engineering?

To create, schedule, and manage data pipelines

What is the benefit of using Azure Databricks' data storage solutions?

It allows for data storage and management in various formats

What is a feature of Spark clusters on Azure Databricks?

Automatic scaling and termination

What is a benefit of using Azure Databricks' collaborative workspaces?

It supports collaborative workspaces for data engineers, data scientists, and data analysts

Study Notes

Spark Core

  • Resilient Distributed Datasets (RDDs): fundamental data structure in Spark, representing a collection of data that can be split across multiple nodes
  • RDD Operations: two types of operations:
    • Transformations: operations that create a new RDD, e.g., map, filter, groupBy
    • Actions: operations that trigger computation and return a result, e.g., count, collect, reduce
  • DAG (Directed Acyclic Graph) Scheduler: Spark's scheduling engine, responsible for executing RDD operations
    • Breaks down RDD operations into a DAG of tasks
    • Schedules tasks on the cluster, considering dependencies and node availability

Memory Management

  • Memory Hierarchy: Spark uses a hierarchical memory management system to optimize performance
    • Execution Memory: used for caching and storing intermediate results
    • Storage Memory: used for storing persisted RDDs
    • System Memory: used for JVM overhead and other system-level memory allocation
  • Memory Modes:
    • In-Memory Mode: Spark stores data in memory for faster access
    • Disk Spill Mode: Spark spills data to disk when memory is limited
  • Cache and Persist:
    • Cache: temporarily stores an RDD in memory for faster access
    • Persist: stores an RDD in memory and/or disk for later reuse
    • StorageLevel: defines the storage behavior for an RDD, e.g., MEMORY_ONLY, DISK_ONLY, MEMORY_AND_DISK

Spark Core

  • Resilient Distributed Datasets (RDDs): a fundamental data structure in Spark, representing a collection of data split across multiple nodes.

RDD Operations

  • Transformations: operations that create a new RDD, e.g., map, filter, groupBy.
  • Actions: operations that trigger computation and return a result, e.g., count, collect, reduce.

DAG (Directed Acyclic Graph) Scheduler

  • Scheduling Engine: responsible for executing RDD operations.
  • Task Scheduling: breaks down RDD operations into a DAG of tasks and schedules them on the cluster, considering dependencies and node availability.

Memory Management

Memory Hierarchy

  • Hierarchical Memory Management: Spark's memory management system optimizes performance.
  • Execution Memory: used for caching and storing intermediate results.
  • Storage Memory: used for storing persisted RDDs.
  • System Memory: used for JVM overhead and other system-level memory allocation.

Memory Modes

  • In-Memory Mode: Spark stores data in memory for faster access.
  • Disk Spill Mode: Spark spills data to disk when memory is limited.

Cache and Persist

  • Cache: temporarily stores an RDD in memory for faster access.
  • Persist: stores an RDD in memory and/or disk for later reuse.
  • StorageLevel: defines the storage behavior for an RDD, e.g., MEMORY_ONLY, DISK_ONLY, MEMORY_AND_DISK.

Azure Databricks Overview

  • Fast, easy, and collaborative analytics platform based on Apache Spark
  • Provides a managed platform for data engineering, data science, and data analytics

Data Engineering with Azure Databricks

  • Ingest data from various sources (e.g., Azure Storage, Azure Data Lake Storage, Azure Cosmos DB) using Azure Databricks' connectors
  • Process and transform large datasets using Apache Spark for scalability and performance
  • Create, schedule, and manage data pipelines using Azure Databricks' workflows and jobs
  • Store and manage data in various formats (e.g., CSV, JSON, Avro) using Azure Databricks' data storage solutions (e.g., Databricks File System (DBFS))

Apache Spark on Azure Databricks

  • Create and manage Spark clusters on Azure Databricks with automatic scaling and termination
  • Supports multiple Apache Spark versions for flexibility and compatibility
  • Leverage Spark's APIs (e.g., DataFrame, Dataset, RDD) for data processing, machine learning, and graph processing
  • Use popular Spark libraries, such as MLlib (machine learning), GraphFrames (graph processing), and Structured Streaming (streaming data processing)

Additional Features

  • Supports collaborative workspaces for data engineers, data scientists, and data analysts
  • Provides enterprise-grade security features, such as Azure Active Directory integration and encryption
  • Integrates with other Azure services (e.g., Azure Synapse Analytics, Azure Storage) and popular data tools (e.g., Jupyter, Git)

Understand the fundamental concepts of Spark Core, including Resilient Distributed Datasets (RDDs), RDD Operations, and DAG Scheduler.

Make Your Own Quizzes and Flashcards

Convert your notes into interactive study material.

Get started for free
Use Quizgecko on...
Browser
Browser