Parallel & Distributed Systems Meeting 4
7 Questions
0 Views

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

What is horizontal scaling in the context of databases?

Horizontal scaling involves adding more machines or nodes to a distributed system to handle increased load.

What are some common use cases for Apache Cassandra?

Common use cases include real-time analytics, IoT applications, and handling large volumes of data across multiple nodes.

Describe the significance of Cassandra's architecture.

Cassandra's architecture is designed to ensure high availability and fault tolerance through its distributed nature and peer-to-peer configuration.

How does Apache Cassandra handle configuration of clusters?

<p>Cassandra allows configuration of clusters through YAML configuration files to define key parameters for nodes and replication.</p> Signup and view all the answers

What is the motivation behind using Apache Cassandra in modern applications?

<p>The motivation stems from its ability to provide high write and read throughput while ensuring fault tolerance and scalability.</p> Signup and view all the answers

Signup and view all the answers

Signup and view all the answers

Study Notes

Course Information

  • Course title: Parallel & Distributed Systems
  • Meeting number: 4
  • Institution: Full Mark Academy

Motivation for Using Cassandra

  • All user-collected metrics are stored on Cassandra.
  • User data increased, demanding a more efficient storage solution.
  • Two main goals for the redesign: smaller storage footprint and consistent read/write performance as membership grows.

Netflix & Cassandra

  • Cassandra is used for its scalability, lack of single points of failure, and cross-regional deployments.
  • A global Cassandra cluster can simultaneously service and asynchronously replicate data across different locations.

Motivation (General)

  • If data isn't growing, then business isn't either.
  • DBMS: Database Management System
  • SQL: Structured Query Language, the standardized language for accessing databases.
  • Database Management Systems (DBMS) examples include: SQLite, MySQL, SQL Server, and PostgreSQL

SQL Example

  • Tables depict data in a structured format, with columns specifying data types (e.g. integer, varchar, date, enum).
  • Example table structure includes data elements like ID, Name, Mood, Birth Date, and Color.
  • Associated data in other tables may relate to the columns using foreign keys (represented by -id, -id_aboie, etc. in the schema)

CAP Theorem

  • It is not possible to guarantee all three of the desirable properties: consistency, availability, and tolerance to partition failures.
  • Different database systems handle varying combinations of these properties (e.g., Dynamo, Cassandra for AP, PostgreSQL, MongoDB for CP)

NoSQL Database

  • A NoSQL database (Not Only SQL) provides a mechanism to store and retrieve data differently than relational databases (which use tables).

Database Scaling

  • Vertical scaling: increase the size of the instance (RAM, CPU)
  • Horizontal scaling: add more instances.

NoSQL vs. SQL

  • NoSQL:
    • Supports simple query languages.
    • Doesn't have a fixed schema.
    • Is "eventually consistent".
    • Doesn't support transactions.
  • SQL:
    • Supports powerful query languages.
    • Has a fixed schema.
    • Follows ACID (Atomicity, Consistency, Isolation, and Durability).
    • Supports transactions.

When to Use NoSQL

  • Scalability is a concern.
  • Absence of a consistent diagram is a concern.
  • Being in fashion is a concern.

Apache Cassandra

  • An open-source, distributed/decentralized storage system for managing large amounts of structured data.
  • Its design is based on Amazon's Dynamo and Google's BigTable.
  • Differs significantly from relational database management systems.
  • Column-oriented database.
  • Popular with large companies (e.g. Facebook, Twitter, Cisco, Rackspace, eBay, Netflix).

Columnar-Store

  • Column-oriented databases (under the hood) store all values from each column together.
  • Row-oriented databases store values in rows, not columns.

Cassandra Features

  • NoSQL distributed database.
  • Geographic distribution across multiple data centers.
  • Hybrid cloud & multi-cloud capabilities supporting on-premise installations.
  • Continuous availability & zero downtime.
  • High read/write performance.
  • Peer-to-peer architecture allowing any node to handle reads/writes for high availability.
  • Straightforward horizontal scalability supporting fast network expansion with commodity servers.

Cassandra Applications

  • Back-end development, analytics, monitoring applications, messaging, time series-based applications, and storage

Cassandra Use Cases

  • High throughput, high-volume applications.
  • Mission-critical cases requiring database availability.
  • Globally distributed applications due to security or business requirements (e.g. GDPR compliance).
  • Data storage requirements with hundreds of millions of events.
  • Efficient time series data storage.

Cassandra Structure

  • "Distributed" means Cassandra operates across multiple machines but appears as a single entity to users.
  • Cassandra consists of multiple nodes (which are individual operating instances)
  • Nodes communicate using a gossip protocol, enabling all nodes to discover and track each other.
  • A node operates as a masterless architecture enabling seamless operation across the network.

Cassandra Architecture

  • Transparent data distribution across nodes based on the specific data properties.
  • Any node can receive a request and either return data if it has it, or forward to the node containing the data.
  • Coordinator nodes direct operations and direct data acquisition from appropriate nodes in the system.
  • High availability across multiple nodes, ensuring data resilience despite node failures.

Cassandra Components

  • Hierarchy includes: cluster, data center(s), rack(s), server(s), and node(s).
  • A node is a storage unit within a server, and potentially more than one such storage unit exist in a single server, e.g. 256 nodes by default
  • A rack is a group of servers, e.g., 10-30 physical units at a time
  • A data center is an infrastructure to house multiple racks
  • A cluster houses multiple datacenters
  • A server is the software and hardware unit which hosts the node
  • A data center infrastructure enables data processing via a LAN connection at a speed of roughly 1 Gbps.

Cassandra Installation

  • Steps for Cassandra installation
    1. Verify Docker setup.
    2. Check if Cassandra image exists.
    3. Download the latest version of the Cassandra image.
    4. Run the container.
    5. Check container status.
    6. Check Cassandra logs.
    7. Stop/Remove the container.

Configuration

  • Use COPY or run from Dockerfile, or basic linux commands (like sed or cat) to place configuration into the container.
  • Use environment variables when running the Cassandra image to pass those settings to the spawned container.

Where To Store Data

  • Option 1: Docker-built volumes manage database files on the host system. The files are not directly accessible to tools on the local machine.
  • Option 2: Create a data directory on the host (outside docker container). Mount that directory to the internal directory in the container. This allows tools to directly access the files outside of the container.

Data Replication in Cassandra

  • Hardware failures/network outages during processing need backup
  • Replication assures there is no single point of failure—data is copied (replicated) across several nodes.
  • The replication strategy in Cassandra places replicas in a certain pattern based on replication strategy and factor(RF).

Replication Strategy

  • Simple Strategy is helpful when deploying on a single data center, where replicas are saved in clockwise order after the first replica has been placed.
  • Network Topology Strategy is the best when using multiple data centers. This method allocates replicas spatially to separate locations.

Keyspaces in Cassandra

  • Keyspaces are containers used to group tables within a Cassandra database.
  • Acts like schemas in relational databases.
  • Each keyspace can have different replication strategies and configuration settings

Nodetool

  • A command used for monitoring Cassandra health.
  • Provides details like node status, IP address, data centers, and racks. Shows location of node(s) in the cluster.

Gossip Protocol

  • A peer-to-peer message exchange between nodes for distributing information about each other's status and locating data.
  • State/status information is exchanged periodically and contributes to resilience and responsiveness.

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

Description

Explore the motivations behind using Cassandra in parallel and distributed systems, with a focus on its scalability and performance. This quiz covers key concepts related to database management systems and their role in handling user data efficiently. Test your understanding of the integration between Netflix and Cassandra, and the importance of SQL in these systems.

More Like This

Cassandra NoSQL Database
12 questions
Cassandra : Présentation
30 questions

Cassandra : Présentation

SpectacularCurium avatar
SpectacularCurium
Egzamin: Przegląd próby - Cassandra
23 questions
Use Quizgecko on...
Browser
Browser