Week05_Intro_to_Cassandra-2.pdf

MONASH INFORMATION TECHNOLOGY FIT3176 – Advanced Database Design Week 5 – Introduction to Column-Oriented Databases Acknowledgement of Country I would like to begin by acknowledging the traditional custodians of the land on which we are meeting today, the people of the Kulin Nation, and pay my respects to their elders past and present. Ask questions anytime at: http://pollev.com/fit3176 Unit Roadmap MongoDB Intro to Intro to Data NoSQL MongoDB Modelling 01 02 03 Cassandra Intro to MonogDB Cassandra 3 Data Cassandra Aggregation & Modeling Indexing 07 06 05 04 NoSQL Polyglot Neo4j 1 Neo4j 2 Neo4j 3 Theory Persistence 08 09 10 11 12 Ask questions anytime at: http://pollev.com/fit3176 Reviewing Learning Outcomes Learning Outcomes: 01 Describe various non-relational database systems, including NoSQL; 02 Compare and contrast between relational and non- relational database design and model; 03 Design database systems using document-store and column-store design techniques; Learning outcome breakdown: 04 Explain transactions systems in non-relational 01. Describe various non-relational database systems, including NoSQL: systems; Introduction to Column-Oriented Databases 05 Implement and manipulate document-store and 03. Design database systems using column-store design techniques: column-store database systems; Introduction to Cassandra Cassandra’s Architecture 06 Investigate graph model in database systems; Setting up Cassandra 07 Demonstrate graph query processing. Ask questions anytime at: http://pollev.com/fit3176 Outline Introduction to Column-Oriented Databases Introduction to Cassandra Cassandra Features and Architecture Cassandra’s Peer-to-Peer Approach Communication between Peers Read and Write Cassandra Query Language (CQL) Using CQL Shell Using Python Wrap-Up Ask questions anytime at: http://pollev.com/fit3176 Recall from week 1… NoSQL Databases NoSQL Category Example Databases Key-value databases Dynamo, Riak, Redis, Voldemort, ArangoDB, Aerospike, Apache Ignite Document-oriented databases MongoDB, CouchDB, OrientDB, RavenDB, ArangoDB, BaseX, Clusterpoint, Couchbase, Cosmos DB, RethinkDB, MarkLogic Column-oriented databases Cassandra, Hbase, Hypertable, Accumulo, Scylla Graph databases Neo4j, ArangoDB, GraphBase, AllegroGraph, InfiniteGraph, Apache Giraph, MarkLogic, OrientDB, Virtuoso Ask questions anytime at: http://pollev.com/fit3176 Wide Column databases https://db-engines.com/en/ranking/wide+column+store Ask questions anytime at: http://pollev.com/fit3176 Image taken from Recall from Week 01… Column-oriented databases ▪ Originated from Google’s BigTable product, HBase, and Cassandra. ▪ Also known as column family database. ▪ Each row may have a different set of columns ▪ Columns that are often read together are stored together, called super columns ▪ All of the super columns are grouped together to create a column family. A column family is conceptually similar to a table in the relational model. ▪ Does not require a fixed schema ▪ Not unusual for rows to contain millions of columns Ask questions anytime at: http://pollev.com/fit3176 Row-oriented systems ▪ A row-based system is designed to efficiently return data for an entire row, which matches the common use case in which users wish to retrieve information about a particular object or entity. ▪ Queries like “Find average score” may need to read the whole data. ID Name Age Country Score Data stored in disk as: 1 Alice 30 USA 5 Row 1: [1, Alice, 30, USA, 5] 2 Bob 25 UK 10 Row 2: [2, Bob, 25, UK, 10] Row 3: [3, Charlie, 35, Canada, 7] 3 Charlie 35 Canada 7 Ask questions anytime at: http://pollev.com/fit3176 Column-Oriented Databases ▪ In a column-oriented database (also called columnar database), all values of a column are placed together on the disk, so the previous example table would be stored in this way. ▪ Queries like “Find average score” needs to read only one column. ▪ Relational databases can be both row and column oriented. Data stored in disk as: ID Name Age Country Score CustomerID Column: [1, 2, 3] 1 Alice 30 USA 5 Name Column: [Alice, Bob, Charlie] 2 Bob 25 UK 10 Age Column: [30, 25, 35] Country Column: [USA, UK, Canada] 3 Charlie 35 Canada 7 Scores Column: [5, 10, 7] Ask questions anytime at: http://pollev.com/fit3176 Column-Oriented Databases ▪ A column-oriented DBMS is a DBMS that stores data tables as sections of columns of data, rather than as rows of data as in most DBMS implementations. ▪ Column with many null values, known as sparse data, can be dealt with more efficiently, without wasting storage capacity for the empty cells. Ask questions anytime at: http://pollev.com/fit3176 Column-Oriented Databases Advantages Disadvantages Speed up query processing and aggregation Retrieving all attributes pertaining to a single operations. entity becomes less efficient Efficient storage space consumption (data can Join operations will be slowed down be easily compressed). considerably (every column must be scanned in order to find values belonging to a certain foreign record identifier). What if you need to often access multiple related columns together, e.g., street, postcode, state? Use column families (the reason these are also called column family databases or wide column databases). Ask questions anytime at: http://pollev.com/fit3176 In the Beginning, There Was Google BigTable ▪ The following are core features of Google BigTable: Developers have dynamic control over columns. Data values are indexed by row identifier, column name, and a time stamp. Data modelers and developers have more control over location of data. Reads and writes of a row are atomic. Ask questions anytime at: http://pollev.com/fit3176 Column-Oriented Databases ▪ Rows are composed of several column families. Each family consists of a set of related columns. ▪ For example, an address column family might contain: Street, City, State, province, zip, postcode, country Ask questions anytime at: http://pollev.com/fit3176 Column-Oriented Databases ▪ Data values are indexed by row identifier, column name, and time stamp. The row identifier is analogous to a primary key in a relational database. ▪ Multiple versions of a column value can exist. The latest version is returned by default when the column value is queried. Ask questions anytime at: http://pollev.com/fit3176 Column-Oriented vs. Key-Value Databases Ask questions anytime at: http://pollev.com/fit3176 Column-Oriented vs. Document-Oriented Ask questions anytime at: http://pollev.com/fit3176 Terminology ▪ Keyspace A keyspace is the top-level data structure that logically holds column families, row keys, and related data structures. Typically, there is one keyspace per application. A keyspace is analogous to a schema in a relational database. Ask questions anytime at: http://pollev.com/fit3176 Terminology ▪ Row Key A row key uniquely identifies a row and has a role in determining the order in which data is stored. It serves some of the same purposes as a primary key in a relational database. ▪ In Cassandra, rows are stored in an order determined by an object known as a partitioner. ▪ Cassandra uses a random partitioner by default. As the name implies, the partitioner randomly distributes rows across nodes. Cassandra also provides an order-preserving partitioner, which can provide lexicographic ordering. Ask questions anytime at: http://pollev.com/fit3176 Terminology ▪ Column A column, along with a row key and version stamp, uniquely identifies values. ▪ Columns have three parts: A column name A time stamp or other version stamp A value Ask questions anytime at: http://pollev.com/fit3176 Terminology ▪ Column Families Column families are collections of related columns. Columns that are frequently used together should be grouped into the same column family. Column families are stored in a keyspace. Each row in a column family is uniquely identified by a row key. Column families are analogous to relational database tables: They store multiple rows and multiple columns. Ask questions anytime at: http://pollev.com/fit3176 Outline Introduction to Column-Oriented Databases Introduction to Cassandra Cassandra Features and Architecture Cassandra’s Peer-to-Peer Approach Communication between Peers Read and Write Cassandra Query Language (CQL) Using CQL Shell Using Python Wrap-Up Ask questions anytime at: http://pollev.com/fit3176 Apache Cassandra ▪ A wide-column store database. ▪ A free, open source, distributed data storage system that differs sharply from relational database management systems (RDBMSs). ▪ Originated at Facebook in 2008 to solve its inbox search problem. ▪ A database designed for high availability, scalability, and consistency. Ask questions anytime at: http://pollev.com/fit3176 Apache Cassandra Features ▪ Distributed Databases can be spread across multiple servers. ▪ Decentralized Every node in the cluster is identical. ▪ Elastically scalable Read and write throughput both increase linearly as new machines are added, with no downtime or interruption to application. ▪ Highly available & Fault-tolerant Data is automatically replicated to multiple nodes, across multiple data centers. Ask questions anytime at: http://pollev.com/fit3176 Cassandra is a distributed database Monolithic Architecture Cassandra Distributed Architecture Ask questions anytime at: http://pollev.com/fit3176 Monolithic Architecture ▪ All functions reside on a single machine ▪ Problems: single point of failure Ask questions anytime at: http://pollev.com/fit3176 Cassandra is a decentralized database Master-Slave Model Cassandra’s Decentralized Model Ask questions anytime at: http://pollev.com/fit3176 The Master-slave model ▪ Many distributed databased still attempt to maintain ACID guarantees, leading to the master- slave model. ▪ For data consistency aspect: Many servers handling requests, but only master server can perform writes ▪ Problem: a failure of the write master Ask questions anytime at: http://pollev.com/fit3176 Using sharding to scale writes ▪ Sharding an approach that enables higher write volumes the data is partitioned into groups of keys One or more masters can own a known set of keys ▪ Problems: Requires manual shuffling of data Failure points of the master nodes Ask questions anytime at: http://pollev.com/fit3176 Handling the death of the leader ▪ Principle is that a new master is appointed when the previous one fails. ▪ Solution to increase availability in a master-slave system. ▪ Employ a master failover protocol. ▪ The protocol vary among implementations. Ask questions anytime at: http://pollev.com/fit3176 Undesirable traits of leader election ▪ Applications must understand the database topology. ▪ Writes are difficult to scale. ▪ A failover dramatically increases the complexity of the system in general, and especially so for multisite databases. ▪ Adding capacity requires reshuffling data with a potential for downtime. Ask questions anytime at: http://pollev.com/fit3176 Cassandra is an elastically scalable database ▪ Scalability – can continue serving a greater number of requests with little degradation in performance. Vertical scaling: adding more hardware capacity and memory. Horizontal scaling: adding more machines that have all or some of the data on them so that no one machine has to bear the entire burden of serving requests. ▪ Elastic scalability – refers to a special property of horizontal scalability, which means your cluster can seamlessly scale up and scale back down Ask questions anytime at: http://pollev.com/fit3176 High availability and fault-tolerance Users Cassandra Ask questions anytime at: http://pollev.com/fit3176 Outline Introduction to Column-Oriented Databases Introduction to Cassandra Cassandra Features and Architecture Cassandra’s Peer-to-Peer Approach Communication between Peers Read and Write Cassandra Query Language (CQL) Using CQL Shell Using Python Wrap-Up Ask questions anytime at: http://pollev.com/fit3176 Cassandra’s peer-to-peer approach ▪ All nodes in a Cassandra cluster can accept reads and writes, no matter where the data being written or requested belongs in the cluster. ▪ Internode communication takes place by means of a gossip protocol, which allows all nodes to quickly receive updates without the need for a master coordinator Ask questions anytime at: http://pollev.com/fit3176 Cassandra’s peer-to-peer approach ▪ There are several advantages to the peer-to-peer approach: 1. Simplicity. 2. No node can be a single point of failure. 3. Scaling up and down is fairly straightforward: Servers are added or removed from the cluster. Ask questions anytime at: http://pollev.com/fit3176 Cassandra’s peer-to-peer approach ▪ Because peer-to-peer networks do not have a single master coordinating server, the servers in the cluster are responsible for managing a number of operations that a master server would handle, including the following: Sharing information about the state of servers in the cluster Ensuring nodes have the latest version of data Ensuring write data is stored when the server that should receive the write is unavailable ▪ Cassandra has protocols to implement all of these functions. Ask questions anytime at: http://pollev.com/fit3176 Outline Introduction to Column-Oriented Databases Introduction to Cassandra Cassandra Features and Architecture Cassandra’s Peer-to-Peer Approach Communication between Peers Read and Write Cassandra Query Language (CQL) Using CQL Shell Using Python Wrap-Up Ask questions anytime at: http://pollev.com/fit3176 Problem with sending messages to everyone ▪ The number of messages sent in a complete server-to-server communication protocol grows more rapidly each time a server is added to the cluster. Ask questions anytime at: http://pollev.com/fit3176 Solution: Gossip Protocol ▪ Gossip protocol: o Is a peer-to-peer communication protocol. o Periodically exchanges state information about themselves and other nodes in the cluster. o Runs every second on a timer and exchanges state message to a few other nodes. o Used as an automatic mechanism for replication in distributed databases o Older information is overwritten with the most current state for a particular node Ask questions anytime at: http://pollev.com/fit3176 How the Gossiper works ▪ The gossiper node will choose a random node in the cluster and initialize a gossip session with it. o Each round of gossip requires three messages. ▪ The gossip initiator sends its chosen friend a Gossip-DigestSyn. ▪ When the friend receives this message, it returns a GossipDigestAck. ▪ When the initiator receives the ack message from the friend, it sends the friend a GossipDigestAck2 to complete the round of gossip. Ask questions anytime at: http://pollev.com/fit3176 Outline Introduction to Column-Oriented Databases Introduction to Cassandra Cassandra Features and Architecture Cassandra’s Peer-to-Peer Approach Communication between Peers Read and Write Cassandra Query Language (CQL) Using CQL Shell Using Python Wrap-Up Ask questions anytime at: http://pollev.com/fit3176 Cassandra – Read ▪ Any node in a Cassandra cluster can handle a client request. ▪ All nodes have information about the state of the cluster and can act as a proxy for a client, forwarding the request to the appropriate node in the cluster. Ask questions anytime at: http://pollev.com/fit3176 Cassandra – Write ▪ If a node is unavailable, then other nodes can receive write requests on its behalf and forward them to the intended node when it becomes available. ▪ Such nodes store “hints” that they have to send the data to the node when they become available ▪ This is called “Hinted handoff”. Ask questions anytime at: http://pollev.com/fit3176 When to Use Column-Oriented Databases? ▪ Column family databases are appropriate choices for large-scale database deployments that require a large number of servers or multi data center availability. Cassandra supports multi data centre deployment, including multi data centre replication. ▪ Write-intensive operations, such as those found in social networking applications, are good candidates for using column family databases. Cassandra’s peer-to-peer architecture with support for hinted handoff means the database will always be able to accept write operations as long as at least one node is functioning and reachable. Ask questions anytime at: http://pollev.com/fit3176 Outline Introduction to Column-Oriented Databases Introduction to Cassandra Cassandra Features and Architecture Cassandra’s Peer-to-Peer Approach Communication between Peers Read and Write Cassandra Query Language (CQL) Using CQL Shell Using Python Wrap-Up Ask questions anytime at: http://pollev.com/fit3176 CQL- Cassandra Query Language ▪ A simple way to manipulate the data you have stored. ▪ Syntax is similar to SQL. ▪ CQL has no concept of GROUP or JOIN, and very limited implementation of ORDER BY. Ask questions anytime at: http://pollev.com/fit3176 Using CQL Shell ▪ To learn about the current cluster you’re working in: - DESCRIBE CLUSTER; ▪ To list all available keyspaces in the cluster: - DESCRIBE KEYSPACES; ▪ Change keyspaces: - USE ▪ To show all tables: - DESCRIBE TABLES; Ask questions anytime at: http://pollev.com/fit3176 Using CQL Shell ▪ To create a keyspace: - CREATE KEYSPACE my_keyspace WITH replication = {'class':'SimpleStrategy’, 'replication_factor':1}; ▪ Can also use CREATE KEYSPACE IF NOT EXISTS my_keyspace… ▪ To check the newly created keyspace: - DESCRIBE KEYSPACE my_keyspace Ask questions anytime at: http://pollev.com/fit3176 Using CQL Shell ▪ To create a table: - CREATE TABLE user ( first_name text, last_name text, PRIMARY KEY (first_name) ) ; ▪ Can also use CREATE TABLE IF NOT EXISTS user… ▪ To get the description of the newly created table: - DESCRIBE TABLE user; Ask questions anytime at: http://pollev.com/fit3176 Using CQL Shell ▪ To write a value (INSERT command): - INSERT INTO user (first_name, last_name ) VALUES ('Bill', 'Nguyen'); ▪ To read (SELECT command): - SELECT * FROM user WHERE first_name='Bill'; ▪ To count (SELECT COUNT command): - SELECT COUNT (*) FROM user; Ask questions anytime at: http://pollev.com/fit3176 Using CQL Shell ▪ To delete a column value (DELETE command): - DELETE last_name FROM user WHERE first_name='Bill'; ▪ To delete the entire row (DELETE command): - DELETE FROM user WHERE first_name='Bill'; Ask questions anytime at: http://pollev.com/fit3176 Using CQL Shell ▪ To remove all data from a table (TRUNCATE command): - TRUNCATE user; ▪ To delete table schema (DROP TABLE command): - DROP TABLE user; - Can also use DROP TABLE IF EXISTS user Ask questions anytime at: http://pollev.com/fit3176 Outline Introduction to Column-Oriented Databases Introduction to Cassandra Cassandra Features and Architecture Cassandra’s Peer-to-Peer Approach Communication between Peers Read and Write Cassandra Query Language (CQL) Using CQL Shell Using Python Wrap-Up Ask questions anytime at: http://pollev.com/fit3176 Cassandra Using Python from cassandra.cluster import Cluster # Connect to Cassandra cluster = Cluster(['127.0.0.1']) # use the IP session = cluster.connect() # Create a keyspace. Triple quotes are for multi-line strings session.execute(""" CREATE KEYSPACE IF NOT EXISTS my_keyspace WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '1'} """) # Using the keyspace (equivalent to USE in cqlsh) session.set_keyspace('my_keyspace') Ask questions anytime at: http://pollev.com/fit3176 Cassandra Using Python # Create the 'user' table session.execute(""" CREATE TABLE IF NOT EXISTS user ( first_name text, last_name text, PRIMARY KEY (first_name) ) """) Ask questions anytime at: http://pollev.com/fit3176 Cassandra Using Python # Insert records into the 'user' table session.execute("INSERT INTO user (first_name, last_name) VALUES ('Bill', 'Nguyen')") session.execute("INSERT INTO user (first_name, last_name) VALUES ('Alice', 'Johnson')") session.execute("INSERT INTO user (first_name, last_name) VALUES ('Charlie', 'Smith')") # Select and display all records from the 'user' table rows = session.execute("SELECT * FROM user") for row in rows: print(row.first_name,row.last_name) Ask questions anytime at: http://pollev.com/fit3176 Cassandra Using Python # Delete the 'last_name' column for the user with first_name='Bill' session.execute("DELETE last_name FROM user WHERE first_name='Bill'") # Delete the entire record for 'Bill' session.execute("DELETE FROM user WHERE first_name='Bill'") # Truncate the 'user' table (delete all rows) session.execute("TRUNCATE user") # Drop the 'user' table session.execute("DROP TABLE IF EXISTS user") Ask questions anytime at: http://pollev.com/fit3176 Outline Introduction to Column-Oriented Databases Introduction to Cassandra Cassandra Features and Architecture Cassandra’s Peer-to-Peer Approach Communication between Peers Read and Write Cassandra Query Language (CQL) Using CQL Shell Using Python Wrap-Up Ask questions anytime at: http://pollev.com/fit3176 Poll Everywhere Competition Ask questions anytime at: http://pollev.com/fit3176 Concluding Remarks What we learned: Fundamentals of Column-Oriented Databases Cassandra features and architecture CQL using CQL Shell and Python To Do Before Next Week: Complete Week 06 Lab Go through this Cassandra Tutorial (ignore the topics that we have not covered yet) Coming Up Next: Cassandra Data Modeling Ask questions anytime at: http://pollev.com/fit3176 References C. Carlos and M. Steven, Database systems: Design, implementation, & management. Cengage Learning, 12th ed., 2016. C. Y. Kan, Cassandra Data Modeling and Analysis. Birmingham, UK: Packt Publishing, 1st ed., 2014. D. Sullivan, NoSQL for Mere Mortals. Michigan, USA: Addison-Wesley Professional, 1st ed., 2015. J. Carpenter and E. Hewitt, Cassandra: The Definitive Guide. Sebastopol, CA: O’Reilly Media, Inc., 2nd ed., 2016. N. Neeraj, Mastering Apache Cassandra. Birmingham, UK: Packt Publishing, 1st ed., 2013. R. Strickland, Cassandra 3.x High Availability. Birmingham, UK: Packt Publishing, 2nd ed., 2016. S. Alapati, Expert Apache Cassandra Administration. Texas, USA: Apress, 1st ed., 2018. W. Lemahieu, S. V. Broucke and B. Baesens, Principles of Database Management: The Practical Guide to Storing, Managing and Analyzing Big and Small Data. New York, NY, USA: Cambridge University Press, 1st ed., 2018. Ask questions anytime at: http://pollev.com/fit3176

Week05_Intro_to_Cassandra-2.pdf

Document Details

Related

Full Transcript