INF2003 Database Systems NoSQL AY24/25 Trimester 1 Lecture Notes PDF
Document Details
Uploaded by ConscientiousDeciduousForest
Singapore Institute of Technology (SIT)
2024
Zhang Wei
Tags
Summary
These lecture notes cover various database systems, including relational databases, NoSQL databases (key-value, document, and graph), and strategies for migrating data between them. The document also discusses specific database technologies like Redis and MongoDB.
Full Transcript
SIT Restricted INF2003: Database Systems NoSQL AY24/25 Trimester 1 Zhang Wei SIT Nov. 5, 2024 SIT Restricted Relational DB: Advantages and Gap Benefits of relational database. Consistency, concurrency, recovery. Mathematically background ->...
SIT Restricted INF2003: Database Systems NoSQL AY24/25 Trimester 1 Zhang Wei SIT Nov. 5, 2024 SIT Restricted Relational DB: Advantages and Gap Benefits of relational database. Consistency, concurrency, recovery. Mathematically background -> solid and rigorous. SQL for management and communication. Mature ecosystem with tools and services. Fact: Centralized. You may have noticed MySQL database is in one disk. Good: fast local join, easy management, etc. Bad: scalability/partition, robustness/replication, task dispatch, etc. Reality: big data, geo-distribution, diversity, etc. SIT Restricted NoSQL Simplify data models to improve the generalizability. Deal with diverse data, diverse locations, diverse systems, etc. Name: Non-SQL? Neo SQL? No SQL? Not Only SQL? Main categories: Key-value database, e.g., Redis, Memcached, and SimpleDB. Document database, e.g., MongonDB, CouchDB, and OrientDB. Columnar database, e.g., Cassandra and Hbase. Graph database, e.g., Neo4j. Key-value database: simplest NoSQL. key-value store: a simple hash table. All the access to the database are via keys. Operations: e.g., get key’s value, put value to a key, and delete key. No complex query filters. Join outside the database, like in Python. Efficient (for its simplicity). Distribute everywhere. SIT Restricted Key-Value Database: Redis Redis: (possibly) the most popular key-value database. https://redis.io/ C++ based (potentially fast). Open-source. Rich data structures. e.g., strings, hashes, lists, sets, sorted sets, bitmaps. In-memory operations: faster than disks. Able to be customized, e.g., for speed or for reliability. Data persistency: save the updated database in a disk periodically. Bindings to many languages. e.g., Python, Java, C, C#, C++, MATLAB, R, VB, Erlang, Perl, and PHP. A wide range of operations: redis> SET key2003 “DBSys” redis> DEL key2003 key1234 GET: get the value of a key in O(1). “OK” (integer) 1 SET: set key to hold a value in O(1). redis> GET key2003 redis> GET key2003 DEL: delete a key in O(1). “DBSys” (nil) Many others: https://redis.io/commands SIT Restricted Document Database What documents? Not really about the docs we are familiar with. More like the heterogenous metadata of some items. How to put different things together? Quite common, e.g., items in Lazada/Shopee. Not suitable to store in relational database. Calligraphy Calligrapher Year Chinese Lanting Xu Wang Xizhi 353CE 兰亭集序 Name Brand Year Memory Jetson Nano Nvidia 2020 SDRAM SIT Restricted Document Database No relation schema. Store data in JSON, BSON, or XML documents. Documents can be nested, can be indexed. Similar format to many applications. Name a few? Potentially much less translation and fast. For SQL, more data assembling and disassembling. { Flexibility with changeable data structures. ItmeType: “Computer”, ItemName: “Jetson Nano”, For SQL databases, would be tough to pass admin control. Brand: “Nvidia”, Year: “2020”, Memory: “SDRAM” } { ItmeType: “Book”, ItemName: “Lanting Xu”, Author: “Wang Xizhi”, Year: “335”, Chinese: “兰亭集序” } SIT Restricted SQL -> NoSQL Migration Do you really want to migrate? Possible reasons: Business scale-up. May not need SQL->NoSQL. Application change. Diff. philosophies. Be careful. Scheme 1: 1-1 mapping. (most popular) 1 table in SQL -> 1 collection in NoSQL (like MongoDB). Tools available, e.g., AWS DMS, Python w/ related libs, SQL Server export. Sample code available in Google/GitHub. Scheme 2: Denormalization. Reverse the decomposition process -> small to big. Moderate modification, moderate benefit. Scheme 3: forget migration and design from the beginning. Suitable for dramatic application changes. Significant workload, but good for future operation and maintenance. SIT Restricted MongoDB Founded in 2007, A document-oriented, NoSQL database. Written in C++. Hash-based, schema-less database. Key is a basic data type but, stored as strings. Document identifiers (_id, system reserved) will be created for each doc. Can insert the “same” data multiple times. Why? Supports APIs (drivers) in many languages. e.g., JS, Python, Ruby, Perl, Java, JavaScala, C#, C++, Haskell, and Erlang. Functionality: Secondary indexes. Query language via an API. Atomic writes and fully-consistent reads. Master-slave replication with automated failover (replica sets). Built-in horizontal scaling via automated range-based partitioning of data. We may not be aware of. No joins (imagine ERD w/t links, shall write scripts to do so if needed). SIT Restricted MongoDB Simple queries (you will see a few later). Applicable to most web applications. Easy and fast integration of data. But, not suitable for heavy and complex transactions systems. Relational DBMS MongoDB Database Database Table, View Collection Row Document (BSON) Column Field Index Index Join Embedded Document Foreign Key Reference Partition Shard (Distributing Data) SIT Restricted MongoDB - CRUD Create: db.collection_name.insert(). db.ppl.insert({{_id: 11, type: “Avengers”, name: “Spider Man” }) db.ppl.insert({type: “Transformers”, cat: “Autobot”, name: “Optimus Prime”}) Omit the _id field to have MongoDB generate a unique key. Read: db.collection_name.find( , ) or findOne. Find all: db.ppl.find(), no criteria specified. db.ppl.find({type: {“Avengers”, “Transformers”}). Update: db.collection_name.update( , ) db.ppl.update({type: “Avengers”, {$set: {type: “The Avengers”}}) Delete: db.collection_name.remove(, ) Delete all records from a collection or matching a criterion. How to delete all? - specifies to delete only 1 record matching the criterion. Example: db.ppl.remove(type: /^s/ } ) - remove all starting with s. https://docs.mongodb.com/manual/reference/method/js-collection/ SIT Restricted Columnar Database Database is a collection of key/value pairs. Sounds familiar? Key 3 parts: row key, column key, and time-stamp. Why include time-stamp? 3-part key may not be the key point, which shall be column-wise storage. MySQL: 22, Dustin, 1000, 45, 31, Lubber, 1001, 55, 58 … Columnar: 22, 31, 58, 74, Dustin, Lubber, Rusty, Rusty, 1000, 1001, 1001, … Flexible schema: column set flexible, may differ from row-to-row. Data compression. Self-indexing. sid name pts age Rows can be constructed from columns. 22 Dustin 1000 45 Popular ones: HBase, Cassandra, Hypertable, 31 Lubber 1001 55 58 Rusty 1001 35 Amazon SimpleDB, Clouddata, Google bigtable, etc. 74 Rusty 1000 35 Not always good, e.g., insert. Why? Facebook MySQL -> Cassandra 50GB: W 300->0.12ms, R 350->15ms. SIT Restricted Graph Database Remember why we use ERD for conceptual database design? Similar motivation, as many applications have native graph structure. Graph database: a database with an explicit graph structure. Each node knows its adjacent nodes. As long as community is more or less unchanged, do not care about how big the whole world is. (local steps based on the links) Apply graph algorithms easily. https://en.wikipedia.org/wiki/Category:Graph_algorithms SIT Restricted Graph Database Node: can contain properties. Relationship: connect two nodes. Can contain types, e.g., ACTED_IN, DIRECTED. Can contain properties, e.g., roles=[‘Forrest’]. Property: are named values where the name is a string. e.g., ‘name, Tom Hanks’ Label for each node, e.g., person, movie. Used to select the same type of nodes. RDBMS Graph Database Tables Graphs Rows Nodes Columns & data Properties & values Constraints Relationships Joins Traversal SIT Restricted Discussion and Summary When do we use NoSQL? When you feel RDBMS is too restrictive without much flexibility. When you do not care about ACID. When your data is too large to be centralized. When you have diverse data sources. When you have high requirement on data availability. No schema, no unused cell, no datatype (implicit), etc. Think carefully for database migration. NoSQL is not always good, e.g., banking and finance prefer RDBMS.