Data Architecture Lecture PDF
Document Details
Uploaded by FearlessFife
Tags
Summary
This lecture covers the fundamentals of data architecture, including different data storage formats and object services. It examines topics such as object file stores, random access file systems, data consistency models (ACID/BASE), and scaling strategies using sharding and read replicas. The lecture also briefly touches upon the concept of relational databases and their trade-offs compared to NoSQL alternatives, along with real-world examples like Google's search engine.
Full Transcript
What is data? Any thoughts? What is data? Data is any collection of (discrete/continuous) values that convey information. Btw, have you heard the quote “Data is the new oil?” What do you think it means? What is architecture? Any thoughts? What is a...
What is data? Any thoughts? What is data? Data is any collection of (discrete/continuous) values that convey information. Btw, have you heard the quote “Data is the new oil?” What do you think it means? What is architecture? Any thoughts? What is architecture? The art or practice of designing and constructing something. E.g.: Buildings Software What is data architecture? A data architecture describes the whole life- cycle of how data is managed – i.e., from collection through to transformation, distribution, and consumption. It sets the blueprint for data and the way it flows through all systems. Cloud Storage (AWS S3 / Azure Blob / GCP Storage): A service for storing your objects. An object is an immutable piece of data consisting of a file of any format. You store objects in containers called buckets. All buckets are associated with a project, and you can group your projects under an organization Buckets can have lifecycles (more of that later). NOTE: Legislations are also a key input to determine the lifecycle of data. E.g., GDPR Do you know what it is? Examples of Lifecycles: GCP Standard: Good for hot data that is accessed frequently. Nearline: Good for use cases that need to store objects for at least 30 days (i.e., data that you plan to access once per month or more frequently). Examples of Lifecycles: GCP (cont’d) Coldline: Low-cost storage option for storing infrequently accessed (i.e., cold) data within 90 days. Archive: The coldest storage among the storage classes. o Designed for storing archive data and disaster recovery data that is expected to be accessed once per 365 days or less. Examples of Lifecycles: AWS What do you think of the AWS data lifecycle compared to GCP? The beauty of the CS/IT world is that, despite the (obvious!) differences between competing technologies (e.g., AWS, GCP, Azure, etc.), they are typically based on the same principles (e.g., the cloud). NOTE: Make sure you understand what throughput is! (as well as its “inverse”, response time). Btw, have you noticed that read IOPS gradually become larger than write IOPS. Why? (proactively!) Definition: The operational activity of managing the IT infrastructure of an information system, delegated, in whole or in part, to an external partner (aka managed services provider or MSP). (HIGH_SCALE_SSD) Btw, talking about jargon/terms, what is latency? Why does it matter? Why do we consider the above data as structured? Why relational matter? Any thoughts? Why relational matter? It is one of the two most widely used database models! (~50%) It involves the Structured Query Language (SQL)! The other most widely used database model defines itself as the “opposite” to relational (i.e., as non-relational). o Thus, to properly assess when non-relational should be used, one first needs to know when relational works (and when does not!) Why relational matter? Oftentimes, other data-intensive systems get inspired in SQL to develop their own query languages! Rapid7 – Insight IDR (cybersec tool) ACID vs BASE ACID refers to a standard set of properties that guarantee database transactions are processed reliably. Atomicity: All operations in a transaction succeed or every operation is rolled back. Consistency: On the completion of a transaction, the database is structurally sound. Isolation: Transactions do not contend with one another. Contentious access to data is moderated by the database so that transactions appear to run sequentially. Durability: The results of applying a transaction are permanent, even in the presence of (system) failures. ACID vs BASE For many use cases, ACID is far pessimistic (i.e., more worried about data safety) than required (by the use case). Thus, (NoSQL) databases loosen the requirements for immediate consistency, data freshness and accuracy to gain other benefits, like scale and resilience. Basic Availability: The DB appears to work most of the time. Soft-state: Stores do not have to be write-consistent, nor do different replicas have to be mutually consistent all the time. Eventual consistency: Stores exhibit consistency at some later point (e.g., lazily at read time). Btw, what is a cache and why is it important? The simplest way of storing data is assigning a value to a variable or a key: Sharding - Risks Difficulties in Data Distribution o E.g., a poor sharding key choice can lead to uneven data distribution. Transactional and Joins Complexity o E.g., multi-shard transactions, ACID properties, and joins are all complicated and often less efficient. Maintenance and Operational Overhead o More complex backup, recovery, monitoring, and optimization processes are needed. As a result of the above: o Increased System/Application Complexity. o Data Consistency Issues. o Cost Implications. Btw, what is the difference between horizontal and vertical scale? Real World Example: Google’s Search Background: Have you heard of webcrawler, altavista, or yahoo search engines? (probably not!) In the 90’s, there were many search engines, but none was significantly better than the others… Google’s founders saw this business opportunity and aimed to address it. Thus, they released the initial Google search engine, which became a huge success! Real World Example: Google’s Search From a data point of view, the first half of the process involves mainly write operations (i.e., get new/updated websites). Real World Example: Google’s Search From a data point of view, the second half of the process involves mainly read operations (thus, techniques like cache and read replicas should certainly be used). Btw, have you ever experienced a “broken” link in the results of Google, despite the fact that the website is working? The above diagram might give you ideas of the possible things that might have caused it! That is all, folks!