Lecture #13.1 - Data Architectures.pdf

MODERN DATA ARCHITECTURES FOR BIG DATA II BIG DATA ARCHITECTURES AGENDA Big Data Architectures Lambda Architecture Kappa Architecture Lakehouse Architecture Data Mesh Deployment Models On-premise Public Cloud Hybrid Roles in Big Data 1. BIG DATA ARCHITECTURES WHAT ARE BIG DATA ARCHITECTURES? Patterns defining how to connect different technologies. Have to be designed with analytical needs in mind: Current and future needs via capacity plannings → what & when Data Architecture Roadmap to address future needs over time Adequate sizing leads to sustainable growth → less risk to the project BIG DATA ARCHITECTURES MIGHT VARY There are different ways to design an architecture*. * Picture from the Big Data Architecture – Blueprint (Part 1 – Basics) article OUR BELOVED DATA VALUE CHAIN At the end of the day, it's all what we've studied so far. COMMON BIG DATA ARCHITECTURES* Over last years, companies have faced similar challenges. Some patterns have proved to work well in certain scenarios. Those patterns have been documented & named: Lambda Architecture Kappa Architecture Delta Architecture * Picture from the Data Architecture: An Epic Battle with the Powerpuff Girls and the Villain MO JO JO JO- Lambda, Kappa, and Delta(Revisiting Childhood) article 1.1 LAMBDA ARCHITECTURE WHAT IS THE LAMBDA ARCHITECTURE? Introduced by Nathan Marz & James Warren in 2013: Big Data - Principles & best practices of scalable realtime data systems Data is processed by batch & streaming systems in parallel. Results combined at query time → very complete insights. Outline three layers within the suggested pattern: Batch Layer - this layer is responsible for: 1. Storing the master copy of the dataset 2. Precomputing batch views on that master dataset → address business needs Serving Layer - distributed datastore where batch views are stored Speed Layer - complements the Batch Layer by: 1. Providing updates while batch views are being created (minutes to hours) 2. Updating real-time views with recent data not yet available in batch views LAMBDA ARCHITECTURE'S LAYERS Visual representation of Lambda Architecture: LAMBDA ARCHITECTURE IMPLEMENTATION Each company uses the most convenient technologies: LAMBDA ARCHITECTURE DATA VALUE CHAIN And, of course, it can be expressed with our data value chain: 1.2 KAPPA ARCHITECTURE WHAT IS THE KAPPA ARCHITECTURE? Introduced by Jay Kreps in 2014: Questioning the Lambda Architecture Data, historical & new, only processed by streaming systems. "Historical" data in this context in the days range (ex. 30 days). Outline two layers within the suggested pattern: Real-Time Layer - this layer is responsible for: 1. Storing the master copy of the events 2. Reprocessing real-time views with relevant events → address business needs 3. Creating/updating real-time views with recent events Serving Layer - distributed datastore where real-time views are stored KAPPA ARCHITECTURE'S LAYERS Visual representation of Kappa Architecture: KAPPA ARCHITECTURE IMPLEMENTATION Each company uses the most convenient technologies: KAPPA ARCHITECTURE DATA VALUE CHAIN And, of course, it can be expressed with our data value chain: 1.3 LAKEHOUSE ARCHITECTURE WHAT IS THE LAKEHOUSE ARCHITECTURE? Introduced by the Databricks team in 2021: Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics It's a Data Management System characterized by: Low-cost & Directly-Accessible storage Unifying batch & streaming data flows. Providing traditional analytical DBMS experience: ACID transactions - as in Relational Databases Data versioning - track changes made to data over time Auditing - track who is using the data Indexing - auxiliary structures (ex. Bloom filters) to speed up data access Caching - keep as much data in use in memory as possible Query optimization - ensure best performance for every query Lakehouses → key benefits of data lakes & data warehouses. LAKEHOUSE ARCHITECTURE IN CONTEXT 40 years of evolution to get to the Lakehouse Architecture: LAKEHOUSE ARCH. IMPLEMENTATION Delta Lake is a Lakehouse implementation powered by Spark: LAKEHOUSE ARCH. DATA VALUE CHAIN And, of course, it can be expressed with our data value chain: LAKEHOUSE ARCHITECTURE BONUS Real-time events to a "Data Warehouse" with low latency*: * All the details (PySpark code included) in From Kafka to Delta Lake using Apache Spark Structured Streaming 1.4 DATA MESH DATA MESH, DATA AS A PRODUCT Introduced by Zhamak Dehghani in 2019: How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh New paradigms moving away from centralized lakes/EDW. Domains data sets → logical grouping of data (ex. marketing): Domain data teams provide data sets to the rest of the organization Data sets must be discoverable & understandable (Data Governance) Data sets must have KQIs & adoption KPIs (Data Quality/Curation) All in all, shift towards treating domain data sets as products. CENTRALIZED TO DISTRIBUTED OWNERSHIP Data Mesh → Distributed Data Architecture with: Centralized Governance & Standardization for interoperability Shared self-serve Data Infrastructure The Data Mesh doesn't replace the Data Lake/EDW: Data Lake/EDW or, generally speaking, a Data Hub → Nodes on mesh Data Hubs provides data sets as products in a distributed fashion * Picture from How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh. DATA MESH EXAMPLE The following picture could be an example of a Data Mesh: 2. DEPLOYMENT MODELS 2.1 ON-PREMISE WHAT ARE ON-PREMISE DATA CENTERS?* IT infrastructure within organization's physical premises: Setting up & maintaining IT infrastructure with company's resources Benefits of Big Data solutions on On-Premise Data Centers: 1. Data control & protection → Orgs. handle data & security needs directly 2. Predictable Performance & Low-Latency Access to Data → it's all local 3. Customizable Infrastructure → any hardware & software to meet needs Drawbacks of Big Data solutions on On-Premise Data Centers: 1. High initial costs & capital expenses → money upfront on tools & infra 2. Limited scale & possibility of overprovisioning → lack of infra elasticity 3. IT management → dedicated operations team to maintain the solution * More info on What is On-Premises Data Centers vs. Cloud Computing? ON-PREMISE BIG DATA SOLUTIONS Big Data Solutions manufactured as Hadoop Distributions. There have been tenths of Hadoop Distributions over time. Three last On-Premise Big Data Solutions until 2019 were: 2.2 PUBLIC CLOUD WHAT ARE PUBLIC CLOUD DATA CENTERS?* IT infrastructure outside organization's physical premises: Available as IaaS (Infra), PaaS (Platform) or SaaS (Software) Benefits of Big Data solutions on Public Cloud Data Centers: 1. Cost-Effectiveness & Flexible Pricing → from CapEx to OpEx model 2. Easy Scaling & Adaptability → adjust resources to match demand 3. Global Access & Collaboration → accessible anywhere via Internet Drawbacks of Big Data solutions on Public Cloud Data Centers: 1. Security & Privacy → confidence on security measures (certifications) 2. Compliance in a Global Context → legal requirements where data lives 3. Internet Connectivity & Downtime Risk → impact of outages to business * More info on What is On-Premises Data Centers vs. Cloud Computing? PUBLIC CLOUD BIG DATA SERVICES Focus on main three providers: AWS, Microsoft Azure & GCP. 2.3 HYBRID WHAT ARE HYBRID DEPLOYMENTS? Relying solely on cloud infrastructure has drawbacks. Cloud backlash movement → computing back on-premises*. Hybrid deployments combine different options: On-Premise Solutions - greater control of data & cost in some cases Public Cloud Services - elasticity to quickly match demands Private Cloud Services - the best of both worlds: "On-Prem Cloud" * Picture from the Cloudera Data Platform Private Cloud - What is it? article 3. ROLES IN BIG DATA FOUR MAIN ROLES IN BIG DATA There are four main roles related to the Big Data ecosystem*: Data Scientists - get valuable insights & knowledge from data sets Data Analysts - data analysis & interpretation to make decisions Data Engineers - build data pipelines to transform & transport data Data Architects - build overall data architecture of an organization * More info on Data Scientist vs Data Analyst vs Data Engineer vs Data Architect WHAT DOES A DATA SCIENTIST DO? Extract insights & knowledge from large & complex data sets. Close to business stakeholders to identify business challenges. Identify patterns, trends & insights by using: Analytical methods Machine learning algorithms Deep learning algorithms * Picture from the 11 Data Scientist Skills Employers Want to See in 2022 article WHAT DOES A DATA ANALYST DO? Also known as Business Analysts due to their domain expertise. Use data to address challenges related to business operations. Help stakeholders make data-driven decisions by using: Tools like Excel, SQL or Tableau for data analysis & visualization Strong problem-solving skills * Picture from the The 7 Data Analyst Skills You Need in 2023 article WHAT DOES A DATA ENGINEER DO? Build data pipelines to transform & transport data. Data Governance practices → metadata, quality, security,... Establish foundations for scientists & analysts by mastering: Programming skills SQL, NoSQL & other Big Data technologies for data preparation * Picture from the What Is a Data Engineer: Role Description, Responsibilities, Skills, and Background article WHAT DOES A DATA ARCHITECT DO? Build overall data architecture of an organization. Define data models, flows & storage layer to support business. Develop strategies in the following areas: Data Integration Data Warehousing Data Migration * Picture from the How to Become a Data Architect article CONGRATS, WE'RE DONE!

Lecture #13.1 - Data Architectures.pdf

Document Details

Tags

Related

Full Transcript

Upgrade to continue