Lecture #19.1 - Course Outline (1).pdf
Document Details
Uploaded by PerfectPanda
IE University
Tags
Full Transcript
MODERN DATA ARCHITECTURES FOR BIG DATA II COURSE OUTLINE THEORETICAL CONTENT The following picture summarizes what we've gone through: LABS We also complemented theory with some labs: MODERN DATA ARCHITECTURES, BIG DATA Believe it or not, we've gone throughout all of this: 1. STREAMING DATA STORAGE...
MODERN DATA ARCHITECTURES FOR BIG DATA II COURSE OUTLINE THEORETICAL CONTENT The following picture summarizes what we've gone through: LABS We also complemented theory with some labs: MODERN DATA ARCHITECTURES, BIG DATA Believe it or not, we've gone throughout all of this: 1. STREAMING DATA STORAGE WITH KAFKA STREAMING DATA STORAGE Keep the Data Value Chain context in mind all the time: KAFKA'S CORE CONCEPTS Consider Kafka a massive scalable message queue. Ranging from the more functional to the more operational: Messages & Schemas - actual streaming data and its format Producers & Consumers - add and use streaming data within pipeline Topics & Partitions - organize and make streaming data available Brokers & Clusters - Nodes' organization of the distributed system WHAT IS IT, WHERE DOES IT COME FROM? Event streaming solution with a different real-time approach: Modern distributed system → massively scalable. True storage system → data is: Replicated multiple times and in different places if needed Persisted to reuse it as many times and by different applications as needed Kept around (data retention) as much time as the business needs Raises level of abstraction → allows computation with less code Developed at LinkedIn back in 2010* because: The need of low-latency ingestion of large amounts of event data Data ingestion tools at the time not designed for real-time use cases Traditional real-time systems overkilling and unable to scale * Kafka's origin story at LinkedIn 2. STREAMING DATA INTRODUCTION REAL-TIME SYSTEMS CLASSIFICATION Real-time systems have been around for decades. They've become popular lately causing ambiguity and debate. The following two dimensions will help us clarify: Latency - how much time takes processing data on average Tolerance for delay - how critical having insights late is REAL-TIME SYSTEMS CLASSIFICATION In the Data Analytics space it's only soft and near real-time: THREE MAIN CONCEPTS Real-time Big Data projects usually deal with: Event vs Processing Time → two timestamps to consider Windows of data → tecnhique to limit unbounded data Message Delivery Semantics → how to deal with critical data EVENT VS PROCESSING/STREAM TIME Messages/events can be timed in two different ways: Event time - when the event occurs Processing/Stream time - when the event enters the streaming system More often than not, Event time!=Processing time: Processing-time lag* - delay between when occurs and it's processed Event-time skew* - how far behind event is when it's processed * They’re just two ways of looking at the same thing; processing-time lag and event-time skew at any given point in time are identical. WINDOWS OF DATA Streams of data infinite in size → data doesn't fit in memory. Window of data - amount of data processed at a given time. Existing attributes in all windowing techniques: Trigger policy - when all data in the window has to be processed Eviction policy - when an event has to be removed from the window WINDOWS OF DATA Two main categories of windowing strategies: Tumbling windows - based on the number of events in the window: Time based windows - three different techniques: MESSAGE DELIVERY SEMANTICS Related to producers, brokers, consumers and messages. Three semantic guarantees when it comes to message delivery: At most once - message may get lost but no duplicates at consumer At least once - message gets in but potential duplicates at consumer Exactly once - message gets in and no duplicates at consumer 3. APACHE SPARK STRUCTURED STREAMING API BUILDING UP ON THE FOUNDATIONS Spark Streaming is built on top of core APIs we've learned: SPARK STREAMING DESIGN OPTIONS Spark Streaming's design options are summarized as follows: Declarative API → what to do with events VS how to do it Event & processing time → full timing information on the events Micro-Batch & Continuous execution → until Spark 2.3 only micro-batch: Micro-batch = ↑ throughput & ↑ latency Continuous execution = ↓ throughput & ↓ latency UNBOUNDED/INFINITE TABLES Stream of data → modeled as an unbounded/infinite table: New data arriving is continuously appended to the Input Table* * Our Spark applications take this Input Table and transform it into a Result Table. STRUCTURED STREAMING APPLICATIONS They rely on Structured/High level APIs studied. Transformations studied and some more are supported. Steps to create and run a streaming application/job: 1. Write the code implementing the business logic 2. Connect that code to the: Source → where the events are coming from (ex. Kafka, file, socket,...) Sink/Destination → where the insights are sent to (ex. Kafka, file, console,...) 3. Start the application/job on the Spark cluster which: It'll evaluate the code continuously as new events arrive It'll produce insights incrementally as code evaluates events It'll run for some time or infinitely depending upon setup KEY ELEMENTS OF STRUCTURED STREAMING Streaming applications/jobs rely on the following elements: Transformations & actions → business logic to consider Input sources → where the events come from Output sinks & modes → where and how the results/insights go to Triggers → When to check for new available data Event-time processing → deal with delayed data SPARK CONTINUOUS APPLICATIONS Visually, Spark fits at the core of company analytics like this*: * A Beginner's Guide to Spark Streaming For Data Engineers - article based on DStreams (RDDs) but still a recommended one. 4. APACHE SPARK GRAPH API BUILDING UP ON THE FOUNDATIONS Spark GraphX/GraphFrames built on studied core APIs: GRAPHS ARE DATA STRUCTURES Graphs are advanced data structures made up of: Nodes/Vertices - main represented entities (ex. bike stations, airports,...) Edges - existing relationships between entities (ex. trips, routes,...) Nodes & Edges can have attributes to describe them better. Graphs can be classified in two based on edges navegability: Undirected graphs - how edges are traversed is not relevant Directed graphs - directional edges, A to B different to B to A GRAPH ANALYTICS Data analysis of relationships in a graph or network. Typical graph analytics use cases*: Social media & social network graphs Recommendation engines Fraud detection IT infrastructure monitoring * Graph Database Use Cases: https://neo4j.com/use-cases/ GRAPHFRAMES* Apache Spark package** providing DataFrame-based Graphs. Built on top of DataFrames, high-level API, which we love. Provide & extend GraphX functionality: Motif finding for pattern search within graphs Additional graph processing algorithms to those in GraphX * More details in GraphFrames User Guide. ** A PySpark version of the API can be found in Welcome to the GraphFrames Python API docs!. 5. APACHE SPARK MACHINE LEARNING API BUILDING UP ON THE FOUNDATIONS Spark Machine Learning built on studied core APIs: SPARK MACHINE LEARNING FLOW The Machine Learning Flow in Spark looks like this: PIPELINE ELEMENTS Pipeline elements are inspired by Scikit-Learn: Transformers - DataFrame into another usually adding columns (model) Estimators - abstracts the concept of a learning algorithm (training) Evaluators - metrics evaluating performance of trained models Pipelines - chain multiple elements together to specify the flow 6. APACHE SPARK IN PRODUCTION SCENARIOS TRANSFORMATIONS & ACTIONS Spark data analysis just some transformations & one action: TRANSFORMATIONS DON'T DUPLICATE DATA New DataFrames don't mean data is duplicated. Our data analysis will be optimized 'automagically': 1. Our PySpark code is converted into a Logical Plan 2. The Logical Plan is then converted into a Physical Plan 3. Multiple optimizations applied along the way by the Catalyst Optimizer Physical Plan is all about RDDs transformations. DATA ANALYSIS OPTIMIZATIONS Optimizations will try to achieve the following: Remove as much unnecessary data as possible Avoid as much data exchange between nodes as possible This optimizations can be introduced by: 1. Filtering as much rows as you can (filter, where) 2. Removing those columns you don’t need (select) 3. Starting with narrow transformations (ex. new columns) 4. Moving on with wide transformations (ex. aggregations) 5. Caching DataFrames if you’ll use them more than once BUILDING SPARK APPLICATIONS Jupyter Notebooks are great for interactive analytics. Batch & stream processing often times are not interactive: Need to happen at times humans are not available (eventually we sleep) Jobs triggered based on dynamic conditions that we cannot foresee Spark Applications the way to go to address those scenarios. Jupyter Notebooks can be translated into Spark Applications. USING AZURE DATABRICKS Azure Databricks is an Apache Spark-based analytics platform. It's optimized for the Microsoft Azure Cloud Services platform. 7. APACHE SPARK IN PRODUCTION SCENARIOS WHAT ARE BIG DATA ARCHITECTURES? Patterns defining how to connect different technologies. Have to be designed with analytical needs in mind: Current and future needs via capacity plannings → what & when Data Architecture Roadmap to address future needs over time Adequate sizing leads to sustainable growth → less risk to the project COMMON BIG DATA ARCHITECTURES* Over last years, companies have faced similar challenges. Some patterns have proved to work well in certain scenarios. Those patterns have been documented & named: Lambda Architecture Kappa Architecture Delta Architecture * Picture from the Data Architecture: An Epic Battle with the Powerpuff Girls and the Villain MO JO JO JO- Lambda, Kappa, and Delta(Revisiting Childhood) article LAMBDA ARCHITECTURE'S LAYERS Visual representation of Lambda Architecture: KAPPA ARCHITECTURE'S LAYERS Visual representation of Kappa Architecture: LAKEHOUSE ARCHITECTURE IN CONTEXT 40 years of evolution to get to the Lakehouse Architecture: CENTRALIZED TO DISTRIBUTED OWNERSHIP Data Mesh → Distributed Data Architecture with: Centralized Governance & Standardization for interoperability Shared self-serve Data Infrastructure The Data Mesh doesn't replace the Data Lake/EDW: Data Lake/EDW or, generally speaking, a Data Hub → Nodes on mesh Data Hubs provides data sets as products in a distributed fashion * Picture from How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh. WHAT ARE ON-PREMISE DATA CENTERS?* IT infrastructure within organization's physical premises: Setting up & maintaining IT infrastructure with company's resources Benefits of Big Data solutions on On-Premise Data Centers: 1. Data control & protection → Orgs. handle data & security needs directly 2. Predictable Performance & Low-Latency Access to Data → it's all local 3. Customizable Infrastructure → any hardware & software to meet needs Drawbacks of Big Data solutions on On-Premise Data Centers: 1. High initial costs & capital expenses → money upfront on tools & infra 2. Limited scale & possibility of overprovisioning → lack of infra elasticity 3. IT management → dedicated operations team to maintain the solution * More info on What is On-Premises Data Centers vs. Cloud Computing? WHAT ARE PUBLIC CLOUD DATA CENTERS?* IT infrastructure outside organization's physical premises: Available as IaaS (Infra), PaaS (Platform) or SaaS (Software) Benefits of Big Data solutions on Public Cloud Data Centers: 1. Cost-Effectiveness & Flexible Pricing → from CapEx to OpEx model 2. Easy Scaling & Adaptability → adjust resources to match demand 3. Global Access & Collaboration → accessible anywhere via Internet Drawbacks of Big Data solutions on Public Cloud Data Centers: 1. Security & Privacy → confidence on security measures (certifications) 2. Compliance in a Global Context → legal requirements where data lives 3. Internet Connectivity & Downtime Risk → impact of outages to business * More info on What is On-Premises Data Centers vs. Cloud Computing? WHAT ARE HYBRID DEPLOYMENTS? Relying solely on cloud infrastructure has drawbacks. Cloud backlash movement → computing back on-premises*. Hybrid deployments combine different options: On-Premise Solutions - greater control of data & cost in some cases Public Cloud Services - elasticity to quickly match demands Private Cloud Services - the best of both worlds: "On-Prem Cloud" * Picture from the Cloudera Data Platform Private Cloud - What is it? article FOUR MAIN ROLES IN BIG DATA There are four main roles related to the Big Data ecosystem*: Data Scientists - get valuable insights & knowledge from data sets Data Analysts - data analysis & interpretation to make decisions Data Engineers - build data pipelines to transform & transport data Data Architects - build overall data architecture of an organization * More info on Data Scientist vs Data Analyst vs Data Engineer vs Data Architect CONGRATS, WE'RE DONE!