Lecture 2 Scalable Data Systems

Reliable, Scalable and Maintainable Data Applications Pravin Y Pawar Data Processing Applications Data Intensive Applications Deals with  huge amount of data  complex data  fast moving data Typically built with several building blocks like  Databases  Caches  Search Indexes  Streaming processing  Batch processing Source : The topics are adapted from “Designing Data Intensive Applications” by Martin Kleppmann Data Systems Described Many different tools are integrated together for the data processing task Users are unaware about the seamless integration of tools / systems The boundaries between the categories of tools is diminishing No single tool can fit for all the data processing requirements Need to make sure this integration works properly end to end Non Functional Requirements for Data Systems Three Requirements Reliability  System should continue work correctly even in cases of failures Scalability  System should be able to cope with increase in load Maintainability  System should be able to work currently and adaptive to changes in future Reliability Described System should continue to work correctly, even when things go wrong  Should carry out expected operation correctly  Should handle wrong user inputs  Should prevent unauthorized access and usage Fault  Things that can go wrong  One component of system deviating from its working  Fault-tolerant / Resilient  Which can anticipate fault and deal with it properly Failure  System as a whole stops providing services Reliability(2) Faults Hardware Faults  Hard disk crash, RAM faults, Network issues etc  Add redundancy to individual components Software Faults  Software bugs, multi-threading issues  Happens under very unusual set of circumstances, hard to reproduce again  No straight away answer, need to rethink about the assumptions made while designing the system Human Errors  Developers designs the system, Operators maintains it  10-25% outages are caused by wrong configurations done by operators  Design systems that minimizes opportunities for error Scalability Described Systems ability to cope with load Load impacts the performance of system Load defined differently for different data systems  Request per second, number of concurrent users etc How system reacts when load increases?  If the system resources are kept same?  What additions needs to be done in systems resources? Maintainability Described Easy to write a software , but very difficult to maintain it Involves  Bug fixing  Keeping existing systems operational  Detecting root cause of failures  Adapting to new platforms Legacy systems needs to be maintained as they are critical for the business operations. Data Systems should be designed in such a way that  They are easily operable  They are simple to understand  They are easy to adapt to new changes Maintainability (2) Approaches Operable  Easy to work with systems should be developed  Appropriate documentation to be provided  Monitoring capabilities should be present  Support for automation and integration with other tools should be possible Simplified  Complexity slows down everything , makes maintenance hard, more vulnerable to errors  Simpler system does not mean compromise on features  Use Abstraction, it hides lot of complexity behind a clean interface  Finding good abstractions is hard Extensible  Making changes should be easy  Adding new features, accommodating new requirements should be possible Scaling with Traditional Databases Pravin Y Pawar Web Analytics Application Example Analytics Application Designing an application to monitor the page hits for a portal Every time a user visiting a portal page in browser, the server side keeps track of that visit Maintains a simple database table that holds information about each page hit If user visits the same page again, the page hit count is increased by one Uses this information for doing analysis of popular pages among the users Source : Adapted from Big Data by Nathan Marz Scaling with intermediate layer Using a queue Portal is very popular, lot of users visiting it  Many users are concurrently visiting the pages of portal  Every time a page is visited, database needs to be updated to keep track of this visit  Database write is heavy operation  Database write is now a bottleneck Solution  Use an intermediate queue between the web server and database  Queue will hold messages  Message will not be lost Scaling with Database Partitions Using Database shards Application is too popular  Users are using it very heavily, increasing the load on application  Maintaining the page view count is becoming difficult even with queue Solution Use database partitions  Data is divided into partitions which are hosted on multiple machines  Database writes are parallelized  Scalability increasing  Also complexity increasing Issues Begins Bottlenecks Disks are prone to failure, hence partition can be inaccessible Complicated to manage many number of shards Repartitioning is again required when load increased More buggy application code as complexity increasing Difficult to retrieve from the mistakes done either by application code or humans Rise of Big Data Systems How it helps Main issue with traditional data processing applications  Hard to make them scalable  Hard to keep them simple Because everything is managed by application code  Which is more prone to mistakes due to buggy implementations New edge systems aka Big Data Systems  Handles high data volume, at very fast rate coming from variety of sources  Systems aware about the distributed nature, hence capable of working with each other  Application does need to bother about common issues like sharding, replication etc  Scalability is achieved by horizontal scaling – just add new machines  Developers more focused on application logic rather than maintaining the environment Big Data Systems Pravin Y Pawar Data Growth Data Generation Source : https://www.guru99.com/what-is-big-data.html Big Data Systems What? New Paradigm for Data Analytics Information Data Insight Data Classification Types of Data Structured Semi- Unstructur structured ed Web Databases pages XML Images Data Usage Pattern Usage Stats Data Usage 10% 10% Structured Semi- structured 80% Unstrcutred Big Data Defined Big data – by Gartner 3 V’s of Big Data Volume, Velocity and Variety Sources of Big Data New Age Data Sources Source : https://www.guru99.com/what-is-big-data.html Big Data Ecosystem Landscape of Big Data Systems Source : Desired Properties of Big Data Systems Pravin Y Pawar Data Systems Defined System which answers the users questions based on the data accumulated over the period or in real time  What are the sales figures for last 6 months?  How is the health of data center at this moment? Data is different from information Information is derived from the data Data systems makes use of data to answer users queries Query = function (all data) Source : Big Data , Nathan Marz Properties of Big Data Systems Properties Fault tolerance  Correct behavior of system even in case of failures Low latency  Both read and write response times should be as low as possible Scalability  Easy to manage the load just by adding the additional machines Extensibility  Easy to update or add new features in the system Maintainability  Easy to keep system running without facing critical issues Debuggability  Should be diagnose systems health if system behaves in inappropriate manner Data Model of Big Data Systems Pravin Y Pawar Properties of Data Three Properties Rawness  Fine grained data  Many interesting queries can be answered  Unstructured data is more rawer than structured data Immutability  No deletion or updates to data, only new addition  Original data is untouched  Easy to retrieve back from the failures / mistakes  No indexing required, enforcing simplicity Eternity  Consequence of immutability  Data is always pure and true  Achieved with adding timestamp with the data Source : Big Data , Nathan Marz Fact based Model for Data Facts Fundamental unit of data Data can not derived from anything else Fact is atomic and time stamped Can be made unique by adding unique identifier to it Example facts  I work for BITS Pilani  My DOB is 1 January 1982  I currently live in Hyderabad  I was living at Pune between 2010 to 2015  I have interest in subjects like data mining, data science etc Fact based Model for Data (2) Benefits Fact based model  Stores raw data as atomic facts  Makes facts immutable by adding time stamp value to it  Makes it uniquely identifiable by attaching identifier Benefits  Data is query able at any time  Data is tolerant to human errors  Data can be stored both in structured and unstructured formats Fact based Model for Data (3) Structure / Schema Graph schemas captures the structure of dataset stored using fact based model Depicts the relationship between nodes, edges and properties Nodes  Entities in system , for example, person Edges  Relationship between entities,  For example, person A knows person B Properties  Information about entities Generalized Architecture of Big Data Systems Pravin Y Pawar Big data architecture style is designed to handle the ingestion, processing, and analysis of data that is too large or complex for traditional database systems. Source : https://docs.microsoft.com/en-us/azure/architecture/guide/architecture- styles/big-data Big Data Applications Workloads Big data solutions typically involve one or more of the following types of workload:  Batch processing of big data sources at rest  Real-time processing of big data in motion  Interactive exploration of big data  Predictive analytics and machine learning Big Data Systems Components Components Most big data architectures include some or all of the following components:  Data sources All big data solutions start with one or more data sources like databases, files, IoT devices etc  Data Storage Data for batch processing operations is typically stored in a distributed file store that can hold high volumes of large files in various formats.  Batch processing Because the data sets are so large, often a big data solution must process data files using long-running batch jobs to filter, aggregate, and otherwise prepare the data for analysis. Usually these jobs involve reading source files, processing them, and writing the output to new files.  Real-time message ingestion  If the solution includes real-time sources, the architecture must include a way to capture and store real-time messages for stream processing.  Stream processing After capturing real-time messages, the solution must process them by filtering, aggregating, and otherwise preparing the data for analysis. The processed stream data is then written to an output sink.  Analytical data store Many big data solutions prepare data for analysis and then serve the processed data in a structured format that can be queried using analytical tools. The analytical data store used to serve these queries can be a Kimball-style relational data warehouse, as seen in most traditional business intelligence (BI) solutions.  Analysis and reporting The goal of most big data solutions is to provide insights into the data through analysis and reporting.  Orchestration Most big data solutions consist of repeated data processing operations, encapsulated in workflows, that transform source data, move data between multiple sources and sinks, load the processed data into an analytical data store, or push the results straight to a report or dashboard. To automate these workflows, you can use an orchestration technology such Azure Data Factory or Apache Oozie and Sqoop. Big data architecture Usage When to use this architecture Consider this architecture style when you need to:  Store and process data in volumes too large for a traditional database  Transform unstructured data for analysis and reporting  Capture, process, and analyze unbounded streams of data in real time, or with low latency Big data architecture Benefits Advantages Technology choices  Variety of technology options in open source and from vendors are available Performance through parallelism  Big data solutions take advantage of parallelism, enabling high-performance solutions that scale to large volumes of data. Elastic scale  All of the components in the big data architecture support scale-out provisioning, so that you can adjust your solution to small or large workloads, and pay only for the resources that you use. Interoperability with existing solutions  The components of the big data architecture are also used for IoT processing and enterprise BI solutions, enabling you to create an integrated solution across data workloads. Big data architecture Challenges Things to ponder upon Complexity  Big data solutions can be extremely complex, with numerous components to handle data ingestion from multiple data sources. It can be challenging to build, test, and troubleshoot big data processes. Skillset  Many big data technologies are highly specialized, and use frameworks and languages that are not typical of more general application architectures. On the other hand, big data technologies are evolving new APIs that build on more established languages. Technology maturity Many of the technologies used in big data are evolving. While core Hadoop technologies such as Hive and Pig have stabilized, emerging technologies such as Spark introduce extensive changes and enhancements with each new release. Thank You!

Lecture 2 Scalable Data Systems

Document Details

Tags

Related

Summary

Full Transcript