Big Data Intro.pptx.pdf
Document Details
Uploaded by PreEminentSpring
Tags
Full Transcript
INTRODUCTION TO BIG DATA 1. INTRODUCTION TO BIG DATA. 2. DATA STORAGE. 3. ETL Vs ELT. 4. DATA WAREHOUSE. 5. DATA MODELING. 6. LEVELS OF ABSTRACTION. 7. SCHEMA AND ITS TYPES. 1. INTRODUCTION TO BIG DATA. 2. DATA STORAGE. 3. ETL Vs ELT. 4. DATA WAREHOUSE. 5. DATA MODELING. 6....
INTRODUCTION TO BIG DATA 1. INTRODUCTION TO BIG DATA. 2. DATA STORAGE. 3. ETL Vs ELT. 4. DATA WAREHOUSE. 5. DATA MODELING. 6. LEVELS OF ABSTRACTION. 7. SCHEMA AND ITS TYPES. 1. INTRODUCTION TO BIG DATA. 2. DATA STORAGE. 3. ETL Vs ELT. 4. DATA WAREHOUSE. 5. DATA MODELING. 6. LEVELS OF ABSTRACTION. 7. SCHEMA AND ITS TYPES. INTRODUCTION TO BIG DATA WHAT IS DATA Data: Data is a set of facts, numbers, words, sound, or even pictures that can be recorded and stored. It represents the raw material from which information and knowledge are derived through analysis. WHAT IS DATA Data: It represents the raw material, unprocessed, and unorganized. Information: It represents processed and organized data. knowledge: It represents the application and understanding derived from information. TYPES OF DATA TYPES OF DATA Structured data: This is the most organized type. Imagine it like a spreadsheet with rows and columns. Each piece of data has a predefined format and fits into a specific field. Easy to search and analyze. TYPES OF DATA Unstructured data: This is the opposite of structured data. It has no predefined format and can be all sorts of things. like text documents, emails, social media posts, images, and videos. Harder to analyze TYPES OF DATA Semi-structured data: This type falls somewhere in between. It has some internal organization, but it doesn't follow a strict Format. Requires specific tools for processing and analysis. like JSON files that use key-value pairs. WHAT IS BIG DATA WHAT IS BIG DATA Big Data: It refers to data that is generated frequently, in high volume and in multiple forms. It's not just about the size of the data, but also the variety and the speed at which it's generated. BIG DATA CHARACTERISTICS The 3V's: The three key characteristics that define how data can be classified as “Big Data”. BIG DATA CHARACTERISTICS 1.Volume: It refers to the amount of data generated. For Example: Facebook stores more than 250 billion images in total. It increases every day as people keep on posting on the platform. BIG DATA CHARACTERISTICS 2.Velocity: It refers to the speed of data generated. For Example: Twitter generates more than 500 Million tweets per day. BIG DATA CHARACTERISTICS 3.Variety: It refers to various types of data For Example: Instagram generates a variety of data formats such as photos, videos, and texts. BIG DATA CHARACTERISTICS Are these three characteristics enough to start working on the data and conducting analysis? BIG DATA CHARACTERISTICS BIG DATA CHARACTERISTICS 4.Veracity: It refers to the quality, reliability and accuracy of data. Poor data quality can lead to incorrect insights and decisions. BIG DATA CHARACTERISTICS 5.Value: It refers to the usefulness and importance of the data and how they can be used to gain benefits and insights. The most important “V” from the perspective of the business. DATA STORAGE DATA STORAGE Data Storage: Data Storage is the process of saving digital information in a medium (such as a hard drive, or cloud service) so it can be accessed, managed, and retrieved later. Data can be stored in several different ways, depending on the type of data and your desired use. MULTI TEMPERATURE STORAGE Hot Storage: Cold Storage: frequently accessed rarely accessed data that requires data that is kept fast read and write for long-term speeds. retention Use Cases: Use Cases: Real-time and Archived and very active databases old historical data. MULTI TEMPERATURE STORAGE Warm Storage: Occasionally accessed data that doesn't require the same speed as hot storage. Use Cases: Historical data that is still relevant for regular reporting and analysis. DATA STORAGE Data Lake Data Warehouse Data Mart DATA STORAGE Data Lake: It is a large, flexible storage repository that can hold both structured and unstructured data at scale. It allows organizations to store raw, unprocessed data from various sources, such as sensors, social media, and more, in its native format. DATA STORAGE Data Lake: the schema and structure of the data are not predefined, giving organizations the flexibility to extract insights from diverse data sources. DATA STORAGE Data Warehouse: It is Highly structured and optimized repository designed for storing, organizing, and querying structured data for analytical purposes. Data is usually extracted from various sources, transformed into a consistent format, and loaded into the data warehouse for analysis. DATA STORAGE Data Warehouse: It is follow a predefined schema, enforce data quality and consistency, and is optimized for complex querying and reporting tasks. Ideal for businesses looking to make informed decisions based on historical data, as they provide a single source of truth for standardized reporting and analytics. DATA STORAGE Data Lake Data Warehouse stores all the raw data stores mainly structured data can be petabytes relatively small difficult to analyze optimized for data analysis schema on read schema on write no predefined purpose read-only queries ELT ETL DATA STORAGE Data Lake: Data Warehouse: stores all the raw data. stores mainly structured data. can be petabytes. relatively small. difficult to analyze. optimized for data analysis. schema on read. schema on write. no predefined purpose. read-only queries. ELT. ETL. DATA STORAGE Data Mart: It is a specialized and focused subset of a data warehouse. Designed to cater to the specific analytical needs of a particular business unit, department, or user group within an organization. ETL Vs ELT ETL Vs ELT ETL Vs ELT ETL ELT Data is transformed before Data is loaded and loading transformed later Slow Fast Provides accessible and Taking advantage of clean data that is ready for powerful resources in huge analytics data stores Well known and Less documentations and documented experience ETL Extract, Transform and Load It is the process of combining data from multiple sources into a large, central repository called a data warehouse. ETL STEPS 1 Extract: Data Sources: Begin by identifying the sources from which data needs to be extracted. EX: databases, flat files (like CSV or Excel), APIs, or web services. Data Extraction: This is the process of retrieving data from various sources and collecting it in a structured format for further analysis, processing, or storage. ETL STEPS 1 Extract: Extraction Types 1.Update notification 2.Incremental extraction 3.Full extraction the source system the system checks some systems notifies you when for changes at can't identify a data record periodic intervals. data changes changes. such as once a or give EX:Databases and week, once a month. notifications. web applications. ETL STEPS 1.1 Extract: Extraction Types 2.Incremental 1.Update 3.Full extraction notification extraction the thesome system sourcesystems checks system for notifies can't changes identify youatwhen periodic a data datarecord changes intervals. such changes. or as give once a week, EX:Databases notifications. once a month. and web applications. ETL STEPS 1.1 Extract: Extraction Types 2.Incremental 1.Update 3.Full extraction notification extraction the thesome system sourcesystems checks system for notifies can't changes identify youatwhen periodic a data datarecord changes intervals. such changes. or as give once a week, EX:Databases notifications. once a month. and web applications. ETL STEPS 1.2 Extract: Extraction Methods 1.Database Queries: SQL queries are commonly used to extract data from relational databases. 2.ETL Tools: tools automate the extraction process by connecting to various data sources, performing transformations, and loading data into a target destination. ETL STEPS 1.2 Extract: Extraction Methods 3.Streaming Data: Continuous extraction of real-time data streams from sources like sensors, IoT devices, or social media platforms. 4.Web Scraping: Extracting data from websites by parsing HTML pages and extracting relevant information using web scraping techniques ETL STEPS 1.2 Extract: Extraction Methods 5.File Parsing: Reading and parsing files in different formats (such as CSV, JSON, XML) to extract data. 6.API Extraction: Application Programming Interfaces (APIs) allow access to data from web services or software applications, enabling programmatic extraction of data. ETL STEPS 2- Transform: Basic data transformation Data format revision: Format revision converts data, such as measurement units, and date/time values, into a consistent format. ETL STEPS 2.1 Transform: Basic data transformation Data cleansing: Data cleansing removes errors and maps source data to the target data format. Data deduplication: Deduplication in data cleansing identifies and removes duplicate records. Data format revision: Format revision converts data, such as measurement units, and date/time values, into a consistent format. ETL STEPS 2.2 Transform: Advanced data transformation Splitting: divide a column or data attribute into multiple columns in the target system. Derivation: drive a new values from existing values. Encryption: You can protect sensitive data to comply with data privacy. ETL STEPS 2.2 Transform: Advanced data transformation Encryption: You can protect sensitive data to comply with data privacy. ETL STEPS 3 Load: Once extracted and transformed, the data is typically loaded into a target destination, such as a data warehouse or analytical platform Where it can be further used for reporting and decision-making. DATA WAREHOUSE DATA WAREHOUSE DATA WAREHOUSE Data Warehouse: Is a subject-oriented, integrated,time-variant, and nonvolatile collection of data in support of management’s decision-making process. A decision support database that is maintained separately from the organization’s operational database. DW CHARACTERISTICS DW CHARACTERISTICS Subject-oriented: Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process. DW CHARACTERISTICS Integrated: Constructed by integrating multiple, heterogeneous data sources. EX: relational databases, flat files. DW CHARACTERISTICS Time-variant: The time limits for data warehouse is wide-ranged than that of operational systems. Operational database: current value data. Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years). DW CHARACTERISTICS Nonvolatile: A physically separate store of data transformed from the operational environment. Operational update of data does not occur in the data warehouse environment. DATA MODELING DATA MODELING Data Modeling: Is a visual representation of a system's data and its relationships. The purpose of data modeling is to help in understanding data structures, improving communication among stakeholders. DATA MODELING Simple ERD Control Flow and Data Flow DATA MODELING Data Modeling in Data Warehouse: Design the structure of data storage. Ensure data is organized, accessible, and useful for analytical purposes. LEVELS OF ABSTRACTION LEVELS OF ABSTRACTION Levels of Abstraction: Different stages of detail in the modeling process. Help in refining the data warehouse design from high-level concepts to detailed, implementable structures. LEVELS OF ABSTRACTION LEVELS OF ABSTRACTION 1- View Level: This is the highest level of abstraction. To present data in a format that is understandable and accessible to users. LEVELS OF ABSTRACTION LEVELS OF ABSTRACTION 2.1- Conceptual Level: 2.2- Logical Level: Focuses on Translates business understanding business requirements into requirements and detailed, defining high-level technology-independent entities and relationships. data models. EX: ER Diagram. EX: Dimensional Data Models: “star and snowflake schema”. LEVELS OF ABSTRACTION 3- Physical Level: Implements the logical models. focusing on physical storage, performance, and optimization. DATA INDEPENDENCE Data Independence: The ability to modify the Model at one level without impacting the higher levels. Advantages: Simplified Maintenance. Enhanced Flexibility. SCHEMA AND ITS TYPES SCHEMA Schema: It defines the structure and organization of data and how data relates logically within a data warehouse. It contains: Fact Table. Dimension. SCHEMA Fact Table: It is a central table in a star or snowflake schema within a data warehouse. It contains: Measurements: Quantitative data that can be analyzed. Foreign keys: Keys that link to dimension tables. SCHEMA Dimension Tables: contain descriptions of the objects in a fact table and provide information about dimensions such as values, characteristics, and keys. SCHEMA DATA WAREHOUSE SCHEMA TYPES TYPES OF SCHEMA 1.Star Schema: It is a simple data warehouse schema that organizes data into a central fact table surrounded by dimension tables. Characteristics: Easy to understand and use. TYPES OF SCHEMA STAR SCHEMA TYPES OF SCHEMA 2. Snowflake Schema: Is a more complex variation of the star schema, where dimension tables are normalized into multiple related tables. Characteristics: More complex schema with multiple joins, which can impact query performance. TYPES OF SCHEMA SNOWFLAKE SCHEMA TYPES OF SCHEMA 3.Galaxy Schema: It is a complex data warehousing schema that consists of multiple fact tables sharing common dimension tables. Characteristics: More complex schema with multiple joins, which can impact query performance. TYPES OF SCHEMA GALAXY SCHEMA TYPES OF SCHEMA Why are there so many types of Schema? To address different needs, complexities, and performance considerations in data management and analysis, Here’s why there are many types of schemas. By: Arwa ElSharawy Arwa Hossam