Big Data: A Deep Dive for Electronics Engineering PDF
Document Details
Uploaded by InventiveLagrange
Ric Marvin G. Dimabayao
Tags
Related
- King's College London MBBS Stage 1 Genes, Behaviour and Environment PDF
- Big Data for Electronics Engineering - PDF
- HIEBP Lecture 1 Introduction and Overview 2024 PDF
- HIEBP Lecture 1 Introduction and Overview 2024 PDF
- TEMES 12-13 ADMINISTRACIÓ ELECTRÒNICA PDF
- Chapter 1. Big Data and Artificial Intelligence Systems PDF
Summary
This document provides an overview of big data, covering historical context, its evolution through three phases (structured, unstructured, and mobile/sensor-based), and the four key characteristics (volume, velocity, variety, and veracity). It explains how big data is used in different industries, providing examples of structured and unstructured data. The document also includes an activity to test the reader's understanding of the different big data phases.
Full Transcript
UNLEASHING THE POWER OF BIG DATA: A DEEP DIVE FOR ELECTRONICS ENGINEERING RIC MARVIN G. DIMABAYAO WHAT IS BIG DATA? Big data refers to extremely large and diverse collections of structured, unstructured, and semi-structured data that continues to grow exponentially over tim...
UNLEASHING THE POWER OF BIG DATA: A DEEP DIVE FOR ELECTRONICS ENGINEERING RIC MARVIN G. DIMABAYAO WHAT IS BIG DATA? Big data refers to extremely large and diverse collections of structured, unstructured, and semi-structured data that continues to grow exponentially over time. These datasets are so huge and complex in volume, velocity, and variety, that traditional data management systems cannot store, process, and analyze them. HISTORICAL CONTEXT AND EVOLUTION OF BIG DATA Origin and Popularization Early Use of Data Term ‘Big Data’ has been in use since the early Historical efforts to use data for decision-making and 1990s. gaining advantages date back to ancient civilizations. John R. Mashey, associated with Silicon Graphics, is The Library of Alexandria (around 300 B.C.) is an early often credited with popularizing the term. example of attempting to collect and use data. Big Data is now a well-established field in both The library contained an estimated 40,000 to 400,000 academia and industry. scrolls (equivalent to around 100,000 books). Historical Perspective Roman Empire's Data Analysis Roman military used detailed statistical analysis for Big Data combines the mature field of statistics with predictive purposes. the relatively young domain of computer science. Early forms of predictive data analysis helped Builds upon knowledge from mathematics, statistics, efficiently deploy Roman armies. and data analysis These techniques provided military advantages. BIG DATA PHASES Big Data Phase 1 – Structured Content Big Data Phase 2 – Web Based Unstructured Content Big Data Phase 3 – Mobile and Sensor-based Content BIG DATA PHASE 1 – STRUCTURED CONTENT Originated from the longstanding domain of database management Relies heavily on storage, extraction, and optimization techniques common in RDBMS Techniques like SQL and ETL professionalized in the 1970s Database management and data warehousing systems are fundamental to modern Big Data solutions Quick storage and retrieval of data are core requirements for Big Data analysis Relational database management technology is integral to Big Data solutions Key technologies and characteristics from this phase are embedded in solutions from leading IT vendors like Microsoft, Google, and Amazon BIG DATA PHASE 2 – WEB BASED UNSTRUCTURED CONTENT Early 2000s: Internet and web applications began generating large amounts of data Web applications stored data in relational databases Unstructured data from IP-specific search and interaction logs provided new insights into internet user behavior Expansion of web traffic and online stores led companies like Yahoo, Amazon, and eBay to analyze customer behavior through click-rates, IP-specific location data, and search logs HTTP-based web traffic increased semi-structured and unstructured data Organizations needed new approaches and storage solutions to analyze these new data types Growth of social media data increased the need for tools and technologies to extract meaningful information from unstructured data New technologies like network analysis, web-mining, and spatial-temporal analysis were developed to analyze large quantities of web-based unstructured data BIG DATA PHASE 3 – MOBILE AND SENSOR-BASED CONTENT Current phase of Big Data evolution driven by mobile technology and devices In 2011, mobile devices and tablets surpassed laptops and PCs in number Estimated 10 billion internet-connected devices in 2020 Mobile devices generate data every second, including behavioral and location-based GPS data Mobile devices allow tracking of movement, physical behavior, and health-related data in real- time Sensor-based internet-enabled devices contribute to increased data creation The Internet of Things (IoT) includes connected TVs, thermostats, wearables, and refrigerators Continuous growth of IoT means the race to extract valuable information from new data sources is ongoing A SUMMARY OF THE EVOLUTION OF BIG DATA AND ITS KEY CHARACTERISTICS PER PHASE Phase 3: Mobile and Phase 1: Structured Data Phase 2: Unstructured Data Sensor-Based Content Unorganized and without a Phase 1: Structured Data Generated by mobile devices and predefined data model. Highly organized and formatted sensors. Difficult to search and manage. data. High volume, velocity, and variety. Easily searchable and manageable. Often textual or multimedia in Often real-time or near-real-time. Stored in relational databases. nature. Examples: Examples: Examples: Location data from smartphones. Customer records with fields like Social media posts (text, images, Sensor data from wearables (heart name, address, phone number, videos). email. rate, steps, sleep patterns). Emails and other text Sales transaction data with product Social media posts from mobile documents. ID, quantity, price, date. devices. Audio and video files. Inventory data with item number, Images and videos captured by Satellite imagery and sensor description, quantity on hand. smartphones. data (before advanced Financial data like account IoT device data (temperature, processing). balances, transaction history. humidity, air quality). Guessing Game Activity: "Data Phase Detective" "DATA PHASE DETECTIVE" Objective: Identify the phase of Big Data for each example. 1. I’ll present a data example. 1. 2. You’ll decide which phase of Big Data it fits into. 2. Type 1 if the sample is Phase 1: Structured Data Type 2 if the sample is Phase 2: Unstructured Data Type 3 if the sample is Phase 3: Mobile and Sensor-Based Content 3. We’ll then discuss the answer and why it fits that particular phase. 3. "DATA PHASE DETECTIVE" EXAMPLE 1: A database containing customer information with fields for name, address, phone number, and email. "DATA PHASE DETECTIVE" EXAMPLE 1: A database containing customer information with fields for name, address, phone number, and email. Answer: 1 or Structured Data (Phase 1) "DATA PHASE DETECTIVE" EXAMPLE 2: A collection of customer reviews and feedback from an online shopping website. "DATA PHASE DETECTIVE" EXAMPLE 2: A collection of customer reviews and feedback from an online shopping website. Answer: 2 or Unstructured Data (Phase 2) "DATA PHASE DETECTIVE" EXAMPLE 3: Sensor data from a smart thermostat, including temperature readings and humidity levels over time. "DATA PHASE DETECTIVE" EXAMPLE 3: Sensor data from a smart thermostat, including temperature readings and humidity levels over time. Answer: 3 or Mobile and Sensor-Based Content (Phase 3) "DATA PHASE DETECTIVE" EXAMPLE 4: Answer: 2 or Unstructured Data (Phase 2) "DATA PHASE DETECTIVE" EXAMPLE 5: Answer: 1 or Structured Data (Phase 1) THE 4V'S OF BIG DATA Big data is characterized by its volume, velocity, variety, and veracity. These four dimensions, often referred to as the "4 V's," define the challenges and opportunities presented by this massive dataset. THE 4V'S OF BIG DATA 01 Volume Volume refers to how much data is actually collected. An analyst must determine what data and how much of it needs to be collected for a given purpose. To imagine the possibilities, consider a social media site where people write updates, like photos, review business, watch videos, search for new items and interact in some way with just about everything they see on their screens. Each of these interactions generates data about that person that can be fed into algorithms. THE 4V'S OF BIG DATA 02 Variety Variety refers to how many points of reference are used to collect data. If data is collected from a single source, that information may be skewed in some ways. It will not represent a broad population or wide trend. In some cases, like with velocity, that is fine. A pet microchipping service, for example, may only want to target data from a neighborhood social networking site. A movie company, on the other hand, may want to target several social media sites and people of various age groups. So they would need more points of reference to decide on the best places to do business. THE 4V'S OF BIG DATA 03 Velocity Velocity in big data refers to how fast data can be generated, gathered and analyzed. Big data does not always have to be used imminently, but in some fields, there is a great advantage to receiving up to the second information about rates and being able to act accordingly. In other businesses, the data trend over time is more important to help make predictions or solve lingering problems. THE 4V'S OF BIG DATA 04 Veracity Veracity relates to how reliable data is. An analyst wants to ensure that the data they look at is valid and comes from a trusted source. This is determined by where the data comes from and how it is collected. Data collected from native sites rather than third-parties is necessary for reliable results. Additionally, testing measures must be properly designed to ensure that data results in the desired information and is not extraneous. CHALLENGES OF MANAGING AND PROCESSING BIG DATA CHALLENGES OF MANAGING AND PROCESSING BIG DATA Data Storage Issues: Scalability Concerns: Big data is constantly growing, demanding storage solutions that can effortlessly expand to accommodate increasing volumes. Traditional storage systems often struggle to keep pace with this rapid growth. Cost of Storage Solutions: Storing vast amounts of data incurs substantial expenses. Organizations must carefully evaluate different storage options (e.g., cloud, on-premises) to find cost-effective solutions. CHALLENGES OF MANAGING AND PROCESSING BIG DATA Data Processing Challenges Handling Real-Time Data Processing: Many business decisions require immediate insights from data. Processing large volumes of data in real- time is computationally intensive and demands specialized technologies. Ensuring Data Quality and Accuracy: Big data often originates from diverse sources, leading to inconsistencies and errors. Cleaning and preparing data for analysis is time-consuming and crucial for reliable results. CHALLENGES OF MANAGING AND PROCESSING BIG DATA Security and Privacy Protecting Sensitive Data: Big data often contains sensitive personal information. Safeguarding this data from unauthorized access, breaches, and cyberattacks is paramount. Compliance with Regulations: Various industries have strict data privacy regulations. Adhering to these laws while managing big data can be complex and costly EXCEL POWER QUERY Power Query Power Query is a powerful tool that helps you transform and prepare data for analysis. It's part of Microsoft Excel and Power BI, allowing you to efficiently clean, shape, and combine data from various sources. EXCEL Importance of Power Query 1. Simplifies Data Preparation Efficient Data Gathering: Power Query streamlines the process of gathering data from multiple sources, including databases, spreadsheets, and cloud services, reducing the need for manual data collection. Automated Data Cleaning: It provides tools for automating data cleaning tasks, such as removing duplicates, handling missing values, and correcting data formats, significantly reducing the time spent on these repetitive tasks Importance of Power Query 2. Enhances Data Analysis Capabilities Advanced Data Transformation: Power Query offers robust tools for cleaning, shaping, and transforming data. This includes capabilities for filtering rows, adding calculated columns, pivoting, and unpivoting data. Data Integration: It allows seamless integration of data from various sources, enabling users to combine datasets in meaningful ways, which leads to more comprehensive and accurate insights. Improved Accuracy: By automating and standardizing data preparation tasks, Power Query helps ensure that the data used in analysis is accurate and consistent, leading to more reliable results. The Four Phases of Power Query 1 Connect 2 Transform In this phase, users connect to the data source(s) from Once the data is loaded into Power Query, users can use which they want to extract data. Power Query supports various data transformation tools to clean, reshape, and many data sources, including databases, files, web transform the data to meet their specific needs. Common pages, and more. Users can also specify any required data transformation tasks include removing duplicates, authentication or authorization details during this filtering data, merging data, splitting columns, and pivoting phase. data. 3 Combine 4 Load Power Query also allows users to combine data from In the Load phase, users specify where to load the multiple sources using various techniques. Users can transformed data. They can load the data into an Excel merge tables, append, or join data using a common key. worksheet or a Power BI report or create a connection to This phase is beneficial for integrating data from different the data source so that the data is automatically sources into a single, unified view. refreshed whenever the source data changes. Key Features of Power Query Data Connectivity Power Query excels in connecting to a wide range of data sources, including: From Files: Excel files(Workbook), Text or CSV files, XML files, and JSON files. From Databases: SQL Server, Microsoft Access, SQL Server Analysis Services. From Other Sources: Excel Tables/ Ranges, Web, Microsoft Query, OData feeds. Key Features of Power Query User-Friendly Interface Query offers a user-friendly interface that simplifies the process of data transformation. Users can perform complex data transformations with just a few clicks, without needing to write code. Showcase the Power Query interface, highlighting the steps to load data, apply transformations, and preview results. Key Features of Power Query Power Query Editor https://www.simplilearn.com/tutorials/excel-tutorial/power-query-in-excel#what_is_power_query Key Features of Power Query Data Transformation Tools Power Query provides a robust set of tools for cleaning and shaping data: Text Formatting Functions Splitting a Column Using Delimiters Transpose a Data Table Removing Duplicates Using Power Query Key Features of Power Query Data Integration Data integration in Power Query involves combining data from different sources into a single, unified view. This is important because it allows us to gather all the relevant information we need in one place, making our data analysis more effective and comprehensive. Key Features of Power Query Automation Automation in data refreshes means setting up your data systems to automatically update themselves without you having to do it manually. This is particularly useful when dealing with large datasets that change frequently. Scalability Power Query is designed to handle large datasets by leveraging efficient data loading and transformation capabilities. Scalability is crucial when working with Big Data, as it allows the system to process and analyze increasing volumes of data seamlessly. Data Import and Transformation: Power Query enables users to import large volumes of data from various sources and perform transformations without impacting performance. Its ability to handle data efficiently means it can scale up as your data grows. Query Folding: Power Query utilizes query folding, which pushes data transformation operations back to the data source server. This approach ensures that only necessary data is transferred and processed, enhancing scalability and performance. Handling Large Datasets Effectively Handling large datasets is a common challenge in Big Data analysis, but Power Query offers several features to manage this efficiently: Incremental Data Refresh: This feature allows you to refresh only new or changed data instead of reloading the entire dataset. It’s particularly useful for large datasets where processing everything afresh would be time-consuming. Data Compression: Power Query compresses data during the import process, reducing the size of the dataset in memory and improving performance. Efficient Data Transformation: With Power Query, you can perform data transformations and clean-ups directly within the tool, reducing the need for additional processing and making it easier to work with large volumes of data. Examples of Big Data Scenarios Healthcare Data Analysis: In the healthcare sector, Power Query can be used to aggregate patient data from various electronic health records (EHR) systems, enabling analysis of patient outcomes, treatment effectiveness, and operational efficiency. Financial Services: Financial institutions use Power Query to integrate and analyze data from various financial systems, including transaction data, market data, and customer information, to detect fraud, analyze market trends, and make investment decisions. Real-Time Data Processing Connecting to Real-Time Data Sources: Power Query can connect to live data sources such as streaming data services, real-time databases, and online APIs. This capability ensures that your analysis is based on the most current data available. Use Cases of Real-Time Data Analysis: IoT Monitoring: In the Internet of Things (IoT) landscape, real-time data from sensors and devices can be analyzed to monitor system performance, detect anomalies, and trigger alerts. Social Media Analytics: Real-time analysis of social media feeds helps businesses track brand mentions, sentiment, and emerging trends as they happen.