Big Data Processing - 2411 - Data Management PDF
Document Details
![BountifulSatire4895](https://quizgecko.com/images/avatars/avatar-6.webp)
Uploaded by BountifulSatire4895
EAE Business School
2024
EAE
Tags
Summary
This document is a chapter on data management from a Big Data Processing course, likely for an undergraduate course, offered by EAE Business School in 2024. It covers fundamental concepts, data storage, governance, integration, cleaning, and preparation in a big data context.
Full Transcript
Big Data Processing - 2411 01. Data management: basic concepts and fundamentals. eae.es 11 Big Data Processing - 2411 What does DATA MANAGEMENT mean? eae....
Big Data Processing - 2411 01. Data management: basic concepts and fundamentals. eae.es 11 Big Data Processing - 2411 What does DATA MANAGEMENT mean? eae.es 12 Big Data Processing - 2411 Data Management Is the process of collecting, storing, organizing, and maintaining data to ensure it’s accessible, accurate, and ready for analysis. It means understanding how to handle data throughout its lifecycle, from raw data collection to processing and storage, all the way to preparing it for decision-making insights. 13 Big Data Processing - 2411 Key Concepts 1. Data Collection: Gathering data from various sources, like customer databases, sales records, or social media, ensuring it’s relevant and comprehensive for the business problem at hand. 2. Data Storage: Using systems (like databases or data warehouses) to store data securely and systematically. This includes cloud storage solutions that make large-scale data management feasible and scalable. 3. Data Cleaning and Preparation: Ensuring data quality by removing duplicates, fixing errors, and handling missing values so that analyses are accurate and reliable. 14 Big Data Processing - 2411 Key Concepts 4. Data Governance and Security: Establishing policies for data access, privacy, and compliance to protect sensitive information and meet regulatory requirements. 5. Data Integration: Combining data from multiple sources, like CRM systems or marketing platforms, to get a holistic view for analysis. 6. Data Access and Analytics: Making data accessible to the right people at the right time, often through dashboards or analytics tools, to support data-driven decision-making. Understanding these elements helps to effectively use data as a strategic asset, making it easier to derive insights and make informed business decisions. 15 Big Data Processing - 2411 1. Data Collection What It Is: This is the foundational step where you gather data relevant to your business questions or objectives. Data can come from internal sources (like sales records, CRM databases, financial systems) or external sources (like market research, social media, or economic data). Key Steps: Identify Data Sources: Determine where the data comes from, including transaction systems, customer feedback forms, IoT devices, or third-party APIs. In business analytics, data sources should align with your business needs. Define Data Types: Decide if you need structured data (like tables in a database) or unstructured data (like social media posts). Structured data is easier to analyze, while unstructured data often requires pre-processing but can reveal insights like customer sentiment. Select Collection Methods: Common methods include automated data pipelines (for transactional or real-time data), surveys (for customer preferences), and web scraping (for collecting publicly available data). Choose methods based on accuracy, reliability, and ease of integration. Ensure Ethical and Legal Compliance: Be mindful of data privacy laws like GDPR in Europe or CCPA in California. Always obtain data responsibly and, where applicable, anonymize it to protect individual privacy. Why It Matters: Good data collection practices ensure you have reliable data that represents the real-world phenomenon you’re studying. It’s the backbone of any analysis and ultimately impacts the accuracy and quality of your insights. 16 Big Data Processing - 2411 2. Data Storage What It Is: After collecting data, you need a safe, organized space to store it. Storage solutions vary based on data size, type, and access requirements. Key Steps: Choose Storage Solutions: Options include databases (like MySQL or PostgreSQL for relational data), data warehouses (like Snowflake for large, structured data), and data lakes (like Amazon S3 for raw or semi-structured data). Consider Cloud Storage: Cloud solutions (e.g., AWS, Google Cloud, Azure) offer scalable, cost-effective storage and make data accessible from anywhere. They’re also convenient for big data projects that require storage flexibility. Organize Data Structure: Organize your data logically. Data should be easy to locate and access, so apply structures (e.g., database schemas, table names) that facilitate analysis. Ensure Data Backup and Security: Backup mechanisms prevent data loss, and security measures (like encryption and access controls) protect against unauthorized access. Why It Matters: Proper storage ensures data is accessible, secure, and ready for analysis. It allows for efficient processing and retrieval, especially as the volume of data grows. 17 Big Data Processing - 2411 3. Data Cleaning and Preparation What It Is: Also known as “data wrangling” this step ensures that data is in good shape for analysis by fixing errors, standardizing formats, and filling in missing information. Key Steps: Remove Duplicates: Identify and eliminate any duplicate records that could distort analysis. Fix Data Quality Issues: Correct inconsistencies (e.g., formatting differences) and errors (like typos or outliers). Handle Missing Data: Decide how to address gaps. You can remove incomplete rows, fill in missing values (with averages, for example), or use imputation techniques. Transform Data for Analysis: Sometimes, data must be transformed into a suitable format (e.g., converting dates into standard formats or splitting text fields) for further processing. Why It Matters: Clean, high-quality data leads to accurate, reliable analysis, reducing the risk of misleading conclusions. 18 Big Data Processing - 2411 4. Data Governance and Security What It Is: This step involves setting policies and standards to manage data access, privacy, and security, ensuring that only authorized users can access specific data. Key Steps: Define Access Controls: Use role-based access to restrict data based on user roles, keeping sensitive information secure. Set Privacy and Compliance Standards: Data must comply with laws and regulations (e.g., GDPR, HIPAA) and should respect user privacy. Create a Data Usage Policy: Outline how data should be used, shared, and stored within the organization. Policies help prevent misuse and ensure data integrity. Implement Security Protocols: Use encryption, secure passwords, and regular audits to protect data against breaches. Why It Matters: Governance and security ensure data is protected, usable, and compliant with legal standards, safeguarding both the business and its stakeholders. 19 Big Data Processing - 2411 5. Data Integration What It Is: Data integration involves combining data from multiple sources into a cohesive, centralized format, enabling a comprehensive view of the business. Key Steps: Establish Common Data Definitions: Ensure that data fields are consistent across sources. For example, if “customer ID” appears in multiple databases, it should follow the same format. Use ETL Tools: Extract, Transform, Load (ETL) tools like Talend or Informatica facilitate data extraction, cleaning, and loading into a central repository. Ensure Data Synchronization: Data should be updated regularly across systems so that information remains accurate and up-to-date. Resolve Data Conflicts: Handle discrepancies (e.g., different names for the same customer across databases) to ensure that all data aligns. Why It Matters: Integrated data gives a complete picture of business operations, supporting better analytics and decision-making by combining insights from multiple data sources. 20 Big Data Processing - 2411 6. Data Access and Analytics What It Is: The final step is to make data accessible to users (often through dashboards or reports) so they can analyze and derive actionable insights. Key Steps: Implement Business Intelligence (BI) Tools: Tools like Power BI, Tableau, or Looker visualize data for business users, making insights accessible and digestible. Ensure Role-Based Data Access: Only authorized users should access specific data, maintaining privacy and security. Enable Self-Service Analytics: Providing tools and resources for business users to perform their analyses can enhance decision-making across departments. Measure Key Metrics and KPIs: Define relevant metrics, such as customer retention rates or sales growth, to monitor performance and drive decisions. Why It Matters: Proper storage ensures data is accessible, secure, and ready for analysis. It allows for efficient processing and retrieval, especially as the volume of data grows. 21 Big Data Processing - 2411 Data management: basic concepts and fundamentals. DATA LIFECYCLE eae.es 22 Big Data Processing - 2411 Data Life Cycle 23 Big Data Processing - 2411 Example GEOTRACKING COMPANY: 1. Devices in trucks send data to cloud: 2. App To Show Almost Real Time: - Truck/Drivers distance (Kms) - Time in each status (driving, working, waiting & available) - Documents 3. ETL Pipeline to aggregate all collected valuable information 4. Data Storage: Data Warehouse for analytical purposes - Reporting services 24 Big Data Processing - 2411 Deep review of the tool - INFORMATION NEEDED - Distance - Speed - GPS - Brake - Fuel Consumption… - INFORMATION AVAILABLE - Sensors: - Truck: - CANBUS - Trailer: - CANBUS - THERMOGRAPH 25 Big Data Processing - 2411 Data management: basic concepts and fundamentals. DATA STORAGE eae.es 26 Big Data Processing - 2411 Data Storage 1. Introduction to Data Storage 2. Relational Databases: SQL 3. No Relational Databases: NoSQL 4. Data Warehouses 5. Data Lakes 6. Case Study 7. Wrap-Up and Q&A 27 Big Data Processing - 2411 Data Storage 1. Introduction to Data Storage “Data is key to good decisions, but great data unlocks powerful insights that propel impactful actions” Goal: efficient data storage for business insights Key Points: - There are different storage systems: databases, data warehouses, data lakes. - Choosing the right storage for efficient data access and analysis. Discussion: personal experiences with data storage (e.g., spreadsheets, cloud storage). 28 Bases de datos relacionales Big Data Processing - 2411 Se utilizan en: aplicaciones tradicionales, DataWarehouse, ERP, CRM y e-commerce. Productos Data Storage 1. Relational Databases: SQL Almacenan datos cuyas relaciones y esquema están predefinidos, diseñadas para admitir transacciones ACID y conservar la integridad referencial, así como la coherencia de los datos. Database structure: tables, rows, columns, primary and foreign keys. SQL basics: SELECT, INSERT, UPDATE, DELETE. Products: - MySQL, PostgreSQL, MariaDB, Oracle, SQL Server - Amazon Aurora, Amazon RDS, Amazon Assistance: AI that helps to créate a database: Discussion: Simple SQL exercise on a sample database (e.g., employee or transportation data). 29 Big Data Processing - 2411 Incapacidad de manejo de los datos del Método Tradicional Data Mining vs Big Data Relational Database Management Systems (RDBMS) - Terabytes y Petabytes de datos → no puede con ellos - Se necesitan cada vez Máquinas más potentes (más procesadores y memoria), y hacen poco viable su implementación - 80% de los datos recopilados son semi-estructurados o no estructurados, con lo cual no puede analizarlos - No puede lidiar con la velocidad de entrada de datos 30 Big Data Processing - 2411 Distributed Storage Storage and processing of big volumes of data with speed and low cost. DISTRIBUTED COMPUTING 31 Big Data Processing - 2411 ¿Qué es BIG DATA? DISTRIBUTED STORAGE 32 Big Data Processing - 2411 DATA STORAGE SOLUTIONS Most commonly used Big Data Storage Solutions - Hadoop - Elasticsearch - Mongo db - Hbase - Cassandra - Neo4J 33 Big Data Processing - 2411 Data management: basic concepts and fundamentals. DATA CLEANING AND PREPARATION eae.es 34 Big Data Processing - 2411 Data Cleaning and Preparation Overview: Often called "data wrangling," this process focuses on getting your data ready for analysis by correcting errors, harmonizing formats, and dealing with missing values. Key Processes: 1.Eliminate Duplicates: Detect and remove repeated entries that could skew analytical results. 2.Resolve Data Quality Issues: Address inconsistencies such as mismatched formats or errors, including typos and extreme outliers. 3.Address Missing Values: Determine how to handle gaps in your data—options include removing incomplete entries, substituting missing values (e.g., with averages), or applying advanced imputation techniques. 4.Prepare Data for Analysis: Modify and format data as needed, such as standardizing date formats, splitting combined text fields, or reorganizing datasets for easier analysis. Importance: Accurate, well-prepared data ensures dependable insights and minimizes the risk of drawing incorrect conclusions. 35 Big Data Processing - 2411 Data Cleaning and Preparation Beyond Basics: Once the foundational cleaning is done, advanced techniques can further optimize your data for analysis and enhance the quality of your insights. Key Advanced Steps: 1.Feature Engineering: 1. Create new variables or modify existing ones to uncover additional insights. 2. Example: Deriving age from a date of birth or calculating a profitability ratio. 2.Normalization and Scaling: 1. Adjust numerical data to a common scale without distorting relationships. 2. Example: Scaling income data to fall between 0 and 1 for machine learning models. 3.Outlier Treatment: 1. Use statistical methods to identify and handle outliers that could bias results. 2. Approaches: Winsorization, clipping, or applying robust statistical techniques. 4.Data Enrichment: 1. Integrate additional data sources to provide more context. 2. Example: Augmenting sales data with weather information for trend analysis. 5.Automating the Process: 1. Leverage tools or scripts (e.g., Python, R) to automate repetitive cleaning and preparation tasks, improving efficiency and consistency. Why It’s Critical: Advanced preparation ensures your data is not just clean but also tailored to the analytical techniques you plan to use, enabling deeper insights and more effective decision-making. 36 Big Data Processing - 2411 Data Cleaning and Preparation STEPS: 1. WHAT DO WE WANT? WHY WE ARE DOING THIS ANALYSIS? 2. WHAT DATA DO WE HAVE? UNDERSTANDING DATA A. STRUCTURED AND SEMISTRUCTURED DATA B. UNSTRUCTURED DATA A: FIRST STEPS ARE GOING TO BE PERFORMED WITH AN ETL TOOL THAT WE ALL HAVE IN OUR COMPUTERS BUT NEVER NEW WE HAD IT. We are going to dive in a tool that everybody has in its computer: POWERQUERY FOR EXCEL. What we are going to learn is also valid for a business inteligence tool like POWER BI. 37 Big Data Processing - 2411 Data Cleaning and Preparation STEPS: 1. What do we want? Why we are doing this analysis? 2. What data do we have? Understanding data A. Structured and semistructured data B. Unstructured data 3. Beyond the basics: outlier treatment, new variables, enrich data 4. Automate: scripts r/Python 5. Connecting those scripts with Workflows or cronjobs/scheduled tasks 38 Big Data Processing - 2411 Data management: basic concepts and fundamentals. DATA GOVERNANCE AND SECURITY eae.es 39 Big Data Processing - 2411 4. DATA GOVERNANCE AND SECURITY Establishing policies for data access, privacy, and compliance ensures sensitive information is protected and regulatory requirements are consistently met. This foundation supports trustworthy, ethical, and efficient use of data across the organization. ACCESS CONTROL: Defines clear permissions and roles to ensure that only authorized individuals can access specific datasets, minimizing the risk of unauthorized use or breaches. PRIVACY: Implements robust measures to safeguard personal and sensitive information, adhering to frameworks such as GDPR, HIPAA, or CCPA to build trust with stakeholders. COMPLIANCE: Aligns data management practices with legal and industry regulations, ensuring that data usage meets all standards for security, auditability, and accountability. TRANSPARENCY: Creates clear documentation and audit trails for how data is collected, stored, shared, and processed, promoting organizational integrity and readiness for regulatory audits. RISK MANAGEMENT: Proactively identifies vulnerabilities and establishes protocols to prevent, detect, and respond to potential data breaches or misuse. Through strong governance and security practices, organizations can not only protect their data assets but also foster a culture of responsibility, mitigate risks, and enhance the reliability of their data-driven initiatives. 40 Big Data Processing - 2411 Data management: basic concepts and fundamentals. DATA INTEGRATION eae.es 41 Big Data Processing - 2411 5. DATA INTEGRATION Combining data from multiple sources, such as CRM systems, marketing platforms, or financial databases, creates a unified and holistic view of organizational information. This enables seamless analysis, improved collaboration, and more strategic decision-making. CONNECTIVITY: Ensures smooth and reliable access to various data sources, facilitating real-time or scheduled synchronization to keep information up-to-date across systems. TRANSFORMATION: Standardizes, cleanses, and enriches data during integration, ensuring consistency, accuracy, and usability for downstream analytics and reporting. VISIBILITY: Provides stakeholders with a comprehensive, 360-degree view of operations, customers, and performance by breaking down silos between disparate data systems. EFFICIENCY: Automates data workflows and reduces manual effort, accelerating the availability of integrated data for timely insights and action. By enabling the seamless merging of data, integration supports scalable analytics, drives operational excellence, and aligns cross-functional objectives, all while ensuring compliance and governance of data assets. 42 Big Data Processing - 2411 Data management: basic concepts and fundamentals. DATA ACCESS AND ANALYTICS eae.es 43 Big Data Processing - 2411 6. DATA ACCESS AND ANALYTICS Making data accessible to the right people at the right time ensures that stakeholders can act on accurate, timely information, often through dashboards, reports, or analytics tools, fostering a culture of data- driven decision-making across the organization. NOTIFICATIONS: Serve as proactive alerts, ensuring that users are immediately informed of significant changes, anomalies, or opportunities within the data, enabling swift action and reducing response times. REPORTS: Provide structured, in-depth insights with formulas, graphs, and data visualizations, summarizing historical trends and key performance metrics to inform strategy and operations. DASHBOARD: Interactive dashboards consolidate and visualize data in real-time, making complex information more accessible and actionable. By bringing businesses closer to their data, dashboards empower users to monitor performance, identify patterns, and make informed decisions with agility. This approach to data access and analytics aligns with broader data management goals, including ensuring data integrity, compliance, and scalability to support evolving business needs. 44