Topic 02 Understanding Data Data Wrangling Data Visualization PDF

Industrial Engineering Understanding Data Engr. Ranzel Dimaculangan Agenda 1. Types of Data 2. Big Data 3. Data Quality 4. Data Collection Methods 5. Data Ethics 6. Data Wrangling 7. Data Visualization 01 Types of Data Types of Data 1. Quantitative Data 2. Qualitative Data 3. Binary Data 4. Time-Series Data 5. Spatial Data 6. Textual Data 7. Structured vs Unstructured Data Quantitative Data (Numerical Data) Quantitative data represents numerical values that quantify an attribute or characteristic. It can be further divided into: 1. Discrete Data: This type of data can only take specific, distinct values. It often involves counting things (e.g., number of students in a class, number of cars in a parking lot). 2. Continuous Data: This data can take any value within a given range and often involves measurements (e.g., height, weight, temperature). It is continuous because it can be infinitely subdivided into smaller increments. Qualitative Data (Categorical Data) Qualitative data represents categories or labels rather than numbers. It can be further divided into: 1. Nominal Data: This data represents categories that have no intrinsic ordering. Examples include gender, nationality, or the type of car (e.g., sedan, SUV, truck). 2. Ordinal Data: This data represents categories with a meaningful order, but the intervals between the categories are not necessarily equal. Examples include rankings (e.g., first, second, third) or levels of satisfaction (e.g., satisfied, neutral, dissatisfied). Binary Data Binary data is a type of qualitative data with only two categories or states, typically represented as 0 and 1, true and false, or yes and no. Examples include whether a switch is on or off or whether an email is spam or not. Time-Series Data Time-series data is data collected over time, usually at regular intervals. It is crucial in areas like economics, finance, and meteorology. Examples include daily stock prices, hourly temperature readings, or monthly sales figures. Spatial Data (Geospatial Data) Spatial data is related to the physical location and shape of objects. It is often used in geographic information systems (GIS) and involves coordinates like latitude and longitude. Examples include maps, satellite imagery, and location-based data. Textual Data Textual data consists of words, sentences, or entire documents. It is typically unstructured and requires techniques like natural language processing (NLP) to analyze. Examples include emails, social media posts, and customer reviews. Structured vs. Unstructured Data Structured Data: Data that is organized in a predefined manner, usually in tables with rows and columns. Examples include databases and spreadsheets. Unstructured Data: Data that doesn’t have a predefined format or structure. It includes text, images, audio, and video files. Analyzing unstructured data often requires more complex processing methods. 02 Big Data Big Data It refers to extremely large and complex datasets that are beyond the capabilities of traditional data processing tools to manage, analyze, and store effectively. These datasets are characterized by: 1. Volume 2. Velocity 3. Variety 4. Veracity 5. Value Volume Description: The sheer size of data being generated and collected is immense, often measured in terabytes, petabytes, or even exabytes. Example: Social media platforms like Facebook generate petabytes of data daily, including posts, comments, likes, and photos. Velocity Description: The speed at which data is generated, collected, and processed. With the rise of IoT devices, sensors, and social media, data is being created in real-time or near real-time. Example: Financial markets generate data at high velocity as stock prices, trading volumes, and news are updated every second. Variety Description: Big Data comes in various formats, including structured, semi- structured, and unstructured data. This variety requires different approaches to storage and analysis. Types: Structured Data: Data organized in rows and columns, like databases and spreadsheets. Unstructured Data: Data without a predefined structure, like text, images, videos, and social media posts. Semi-Structured Data: Data that doesn't fit neatly into tables but has some organizational properties, like XML or JSON files. Variety Example: An organization might collect structured data from customer transactions, unstructured data from social media interactions, and semi-structured data from email logs. Veracity Description: Veracity refers to the quality, accuracy, and trustworthiness of the data. Big Data can include noise, inconsistencies, and uncertainty, making it challenging to derive meaningful insights. Example: Data collected from social media might contain inaccurate or biased information, requiring careful filtering and validation. Value Description: The potential value that can be derived from analyzing big data. It's not just about having a large volume of data, but about extracting meaningful insights that can lead to better decision-making, innovation, and competitive advantage. Example: Retailers analyze purchasing patterns and customer behavior to optimize inventory, personalize marketing efforts, and improve customer experience. Sources of Big Data Big Data comes from various sources, including but not limited to: Social Media: Platforms like Twitter, Facebook, and Instagram generate vast amounts of user-generated content daily. Internet of Things (IoT): Connected devices, such as smart thermostats, wearable fitness trackers, and industrial sensors, continuously collect and transmit data. Transaction Data: Every time a customer makes a purchase, a transaction is recorded, contributing to massive datasets in e- commerce and retail. Sources of Big Data Big Data comes from various sources, including but not limited to: Web Data: Clickstreams, website visits, and browsing history generate large volumes of data about user behavior online. Machine Data: Logs from servers, applications, and networks provide a wealth of information about system performance and usage. Healthcare Data: Medical records, imaging data, and data from wearable health devices contribute to Big Data in healthcare. 03 Data Quality Data Quality It refers to the condition of a Key Dimensions of Data Quality dataset and how well it 1. Accuracy 2. Completeness meets the requirements of 3. Consistency its intended use. High- 4. Timeliness quality data is crucial for 5. Validity accurate analysis, decision- 6. Uniqueness making, and reporting. 7. Integrity 8. Relevance 9. Accessibility 10. Reliability Accuracy Definition: Accuracy refers to how closely data reflects the real- world values or events it is supposed to represent. Example: A dataset recording temperatures should have accurate readings that match the actual temperatures at the time of measurement. If the temperature was 25°C and the dataset shows 35°C, the data is inaccurate. Completeness Definition: Completeness refers to the extent to which all required data is present in the dataset. Example: If a customer database is supposed to include phone numbers, but 20% of the records are missing phone numbers, the dataset is incomplete. Consistency Definition: Consistency means that the data is uniform and reliable across different datasets or within the same dataset. Example: If a product's price is listed as $100 in one system and $90 in another, the data is inconsistent. Consistent data should match across different systems and time periods. Timeliness Definition: Timeliness refers to how up-to-date the data is and whether it is available when needed. Example: Sales data should be updated daily to reflect the most recent transactions. If the data is a week old, it may no longer be relevant or useful for real-time decision-making. Validity Definition: Validity indicates that the data conforms to the expected formats, rules, or constraints. Example: If a field for birth dates in a dataset contains future dates, those records are invalid since a birth date should not be in the future. Valid data adheres to defined business rules and data types. Uniqueness Definition: Uniqueness ensures that each record is distinct and not duplicated within a dataset. Example: A customer database should not have multiple records for the same individual. Duplicate records can lead to inaccurate analysis and reporting. Integrity Definition: Integrity refers to the correctness and reliability of the relationships between different data elements within a dataset. Example: In a relational database, if a foreign key in one table references a nonexistent primary key in another table, this violates referential integrity. Data with high integrity correctly reflects all defined relationships. Relevance Definition: Relevance ensures that the data is applicable and useful for the specific context or purpose for which it is being used. Example: In a study on urban transportation, data on rural agricultural practices might be irrelevant, even if it is accurate and complete. Accessibility Definition: Accessibility refers to how easily data can be accessed and used by those who need it, without unnecessary barriers. Example: A dataset stored in a format that requires specialized software may be less accessible than one stored in a common format like CSV or Excel, even if both contain high-quality data. Reliability Definition: Reliability refers to the consistency of data over time and across different contexts. Example: A sensor that provides consistent readings under the same conditions is reliable. Unreliable data may produce different results under similar conditions, leading to uncertainty. 04 Data Collection Methods Data Collection Methods These are techniques used to gather information from various sources for analysis, interpretation, and decision-making. The choice of method depends on the research objective, the nature of the data, and the resources available. Primary Data Collection Methods 1. Surveys and Questionnaires 2. Interviews 3. Observation 4. Experiments 5. Focus Groups 6. Document and Content Analysis 7. Case Studies 8. Sensor and Instrument Data 9. Big Data Collection 10. Secondary Data Collection Surveys and Questionnaires Description: Surveys involve asking people questions through a structured form or questionnaire to gather data on their opinions, behaviors, or characteristics. Types: Online Surveys: Distributed via email or web platforms. Telephone Surveys: Conducted over the phone. Face-to-Face Surveys: In-person interviews. Paper Surveys: Distributed physically. Surveys and Questionnaires Pros: Can reach a large audience, standardized questions provide consistent data, cost-effective (especially online). Cons: Response rates may vary, possible bias in responses, limited depth of responses. Interviews Description: Interviews involve direct, face-to-face or remote conversations with individuals to gather in-depth information on a specific topic. Types: Structured Interviews: Follow a strict script of questions. Semi-Structured Interviews: Have a general framework but allow flexibility in questioning. Unstructured Interviews: Open-ended, with no predefined questions. Interviews Pros: Provides deep, qualitative insights, flexible, can clarify and probe responses. Cons: Time-consuming, requires skilled interviewers, potential interviewer bias. Observation Description: Observation involves collecting data by watching and recording behaviors, events, or conditions as they occur in their natural setting. Types: Participant Observation: The researcher actively engages in the environment being studied. Non-Participant Observation: The researcher observes without interaction. Structured Observation: Predetermined criteria and behaviors are observed. Unstructured Observation: Open-ended, observing whatever occurs in the environment. Observation Pros: Can capture real-world behavior, provides context, unobtrusive methods can reduce bias. Cons: Can be time-consuming, observer bias, difficult to replicate, may miss non-visible factors. Experiments Description: Experiments involve manipulating one or more variables to observe the effect on another variable, often used to establish cause-and-effect relationships. Types: Laboratory Experiments: Conducted in a controlled environment. Field Experiments: Conducted in a natural environment. Quasi-Experiments: Lacks random assignment to groups but still manipulates variables. Experiments Pros: Can establish causality, controlled conditions, replicable. Cons: May lack external validity (generalizability), ethical concerns, sometimes expensive and complex. Focus Groups Description: Focus groups involve guided discussions with a small group of participants to explore their attitudes, feelings, and beliefs about a specific topic. Pros: Generates rich, qualitative data, interactive, can uncover group dynamics and shared views. Cons: May be influenced by dominant participants, not generalizable, requires skilled moderators. Focus Groups Description: This method involves analyzing existing documents, records, and content to extract relevant data. Types: Textual Analysis: Examining written documents like reports, books, articles, or emails. Content Analysis: Analyzing media content (e.g., videos, social media posts, websites). Focus Groups Pros: Useful for historical data, cost-effective, unobtrusive. Cons: Limited to existing content, may require interpretation, potential for biased sources. Case Studies Description: Case studies involve an in-depth examination of a specific individual, group, event, or situation to explore and understand complex issues in real-world contexts. Pros: Provides detailed, context-rich data, useful for exploring new or complex phenomena. Cons: Not generalizable, time-consuming, may be subject to researcher bias. Sensor and Instrument Data Description: This method involves collecting data using physical instruments or sensors to measure and record variables like temperature, pressure, motion, or biochemical signals. Types: Environmental Sensors: For monitoring conditions like weather, pollution, or water quality. Wearable Devices: For tracking health metrics like heart rate, steps, or sleep patterns. Sensor and Instrument Data Pros: Provides precise, real-time data, useful for continuous monitoring, minimal human error. Cons: Expensive equipment, requires technical expertise, potential for data overload. Big Data Collection Description: Involves gathering vast amounts of data from various digital sources, often in real-time, to analyze patterns, trends, and associations. Types: Web Scraping: Extracting data from websites. Social Media Monitoring: Collecting data from platforms like Twitter, Facebook, etc. Sensor Networks: Collecting data from IoT devices. Big Data Collection Pros: Captures large-scale data, useful for predictive analytics, can handle complex, unstructured data. Cons: Requires advanced processing tools, privacy concerns, potential for noisy data. Secondary Data Collection Description: This method involves using existing data collected by others, such as government reports, company records, or previously conducted research. Pros: Cost-effective, time-saving, useful for historical analysis, often large-scale. Cons: May not perfectly align with research needs, limited control over data quality, may be outdated. Choosing the Right Method The selection of a data collection method depends on the research objectives, the nature of the data required, the resources available (time, budget, and personnel), and the desired level of accuracy and detail. In many cases, a combination of methods is used to gather comprehensive and reliable data. 05 Data Ethics Data Ethics Data ethics is the branch of ethics that evaluates the moral issues related to data collection, sharing, analysis, and use. As data becomes increasingly central to decision-making in both public and private sectors, understanding and applying ethical principles to data practices is crucial. Key Concepts and Considerations in Data Ethics: 1. Privacy 2. Informed Consent 3. Transparency 4. Fairness 5. Accountability 6. Data Ownership 7. Data Minimization 8. Security 9. Purpose Limitation 10. Avoiding Harm 11. Ethical Use of AI and Automation 12. Human Dignity Privacy Description: Privacy concerns the right of individuals to control their personal information. Ethical data practices must respect individuals' privacy and ensure that personal data is not collected, used, or shared without consent. Example: Companies should ask for explicit consent before collecting personal information, like email addresses or health data, and clearly explain how this data will be used. Informed Consent Description: Informed consent involves ensuring that individuals are fully aware of what data is being collected, why it’s being collected, how it will be used, and who will have access to it before they agree to participate. Example: A mobile app that collects location data should inform users what data is being collected, how it will be used (e.g., to provide personalized recommendations), and seek explicit permission before starting to collect the data. Transparency Description: Transparency involves being open and clear about data practices, including what data is collected, how it is processed, who has access to it, and for what purposes it is used. This helps build trust with data subjects and stakeholders. Example: Organizations should publish clear privacy policies and data handling practices, and notify users of any changes to these practices. Fairness Description: Fairness in data ethics refers to ensuring that data collection, analysis, and use do not lead to discrimination, bias, or unjust outcomes. It involves treating all data subjects equitably and avoiding practices that could lead to harmful or unequal treatment. Example: An algorithm used for job recruitment should be designed and tested to ensure it doesn’t unfairly favor or disadvantage certain groups based on race, gender, or age. Accountability Description: Accountability involves ensuring that individuals and organizations responsible for data management and use are held accountable for their actions. It includes implementing mechanisms to audit data practices, address breaches, and rectify any harm caused. Example: A company that experiences a data breach should take responsibility, notify affected individuals promptly, and take steps to prevent future breaches. Data Ownership Description: Data ownership refers to the concept of who owns the data and the rights that come with ownership. Ethical considerations involve respecting the ownership rights of data subjects and ensuring that their data is used in ways that align with their expectations and consent. Example: A social media platform should not assume ownership of user-generated content and must respect users' rights to their data, including the right to delete it or transfer it to another service. Data Minimization Description: Data minimization is the principle of collecting only the data that is strictly necessary for a specific purpose. This reduces the risk of privacy breaches and minimizes the amount of data that needs to be protected. Example: An online retailer should collect only the information necessary to process a transaction, such as payment details and shipping address, rather than unnecessary personal details like age or gender. Security Description: Data security involves protecting data from unauthorized access, breaches, and other threats. Ethical data practices require implementing robust security measures to safeguard personal and sensitive information. Example: Encrypting sensitive data, using secure methods for data transmission, and regularly updating security protocols are key practices to ensure data security. Purpose Limitation Description: Purpose limitation means that data should only be used for the specific purpose for which it was collected, and not for unrelated purposes without additional consent from the data subject. Example: If a company collects data for improving service quality, it should not use that data for targeted advertising unless users have explicitly agreed to this additional use. Avoiding Harm Description: Data practices should avoid causing harm to individuals or groups, whether through privacy breaches, misuse of data, or the perpetuation of bias and inequality. Example: A health app should ensure that the data it collects is securely stored and not shared with third parties who might use it to discriminate against individuals (e.g., insurance companies denying coverage based on health data). Ethical Use of AI and Automation Description: With the rise of AI and machine learning, ensuring that these technologies are used ethically is crucial. This includes avoiding bias in algorithms, ensuring transparency in automated decision-making, and being accountable for the outcomes generated by AI systems. Example: An AI system used in criminal justice should be transparent, explainable, and designed to avoid perpetuating existing biases against certain demographics. Human Dignity Description: Data practices should respect the inherent dignity of all individuals, recognizing that they are more than just data points. This includes treating data subjects with respect and ensuring that data practices do not undermine their autonomy or rights. Example: A company collecting data from vulnerable populations, like children or the elderly, should take extra care to protect their dignity and interests. Challenges in Data Ethics Surveillance: The collection of data through monitoring, tracking, and surveillance can infringe on privacy rights and lead to a loss of individual autonomy. Bias in Data and Algorithms: Data often reflects the biases present in society, and if not carefully managed, these biases can be perpetuated or even amplified by algorithms, leading to unfair treatment of certain groups. Challenges in Data Ethics Data Monetization: The practice of monetizing personal data raises ethical concerns about consent, ownership, and the potential for exploitation. Data Breaches: The increasing frequency of data breaches has highlighted the importance of robust data security and the ethical responsibility organizations have to protect the data they collect. Regulations and Guidelines General Data Protection Regulation (GDPR): A comprehensive data protection law in the European Union that sets strict rules on data privacy and security, including requirements for consent, data minimization, and the right to be forgotten. Ethical Guidelines for AI: Organizations like the European Commission and IEEE have developed guidelines for the ethical use of AI, emphasizing principles like transparency, accountability, and fairness. Regulations and Guidelines Data Privacy Act of 2012 (Republic Act No. 10173) is the primary law in the Philippines that governs the collection, processing, and storage of personal data. It aims to protect individual privacy rights while ensuring the free flow of information to promote innovation and growth. Cybercrime Prevention Act of 2012 (Republic Act No. 10175). This law addresses crimes committed through electronic means, including offenses related to data privacy, such as illegal access, data interference, and cyber-squatting 06 Data Wrangling Data Wrangling Also known as data munging, is the process of cleaning, transforming, and organizing raw data into a more structured and usable format for analysis. Key steps include data cleaning, data transformation, data integration, and data formatting Data Cleaning Description: Identifying and correcting errors, inconsistencies, and inaccuracies in the data. This step involves handling missing values, outliers, and duplicates. Tasks: Handling Missing Values: Filling in missing data using methods like mean imputation, removing rows with missing data, or using interpolation techniques. Removing Duplicates: Identifying and removing duplicate records that can skew analysis results. Correcting Errors: Fixing data entry errors, such as incorrect spelling, formatting issues, or incorrect values. Data Transformation Description: Modifying data into the desired format or structure. This may involve normalization, standardization, or other transformations to make the data suitable for analysis. Tasks: Normalization: Rescaling numeric data to a common scale without distorting differences in ranges. Standardization: Transforming data so that it has a mean of zero and a standard deviation of one. Encoding Categorical Variables: Converting categorical data (e.g., "Yes" or "No") into a numerical format (e.g., 1 or 0) so that it can be used in machine learning algorithms. Binning. It is the process of transforming continuous numerical variables into discrete categorical bins for grouped analysis. Data Integration Description: Combining data from different sources or datasets into a cohesive whole. This can involve merging, joining, or concatenating datasets. Tasks: Merging: Combining datasets with a common identifier or key (e.g., merging sales data with customer data using customer ID). Joining: Combining datasets based on a shared column, often used in relational databases. Concatenating: Stacking datasets on top of each other when they share the same structure or columns. Data Formatting Description: Structuring the data in a way that is compatible with the tools or software used for analysis. This may involve converting data types, reordering columns, or renaming variables. Tasks: Data Type Conversion: Converting data types (e.g., from text to numeric) to ensure compatibility with analysis tools. Column Reordering: Arranging columns in a specific order to facilitate analysis or visualization. Renaming Variables: Assigning meaningful names to variables to improve clarity and understanding. 07 Data Visualization Data Visualization It is the graphical representation of data and information using visual elements like charts, graphs, maps, and diagrams. The goal of data visualization is to make complex data more accessible, understandable, and actionable by presenting it in a visual format that can quickly convey insights and patterns. Type of Data Visualization Bar Charts: Used to compare quantities across different categories. Example: Comparing the sales figures of different products in a store. Line Graphs: Show trends over time or continuous data. Example: Tracking stock prices over several months. Pie Charts: Represent parts of a whole, showing the proportion of different categories. Example: Visualizing the market share of different smartphone brands. Type of Data Visualization Histograms: Display the distribution of a single variable, showing the frequency of data within certain ranges. Example: Showing the distribution of ages among a group of people. Scatter Plots: Illustrate the relationship between two variables, often used to identify correlations. Example: Plotting hours studied against exam scores to see if there's a positive correlation. Type of Data Visualization Heatmaps: Use color to represent data values in a matrix, often used to show intensity or frequency. Example: Visualizing the frequency of website visits across different times of the day. Box Plots: Show the distribution of a dataset by highlighting the median, quartiles, and potential outliers. Example: Comparing the salary distribution across different departments in a company. Type of Data Visualization Geospatial Maps: Display data geographically, often used for location-based data. Example: Mapping the incidence of diseases in different regions. Tree Maps: Represent hierarchical data using nested rectangles, with the size and color of each rectangle representing different attributes. Example: Visualizing the market share of companies within an industry. Tree Map Tree Map Thanks! Do you have any questions? [email protected] CREDITS: This presentation template was created by Slidesgo, and includes icons by Flaticon, and infographics & images by Freepik Please keep this slide for attribution

Topic 02 Understanding Data Data Wrangling Data Visualization PDF

Document Details

Tags

Related

Summary

Full Transcript