Chapter-II-Data-Collection-and-Management.pdf

IT INST3 – Data Science Analytics Chapter II Subject Teacher: Edward B. Panganiban, Ph.D. Republic of the Philippines Isabela State University...

IT INST3 – Data Science Analytics Chapter II Subject Teacher: Edward B. Panganiban, Ph.D. Republic of the Philippines Isabela State University Echague, Isabela College of Computing Studies, Information and Communication Technology Chapter II - Data Collection and Management Data collection and management is the process of gathering and organizing data in a way that it can be used to answer questions, make decisions, or solve problems. It is an essential part of many research projects, business operations, and government initiatives. The data collection process typically involves the following steps: 1. Define the research question or problem. What do you want to learn from the data? 2. Identify the data sources. Where will you get the data? 3. Choose the data collection methods. How will you collect the data? 4. Collect the data. This may involve conducting surveys, interviews, or experiments. 5. Clean and prepare the data. This involves removing errors and inconsistencies from the data. 6. Store and manage the data. This involves organizing the data in a way that it can be easily accessed and analyzed. 7. Analyze the data. This involves using statistical methods to draw insights from the data. 8. Communicate the results. This involves sharing the findings of the data analysis with others. Data management is the ongoing process of organizing, storing, and protecting data. It is important to ensure that the data is accessible to authorized users, and that it is protected from unauthorized access, use, or disclosure. Data collection and management is a complex and challenging process, but it is essential for organizations to make informed decisions and achieve their goals. Here are some of the benefits of data collection and management: Improved decision-making: Data can help organizations make better decisions by providing insights into their operations, customers, and markets. Increased efficiency: Data can help organizations identify areas where they can improve efficiency, such as by reducing costs or increasing productivity. Enhanced innovation: Data can help organizations identify new opportunities for innovation by providing insights into customer needs and trends. Improved compliance: Data can help organizations comply with regulations by providing a record of their activities. Enhanced security: Data can help organizations protect themselves from security threats by providing a way to track and monitor access to sensitive data. 1 IT INST3 – Data Science Analytics Chapter II Subject Teacher: Edward B. Panganiban, Ph.D. There are many different methods of data collection, each with its own advantages and disadvantages. Some of the most common methods include: Surveys: Surveys are a popular way to collect data from a large number of people. They can be conducted online, by mail, or in person. Interviews: Interviews are a more in-depth way to collect data from a smaller number of people. They can be conducted face-to-face, over the phone, or by video chat. Experiments: Experiments are used to test cause-and-effect relationships. They involve manipulating one variable and observing the effects on another variable. Observational studies: Observational studies are used to collect data without interfering with the subjects being studied. They can be conducted in a natural setting or in a laboratory. The best data collection method for a particular project will depend on the research question, the budget, and the time constraints. Data management is an ongoing process that involves organizing, storing, and protecting data. There are many different data management tools and techniques available, and the best choice will depend on the specific needs of the organization. Data collection and management is a critical part of many research projects, business operations, and government initiatives. By following the steps outlined above, organizations can ensure that they are collecting and managing data in a way that is efficient, effective, and compliant. Data Source A data source is the location where data originates from. It can be internal or external to an organization. Internal data sources include: Transactional data: This is data about the day-to-day operations of an organization, such as sales, orders, and inventory. Customer data: This is data about customers, such as demographics, purchase history, and contact information. Employee data: This is data about employees, such as compensation, performance reviews, and training history. Financial data: This is data about the financial performance of an organization, such as revenue, expenses, and profits. External data sources include: Government data: This is data collected by governments, such as census data and economic data. Industry data: This is data collected by industry associations, such as market research data and pricing data. 2 IT INST3 – Data Science Analytics Chapter II Subject Teacher: Edward B. Panganiban, Ph.D. Academic data: This is data collected by universities and research institutions, such as scientific data and medical data. Social media data: This is data collected from social media platforms, such as user posts, comments, and likes. The choice of data source will depend on the specific needs of the organization. For example, if an organization is trying to understand its customer behavior, it might use customer data from its CRM system. If an organization is trying to forecast demand, it might use economic data from a government agency. When choosing a data source, it is important to consider the following factors: Relevance: The data should be relevant to the research question or problem that the organization is trying to solve. Accuracy: The data should be accurate and reliable. Timeliness: The data should be up-to-date. Accessibility: The data should be easy to access and use. Cost: The cost of obtaining the data should be reasonable. Data sources can be classified into two main categories: primary and secondary. Primary data is data that is collected for the first time. It is typically collected through surveys, interviews, experiments, or observations. Secondary data is data that has already been collected and is available for reuse. It can be found in books, journals, government reports, and online databases. Primary data is often more accurate and reliable than secondary data, but it can be more expensive and time-consuming to collect. Secondary data is less expensive and time- consuming to collect, but it may not be as accurate or reliable as primary data. The best data source for a particular project will depend on the research question, the budget, and the time constraints. Data Collection and APIs APIs, or application programming interfaces, are a way for software applications to communicate with each other. They provide a set of rules and instructions that allow applications to request and receive data from each other. APIs can be used to collect data from a variety of sources, including: Web services: Web services are websites that provide data through APIs. For example, the Google Maps API allows you to get information about geographic locations, such as their latitude and longitude. Databases: APIs can be used to access data from databases. For example, the MySQL API allows you to query MySQL databases. 3 IT INST3 – Data Science Analytics Chapter II Subject Teacher: Edward B. Panganiban, Ph.D. Sensor devices: APIs can be used to collect data from sensor devices, such as temperature sensors or GPS sensors. Social media platforms: APIs can be used to collect data from social media platforms, such as Twitter or Facebook. To collect data using an API, you will need to: Find the API that you want to use. There are many APIs available, so you will need to do some research to find one that meets your needs. Get an API key. Most APIs require you to get an API key before you can use them. This key is used to authenticate your requests and to prevent unauthorized access to the data. Understand the API documentation. The API documentation will tell you how to use the API to request and receive data. Make API requests. Once you understand the API documentation, you can start making API requests to collect data. Data collection using APIs can be a quick and easy way to get the data that you need. However, it is important to be aware of the limitations of APIs. For example, some APIs may only allow you to access a limited amount of data, or they may charge a fee for access. Here are some of the benefits of using APIs to collect data: Speed: APIs can be used to collect data quickly and easily. Efficiency: APIs can automate the data collection process, which can save time and resources. Flexibility: APIs can be used to collect data from a variety of sources. Scalability: APIs can be scaled to meet the needs of large data sets. Here are some of the challenges of using APIs to collect data: Cost: Some APIs may charge a fee for access. Security: APIs can be a security risk if they are not properly secured. Complexity: APIs can be complex to use, especially if you are not familiar with them. Overall, APIs can be a valuable tool for data collection. However, it is important to weigh the benefits and challenges before deciding whether or not to use them. Exploring and Fixing Data Data exploration is the process of understanding the data by summarizing its main characteristics, identifying patterns, and outliers. It is an important first step in data analysis, as it helps to ensure that the data is clean and ready for further analysis. There are many different methods of data exploration, but some of the most common include: 4 IT INST3 – Data Science Analytics Chapter II Subject Teacher: Edward B. Panganiban, Ph.D. Data profiling: This involves summarizing the main characteristics of the data, such as the number of records, the number of variables, the data types, and the distribution of the values. Data visualization: This involves using charts and graphs to visualize the data, which can help to identify patterns and outliers. Statistical analysis: This involves using statistical tests to identify relationships between variables. Data fixing is the process of identifying and correcting errors in the data. It is an important step in data preparation, as it ensures that the data is accurate and reliable. There are many different methods of data fixing, but some of the most common include: Data cleaning: This involves removing errors from the data, such as typos, missing values, and inconsistent values. Data imputation: This involves filling in missing values with estimates. Data transformation: This involves converting the data into a different format, such as converting categorical data into numerical data. Data exploration and data fixing are essential steps in data analysis. By understanding the data and fixing any errors, you can ensure that your analysis is accurate and reliable. Here are some of the things to look for when exploring and fixing data: Missing values: Are there any missing values in the data? If so, how many? And are they missing randomly or systematically? Outliers: Are there any outliers in the data? Outliers are data points that are significantly different from the rest of the data. They can be caused by errors or by legitimate variation in the data. Duplicate values: Are there any duplicate values in the data? Duplicate values can occur when data is entered incorrectly or when two different records refer to the same entity. Inconsistent values: Are there any inconsistent values in the data? Inconsistent values are data points that have different values for the same variable. They can occur when data is entered incorrectly or when two different records refer to the same entity. Incorrect data types: Are there any data points that are stored in the wrong data type? For example, a date value might be stored as a string. Corrupt data: Is there any corrupt data in the file? Corrupt data is data that is damaged or unreadable. Once you have identified any problems with the data, you can take steps to fix them. For example, you can remove missing values, impute missing values, or transform data types. Data exploration and data fixing are important steps in data analysis. By taking the time to explore and fix the data, you can ensure that your analysis is accurate and reliable. 5 IT INST3 – Data Science Analytics Chapter II Subject Teacher: Edward B. Panganiban, Ph.D. Data storage management: What is it and why is it important? Effective data storage management is more important than ever, as security and regulatory compliance have become even more challenging and complex over time. Enterprise data volumes continue to grow exponentially. So how can organizations effectively store it all? That's where data storage management comes in. Effective management is key to ensuring organizations use storage resources effectively, and that they store data securely in compliance with company policies and government regulations. IT administrators and managers must understand what procedures and tools encompass data storage management to develop their own strategy. Organizations must keep in mind how storage management has changed in recent years. The COVID-19 pandemic increased remote work, the use of cloud services and cybersecurity concerns such as ransomware. Even before the pandemic, all those elements saw major surges -- and after the pandemic, these elements will still be prominent. With this guide, explore what data storage management is, who needs it, advantages and challenges, key storage management software features, security and compliance concerns, implementation tips, and vendors and products. What data storage management is, who needs it and how to implement it Storage management ensures data is available to users when they need it. Data storage management is typically part of the storage administrator's job. Organizations without a dedicated storage administrator might use an IT generalist for storage management. The data retention policy is a key element of storage management and a good starting point for implementation. This policy defines the data an organization retains for operational or compliance needs. It describes why the organization must keep the data, the retention period and the process of disposal. It helps an organization determine how it can search and access data. The retention policy is especially important now as data volumes continually increase, and it can help cut storage space and costs. The task of data storage management also includes resource provisioning and configuration, unstructured and structured data, and evaluating how needs might change over time. To help with implementation, a management tool that meets organizational needs can ease the administrative burden that comes with large amounts of data. Features to look 6 IT INST3 – Data Science Analytics Chapter II Subject Teacher: Edward B. Panganiban, Ph.D. for in a management tool include storage capacity planning, performance monitoring, compression and deduplication. Advantages and challenges of data storage management Data storage management has both advantages and challenges. On the plus side, it improves performance and protects against data loss. With effective management, storage systems perform well across geographic areas, time and users. It also ensures that data is safe from outside threats, human error and system failures. Proper backup and disaster recovery are pieces of this data protection strategy. An effective management strategy provides users with the right amount of storage capacity. Organizations can scale storage space up and down as needed. The storage strategy accommodates for constantly changing needs and applications. Storage management also makes it easier on admins by centralizing administration so they can oversee a variety of storage systems. These benefits lead to reduced costs as well, as admins are able to better utilize storage resources. Benefits of data storage management include more efficient operations and optimized resource utilization. Challenges of data storage management include persistent cyberthreats, data management regulations and a distributed workforce. These challenges illustrate why it's so important to implement a comprehensive plan: A storage management strategy should ensure organizations protect their data against data breaches, ransomware and other malware attacks; lack of compliance could lead to hefty fines; and remote workers must know they'll have access to files and applications just as they would if in a traditional office environment. 7 IT INST3 – Data Science Analytics Chapter II Subject Teacher: Edward B. Panganiban, Ph.D. Distributed and complex systems present a hurdle for data storage management. Not only are workers spread out, but systems run both on premises and in the cloud. An on- premises storage environment could include HDDs, SSDs and tapes. Organizations often use multiple clouds. New technologies, such as AI, can benefit organizations but also increase complexity. Unstructured data -- which includes documents, emails, photos, videos and metadata -- has surged, and this also complicates storage management. Unstructured data challenges include volume, new types and how to gain value. Although some organizations might not want to spend the time to manage unstructured data, in the end it saves money and storage space. Vendors such as Aparavi, Dell EMC, Pure Storage and Spectra Logic offer tools for this type of management. Object storage can provide high performance but also has challenges, including the infrastructure's scale-out nature and potentially high latency, for example. Organizations must address issues with metadata performance and cluster management. Data storage management strategies Storage management processes and practices vary, depending on the technology, platform and type. Here are some general methods and services for data storage management: storage resource management software consolidation of systems multiprotocol storage arrays storage tiers strategic SSD deployment hybrid cloud scale-out systems archive storage of infrequently accessed data elimination of inactive virtual machines deduplication disaster recovery as a service object storage Organizations may consider incorporating standards-based storage management interfaces as part of their management strategy. The Storage Management Initiative Specification and the Intelligent Platform Management Interface are two veteran models, while Redfish and Swordfish have emerged as newer options. Interfaces offer management, monitoring and simplification. As far as media type, it's tempting to go all-flash because of its performance. However, to save money, try a hybrid drive option that incorporates high-capacity HDD and high- speed SSD technology. 8 IT INST3 – Data Science Analytics Chapter II Subject Teacher: Edward B. Panganiban, Ph.D. Organizations also must choose among object, block and file storage. Block storage is the default type for HDDs and SSDs, and it provides strong performance. File storage places files in folders and offers simplicity. Object storage efficiently organizes unstructured data at a comparatively low cost. NAS is another worthwhile option for storing unstructured data because of its organizational capabilities and speed. Understand how object, block and file storage compare. Storage security With threats both internal and external, storage security is as important as ever to a management strategy. Storage security ensures protection and availability by enabling data accessibility for authorized users and protecting against unauthorized access. A storage security strategy should have tiers. Security risks are so varied, from ransomware to insider threats, that organizations must protect their data storage in a number of ways. Proper permissions, monitoring and encryption are key to cyberthreat defense. Offline storage -- for example, in tape backup -- that isn't connected to a network is a strong way to keep data safe. If attackers can't reach the data, they can't harm it. While it's not feasible to keep all data offline, this type of storage is an important aspect of a strong storage security strategy. 9 IT INST3 – Data Science Analytics Chapter II Subject Teacher: Edward B. Panganiban, Ph.D. Another aspect is off-site storage, one form of which is cloud storage. Organizations shouldn't assume that this keeps their data entirely safe. Users are responsible for their data, and cloud storage is still online and thus open to some risk. The surge in remote workers produced a new level of storage security complications, including the following risks: less secure home office environments; use of personal devices for work; misuse of services and applications; less formal work habits; adjustments to working from home; and more opportunities for malicious insiders. Endpoint security, encryption, access controls and user training help protect against these new storage security issues. Data storage compliance Compliance with regulations has always been important, but the need has increased in the last few years with laws such as the General Data Protection Regulation (GDPR) and the California Consumer Privacy Act. These laws specifically address data and storage, so it's incumbent on organizations to comprehend them and ensure compliance. 10 IT INST3 – Data Science Analytics Chapter II Subject Teacher: Edward B. Panganiban, Ph.D. GDPR can spur enterprises into adopting practices that deliver long-term competitive advantages. Data storage management helps organizations understand where they have data, which is a major piece of compliance. Compliance best practices include documentation, automation, anonymization and use of governance tools. Immutable data storage also helps achieve compliance. Immutability ensures retained data -- for example, legal holds -- doesn't change. Vendors such as AWS, Dell EMC and Wasabi provide immutable storage. However, organizations should still retain more than one copy of this data, as immutability doesn't protect against physical threats, such as natural disasters. Data storage technology, vendors and products Key features for overall data storage management providers include resource provisioning, process automation, load balancing, capacity planning and management, predictive analytics, performance monitoring, replication, compression, deduplication, snapshotting and cloning. Recent trends among vendors include services for cloud storage and the container management platform Kubernetes. Top storage providers can support a range of different platforms. And though Kubernetes is more specialized, it has gained traction: Vendors such as Diamanti, NetApp and Pure Storage provide Kubernetes services. 11 IT INST3 – Data Science Analytics Chapter II Subject Teacher: Edward B. Panganiban, Ph.D. Some form of cloud management is essentially table stakes for storage vendors. A few vendors, including Cohesity and Rubrik, have made cloud data management a hallmark of their platforms. Many organizations use more than one cloud, so multi-cloud data management is crucial. Managing data storage across multiple clouds is complex, but vendors such as Ctera, Dell EMC, NetApp and Nutanix can help. Cloud management components include automation and orchestration; security; governance and compliance; performance monitoring; and cost management. The future of data storage management Data storage administrators must be ready for a consistently evolving field. Cloud storage was trending up before the pandemic and has skyrocketed since -- and once organizations go to the cloud, they typically stay there. As a result, admins must understand the various forms of cloud storage management, including multi-cloud, hybrid cloud, cloud-native data and cloud data protection. 12 IT INST3 – Data Science Analytics Chapter II Subject Teacher: Edward B. Panganiban, Ph.D. Hyper-convergence, composable infrastructure and computational storage are also popular frameworks. In addition, admins must be aware of other new and emerging technologies that help storage management, from automation to machine learning. Lesson Overview 1. Importance of Data Collection and Preparation 2. Basic Data Quality Assessment 3. Ethical Considerations in Data Collection 4. Importing Data from Various Sources 5. Data Cleaning and Preprocessing in Excel 6. Handling Missing Data and Outliers Lecture Outline Each topic is covered with a theoretical lecture followed by practical laboratory activities. The activities focus on applying the concepts in Microsoft Excel, allowing students to work with datasets related to their fields (Mathematics or Social Science). 1. Importance of Data Collection and Preparation Theoretical Lecture Objective: Understand why data collection and preparation are crucial in data science. Key Points: o Foundation of Data Science: Data collection is the first step in data analysis; the quality of collected data determines the reliability of results. o Minimizing Errors: Proper preparation helps identify and correct errors early in the analysis process. o Efficiency: Well-organized data saves time during the analysis stage. o Consistency: Consistent data allows for accurate comparisons across different datasets or studies. Laboratory Activities Task: Create a Data Collection Plan using Excel. Steps: 1. Define Data Requirements: Identify the type of data needed for analysis based on the student's major (e.g., test scores for Mathematics or survey responses for Social Science). 13 IT INST3 – Data Science Analytics Chapter II Subject Teacher: Edward B. Panganiban, Ph.D. 2. Design a Data Collection Template in Excel: Set up a spreadsheet with columns for data sources, collection methods, data types, and validation rules. 3. Document the Collection Process: Write a step-by-step plan for collecting the data, including tools and procedures to be used. 2. Basic Data Quality Assessment Theoretical Lecture Objective: Learn to assess data quality using dimensions such as completeness, accuracy, and consistency. Key Points: o Completeness: Ensures all necessary data is present. o Accuracy: Data must be correct and free from errors. o Consistency: Data should follow the same format and standards. o Validity: Data should represent what it is supposed to measure. o Uniqueness: Avoids duplicate records that can skew analysis results. Laboratory Activities Task: Perform a Data Quality Assessment using Excel. Steps: 1. Import Data into Excel: Students import a dataset they have collected. 2. Check for Completeness: ▪ Use COUNTA and IF formulas to identify missing data. 3. Validate Data Accuracy: ▪ Use DATA VALIDATION to set rules for data entry (e.g., numerical ranges, text length). 4. Ensure Consistency: ▪ Use FIND and REPLACE to standardize text entries (e.g., "Male" and "Female" vs. "M" and "F"). 5. Check for Duplicates: ▪ Use the Remove Duplicates feature to identify and remove duplicates. 3. Ethical Considerations in Data Collection Theoretical Lecture Objective: Understand ethical implications when collecting and handling data. Key Points: o Informed Consent: Participants must be informed about the data collection process. o Data Privacy and Security: Protecting sensitive information is critical. 14 IT INST3 – Data Science Analytics Chapter II Subject Teacher: Edward B. Panganiban, Ph.D. o Anonymity and Confidentiality: Ensure that participants' identities remain confidential. o Avoiding Bias: Collect data in a manner that avoids bias and misrepresentation. Laboratory Activities Task: Develop an Ethical Data Collection Plan. Steps: 1. Define the Purpose of the Data Collection: Document the objectives and intended use of the data. 2. Create a Consent Form Template in Excel: Include sections for study information, participant rights, and signature fields. 3. Design Data Storage and Access Control Measures: Plan how to secure data in Excel (e.g., using password protection and hidden sheets). 4. Importing Data from Various Sources Theoretical Lecture Objective: Learn how to import data from different formats into Excel. Key Points: o Data Formats: Understanding common formats such as CSV, JSON, and Excel workbooks. o Data Import Techniques: Using Excel's Get Data feature for importing data from various sources. o Combining Data: Methods for combining multiple datasets into one using Excel tools like Power Query. Laboratory Activities Task: Import Data from Various Sources. Steps: 1. Import from CSV Files: ▪ Go to Data > Get Data > From Text/CSV and select a file. 2. Import from Web Pages: ▪ Use Get Data > From Web to import data directly from a URL. 3. Combine Multiple Datasets: ▪ Use Power Query to merge or append datasets from different files. 5. Data Cleaning and Preprocessing in Excel Theoretical Lecture Objective: Learn essential data cleaning and preprocessing techniques. Key Points: 15 IT INST3 – Data Science Analytics Chapter II Subject Teacher: Edward B. Panganiban, Ph.D. o Removing Duplicates: Ensures unique data points. o Handling Missing Data: Using techniques like mean imputation or data interpolation. o Standardizing Data Formats: Ensuring consistency in data types (e.g., date formats). o Data Validation: Setting up rules to prevent incorrect data entry. Laboratory Activities Task: Clean and Preprocess Data in Excel. Steps: 1. Remove Duplicates: ▪ Select the data range and go to Data > Remove Duplicates. 2. Handle Missing Data: ▪ Use IF and ISBLANK functions to fill in missing values. 3. Standardize Data Formats: ▪ Use TEXT functions and Data Validation to ensure consistent formatting. 4. Create Data Validation Rules: ▪ Go to Data > Data Validation to set up entry restrictions (e.g., only allowing numbers between 0 and 100). 6. Handling Missing Data and Outliers Theoretical Lecture Objective: Learn methods to handle missing data and outliers. Key Points: o Handling Missing Data: Options include deletion, mean/mode imputation, and interpolation. o Identifying Outliers: Use visual tools like box plots and scatter plots. o Dealing with Outliers: Deciding whether to remove, cap, or transform outliers based on their impact on the analysis. Laboratory Activities Task: Handle Missing Data and Outliers in Excel. Steps: 1. Identify Missing Data: ▪ Use CONDITIONAL FORMATTING to highlight missing values. 2. Impute Missing Data: ▪ Apply the AVERAGEIF function to replace missing values. 3. Identify Outliers: ▪ Use Excel’s BOX PLOT feature to visualize and detect outliers. 4. Transform or Remove Outliers: 16 IT INST3 – Data Science Analytics Chapter II Subject Teacher: Edward B. Panganiban, Ph.D. ▪ Use IF statements to cap outliers or remove them based on a defined threshold. Summary and Assessment 1. Review Key Concepts: Summarize the importance of data collection, quality assessment, ethical considerations, importing data, cleaning, and handling missing data and outliers. 2. Quiz: Conduct a short quiz using Excel to test understanding. 3. Assignment: Each student collects a dataset, performs a quality assessment, cleans the data, handles missing data and outliers, and submits a report detailing the steps and insights. Lab Activity 1: Importing and Assessing Data Quality Dataset: Sales Data (CSV) Description: This dataset includes sales information with some inconsistencies and missing values. It helps students practice importing data and performing basic data quality assessments. Example Data: OrderID Date Product Quantity Price 1001 15/01/2024 Widget A 10 20.5 1002 16/01/2024 Widget B 25 1003 17/01/2024 Widget C 5 15 1004 18/01/2024 10 1005 19/01/2024 Widget A 8 21 1006 20/01/2024 Widget B 12 24 1007 Widget C 7 16 Lab Activity Instructions: Import the CSV file into Excel. Assess the quality of the data by calculating summary statistics (mean, median, mode) for numerical columns. Use conditional formatting to highlight missing values and inconsistencies. 17 IT INST3 – Data Science Analytics Chapter II Subject Teacher: Edward B. Panganiban, Ph.D. Lab Activity 2: Data Cleaning and Preprocessing Dataset: Customer Feedback (Excel) Description: This dataset includes customer feedback with various issues such as extra spaces, inconsistencies in text, and varying formats. It is used for data cleaning and preprocessing. Example Data: CustomerID Date Feedback Rating 1 15/01/2024 Great service! 5 2 16/01/2024 Poor service 2 Average 3 17/01/2024 3 experience 4 18/01/2024 Excellent service 4 5 19/01/2024 Great service! 5 6 20/01/2024 Poor service 7 21/01/2024 Excellent service 4 Lab Activity Instructions: Open the Excel file and clean the data using functions like TRIM(), CLEAN(), and SUBSTITUTE(). Standardize the feedback text (e.g., remove extra spaces, correct inconsistencies). Normalize the rating scale if necessary. Lab Activity 3: Handling Missing Data and Outliers Dataset: Employee Performance (CSV) Description: This dataset includes employee performance metrics with some missing values and potential outliers. Students will practice handling missing data and identifying outliers. 18 IT INST3 – Data Science Analytics Chapter II Subject Teacher: Edward B. Panganiban, Ph.D. Example Data: EmployeeID Age Department PerformanceScore E001 29 Sales 85 E002 34 HR 78 E003 28 Sales E004 45 IT 92 E005 31 HR 80 E006 29 Sales 115 E007 IT 88 E008 40 HR 72 E009 35 IT 89 E010 50 Sales 105 Import the CSV file into Excel. Identify missing data and decide on appropriate methods for imputation (mean, median, etc.). Use a boxplot to visualize potential outliers in the PerformanceScore column. Handle outliers based on the chosen method (e.g., transformation or removal). 19

Chapter-II-Data-Collection-and-Management.pdf

Document Details

Tags

Related

Full Transcript