ml2.pdf

Lecture 2 By Japhet Moise H. Data Collection and Acquisition Description of key terms 1. Data: Data is a collection of information gathered by observations, measurements, research or analysis. They may consist of facts, numbers, names, figures or even description of things. Data is organized in the form of graphs, charts or tables. simply data refers to raw facts of information. 2. Information: meaning assigned to data within some context for the use of that data. 3. dataset: a collection of data taken from a single source or intended for a single project. 4. Data warehouse: A data warehouse is a system that stores and analyzes data from multiple sources. 5. Big data:large and diverse datasets that are huge in volume and also rapidly grow in size over time. Identification of Source of data 1. IoT Sensors 2. Camera 3. Computer 4. Smartphone 5. Social data 6. Transactional data Description of 6 V's of Big Data The 6 V's of Big Data are a framework used to characterize the key challenges and opportunities associated with large-scale data sets. 1. Volume: This refers to the sheer amount of data generated. Big data sets are typically massive in size, often exceeding terabytes or even petabytes. 2. Velocity: This refers to the speed at which data is generated and processed. Big data often arrives at a rapid pace, requiring real-time or near-real-time analysis. 3. Variety: This refers to the diversity of data types. Big data can include structured data (like databases), semi-structured data (like XML or JSON), and unstructured data (like text, images, and videos). 4. Veracity: This refers to the quality and accuracy of the data. Big data sets can often contain errors, inconsistencies, or biases that need to be addressed before analysis. 5. Value: This refers to the potential benefits that can be derived from analyzing the data. Big data can provide valuable insights into business operations, customer behavior, and market trends. 6. Variability: Big Data Variability refers to the dynamic nature of data flow within large datasets. Description of Types of data · Structured data: Organized in a predefined format (e.g., databases, spreadsheets). · Unstructured data: Not organized in a predefined format (e.g., text, images, audio). · Semi-structured data: Partially structured (e.g., XML, JSON). Gathering Machine Learning Datasets ∙ Web scraping: Extracting data from websites using automated tools. ∙ APIs: Interacting with APIs to retrieve data from online services. ∙ Surveys and questionnaires: Gathering data directly from individuals. ∙ Sensor data: Collecting data from physical sensors. ∙ Data purchases: Acquiring data from commercial data providers. Data Visualization tools 1. Tableau: A powerful and user-friendly tool for creating interactive dashboards and visualizations. 2. Power BI: Microsoft's business intelligence tool with strong integration with Office products. 3. Qlik Sense: Offers associative exploration, enabling users to discover relationships between data points. 4. Plotly: A Python library for creating interactive plots and graphs. 5. Matplotlib: A Python library for creating static plots and graphs. 6. Seaborn: A Python library built on top of Matplotlib, offering a higher-level interface for creating attractive statistical visualizations. Description of Characteristics of quality Data 1. Accuracy: The term “accuracy” refers to the degree to which information correctly reflects an event, location, person, or other entity. 2. Completeness: Data is considered “complete” when it fulfills expectations of comprehensiveness. 3. Consistency: At many companies, the same information may be stored in more than one place. If that information matches, it’s considered to be “consistent.” 4. Timeliness: Is your information available right when it’s needed? That data quality dimension is called “timeliness.”. 5. Validity: Validity is a data quality dimension that refers to information that conforms to a specific format or follows business rules. To meet this data quality dimension, you must confirm all of your information follows a specific format or business rules. 6. Uniqueness: “Unique” information means that there’s only one instance of it appearing in a database. 7. Relevance: It refers to the extent to which data is useful and meaningful for a specific purpose. The importance of data cleaning 1. Accuracy and Reliability: Ensures the data you work with is correct and dependable, which is crucial for making informed decisions. 2. Better Decision-Making: Clean data leads to more accurate insights, helping organizations make better strategic choices. 3. Efficiency: Streamlines the data analysis process by removing irrelevant or redundant information, making datasets easier to manage. 4. Enhanced Data Quality: Maintains high data quality, making the data more useful and valuable for analysis. 5. Compliance and Risk Management: Helps organizations comply with regulations and manage risks by ensuring data is handled properly. 6. Cost Savings: Prevents errors and reduces costs associated with correcting mistakes and dealing with poor data quality. Data cleaning Techniques 1. Removing Duplicates: Identifying and eliminating duplicate records to prevent redundancy and ensure each entry is unique. 2. Handling Missing Values: Addressing missing data by either filling in the gaps with appropriate values (imputation) or removing incomplete records, depending on the context. 3. Standardizing Data: Ensuring consistency in data formats, such as dates, addresses, and names, to make the data uniform and easier to analyze. 4. Correcting Errors: Identifying and fixing errors in the data, such as typos, incorrect values, or inconsistencies. 1. Validating Data: Checking data against predefined rules or criteria to ensure it meets the required standards and is within acceptable ranges. 2. Filtering Outliers: Identifying and handling outliers that may skew the analysis, either by removing them or adjusting their values. 3. Normalization: Transforming data into a common scale without distorting differences in the ranges of values, which is particularly useful for numerical data. 4. Data Enrichment: Enhancing the dataset by adding relevant information from external sources to provide more context and improve analysis. Thank you!!!!

Document Details

Related

Full Transcript

Upgrade to continue