Unit 5: Data Literacy - Data Collection to Data Analysis PDF

UNIT 5: Data Literacy – Data Collection to Data Analysis Title: Data Literacy - Data Collection to Approach: Team Discussion, Web Data Analysis search, Case studies Summary: This unit will introduce students to the basics of data literacy, focusing on data collection and its sources, Level of Measurements, Statistical analysis of data, Matrices and Data preprocessing. Students will learn how to collect different types of data, how to store data effectively and visualise it. Learning Objectives: 1. To understand the importance of data literacy in AI. 2. To explore various data collection methods and their applications. 3. To analyse data using basic Statistic analysis techniques. 4. To identify matrices and their role in representing data like images. 5. To understand the preparation of data to suit the models. Key Concepts: 1. What is Data Literacy 2. Data Collection 3. Exploring Data 4. Statistical Analysis of data 5. Representation of data, Python Programs for Statistical Analysis and Data Visualization 6. Knowledge of matrices 7. Data Pre-processing 8. Data in Modelling and Evaluation Learning Outcomes: Students will be able to - 1. Explain the importance of data literacy in AI. 2. Identify different data collection methods and their applications. 3. Apply basic data analysis techniques to analyse data. 4. Visualize the data using different techniques. Pre-requisites: Basic computer skills and basic maths skills 70 1. WHAT IS DATA LITERACY? Data can be defined as a representation of facts or instructions about some entity (students, school, sports, business, animals etc.) that can be processed or communicated by human or machines. It is a widely known fact that Artificial Intelligence (AI) is essentially data-driven. AI involves converting large amounts of raw data into actionable information that carry practical value and is usable. Data literacy means being able to find and use data effectively. This includes skills like collecting data, organizing it, checking its quality, analysing it, understanding the results and using it ethically. Data may be structured, semi structured or unstructured. It should be collected, organized and analysed properly to know whether the input for AI models is valid and appropriate or not. AI data analysis involves using AI techniques and data science to improve the processes of cleaning, inspecting, and modelling both structured and unstructured data. The primary objective is to extract valuable information that can support decision-making and drawing conclusions. 2. DATA COLLECTION Data collection allows you to capture a record of past events so that we can use data analysis to find recurring patterns. From those patterns, you build predictive models using machine learning algorithms that look for trends and predict future changes. Data collection means pooling data by scraping, capturing, and loading it from multiple sources, including offline and online sources. High volumes of data collection or data creation can be the hardest part of a machine learning project, especially at scale. How much data you need depends on how many features there are in the dataset. It is recommended to collect as much data as possible for good predictions. You can begin with small batches of data and see the result of the model. The most important thing to consider while data collection is diversity. Diverse data will help your model cover more scenarios. So, when focusing on how much data you need, you should cover all the scenarios in which the model will be used. The quantity of data also depends on the complexity of your model. If it is as simple as license plate detection then you can expect predictions with small batches of data. But if you are working on higher levels of Artificial intelligence like medical AI, you need to consider huge volumes of data. Before collecting the data, data scientists may understand the problem, its preferable solution and the data requirements. Based on these data requirements, sources of data will be identified and data will be collected. Data is the main ingredient of any Project. throughout the development of the project, data is required. Hence the process of identifying the data requirements, its collection and analysis will be done iteratively. There are mainly two sources of data collection: Primary and Secondary. 71 Primary Sources are sources which are created to collect the data for analysis. Some of the examples are given below Method Description Example Gathering data from a population through A researcher uses a interviews, questionnaires, or online forms. questionnaire to understand Survey Useful for measuring opinions, behaviors, consumer preferences for a new and demographics. product. Direct communication with individuals or An organization conducts an groups to gather information. It can be online survey to collect Interview structured, semi-structured, or employee feedback about job unstructured. satisfaction. Watching and recording behaviors or events Observing children's play as they occur. Often used in ethnographic Observation patterns in a schoolyard to research or when direct interaction is not understand social dynamics. possible. Manipulating variables to observe their Testing the effectiveness of Experiment effects on outcomes. Used to establish different advertising campaigns cause-and-effect relationships. on a group of people. Marketing A company personalizes email Utilizing customer data to predict behavior Campaign marketing campaigns based on and optimize campaign performance. (using data) past customer purchases. A specific tool used within surveys - a list of A questionnaire might ask questions designed to gather data from respondents to rate their Questionnaire respondents. You can collect quantitative satisfaction on a scale of 1 to 5 (numerical) or qualitative (descriptive) and also provide open-ended information. feedback. Secondary data sources are where the data is already stored and ready for use. Data given in Books, Journals, News Papers, Websites, Internal transactional databases etc. can be reused for data analysis. Some methods of collecting secondary data are: Method Description Example Collecting data from social media Analyzing social media sentiment to Social Media platforms like user posts, understand audience reception towards Data Tracking comments, and interactions. a new product launch. Using automated tools to extract Scraping product information and prices Web Scraping specific content and data from from e-commerce websites for price websites. comparison. Gathering information about the Monitoring weather patterns and Satellite Data Earth's surface and atmosphere environmental changes using satellite Tracking using satellites. imagery. Online Data Websites offering pre-compiled Kaggle, GitHub etc. Platforms datasets for various purposes. 72 3. EXPLORING DATA Exploring data is about "getting to know" the data: and its values - whether they are typical, unusual, spread out, or whether they are extremes. More importantly, during the process of exploration one gets an opportunity to identify and correct any problems in the data that would affect the conclusions drawn in any way during analysis. Levels of Measurement The way a set of data is measured is called the level of measurement. Not all data can be treated equally. It makes sense to classify datasets based on different criteria. Some are quantitative, and some qualitative. Some datasets are continuous and some are discrete. Qualitative data can be nominal or ordinal. And quantitative data can be split into two groups: interval and ratio. Discrete Quantitative Continuous Levels of Measurement Nominal Qualitative Ordinal https://slideplayer.com/slide/8137745/ 1. Nominal Level Nominal variables are like categories such as Mercedes, BMW or Audi, or like the four seasons – winter, spring, summer and autumn. They aren’t numbers, and cannot be used in calculations and neither in any order or rank. The nominal level of measurement is the simplest or lowest of the four ways to characterize data. Nominal means "in name only". 73 Colours of eyes, yes or no responses to a survey, gender, smartphone companies, etc all deal with the nominal level of measurement. Even some things with numbers associated with them, such as a number on the back of a cricketer’s T- shirt are nominal since they are used as "names" for individual players on the field and not for any calculation purpose. 2. Ordinal Level Ordinal data, is made up of groups and categories which follow a strict order. For e.g. if you have been asked to rate a meal at a restaurant and the options are: unpalatable, unappetizing, just okay, tasty, and delicious. Although the restaurant has used words not numbers to rate its food, it is clear that these preferences are ordered from negative to positive or low to high, thus the data is qualitative, ordinal. However, the difference between the data cannot be measured. Like the nominal scale data, ordinal scale data cannot be used in calculations. A Hotel industry survey where the responses to questions about the hotels are accepted as, "excellent," "good," "satisfactory," and "unsatisfactory." These responses are ordered or ranked from the excellent service to satisfactory response to the least desired or unsatisfactory. But the differences between the two pieces of data as seen in the previous case cannot be measured. Another common example of this is the grading system where letters are used to grade a service or good. You can order things so that A is higher than a B, but without any other information, there is no way of knowing how much better an A is from a B. 74 3. Interval Level Data that is measured using the interval scale is similar to ordinal level data because it has a definite ordering but there is a difference between the two data. The differences between interval scale data can be measured though the data does not have a starting point i.e. zero value. Temperature scales like Celsius (° C) and Fahrenheit (° F) are measured by using the interval scale. In both temperature measurements, 40° is equal to 100° minus 60°. Differences make sense. But 0 degrees does not because, in both scales, 0 is not the absolute lowest temperature. Temperatures like -20° F and -30° C exist and are colder than 0. Interval level data can be used in calculations, but the comparison cannot be done. 80° C is not four times as hot as 20° C (nor is 80° F four times as hot as 20° F). There is no meaning to the ratio of 80 to 20 (or four to one) 4. Ratio Scale Level Ratio scale data is like interval scale data, but it has a 0 point and ratios can be calculated. For example, the scores of four multiple choice statistics final exam questions were recorded as 80, 68, 20 and 92 (out of a maximum of 100 marks). The grades are computer generated. The data can be put in order from lowest to highest: 20, 68, 80, 92 or vice versa. The differences between the data have meaning. The score 92 is more than the score 68 by 24 points. Ratios can be calculated. The smallest score is 0. So, 80 is four times 20. The score of 80 is four times better than the score of 20. So, we can add, subtract, divide and multiply the two ratio level variables. Egg: Weight of a person. It has a real zero point, i.e. zero weight means that the person has no weight. Also, we can add, subtract, multiply and divide weights at the real scale for comparisons 75 Activity-1 Student Health Survey – Fill in the response and mention appropriate Level of Measurement. Activity-2. Indicate whether the variable is ordinal or not. Write the variable type, if it is not ordinal. ❖ Opinion about a new law (favour or oppose) _____________________________ ❖ Letter grade in an English class (A, B, C, etc.) _____________________________ ❖ Student rating of teacher on a scale of 1 – 10. _____________________________ 4. STATISTICAL ANALYSIS OF DATA Measure of Central Tendency Statistics is the science of data, which is in fact a collection of mathematical techniques that helps to extract information from data. For the AI perspective, statistics transforms observations into information that you can understand and share. Usually, Statistics deals with large dataset and Central tendency is used for the understanding and analysis purpose of data. “Central tendency” is stated as the summary of a dataset in a single value that represents the entire distribution of data domain (or dataset). We can perform Statistical Analysis using Python programming language. For that we have to import the library statistics into the Program. Some important functions which we will use in future programs in this module are mean ( ) →returns the mean of the data median ( ) →returns the median of the data mode ( ) →returns the mode of the data variance ( ) →returns the variance of the data stdev ( ) →returns the standard deviation of the data 76 Mean In statistics, the mean (more technically the arithmetic mean or sample mean) can be estimated from a sample of examples drawn from the domain. It is a quotient obtained by dividing the total of the values of a variable by the total number of their observations or items. M = ∑ fx / n where M = Mean ∑ = Sum total of the scores f = Frequency of the distribution x = Scores n = Total number of cases Example -1 The set S = {5,10,15,20,30} Mean of set S = 5+10+15+20+30/5 = 80/5 = 16 Example- 2 Calculate the mean of the following grouped data Class Frequency 2-4 3 4-6 4 6–8 2 8 – 10 1 77 Program-1 There are 25 students in a class. Their heights are given below. Write a Python Program to find the mean. heights → 145, 151, 152, 149, 147, 152, 151,149, 152, 151, 147, 148, 155, 147,152,151, 149,145, 147, 152,146, 148, 150, 152, 151 Median The median is another measure of central tendency. It is positional value of the variables which divides the group into two equal parts, one part comprising all values greater than median and other part smaller than median. Example-3 Following series shows marks in mathematics of students learning AI 17 32 35 15 21 41 32 11 10 20 27 28 30 We arrange this data in an ascending or descending order. 10, 11, 15, 17, 20, 21, 27, 28, 30, 32, 32, 35, 40 As 27 is in the middle of this data position wise, therefore Median = 27 Program-2 There are 25 students in a class. Their heights are given below. Write a Python Program to find the median. heights → 145, 151, 152, 149, 147, 152, 151,149, 152, 151, 147, 148, 155, 147,152,151, 149,145, 147, 152,146, 148, 150, 152, 151 78 Mode Mode is another important measure of central tendency of statistical series. It is the value which occurs most frequently in the data series. It represents the highest bar in a bar chart or histogram. An example of a mode is presented below: Example-4 Age of 15 students of a class Age (years) 22, 24, 17, 18, 17, 19, 18, 21, 20, 21, 20, 23, 22, 22, 22,22,21,24 We arrange this series in ascending order as 17,17,18,18,19,20,20,21,21,22,22,22, An inspection of the series shows that 22 occurs most frequently, hence Mode=22 Program – 3 Write a program to find the mode (heights → 145,151, 152, 149, 147, 152, 151,149, 152, 151, 147, 148, 155, 147,152,151, 149, 145, 147, 152,146, 148, 150, 152, 151) 79 In summary, when do we use mean, median and mode: Mean Median Mode The mean is a The median is a Mode is used when you good measure of good measure of need to find the the central the central value distribution peak and tendency when a when the data peak may be many. For data set contains include example, it is important values that are exceptionally high to print more of the relatively evenly or low values. The most popular books; spread with no median is the because printing exceptionally most suitable different books in equal high or low measure of numbers would cause a values. average for data shortage of some books classified on an and an oversupply of ordinal scale. others. Variance and Standard Deviation Measures of central tendency (mean, median and mode) provide the central value of the data set. Variance and standard deviation are the measures of dispersion (quartiles, percentiles, ranges), they provide information on the spread of the data around the centre. Let us understand these two using a diagram: Measure the height (at the shoulder) of 5 dogs (in millimetres) As you can see, their heights are: 600mm, 470mm, 170mm, 430mm and 300mm. Let us calculate their mean, Mean = (600 + 470 + 170 + 430 + 300) / 5 = 1970 / 5 = 394 mm 80 Now let us plot again after taking mean height (The green Line) Now, let us find the deviation of dogs’ height from the mean height Calculate the difference (from mean height), square them, and find the average. This average is the value of the variance. Variance = [ (206) 2 + (76) 2 + (-224) 2 + (36) 2 + (-94) 2] / 5 = 108520 / 5 = 21704 And standard deviation is the square root of the variance. Standard deviation = √𝟐𝟏𝟕𝟎𝟒 = 147.32 FORMULA Some important facts about variance and standard deviation A small variance indicates that the data points tend to be very close to the mean, and to each other. A high variance indicates that the data points are very spread out from the mean, and from one another. A low standard deviation indicates that the data points tend to be very close to the mean. A high standard deviation indicates that the data points are spread out over a large range of values. 81 Program -4 Write a program to find the variance and standard deviation. (heights → 145,151, 152, 149, 147, 152, 151,149, 152, 151, 147, 148, 155, 147,152,151, 149,145, 147, 152,146, 148, 150, 152, 151) 5. REPRESENTATION OF DATA According to Wikipedia, “Statistics is the discipline that concerns the collection, organization, analysis, interpretation and presentation of data. To achieve this task, statisticians summarize a large amount of data in a format that is compact and produces meaningful information. Without displaying values for each observation (from populations), it is possible to represent the data in brief while keeping its meaning intact using certain techniques called 'data representation'. It can also be defined as a technique for presenting large volumes of data in a manner that enables the user to interpret the important data with minimum effort and time. Data representation techniques are broadly classified in two ways: Non-Graphical technique: Tabular form and case form: This is the old format of data representation not suitable for large datasets. Non-graphical techniques are not so suitable when our objective is to make some decisions after analysing a set of data. Graphical Technique: The visual display of statistical data in the form of points, lines, dots and other geometrical forms is most common. For a complex and large quantity of data, human brain is more comfortable in dealing if represented through visual format means Graphical or pictorial representation of the data using graph, chart, etc. is known as Data visualization. It would not be possible to discuss the methods of construction of all types of diagrams and maps primarily due to time constraint. We will, therefore, describe the most commonly used graphs and the way they are drawn. Line graphs Bar diagrams Pie diagram Scatter Plots Histogram 82 Data Visualization is possible in python using the library Matplotlib. It is a comprehensive library that can be used to create a wide variety of plots, including line plots, bar charts, histograms, scatter plots, and more. Matplotlib is also highly customizable, allowing users to control the appearance of their plots in great detail. pyplot is a submodule of Matplotlib that provides a MATLAB-like interface to the library. pyplot also provides a number of convenience functions that make it easy to create simple plots. Installing Matplotlib pip install matplotlib or python – m pip install – U matplotlib In the program we have to import the library. import matplotlib.pyplot Some of the common functions and its description is given below Function Name Description title ( ) Adds title to the chart/graph xlabel ( ) Sets label for X-axis ylabel ( ) Sets label for Y-axis xlim ( ) Sets the value limit for X-axis ylim( ) Sets the value limit for Y-axis xticks ( ) Sets the tick marks in X-axis yticks( ) Sets the tick marks in Y-axis show ( ) Displays the graph in the screen savefig(“address”) Saves the graph in the address specified as argument. figure ( figsize = value in tuple Determines the size of the plot in which the graph is format) drawn. Values should be supplied in tuple format to the attribute figsize which is passed as argument. 83 List of Markers and its descriptions: List of Graph Colour Codes: 1. Line Graph A line graph is a powerful tool used to represent continuous data along a numbered axis. It allows us to visualize trends and changes in data points over time. Line graphs are suitable for data that can take on any value within a specific range. The line can slope upwards, indicating an increase, or downwards, signifying a decrease, reflecting the changes in the data over time. Example-5: Kavya’s AI marks for 5 consecutive tests is given below. Draw a line graph to Analyse her performance. Test-1 Test-2 Test-3 Test-4 Test-5 25 34 49 40 48 84 Activity -3: Construct a simple line graph to represent the rainfall data of Kerala as shown in the table below Month JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC Rainfall 7.5 6.3 3.5 1.8 1.2 25.8 19.7 20.3 15. 22. 18.6 11.2 (cm) 9 4 Line chart is plotted in python using the function plot ( ). Colour of the line can be mentioned by giving the colour codes inside the plot function. Attributes of plot function which are used inside plot ( ) function are: line width sets the width of the line line style determines the style of line (solid, dashed, dot, dashdot) marker, markersize, markeredgecolor determines the marker’s shape, size and marker edge colour respectively Program-5 Write a program to draw a line chart, we use plot function (use Example 1) OUTPUT 85 Program -6 Write a program to draw a line chart to visualize the comparative rainfall data for 12 months in Tamil Nadu using the CSV file "rainfall.csv". OUTPUT 2. Bar Graph A bar chart or bar graph is a graph that presents categorical data with rectangular bars with heights or lengths proportional to the values that they represent. It is a good way to show relative sizes, i.e., to show comparison between different categories. The relative sizes of the bars allow for easy comparison between different categories. Example-6 Create a bar graph to illustrate the distribution of students from various schools who attended a seminar on “Deep Learning”. The total number of students from each school is provided below. Oxford Delhi Public Jyothis Central Sanskriti School Bombay Public Public School School School School 123 87 105 146 34 86 Bar chart is plotted in python using the function bar ( ). Attributes of bar function which are used inside bar ( ) functions are: color determines the color of the bars edgecolor determines the colour of the bar edges width determines the width of the bars Program – 7 Write a program to draw a bar chart to visualize the comparative rainfall data for 12 months in Tamil Nadu using the CSV file "rainfall.csv". 87 3. Histogram Histograms are graphical representations of data distribution, with vertical rectangles depicting the frequencies of different value ranges. They are drawn on a natural scale, making it easy to interpret the central tendency, such as the mode, of the data. Despite their simplicity and ease of understanding, histograms have a limitation: they can only represent one data distribution per axis. Example -7 Given a dataset containing the heights of girls in class XII, construct a histogram to visualize the distribution of heights. 141,145,142,147,144,148,141,142,149,144,143,149,146,141, 147, 142, 143 Solution: To draw a histogram from this, we first need to organize the data into intervals. These intervals are also called logical ranges or bins. After computing the number of girls in each interval, draw the graph. Histogram is plotted in python using the function hist ( ). 88 4. Scatter Graph Scatter plots visually represent relationships between two variables by plotting data points along both the x and y axes. They reveal correlations, whether positive or negative, within paired data, showcasing trends and patterns. Essentially, scatter plots illustrate connections between variables through ordered pairs, making them useful for analyzing paired numerical data and situations where the dependent variable varies across different values of the independent variable. Their strength lies in their ability to clearly depict trends, clusters, and relationships within datasets. Example-8 A student had a hypothesis for a science project. He believed that the more the students studied Math, the better their math scores would be. He took a poll in which he asked students the average number of hours that they studied per week during a given semester. He then found out the overall percentage that they received in their Math classes. His data is shown in the table below: To understand this data, he decided to make a scatter plot. The independent variable, or input data, is the study time because the hypothesis is that the Math grade depends on the study time. That means that the Math grade is the dependent variable, or the output data. The input data is plotted on the x-axis and the output data is plotted on the y-axis. 89 Scatterplot is plotted using the function scatter ( ) Program-8 Write a program to draw a scatter chart to visualize the comparative rainfall data for 12 months in Tamil Nadu using the CSV file "rainfall.csv". OUTPUT 90 5. Pie Chart A pie chart is a circular graph divided into segments or sections, each representing a relative proportion or percentage of the total. Each segment resembles a slice of pie, hence the name. Pie charts are commonly used to visualize data from a small table, but it is recommended to limit the number of categories to seven to maintain clarity. However, zero values cannot be depicted in pie charts. While useful for illustrating compositions or comparing parts of a whole, pie charts can be challenging to interpret and compare with data from other charts. They are not suitable for showing changes over time. Pie charts find applications in various domains such as business, education, and personal finance. In business, they can indicate the success or failure of products or services. In education, they can depict time allocations for different subjects. At home, pie charts can help visualize monthly expenses relative to income. Example-9 Below given is a Pie chart drawn with the periods allotted for each subject in a week. Subject Periods Allotted English 6 Maths 8 Science 8 Social Science 7 AI 3 PE 2 Pie Chart is plotted using the function pie ( ) Program-9 Write a program to draw a pie chart to visualize the comparative rainfall data for 12 months in Tamil Nadu using the CSV file "rainfall.csv". 91 6. INTRODUCTION TO MATRICES The knowledge of matrices is necessary in all branches of mathematics. Matrix is one of the most powerful tools in Mathematics. In mathematics, matrix (plural matrices) is a rectangular arrangement of numbers. The numbers are arranged in tabular form as rows and columns. Matrices play a huge role in computer vision domain of AI. On the computer, the image is represented as a combination of pixels. This is represented mathematically as matrices! Let us understand with the help of an example: Consider Aditi bought 25 pencils 5 erasers Adit bought 10 pencils 2 erasers Manu bought 5 pencils 1 eraser The above information can be arranged in tabular form as follows Pencils Erasers Aditi 25 5 Adit 10 2 Manu 5 1 And this can be represented as Row1 25 5 Row2 [10 2] Row3 5 1 The entries in the rows represent number of pencils and erasers bought by Aditi, Adit and Manu respectively. Or in another form as Col1 Col2 Col3 Row1 25 10 5 [ ] Row2 5 2 1 92 Here, the entries in the columns represent number of pencils and erasers bought by Aditi, Adit and Manu respectively. We denote matrices by capital letters, for example 5 15 A = [−7 √2] 12 0 Order of a matrix A matrix has m rows and n columns. It is called a matrix of order m × n or simply m × n matrix (read as an m by n matrix). So, the matrix A in the above example is a 3 × 2 matrix. The number of elements are m x n => 3 x 2 = 6 elements. Each individual element is represented as aij where i represents row and j represents column. In general aij, is an element lying in the ith row and jth column. We can also call it as the (i, j)th element of the matrix. 𝑎11 𝑎12 P = [𝑎21 𝑎22] 𝑎31 𝑎32 Operations on Matrices 1. Addition of matrices - The sum of two matrices is obtained by adding the corresponding elements of the given matrices. Also, the two matrices have to be of the same order. Example: 3 2 6 3 A = [4 −1] B = [5 9] 2 0 3 2 3+6 2+3 9 5 A+B = [4 + 5 −1 + 9] = [9 8] 2+3 0+2 5 2 2. Difference of matrices - The difference A – B is defined as a matrix where each element is obtained by subtracting the corresponding elements (aij – bij). Matrices A and B must be of the same order. Example −2 1 −1 3 A = [ 6 10] B = [ 2 9] 5 3 3 1 −2 + 1 1 − 3 −1 −2 A-B = [ 6 − 2 10 − 9] = [4 1] 5−3 3−1 2 2 3. Transpose of a matrix – A matrix obtained by interchanging the rows and columns. Transpose of a matrix A is denoted by A’ or AT. Example 8 7 8 2 4 A = [2 5] AT = [ ] 7 5 6 4 6 Order = 3x2 Order = 2x3 93 Applications of matrices in AI Matrices are used throughout the field of machine learning for computing: Image Processing – Digital images can be represented using matrices. Each pixel on the image has a numerical value. These values represent the intensity of the pixels. A grayscale or black-and-white image has pixel values ranging from 0 to 255. Smaller values closer to zero represent darker shades, whereas bigger ones closer to 255 represent lighter or white shades. So, in a computer, every image is kept as a matrix of integers called a Channel. Recommender systems use matrices to relate between users and the purchased or viewed product(s) In Natural Language processing, vectors depict the distribution of a particular word in a document. Vectors are one-dimensional matrices. 7. DATA PREPROCESSING Data preprocessing is a crucial step in the machine learning process aimed at making datasets more machine learning-friendly. It involves several processes to clean, transform, reduce, integrate, and normalize data: 1. Data Cleaning 1. Missing Data: Missing data occurs when values are absent from the dataset, which can happen due to various reasons. Strategies for handling missing data include deleting rows or columns with missing values, inputting missing values with estimates, or using algorithms that can handle missing data. 2. Outliers: Outliers are data points that significantly differ from the rest of the data, often due to errors or rare events. Dealing with outliers involves identifying and removing them, transforming the data, or using robust statistical methods to reduce their impact. 3. Inconsistent Data Data with typographical errors, different data types etc are corrected and consistency among the data is observed. 4. Duplicate Data Duplicate data will be identified and removed to ensure data integrity. 2. Data Transformation Categorical variables are converted to Numerical variable. New features are identified and existing features are modified if needed. 3. Data Reduction Dimensionality reduction, i.e. reducing the number of features of data set is done. If data set is too large to handle sampling techniques are applied. 94 4. Data Integration and Normalization If data is stored in multiple sources or formats, they are merged or aggregated together. Then the data is normalized to ensure that all features have a similar scale and distribution which can improve machine learning models. 5. Feature Selection The most relevant features that contribute the most to the target variable are selected and irrelevant data are removed. 8. DATA IN MODELLING & EVALUATION After the data is pre-processed, it is split into two --Training dataset and Testing dataset. The training set is used to train the machine learning models, while the testing set is used to evaluate the performance of the trained models. While modelling, appropriate machine learning algorithms are chosen based on the nature of the problem (e.g., classification, regression, clustering) and the characteristics of the dataset. Techniques such as train-test split, cross-validation, and error analysis are employed to estimate the model's generalization ability and identify areas for improvement. Train-Test Split trains the model with its training set and evaluates using the test set. Cross Validation ensures that the model's performance is consistent across different subsets of the data. Different types of evaluation techniques are applied on the model depending on the data. For classification problems, metrics like accuracy, precision, recall, F1-score, and ROC curve are commonly used. For regression problems, metrics like mean squared error (MSE), root mean squared error (RMSE), mean absolute error (MAE), and R-squared are often used. In today's world, knowing how to work with data is important. As Artificial Intelligence becomes more and more common, understanding data helps us use information better. It is like having a map to find your way through a big city. Being good with data helps us make smart decisions and use technology wisely. EXERCISES A. Multiple-choice questions 1. Which of the following best defines data literacy? A) The ability to read and write data B) The ability to find and use data effectively C) The ability to analyse data using AI D) The ability to collect and store data securely 2. What is the purpose of data preprocessing? A) To make data more complex B) To make data less accessible C) To clean and prepare data for analysis D) To increase the size of the dataset 3. How can missing data be handled in a dataset? A) By ignoring it B) By replacing missing values with estimates C) By deleting rows or columns with missing values D) By converting missing values to zero 95 4. Which of the following statements about the quantity of data needed for machine learning projects is true? A) More data is always better for good predictions. B) Small batches of data are sufficient for complex models. C) Data quantity depends solely on the number of features. D) Data diversity is not essential for model performance. 5. Which of the following is an example of a primary source of data collection? A) Web scraping B) Social media data tracking C) Surveys D) Kaggle datasets 6. What method of data collection involves direct communication with individuals or groups to gather information? A) Observations B) Experiments C) Interviews D)Marketing campaigns 7. Which of the following is an example of ratio scale data? A) Grading students' exam papers as "A," "B," "C," "D," and "F" B) Measuring the temperature in Celsius C) Rating a meal at a restaurant as "unpalatable," "unappetizing," "just okay," "tasty," and "delicious" D) Recording the weight of a person in kilograms 8. What is the distinguishing feature of ratio scale data? A) It involves categories without a specific order B) It has a zero point and allows for ratios to be calculated C) It involves categories with a strict order but no measurable differences between categories D) It has a definite order, but the differences between categories cannot be measured 9. Which statistical measure is most suitable for data sets with evenly spread values and no exceptionally high or low values? A) Mean B) Median C) Mode D) Variance 10. What is the term used to describe the graphical or pictorial representation of data? A) Statistical summary B) Data organization C) Data visualization D) Data interpretation B. Short answer questions: 1. Explain the concept of data literacy and its importance in today's digital age. 2. What is data preprocessing? 3.What is data visualization and why is it important? 4. How does a line graph differ from a bar graph? 5. When would you use a scatter plot? 6. What is data? 7. What do you mean by web scraping? 8. If a matrix has 6 elements, what are the possible orders it can have? 9. Construct a 3x2 matrix where each element is given by 𝑎𝑖𝑗 = 𝑖 ∗ 𝑗 5 −1 4 10. Find the transpose of the matrix B = [ ] 2 3 6 96 C. Long answer questions: 1. Discuss the advantages and limitations of using a pie chart in data visualization. Provide examples to illustrate your points. 2. Explain the terms mean, median and mode. 3. Explain the four levels of measurement. 4. Given the matrices A and B. Calculate A+ B and B – A. D. Python Programs 1. The ages of a group of people in a community are: 25, 28, 30, 35, 40, 45, 50, 55, 60, 65. Write a program to calculate the mean, median, and mode of the ages. 2. A company recorded the daily temperatures (in degrees Celsius) for five consecutive days: 20°C, 22°C, 25°C, 18°C, and 23°C. Determine the variance and standard deviation of the temperatures. 3. Plot a line chart representing the weekly number of customer inquiries received by a customer service center: Week 1: 150 inquiries Week 2: 170 inquiries Week 3: 180 inquiries Week 4: 200 inquiries 4. Plot a bar chart representing the number of books sold by different genres in a bookstore: Fiction: 120 books Mystery: 90 books Science Fiction: 80 books Romance: 110 books Biography: 70 books 5. Visualize the distribution of different types of transportation used by commuters in a city using a pie chart: Car: 40% Public Transit: 30% Walking: 20% Bicycle: 10% 97

Unit 5: Data Literacy - Data Collection to Data Analysis PDF

Document Details

Tags

Related

Summary

Full Transcript

Upgrade to continue