CS322 Data Analysis - Spring 2024 PDF
Document Details
Uploaded by Deleted User
Nile University
2024
Dr. Noha Gamal, Dr. Shaimaa Mohamed
Tags
Summary
This document presents course material for CS322 - Data Analysis, Spring 2024, at Nile University. It details the course outline, grading policy, calendar, and syllabus, along with various data analysis topics.
Full Transcript
CS322 – Data Analysis Dr. Noha Gamal and Dr. Shaimaa Mohamed Spring 2024 CS322 – Data Analysis Instructors: Dr. Noha Gamal Assistant Professor [email protected] Office: 203 Office hours: Monday 4:30 to 5:30 Wednesday: 1...
CS322 – Data Analysis Dr. Noha Gamal and Dr. Shaimaa Mohamed Spring 2024 CS322 – Data Analysis Instructors: Dr. Noha Gamal Assistant Professor [email protected] Office: 203 Office hours: Monday 4:30 to 5:30 Wednesday: 10:30 to 12:00 Dr. Shaimaa Mohamed lecturer Office: visiting instructor Office Hours: Thursday 11:00 to1:00 TAs: Eng. Aly Abdelmegeid [email protected] Office: 220 Office hours: Monday 11:00 to 1:00 Eng. Mohamed Ibrahim [email protected] CS322 – Data Analysis 5 THIS GRADE SCHEME IS SUBJECT TO CHANGE CS322 – Data Analysis Grading Policy: Lecture Attendance 4.5% 4 Assignments 10% (4 – top three) Tutorial and Lab 10.5% (7*1.5) Quizzes 5% Project 20% Midterm 20% Final 30% Cheating or copying is not negotiable, no second chance is available in case of confirmed cheating. Handouts: Lectures + Labs Textbook Data Mining: Concepts and Techniques, Jiawei Han, Micheline Kamber, and Jian Pei Statistical Analysis Handbook by Michael John De Smith CS322 – Data Analysis 6 Calendar and Syllabus Date (Sunday) Lecture Lab Assignments Topic 1: Course Introduction and 11 Feb Data Collection 18 Feb Topic 2: Data Sampling Numpy and Pandas Assignment 1 (Sampling 25 Feb Topic 3: Data Preprocessing Pt.1 Sampling Problem) Data Cleaning and 3 Mar Topic 3: Data Preprocessing Pt.2 Assignment 1 Deadline Wrangling Assignment 2 (EDA Problem) Topic 3: Data Preprocessing Pt.3 10 Mar EDA Project registered for each (Quiz 1) three students 17 Mar Topic 4: Exploratory Data Analysis Data Visualization Assignment 2 Deadline Analytical Problem Solving Topic 5: Inferential Statistics 24 Mar (sample problems on all (Simple Regression) topics) 30 Mar-11 Apr Midterm Period 14 Apr Topic 6: Inferential Statistics Simple Regression 21 Apr (Multiple Regression) and Web scraping Topic 7: Cluster Analysis Pt.1 (Quiz 28 Apr MLR Assignment 3 (MLR) 2) Assignment 3 Deadline + 5 May Topic 7: Cluster Analysis Pt.2 Clustering: Kmeans Assignment 4 (Clustering - kmeans) 12 May Statistical Testing & Revision Clustering: HC Assignment 4 Deadline Analytical Problem Solving Project Submission, Discussion, 19 May (sample problems on all and Oral Presentation topics) 26 May CS322 – Data Analysis Study Week 7 Lec 1 - Introduction 8 Outline Data and Information What is Data Analysis? Why Data Analysis is Important? Applications of Data Analysis Data Analysis Methods What is the statistical analysis? Steps of the statistical analysis Elements of the statistical analysis Statistical Analysis Methods Data Types and categories CS322 – Data Analysis 9 What is Data? Data – Distinct pieces of Raw Facts, or Actions, usually processed formatted in a specific way, that can be analyzed for more insightful information highlights. Data Forms – Numbers, Text, Images, Audio, Video, Databases,…etc. What is the difference between Data and Information? Any raw facts or figures is known as data. When the data is processed by doing statistical analysis and some conclusion can be drawn from it, it is known as information. CS322 – Data Analysis 10 What is Data Analysis? Data analysis is the process of inspecting, cleaning, transforming, and modeling data with the goal of discovering useful information, drawing conclusions, and supporting decision-making. It involves applying various techniques and methods to understand patterns, extract insights, and make informed decisions based on data. extract By Applying Various Techniques and Methods CS322 – Data Analysis 12 What is Data Analysis? Data analysis is the process of inspecting, cleaning, transforming, and modeling data with the goal of discovering useful information, drawing conclusions, and supporting decision-making. It involves applying various techniques and methods to understand patterns, extract insights, and make informed decisions based on data. From this definition, can you tell The purpose of analyzing data? ○ To obtain usable and useful information. ○ To help organizations harness the power of data, enabling them to make decisions, optimize processes, and gain a competitive edge. ○ Depicting Insightful Highlights, Making Reliable Prediction, Supporting Decision Making, …etc, in various fields, such as Healthcare, Marketing, Education,… and so on. CS322 – Data Analysis 13 Data analysis is crucial in several Processes: CS322 – Data Analysis 16 CS322 – Data Analysis 17 Real-World Examples: Applications of Data Analysis E-commerce Recommendations: Companies like Amazon use data analysis to recommend products to users based on their browsing history and purchase behavior. Healthcare Predictive Analytics: Healthcare providers use data analysis to predict patient outcomes, optimize treatment plans, and identify potential health risks. Financial Fraud Detection: Banks and financial institutions employ data analysis to detect unusual patterns and anomalies in financial transactions, helping in the early identification of fraud. Sports Analytics: Teams in various sports analyze player performance, game statistics, and opposing team strategies to make informed decisions during matches. CS322 – Data Analysis 18 Real-World Examples: Contd. Traffic Management: City planners use data analysis to optimize traffic flow, reduce congestion, and improve transportation infrastructure based on real- time and historical data. Climate Change Research: Scientists analyze vast datasets to understand climate patterns, track changes over time, and make predictions about the impact of climate change. In essence, data analysis is a powerful tool that empowers individuals and organizations to extract meaningful insights from data, leading to more effective decision-making across various domains. CS322 – Data Analysis 19 Data Analysis Methods Top Data Analysis Techniques that used to Analyze Data Statistical Analysis ○ Is a scientific tool in AI and ML that helps collect and analyze large amounts of data to identify common patterns and trends to convert them into meaningful information. ○ It help make inferences about populations from sample data ○ This can be done by using Statistics. Statistics is a branch of mathematics that involves collecting, analyzing, interpreting, presenting, and organizing data. CS322 – Data Analysis 20 CS322 – Data Analysis Data Analysis Methods Top Data Analysis Techniques that used to Analyze Data Statistical Analysis It encompasses a broad range of techniques for summarizing and interpreting data. 1- Descriptive Analysis - Exploratory Data Analysis (EDA) It focuses on exploring and understanding the data without preconceived hypotheses. It involves visualizations, summary statistics, and data profiling techniques to uncover patterns, relationships, and interesting features. It helps generate hypotheses for further analysis. (Topic 1,2,3, 4) CS322 – Data Analysis 22 CS322 – Data Analysis Data Analysis Methods. Contd. 2- Inferential Analysis - Regression Analysis It is a powerful method for understanding the relationship between one or more variables. By fitting a regression model, you can make predictions, analyze cause-and-effect relationships, and uncover trends within your data. (Topic 5,6) 3- Inferential Analysis - Clustering Analysis It is an unsupervised learning method that groups similar data points. K-means clustering, and hierarchical clustering are examples. This technique is used for anomaly detection and pattern recognition. (Topic 7) 4- Inferential Analysis - Classification Analysis It assigns data points to predefined categories or classes. It's often used in applications like image recognition and diseases diagnosis. Popular algorithms include decision trees, support vector machines, CS322 – Data Analysis 24 and neural networks. CS322 – Data Analysis 25 Statistical Analysis CS322 – Data Analysis 26 Statistics as a Tool: Collecting Data: Statistics begins with the collection of data, which can take various forms such as surveys, experiments, or observations. Data is the raw information that is collected. Organizing and Summarizing Data: Once data is collected, statistics helps in organizing and summarizing it. This involves creating tables, charts, and summary measures (mean, median, standard deviation, etc.) to provide a clear and concise representation of the information. Descriptive Statistics: Descriptive statistics involve methods for summarizing and describing the main features of a dataset. This step helps in gaining insights into the characteristics of the data. Inferential Statistics: Inferential statistics are used to make inferences and predictions about a population based on a sample of data. This involves generalizing findings from a sample to the larger population. CS322 – Data Analysis 27 Steps of the statistical analysis Collecting Data Data Why? e.g., Survey Analysis Presenting Data e.g., Charts & Tables Characterizing Data © 1984-1994 T/Maker Co. e.g., Average Decision- Making CS322 – Data Analysis © 1984-1994 T/Maker Co. 28 Statistical Analysis Elements Experimental unit ○ Object upon which we collect data (Student) Population ○ All items of interest (CS Students in Egypt) Variable ○ Characteristic of an individual experimental unit (Over All Grade) Sample ○ Subset of the units of a population (CS Students At NU) CS322 – Data Analysis 29 Statistical Analysis Elements: an example CS322 – Data Analysis 1. Experimental Unit: Example: Individual people Explanation: The experimental unit is the object upon which we collect data. Here, it's each individual person, and we measure their height. 2. Population: Example: All people in a city Explanation: The population is the entire group of items of interest. In this case, it's all the people in a city. 3. Variable: Example: Height (a characteristic of individual people) Explanation: The variable is the characteristic of an individual experimental unit that we're interested in measuring. In this case, it's the height of each person. 4. Sample: Example: A randomly selected group of 100 people Explanation: The sample is a subset of the units of the population. It's a smaller group that we use to make inferences about the larger population. In this case, it's the 100 people randomly selected from the entire city population. CS322 – Data Analysis 31 Statistical analysis Methods Descriptive Statistics It involve methods for summarizing and describing the main features of a dataset. This step helps in gaining insights into the characteristics of the data. Describing sets of data (mean, median, standard deviation) Inferential Statistics It involve making predictions or inferences based on a sample of data. These methods are crucial for drawing conclusions from data and assessing the significance of findings. Drawing conclusions (making estimates, decisions, predictions, etc. about sets of data based on sampling) it used to make inferences or generalizations about the broader population. CS322 – Data Analysis 32 Descriptive Statistics Involves ○ Collecting Data $ ○ Presenting Data 50 ○ Characterizing Data 25 Purpose ○ Describe Data 0 Q1 Q2 Q3 Q4 X = 30.5 STD = 7.4 CS322 – Data Analysis 33 Four Elements of Descriptive Statistical Problems 1. The population or sample of interest 2. One or more variables (characteristics of the population or sample units) that are to be investigated 3. Tables, graphs, or numerical summary tools 4. Identification of patterns in the data CS322 – Data Analysis 35 Scenario: Describing Heights of Male and 1. Description: Female Populations You want to describe the heights of male and female populations separately without making any specific claims about their averages being different. 2. Data Collection: You collect data on the heights of a sample of males and females. 3. Descriptive Statistics: You calculate descriptive statistics such as mean, median, and standard deviation for both male and female populations. 4. Interpretation: You provide a summary of the central tendency (mean or median) and the spread (standard deviation) of heights for both males and females. 5. Visualization: You may create visual representations like histograms or box plots to illustrate the distribution of heights for each gender. 6. Key Observations: You observe key characteristics of each distribution, such as whether one has a wider range of heights or if there are noticeable patterns. In simpler terms, you're like a reporter describing the heights of males and females without making any specific claims about whether they are the same or different. Descriptive statistics help you provide a clear and concise summary of the height data for each gender, allowing others to understand CS322 – Data Analysis the characteristics of each group independently. 36 Inferential Statistics Involves ○ Estimation ○ Hypothesis Testing Purpose ○ Make decisions about population characteristics CS322 – Data Analysis 37 Scenario: Heights of Male and Female Populations 1. Setting up the Case: Null Hypothesis (H0): The average height of males is the same as the average height of females. Alternative Hypothesis (Ha): The average height of males is different from the average height of females. 2. Collecting Evidence: You collect data on the heights of a sample of males and females. 3. Checking the Evidence: Using statistical methods, you calculate the average height for both males and females in your sample. 4. Decision Time: If the average heights are significantly different, you might decide to reject the idea that the average heights are the same for males and females. 5. Interpreting the Verdict: If you reject the idea of equal average heights, you might conclude that there is a difference in average heights between males and females. CS322 – Data Analysis 38 If you don't have enough evidence to reject the idea of equal CS322 – Data Analysis Five Elements of Inferential Statistical Problems Average men heights [65, 75], average women heights [60, 70]. This is an insight drawn from the data sample. What if you repeated the experiment using another sample of the population, will you get the same inference or insight? With which probability (90, 95, 97, 99%)This is the confidence interval of an inference. *confidence interval is the uncertainty about the inference. CS322 – Data Analysis Five Elements of Inferential Statistical Problems 1. The population of interest 2. One or more variables (characteristics of the population units) that are to be investigated 3. The sample of population units 4. The inference about the population based on information contained in the sample 5. A measure of reliability for the inference CS322 – Data Analysis Types of Data Quantitative data are measurements that are recorded on a naturally occurring numerical scale. Qualitative data are measurements that cannot be measured on a natural numerical scale; they can only be classified into one of a group of categories. Types of Data Quantitative Qualitative Data Data CS322 – Data Analysis 42 Quantitative Data Measured on a numeric scale. 4 Number of defective items in a lot. 943 Salaries of CEOs of oil companies. 21 52 Ages of employees at a company. 120 12 8 71 3 CS322 – Data Analysis 43 Qualitative Data Classified into categories. College major of each student in a class. Gender of each employee at a company. Method of payment (cash, check, credit card). $ Credit CS322 – Data Analysis 44 Count of Students in 322 Alex is 200Km far CS322 – Data Analysis 45 Collecting Data Recall: Statistics is a tool for converting data into information. Statistics Data Information But Where then does data come from? How is it gathered? How do we ensure its accurate? Is the data reliable? Is it representative of the population from which it was drawn? CS322 – Data Analysis 46 Methods of Collecting Data There are many methods used to collect or obtain data for statistical analysis. Three of the most popular methods are: Direct Observation Questionnaires Surveys Experiments CS322 – Data Analysis 47 Direct Observations Observing organizational behaviours in their functional settings is one of the most direct ways to collect data. Observation can range from complete participant observation, where the practitioner becomes a member of the group under study to a more detached observation using a casually observing and noting occurrences of specific kinds of behaviours. CS322 – Data Analysis 48 Direct Observations (Pros and Cons) They are free of the biases inherent in the self-report data. They put the practitioner directly in touch with the behaviours in question. They involved real-time data, describing behaviour occurring in the present rather than the past. Difficulties interpreting the meaning underlying the observations. (Not Structured) Observers must decide which people to observe; choose time periods, territory and events CS322 – Data Analysis 49 Surveys A survey solicits information from people; e.g. Gallup polls; pre- election polls; marketing surveys. The Response Rate (i.e. the proportion of all people selected who complete the survey) is a key survey parameter. Surveys may be administered in a variety of ways, e.g. ○ Personal Interview ○ Telephone Interview ○ Self Administered Questionnaire ○ Internet CS322 – Data Analysis 50 Questionnaire Design Over the years, a lot of thought has been put into the science of the design of survey questions. Key design principles: ○ Keep the questionnaire as short as possible. ○ Ask short, simple, and clearly worded questions. ○ Start with demographic questions to help respondents get started comfortably. ○ Use dichotomous (yes/no) and multiple choice questions. ○ Use open-ended questions cautiously. ○ Avoid using leading-questions. ○ Pre-test a questionnaire on a small number of people. ○ Think about the way you intend to use the collected data when preparing the questionnaire. CS322 – Data Analysis 51 Questionnaires (Pros and Cons) Questionnaires are one of the most efficient ways to collect data. They contain fixed-response questions about various features of an organization. These on-line or paper-and pencil measures can be administered to large numbers of people simultaneously. They can be analysed quickly. Responses are limited to the questions asked in the instrument. They provide little opportunity to probe for additional data or ask for points of clarification. They tend to be impersonal. Often elicit response biases – tend to answer in a socially acceptable manner. CS322 – Data Analysis 52 The Conclusion in Collecting Data Each method has advantages and problems. No single method can fully measure the variable importance Examples: ○ Questionnaires and surveys are open to self-report biases, such as respondents’ tendency to give socially desirable answers rather than honest opinions. ○ Observations are susceptible to observer biases, such as seeing what one wants to see rather than what is actually there. (Solution) Use more than one Because of the biases inherent in any data-collection method, it is best to use more than one method when collecting diagnostic data. The data from the different methods can be compared, and if consistent, it is likely the variables are being validly measured. CS322 – Data Analysis 53 Do You Remember ? The Probability Density Function (PDF) and Cumulative Distribution Function (CDF) are both concepts related to probability distributions, but they serve different purposes. 1. Probability Density Function (PDF): 1. Definition: The PDF, denoted as f(x) for a random variable x, describes the likelihood of the variable taking a specific value. 2. Interpretation: It gives you the probability of the variable falling within an infinitesimally small interval around a specific value. 3. Integration: The integral of the PDF over a range gives the probability that the variable falls within that range. 4. Example: For a continuous distribution, such as the normal distribution, the PDF provides the shape of the curve. 2. Example Equation (Normal Distribution) CS322 – Data Analysis 54 1. Cumulative Distribution Function (CDF): 1. Definition: The CDF, denoted as F(x) for a random variable x, gives the probability that the variable is less than or equal to a specified value. 2. Interpretation: It provides the cumulative probability up to a certain point on the distribution. 3. Calculation: The CDF is obtained by integrating the PDF. Mathematically,. 4. Example: For a continuous distribution, the CDF is a monotonically increasing function from 0 to 1. 2. Example Equation (Normal Distribution): 3. (where erf is the error function) In summary, the PDF provides the probability density at a specific point, while the CDF gives the cumulative probability up to a certain point. The PDF is used to understand the likelihood of individual values, and the CDF is used to analyze the cumulative probability CS322 – Data Analysis 55 distribution. End of slides CS322 – Data Analysis