DA 102 Data Analysis Basics PDF
Document Details
Tags
Summary
This document is a presentation or course material on data analysis, covering various data types, issues, and analytical techniques. It introduces different types of data, including qualitative, quantitative, and boolean values. The document also presents examples of data analysis methods.
Full Transcript
Bachelor of Science (Honours) in Data Science and Artificial Intelligence DA 102 Data Analysis Basics Introduction Learning Objectives 01 Know various types of data 02 Issues associated with data 03 Understand broad group of analytics 04 Understand what is descriptive analytic...
Bachelor of Science (Honours) in Data Science and Artificial Intelligence DA 102 Data Analysis Basics Introduction Learning Objectives 01 Know various types of data 02 Issues associated with data 03 Understand broad group of analytics 04 Understand what is descriptive analytics 05 Understand what is predictive analytics 06 Tools used to perform analytics 3 4 About Data Data Definition: collection of information, facts, or values The data is used as basis for computation, analysis and decision making. Data can be in the form of numbers, text, images, audio, video. Data is the raw material that is processed, organized, interpreted to extract meaning and generate insights. Data is a fundamental concept that plays a crucial role in various fields such as science, business, and research. Data is considered as the key in this era of digital age 6 Examples of data 7 Examples of data 8 Examples of data 9 About Data Data 01 Numbers Quantitative Data Boolean Values Qualitative Data Quantity: 10 {"Un-married", {1 = Poor, 2 = Fair, 3 = {True, False} "Married", Divorced"} Good, 4 = Very Good, Price: 347.73 5 = Excellent} {"Student", "Faculty", {AA = 10, AB = 9, BB = "Staff"} 8, …, DD = 4, F = 0} 11 Data 02 Temporal Data Text Data Mixed data Date: 15-Aug-2023 Sequence of characters such as Combination of any of words, sentences or paragraphs these data types Time: 11:30 AM Time Stamp: 11:30 AM, 15-Aug-2023 12 Data 03 Audio Image Data Video data An encoded file Data captured through cameras Combination several corresponding to a song images and an Data captured through associated audio data screenshots in computers 13 14 Data – stored in spread sheet 15 Data – stored in a note pad 16 17 Real valued data Analysis - example 01 To buy the book titled "The Linux Programming Interface" by Michale Kerrisk from online website The objective is to buy the above book at a website which offers best price. To achieve this objective, collect prices from various ecommerce websites When data is real valued 19 Analysis – example 01 When data is real valued 20 Analysis – example 01 When data is real valued 21 Analysis – example 01 When data is real valued 22 Analysis – example 01 When data is real valued 23 Analysis – example 01 When data is real valued 24 Analysis – example 01 Website Price Additional information www.amazon.in 5594 In stock www.flipkart.com 5599 In stock books.rediff.com 5378 Out of stock bookswagon.com 7738 Out of stock www.snapdeal.com Not available Not available When data is real valued 25 Analysis – example 01 Website Price Additional Prices are real valued information numbers www.amazon.in 5594 In stock Price at snapdeal.com is not available due to www.flipkart.com 5599 In stock unavailability of product books.rediff.com 5378 Out of stock Computing lowest price from the second column of the table is a challenging bookswagon.com 7738 Out of stock task as price information in unavailable for all websites www.snapdeal.co Not available Not available m When data is real valued 26 Analysis – example 01 Website Price Additional The price offered information by snapdeal is made 0 to make price a real valued www.amazon.in 5594 In stock number. www.flipkart.com 5599 In stock However, this change leads to factually incorrect books.rediff.com 5378 Out of stock information that the lowest price is offered snapdeal. bookswagon.com 7738 Out of stock Handling "not available" values is very www.snapdeal.co 0 Not available important in data analysis m otherwise, decisions swing significantly resulting in When data is real valued errors 27 Analysis – example 01 Lowest price is offered by books.rediff.com Website Price Additional information Additional information tells www.amazon.in 5594 In stock us that though this is the best price offered, the book www.flipkart.com 5599 In stock is out of stock. We therefore must search books.rediff.com 5378 Out of stock for second lowest price. bookswagon.com 7738 Out of stock Sort price in descending order and pick the second element in sorted list. www.snapdeal.co Not available Not available m We understand that Amazon When data is real valued Offers the second lowest price. 28 Analysis – example 01 Website Price Additional Sampling – 5 websites are information visited of many e-commerce websites. www.amazon.in 5594 In stock Computing minimum price www.flipkart.com 5599 In stock value lead to decision making books.rediff.com 5378 Out of stock Is this example too easy? Well the fact is: bookswagon.com 7738 Out of stock 9 out of 10 consumers price www.snapdeal.co Not available Not available check a product on Amazon m https://www.bigcommerce.c When data is real valued om/blog/amazon-statistics/ 29 30 Quantitative data Analysis – example 02 Visit www.amazon.com Search for any product and access associated product description page. Apart from book description, we get to see ratings given by users The ratings are quantitative values such as When data is Quantitative https://www.vecteezy.com/ 32 Analysis – example 02 A total of 1043 users gave ratings 75% of users gave 5 stars 17% of users gave 4 stars and so on The ordered values are presented as histogram visualization Average rating is computed and presented as 4.6 out of 5 stars When data is Quantitative 33 Analysis – example 02 User 1 5 stars Analysis: Computed the following User 2 5 stars Number of users who gave 5 stars User 3 4 stars Number of users who gave 4 User 4 1 star starts... … … Number of users who gave 1 star User 1043 5 stars Average rating across all users When data is Quantitative 34 Analysis – example 02 Product ratings influence in conversions from viewing to buying A correlation between ratings and conversion has been observed – a product with 3.7 rating has 15% more click through rate (CTR) Analysis involving only ratings When data is Quantitative 35 Analysis – example 02 User 1 5 stars From the data perspective User 2 5 stars The rating data may be visualized to be stored as shown When users visits product page, average rating and User 3 4 stars histograms are generated from this kind of data. Analytics play a pivotal role in transforming data and User 4 1 star influencing decisions … … User 1043 5 stars When data is Quantitative 36 37 Qualitative data Analysis – example 03 Visit www.youtube.com Search for any video Apart from the video, we get to see how many users {liked, disliked} the video More likes of a video has strong correlation to increase in revenue https://www.vecteezy.com/ When data is Qualitative 39 Analysis – example 03 Visit www.youtube.com Search for any video Apart from the video, we get to see how many users {liked, disliked} the video More likes of a video has strong correlation to increase in revenue https://www.vecteezy.com/ When data is Qualitative 40 41 Text data Analysis – example 04 Visit www.amazon.com Search for any product and access associated product description page. Apart from book description, we get to read detailed product review given by customers who bought the product This information is in the text form. 43 Analysis – example 04 Analyze the contents for finding out the sentiment of the customer In the analytics terms, perform sentiment analysis of customer reviews If the complete review is in positive sentiment, then the it suggests that customer is happy with the product. 44 45 A computer program A computer program is data Collection of such programs Collection of executables 46 Collection of programs A computer program is data Collection of such programs Collection of executables 47 Collection of executables A computer program is data Collection of such programs Collection of executables 48 49 Issues in Data Collection Bias and Sampling Issues Sampling Bias - When the sample collected does not represent the entire population Selection Bias - When participants are not selected randomly, leading to skewed or inaccurate results Nonresponse Bias - When a significant portion of the selected participants does not respond, leading to a biased sample 51 Measurement Issues Measurement Error: Inaccuracies or inconsistencies in measurement instruments, leading to incorrect or imprecise data. Subjective Measurements: When data is collected through subjective methods like surveys, individuals might respond based on personal biases or interpretations. Social Desirability Bias: Participants may provide responses they believe are socially acceptable rather than their true opinions or behaviors 52 Data Quality and Integrity Issues Data Integrity Issues Missing values A data record that is not complete. Data entry errors. Duplications in data. 53 Ethical and Privacy Concerns Informed Consent: Obtaining proper informed consent from participants, especially in sensitive or intrusive research. Privacy Protection: Ensuring that personally identifiable information is not leaked or misused. Anonymity: Striking a balance between collecting meaningful data and protecting participants' identities 54 Technological Challenges Data Security: Protecting data from breaches, leaks, or unauthorized access. Data Storage: Ensuring proper and secure storage of collected data. Technical Glitches: Issues with data collection tools, software, or hardware can affect data quality. 55 Cultural and Language Related Challenges Cultural Bias: Data collection instruments might be biased towards a particular culture or group. Translation Issues: Translating surveys or questionnaires into different languages can lead to discrepancies in meanings. 56 Resource Limitations Financial Constraints: Adequate data collection might require funding for tools, personnel, and infrastructure. Time Constraints: Rushed data collection might lead to errors or incomplete data. 57 Types of analytics Analytics components A business context in order Pregnant women are likely to be price- to take up data analytics. insensitive. Two examples to support the Their willingness to spend more is the potential business context. for business. Baby products related market share is 38 billion dollar. Given a female customer predict if the customer is pregnant or not. In the business context, if the prediction is accurate, offer pregnant women the customized products and services. 59 Analytics components A business context in order In competitive markets such as to take up data analytics. telecommunication industry, customers abruptly leave the services of the service provide. Two examples to support the business context. Predicting which customer leaves the service before hand help address core concerns of the customers. In the business context, if the prediction is accurate, offer discounts to retain such customers. 60 Analytics types Descriptive analytics Predictive analytics 61 Descriptive analytics Descriptive analytics In descriptive analytics the focus is on summarizing and interpreting historical data It gives insights into past events and trends Strives to provide a clear and concise representation of data in order to understand the data and aid in decision making Summarizing large volumes of data into manageable and interpretable forms Identify patterns and trends present in the historical data Visualize data as visualization make it easier to communicate complex data 63 Descriptive analytics - example An e-commerce company want to analyze the sales data for a particular product over the past year. It has data describing each sale such as: date of the sale, the product sold, the quantity, and the revenue generated. 64 Descriptive analytics - example Descriptive analytics include Summarization: Computing average revenue for each product Summarization: What is the total revenue of each product Visualization: Plot sales of product A over timeline Visualization: Plot Quantity sold every month for product A Determine top selling products based on total revenue generated by each product Customer segmentation: grouping customers based on purchase frequency, total spending by customer, identify high value and low value customers Peak sales times: Identify days of the week when sales are highest 65 66 Predictive analytics Predictive analytics In predictive analytics the focus is on finding future trends given historical data and current data The goal is to build models that can learn from historical data. Use the learning to make predictions about future events (data). The models are trained on known outcomes and patterns. The trained models are employed to predict new or unseen data. 68 Example - customer churn Example - customer churn Percentage of customers who stopped using the product or service In competitive marketplace identifying potential customer who would discontinue their services is key. If known beforehand the intent of the customer that he/she would leave the services company may initiate retention plan. Which may include offering a coupon or providing discount for the previous three months service etc. The main challenge is no customer will explicitly state the reason leaving the services The reasons are to be understood from multiple data sources. Here is an example 70 Example - customer churn Data from billing department Data from service department 71 Example - customer churn In this example a correlation exists between number of complaints raised, resolved, unresolved and rating. Obtain an elaborate data for all active customers for past four months a snapshot of the same is given below. 72 Customer churn When large number of customer data is presented, it is hard to establish relations between churn and the attributes of the data visually. When a relationship is established, the relationship is to be validated. In the previous example, customer ID 1235 is identified as potential to churn. This must be validated. The outline of predictive analytics is Perform data analysis and modeling: Analyze historical data and discover relationships, correlations, and patterns that can be used to make predictions. In the customer churn examples, complaints raised, number of unresolved complaints, rating on resolved complains are the information relationships. 73 74 Data analytics tools 76 Data analytics tools Several tools which are used for data analytics. Some of them are listed below Microsoft Excel: Widely used for basic data analysis, calculations, and charting. Tableau: Known for its powerful data visualization capabilities and interactive dashboards. Power BI: Microsoft's business analytics service for interactive visualizations and business intelligence. QlikView/Qlik Sense: Tools for data visualization, reporting, and business intelligence. R: A programming language and software environment for statistical computing and graphics. Python: A versatile programming language used for data analysis and manipulation with libraries like pandas and NumPy. 77 Data analytics tools SAS (Statistical Analysis System): Offers a suite of software for advanced analytics, business intelligence, and data management SPSS (Statistical Package for Social Sciences): Software for statistical analysis, used in social science research and data mining. MATLAB: A programming language and environment for numerical computing, often used in engineering and scientific research. Stata: A software package used for data analysis and statistical purposes JMP: Statistical discovery software often used for exploratory data analysis. 78 DA102 – Microsoft Excel This course focus on basics of data analysis and business modeling Microsoft Excel software tool is used. The choice of this tool stems from the fact that students of this online programme are in their first semester and are yet to introduce to programming languages. For those students who are introduced to the programming languages this course should enrich them with the orientation of data manipulation through software systems. 80 DA102 – Detailed contents Basic spread sheet modeling Understanding range names LOOKUP functions INDEX function MATCH function Text manipulation functions Time – Dates and date functions Conditional statements (IF statement) 81 DA102 – Detailed contents Three-dimensional formulas Sensitivity analysis COUNT family SUM family OFFSET function INDIRECT function Data validation Filtering and removing duplicates Consolidating data 82 DA102 – Detailed contents Pivot tables 83