Conestoga College - Week 1,2, & 3 Data & Analytics PDF

Document Details

Uploaded by Deleted User

Conestoga College

Tags

data analytics data science artificial intelligence business analytics

Summary

These Conestoga College notes cover the fundamentals of data, data science, and artificial intelligence. The document introduces data analytics and its importance for businesses. It also mentions tools and techniques for data analysis, including Excel, R, and Python.

Full Transcript

WEEK 1 The World of Data , Data Science and Artificial Intelligence INFO8066 - DATA ANALYTICS - SEC XX 1 DATA Data has been the buzzword for ages now. Either the data being generated from large-scale enterprises or the data generated from an individual, each and every asp...

WEEK 1 The World of Data , Data Science and Artificial Intelligence INFO8066 - DATA ANALYTICS - SEC XX 1 DATA Data has been the buzzword for ages now. Either the data being generated from large-scale enterprises or the data generated from an individual, each and every aspect of data needs to be analyzed to benefit yourself from it. But how do we do it? Well, that's where the term 'Data Analytics' comes in. INFO8066 - DATA ANALYTICS - SEC XX 2 WHAT IS DATA ? “Data is raw, unorganized facts that need to be processed” Diffen.com "Data" comes from a singular Latin word, datum, which originally meant "something given." Its early usage dates back to the 1600s Data are simply facts or figures — bits of information, but not information Itself. When data are processed, interpreted, organized, structured or presented to make them meaningful or useful, they are called information. Information provides a much richer contexts Into the phenomenon a clear actionable Insight (if applicable) Exploring your data at the early onset of a project will help you communicate this need to data scientists and your team alike, thus easily narrowing the project's focus and streamlining the path toward results. It is always important to understand how the data was generated before starting to analyze it INFO8066 - DATA ANALYTICS - SEC XX 3 DATA AND INFORMATION "Data" and ”Information" are intricately tied together, whether one is recognizing them as two separate words or using interchangeably, as is common today EXAMPLES OF DATA EXAMPLES OF INFORMATION The number of visitors to a website in one month Understanding that changes to a website have led to an increase or decrease in monthly site visitors Inventory levels in a warehouse on a specific date Identifying supply chain issues based on trends in warehouse inventory levels over time Individual satisfaction scores on a customer service survey Finding areas for improvement with customer service based on a collection of survey responses The price of a competitors’ product Determining if a competitor is charging more or less for a similar product INFO8066 - DATA ANALYTICS - SEC XX 4 DATA CAN BE MISLEADING (SURVIVORSHIP BIAS) INFO8066 - DATA ANALYTICS - SEC XX 5 SURVIVORSHIP BIAS Survivorship bias is our tendency to study the people or companies who “survived” or were victorious in a certain situation while ignoring those that failed (“LinkedIn”). This can lead researchers to form incorrect conclusions due to only studying a subset of the population (Kassiani Nikolopoulou). Examples to get you thinking: A real estate investing course features testimonials from a handful of happy customers who “got rich” following the investing approach being sold by the company. They don’t mention the thousands of others who failed using the same method. Mark Zuckerberg, Steve Jobs, and Bill Gates dropped out of college and became billionaires. Does this mean that if you follow their example and drop out of college you are more likely to become a billionaire too? “LinkedIn.” Linkedin.com, 2023, www.linkedin.com/pulse/survivorship-bias-avoid-mistake-stephen-lynch/. Accessed 4 Sept. 2023. Kassiani Nikolopoulou. “What Is Survivorship Bias? | Definition & Examples.” Scribbr, 4 Oct. 2022, www.scribbr.com/research-bias/survivorship-bias/. Accessed 4 Sept. 2023. INFO8066 - DATA ANALYTICS - SEC XX 6 WHAT IS DATA ANALYTICS ? Data analytics is the process of looking at raw data to find patterns and insights. It helps businesses make better decisions, improve efficiency, and increase revenue by turning data into useful information. INFO8066 - DATA ANALYTICS - SEC XX 7 DATA ANALYTICS TOOLS To analyze data, you need to use tools that can help you process and manipulate data. Some of the commonly used data analytics tools include Excel, R, Python, SQL, Tableau, and PowerBI. Excel is a widely used tool for data analytics, and it is easy to use, especially for beginners. R & Python are programming languages that are commonly used for data analytics. SQL is a language used to query databases Tableau and Power BI are tools used for data visualization INFO8066 - DATA ANALYTICS - SEC XX 8 WHY IS DATA ANALYTICS IMPORTANT ? Data Analytics has a key role in improving your business as it is used to gather hidden insights, generate reports, perform market analysis, and improve business requirements. COLLECT DATA ANALYZE DATA GENERATE REPORTS INFO8066 - DATA ANALYTICS - SEC XX 9 THE POWER OF DATA ANALYTICS - NETFLIX INFO8066 - DATA ANALYTICS - SEC XX 10 THE POWER OF DATA ANALYTICS - NETFLIX Founded in 1997, as a subscription mail – order DVD company Current valuation of over $282 Billion – (Yahoo Finance) Current user base 151 million worldwide Retention rate 93% Jangam, R. (2023, March 23). The power of Data Analytics: A case study of netflix. Medium. https://medium.com/@raj.w.2336/the-power-of- data-analytics-a-case-study-of-netflix-555ae819b0d7 INFO8066 - DATA ANALYTICS - SEC XX 11 NETFLIX – KEY STRATEGIES DEVELOPING CUSTOMER USER INTERACTION PERSONA DATA ROBUST FEEDBACK SYSTEM INFO8066 - DATA ANALYTICS - SEC XX 12 NETFLIX – KEY STRATEGIES DEVELOPING CUSTOMER USER INTERACTION PERSONA DATA ROBUST FEEDBACK SYSTEM INFO8066 - DATA ANALYTICS - SEC XX 13 NETFLIX – KEY STRATEGIES DEVELOPING CUSTOMER USER INTERACTION PERSONA DATA ROBUST FEEDBACK SYSTEM INFO8066 - DATA ANALYTICS - SEC XX 14 NETFLIX – KEY STRATEGIES DEVELOPING CUSTOMER USER INTERACTION PERSONA DATA ROBUST FEEDBACK SYSTEM INFO8066 - DATA ANALYTICS - SEC XX 15 UBER USES DATA TO REINVENT TRANSPORTATION 8 Million Users 160k drivers 449 cities in 66 countries 1 million rides/day 2 billion rides recorded https://www.projectpro.io/article/how-uber-uses-data-science-to-reinvent-transportation/290 INFO8066 - DATA ANALYTICS - SEC XX 16 UBER USES DATA TO REINVENT TRANSPORTATION (Kivestu, 2020) INFO8066 - DATA ANALYTICS - SEC XX 17 UBER USES DATA TO REINVENT TRANSPORTATION Uber Knows You Well Tracking supply and demand allows them to implement "surge pricing," boosting fares at peak times to draw more drivers out. Uber's secret lies in its sophisticated supply chain management system, which uses data analytics to optimize every aspect of the ride-sharing experience. Uber relies heavily on real-time data to monitor supply and demand patterns, adjust operations, and optimize its driver allocation. Using data analytics, Uber analyzes user behavior, location, and other data points to predict demand patterns, identify potential bottlenecks, and adjust operations to ensure maximum efficiency. INFO8066 - DATA ANALYTICS - SEC XX 18 UBER USES DATA TO REINVENT TRANSPORTATION Uber uses data analytics is through its "heatmap" tool, which provides real-time insights into where and when riders are requesting rides. This tool allows Uber to adjust its pricing and driver allocation to meet demand, resulting in a better user experience. Codebasics. (n.d.). How uber uses data analytics to increase supply efficiency? https://codebasics.io/blog/how-uber- uses-data-analytics-to-increase-supply-efficiency INFO8066 - DATA ANALYTICS - SEC XX 19 UBER USES DATA TO REINVENT TRANSPORTATION Uber’s predictive supply management system uses historical and real- time data to predict rider demand and driver supply in a given geographical area. By analyzing past demand patterns, Uber determines the likelihood of future demand in a particular location at a specific time. The system also considers factors such as weather, events, and traffic to make more accurate predictions. Codebasics. (n.d.). How uber uses data analytics to increase supply efficiency? https://codebasics.io/blog/how-uber- uses-data-analytics-to-increase-supply-efficiency Uber’s real-time data intelligence platform at scale: Improving gairos INFO8066 - DATA ANALYTICS - SEC XX scalability/reliability | uber blog. (n.d.). https://www.uber.com/blog/gairos- 20 scalability/ USES OF DATA ANALYTICS IN VARIOUS INDUSTRIES Application of data analytics is all around us Whether it be Entertainment , Manufacturing, E- commerce , Health, Marketing you name it. The rise of AI Is primarily due to the billions of information points we as humans are generating everyday combined with the rising computational power. INFO8066 - DATA ANALYTICS - SEC 1 21 DATA IN HEALTHCARE Among the many use cases , healthcare sector continues to use Chatbots for medical scheduling and Xray computer visions to detect early signs of a lethal disease saving lives of thousands of patients. Nalini. (2023, December 20). AI in Healthcare: Benefits, Applications, and Cases. Apptunix Blog. https://www.apptunix.com/blog/ai-in- healthcare-benefits-applications-and-cases/ INFO8066 - DATA ANALYTICS - SEC 1 22 VIDEO: HIGH-TECH HOSPITAL USES ARTIFICIAL INTELLIGENCE IN PATIENT CARE INFO8066 - DATA ANALYTICS - SEC 1 23 DATA IN E-COMMERCE Recommender system algorithms continue to make profit for companies by showing targeted and useful product recommendations to the customers. INFO8066 - DATA ANALYTICS - SEC 1 24 DATA IN ENTERTAINMENT Entertainment giants like Netflix and YouTube also use your past viewership data to recommend movies and videos you would also like. Medium.com lists that over 80% of Netflix viewing comes from Its recommendation system. INFO8066 - DATA ANALYTICS - SEC 1 25 DATA IN TRANSPORTATION Fully self-Driving Cars are now on roads taking every decision a human executes while driving including stopping , steering , braking for pedestrians or emergency stops. This kind of AI enabled use case Is only made possible with highly advance compute vision algorithms and blazing fast computational speeds. INFO8066 - DATA ANALYTICS - SEC 1 26 VIDEO: TRANSFORMING TRANSPORTATION WITH AI | I AM AI INFO8066 - DATA ANALYTICS - SEC 1 27 ARTIFICIAL INTELLIGENCE INFO8066 - DATA ANALYTICS - SEC XX 28 INTELLIGENCE All living organisms are intelligent They interact with their environment & survive Examples: Crossing a road Discovering alternate paths Writing a poem, drawing a picture, creating a new recipe INFO8066 - DATA ANALYTICS - SEC XX 29 ARTIFICIAL INTELLIGENCE Living things are intelligent; but are man made non-living things also intelligent? Can a machine Make discoveries? Pass a ruling order in a court? Compose a symphony? Go for a PLAN B? Decide to wait or let go? INFO8066 - DATA ANALYTICS - SEC XX 30 ARTIFICIAL INTELLIGENCE Traditional computers are powerful but not intelligent They can compile MBs and GBs of code but may get stuck at a minor logical error AI is a field of Computer Science which aims to make computer systems that can mimic human intelligence Just as we humans act when we don’t have exact information about a situation but still go ahead and choose one of the many possible moves INFO8066 - DATA ANALYTICS - SEC XX 31 ARTIFICIAL INTELLIGENCE IN ACTION INFO8066 - DATA ANALYTICS - SEC 1 32 ARTIFICIAL INTELLIGENCE Subsets of AI - Javatpoint. www.javatpoint.com. (n.d.). https://www.javatpoint.com/subsets-of-ai INFO8066 - DATA ANALYTICS - SEC XX 33 AI - > MACHINE LEARNING “Learning is any process by which a system improves performance from experience.” - Herbert Simon ML is used when: Human expertise does not exist (navigating on Mars) Humans can’t explain their expertise (speech recognition) Models must be customized (personalized medicine) Models are based on huge amounts of data (genomics) Based on slide by E. Alpaydin INFO8066 - DATA ANALYTICS - SEC 1 34 TYPES OF MACHINE LEARNING INFO8066 - DATA ANALYTICS - SEC 1 35 APPLICATIONS OF MACHINE LEARNING Image Recognition Virtual Personal Assistant INFO8066 - DATA ANALYTICS - SEC 1 36 APPLICATIONS OF MACHINE LEARNING What other applications of Machine Learning can you think of ?? INFO8066 - DATA ANALYTICS - SEC 1 37 PROS & CONS OF MACHINE LEARNING No Human Intervention Easily identify Wide trends and Applications patterns Advantages Handling Continuous multi-variety Improvement data INFO8066 - DATA ANALYTICS - SEC 1 38 PROS & CONS OF MACHINE LEARNING Data No Human Acquisition Intervention Easily identify Results Time & Wide Interpretations Resources trends and Applications patterns Disadvantages Advantages Elimination of Handling High Error Continuous Human multi-variety Chances Improvement Interface data INFO8066 - DATA ANALYTICS - SEC 1 39 DEEP LEARNING INFO8066 - DATA ANALYTICS - SEC 1 40 DEEP LEARNING DL tasks can be expensive, depending on significant computing resources, and require massive structured or unstructured data sets to train ML models on. For Deep Learning, a huge number of parameters need to be understood by a learning algorithm, which can initially produce many false positives. Barn owl or apple? This example indicates how challenging learning from samples is – INFO8066 - DATA ANALYTICS - SEC 1 even for machine learning. – Source: @teenybiscuit 41 APPLICATIONS OF DEEP LEARNING Medical Image Analysis Self Driving Car INFO8066 - DATA ANALYTICS - SEC 1 42 NATURAL LANGUAGE PROCESSING (NLP) “ Natural language processing is the set of methods for making human language accessible to computers” – Jacob Eisenstein (Google Scientist) “ Natural language processing is the field at the intersection of Computer science (Artificial intelligence) and linguistics “ – Christopher Manning (Professor, Stanford Uni) INFO8066 - DATA ANALYTICS - SEC 1 43 NLP IN ACTION INFO8066 - DATA ANALYTICS - SEC 1 44 APPLICATIONS OF NLP Sentiment Analysis Email Filtering INFO8066 - DATA ANALYTICS - SEC 1 45 DATA GENERATION PROCESS (DGP) INFO8066 - DATA ANALYTICS - SEC XX 46 DATA GENERATION Data Generating Process (DGP) describes the rules with which the data has been generated - (“Data Generating Process”) Extensive surveys are never easy to analyze. There are so many factors going into the design of a survey, most notably the selection probabilities. Consider the Australian Bureau of Statistics (ABS) national surveys. The ABS data protocols for the DGP of every survey are extensive and warrant careful examination before engaging with the data. By reading ABS’s data protocols you learn about the how the data was collected, about the scale of each measure, the coverage of the survey, precision and estimation error, the sample employed and more. https://www.globalpatron.com/images/10-best-data-collection-forms-1024x576.png Its is always important to understand how the data has been captured, stored and presented to you before starting to analyze it. INFO8066 - DATA ANALYTICS - SEC XX 47 DATA GENERATION Some of the Key points to check are What does each row (observation) in data represents ? What does each column (variable) represents in the data ? Are there any missing values In the datasets ? If yes, then what could have caused the missing values ? Has this data been processed or altered before ? Is the next cycle of data going to be In the same format or will carry the same set of rules ? INFO8066 - DATA ANALYTICS - SEC XX 48 DATA GENERATION CHALLENGES Data Quality Issues - Poor data quality can lead to incorrect insights and wasted resources Data Security - Protecting data at every stage of its lifecycle, from collection to storage to disposal Data Privacy - With the rise of data breaches and cyber attacks, customers are increasingly concerned about how their data is being used and who has access to it Data Volume - The sheer volume of data generated by businesses can be overwhelming, making it difficult to extract meaningful insights Siloed Data sources - Combining data from multiple sources to create a complete picture of a business’s operations. The top challenges of data collection and how to overcome them. (2023, April 5). https://aspenasolutions.com/challenges-of-data-collection-and-how-to-overcome-them INFO8066 - DATA ANALYTICS - SEC XX 49 The End What do you call data that floats? INFO8066 - DATA ANALYTICS - SEC XX 50 WEEK 2 Disciplines of Business Analytics INFO8066 - DATA ANALYTICS - SEC XX 1 WHAT IS BUSINESS ANALYTICS ? Analytics is the use of: Data Information Technology Statistical Analysis Quantitative methods Mathematical or computer – based models Purpose is to help managers gain improved insight about their business operations and made better, fact based decisions INFO8066 - DATA ANALYTICS - SEC XX 2 DATA BUSINESS INTELLIGENCE / MINING INFORMATION STATISTICS SYSTEMS VISUALIZATION SIMULATION WHAT IF ? & RISK Business Analytics is a MODELLING & OPTIMIZATION multidisciplinary field INFO8066 - DATA ANALYTICS - SEC XX 3 CRISP - DM Cross-Industry Standard Process for Data Mining Saltz, J. (2024, April 10). CRISP-DM is still the most popular framework for executing data science projects. Data Science Process Alliance. https://www.datascience-pm.com/crisp-dm-still-most-popular/ INFO8066 - DATA ANALYTICS - SEC XX 4 ROLES & STEPS IN THE CRISP PROCESS INFO8066 - DATA ANALYTICS - SEC XX 5 ROLES & STEPS IN THE CRISP PROCESS Get a clear understanding of the business objectives Example : To reduce churn rates To acquire valuable customers Agree success criteria Example : To reduce out annual churn rate from 5% to 3% Assess the situation Translate to analytical objectives (if possible) Evaluate the cost/benefit Clearly understand how action can be taken based on the likely outcomes Document relevant resources, constraints, systems INFO8066 - DATA ANALYTICS - SEC XX 6 ROLES & STEPS IN THE CRISP PROCESS Identify the data sources and fields which may have a bearing on the business/analytical objectives Review data schemas and any other data documentation What looks relevant? What are the formats? Databases, text files, excel, etc. What are the fieldnames? Metadata Crucially … what is the likely target field that maps to the business objective Example: Customers purchasing Machinery failing Revenue/Profit/ROI Visits to the web site Denial of service attacks detected Customers churning INFO8066 - DATA ANALYTICS - SEC XX 7 ROLES & STEPS IN THE CRISP PROCESS Data Understanding helps design this step Together with Data Understanding this can be more time consuming than expected Sometimes 80% of a project Especially for newer projects Typically integrates data from different sources Aggregates data Comparable to ETL (Extract Transform Load) INFO8066 - DATA ANALYTICS - SEC XX 8 ROLES & STEPS IN THE CRISP PROCESS Modelling With clean data in hand, various modeling techniques are applied. Each method may require specific data formats, so it’s not uncommon to loop back to the data preparation phase. Key tasks include: Selecting modeling techniques Designing tests Building the model Assessing the model Evaluation Before proceeding to deployment, the model’s performance is thoroughly evaluated. This ensures that it meets the business objectives set in the first phase. Key tasks include: Evaluating results Reviewing the process Determining the next steps INFO8066 - DATA ANALYTICS - SEC XX 9 ROLES & STEPS IN THE CRISP PROCESS This can be as simple as generating a report or as complex as implementing a repeatable data mining process. Key tasks include: Planning deployment Monitoring and maintenance Reviewing the project Finalizing the project INFO8066 - DATA ANALYTICS - SEC XX 10 CRISP – DM IN ACTION Suppose we want to build a spam detection system: for each email we get, we want to determine if it’s spam or not. If it is, we want to put it into the “spam” folder. Let’s see how we can solve this problem with CRISP-DM. CRISP-DM – Machine Learning Bookcamp. (n.d.). Machine Learning Bookcamp. https://mlbookcamp.com/article/crisp-dm INFO8066 - DATA ANALYTICS - SEC XX 11 CRISP – DM IN ACTION Spam Detection System – What Does the business need ? Our users started to complain about spam messages, so we decided to check if it’s something we can solve. Goal : - Reduce the amount of spam messages or - Reduce the amount of complaints about spam The goal must be measurable - Reduce the amount of spam by 50% CRISP-DM – Machine Learning Bookcamp. (n.d.). Machine Learning Bookcamp. https://mlbookcamp.com/article/crisp-dm Business Data Data Preparation Modelling Evaluation Deployment Understanding Understanding 12 CRISP – DM IN ACTION Spam Detection System – What data do we have / need? Is It clean? Identify the data sources - We have a report spam button - Is the data behind button good enough? - Is it reliable? - Do we track it correctly? - Is the dataset large enough? - Do we need to get more data? CRISP-DM – Machine Learning Bookcamp. (n.d.). Machine Learning Bookcamp. https://mlbookcamp.com/article/crisp-dm Business Data Data Preparation Modelling Evaluation Deployment Understanding Understanding 13 CRISP – DM IN ACTION Spam Detection System – How do we organize the data for modeling? Clean the data Build the pipelines Convert into tabular form CRISP-DM – Machine Learning Bookcamp. (n.d.). Machine Learning Bookcamp. https://mlbookcamp.com/article/crisp-dm Business Data Data Preparation Modelling Evaluation Deployment Understanding Understanding 14 CRISP – DM IN ACTION Spam Detection System – What modeling techniques should we apply? Training a model: - Try different models - Logistic Regression - Decision Tree - Neural Networks - Select the best one CRISP-DM – Machine Learning Bookcamp. (n.d.). Machine Learning Bookcamp. https://mlbookcamp.com/article/crisp-dm Business Data Data Preparation Modelling Evaluation Deployment Understanding Understanding 15 CRISP – DM IN ACTION Spam Detection System – Which model best meets the business objectives? Is the model good enough ? - Have we reached the goal ? - Do our metrics improve ? CRISP-DM – Machine Learning Bookcamp. (n.d.). Machine Learning Bookcamp. https://mlbookcamp.com/article/crisp-dm Business Data Data Preparation Modelling Evaluation Deployment Understanding Understanding 16 CRISP – DM IN ACTION Spam Detection System Often happens together: - Online evaluation : evaluation of live users - It means deploy the mode, evaluate it CRISP-DM – Machine Learning Bookcamp. (n.d.). Machine Learning Bookcamp. https://mlbookcamp.com/article/crisp-dm Business Data Data Preparation Modelling Evaluation Deployment Understanding Understanding 17 TYPES OF DATA Previously we have covered what data is and how more advanced forms of data analysis such as machine learning is reshaping the world around us. However, even the most basic form of data analysis requires the knowledge of types of data an analyst is dealing with how to convert one form of data to another and what are some of the strengths and limitations of each type. INFO8066 - DATA ANALYTICS - SEC XX 18 TYPES OF DATA Data Can’t be measured or counted in the Can be expressed in numerical form of numbers values, making it countable Sorted by Category not numbers “How much”, “How many”, “How Audio, images, symbols, or text often” Market Share Price Number of cars Gender (ie: Grades (ie: A+, (ie: 50.4%, 23%) in a parking lot Girl/Boy) B,D) (ie: 5,10,100) Quantitative Data Qualitative or Categorical Data Customer Satisfaction Rating Rolling a Colors (ie: blue, Temperatur (ie: Fair, Good, Poor) die(ie:1,2,3) red) e(37.2C) TYPES OF DATA INFO8066 - DATA ANALYTICS - SEC XX 20 TYPES OF DATA - EXPLAINED Quantitative Data Type : Numeric Data Types Continuous : This type of data can assume any whole number or fractional value such as dollar amount ($100.234) or temperature (37.2’C). Discrete : Discrete data only consists of whole number values such as number of trips to a shopping store , number of cars , total household members etc.. Qualitative or Categorical Data Type : Qualitative or Categorical Data describes the object under consideration using a finite set of discrete classes. The gender of a person (male, female, or others) is a good example of this data type. Nominal : These are the set of values that don’t possess a natural ordering. Let’s understand this with some examples. The color of a car can be considered as a nominal data type as we can’t numerically compare one color with others. Such as we can't say that “white” Is greater than “red” color. Ordinal : These types of values have a natural ordering while maintaining their class of values. If we consider the size of a clothing brand, then we can easily sort them according to their name tag in the order of small < medium < large. The grading system while marking candidates In a test can also be considered as an ordinal data type where A+ is better than B grade. Similarly, one can differentiate and order a happy, Indifferent and unhappy customer data point. INFO8066 - DATA ANALYTICS - SEC XX 21 TYPES OF BUSINESS ANAYTICS INFO8066 - DATA ANALYTICS - SEC XX 22 TYPES OF BUSINESS ANAYTICS Simplest type of analytics and the foundation the other types are built on. It allows you to pull trends from raw data and succinctly describe what happened or is currently happening. Data visualization is a natural fit for communicating descriptive analysis because charts, graphs, and maps can show trends in data—as well as dips and spikes— in a clear, easily understandable way. INFO8066 - DATA ANALYTICS - SEC XX 23 TYPES OF BUSINESS ANAYTICS Diagnostic analytics provides crucial information about why a trend or relationship occurred Useful for getting at the root of an organizational issue. INFO8066 - DATA ANALYTICS - SEC XX 24 TYPES OF BUSINESS ANAYTICS Used to make predictions about future trends or events and answers the question, “What might happen in the future?” Making predictions for the future can help your organization formulate strategies based on likely scenarios. INFO8066 - DATA ANALYTICS - SEC XX 25 TYPES OF BUSINESS ANAYTICS Prescriptive analytics takes into account all possible factors in a scenario and suggests actionable takeaways. This type of analytics can be especially useful when making data-driven decisions. INFO8066 - DATA ANALYTICS - SEC XX 26 Job Roles In the World of Data “Why did the data analyst bring a ladder to work? To reach the high data points!” INFO8066 - DATA ANALYTICS - SEC XX 27 Machine Learning Business Intelligence Data Analyst Data Scientist Data Architect Engineer Analyst Average : $65,112 Average : $98,789 Average : $113,892 Average : $72,106 Average : $127,732 Range: $47,000–$89,000 Range: $70,000 - $137,000 Range: $79,000 - $156,000 Range: $53,000 - $98,000 Range: $82,000 - $168,000 Responsible for gathering & Apply statistics and build Responsible for researching, Focused on using data to Designs the structures an organizing large sets of data, machine learning models to building, and designing the improve an organization. organization needs to analyzing the data and using make predictions artificial intelligence effectively acquire, organize, their analysis to draw specific responsible for machine Expected to translate their analyze, manage, and utilize business conclusions learning and maintaining and data analysis into actionable data improving existing artificial strategies to improve the intelligence systems. business—and present their strategic analysis to Also responsible for leadership. transforming models created by data scientists into real world that can be used in production INFO8066 - DATA ANALYTICS - SEC XX 28 DATA ROLES + CRISP DM Hotz, N. (2024, April 2). Data Science Roles – A Definitive Guide. Data Science Process Alliance. https://www.datascience- pm.com/data-science-roles/ INFO8066 - DATA ANALYTICS - SEC XX 29 WEEK 3 Descriptive Analytics using Excel INFO8066 - DATA ANALYTICS - SEC XX 30 DESCRIPTIVE STATISTICS PROVIDES A PICTURE ABOUT THE DATASET 1. What is the shape of the distribution ? Do the values tend to fall into some recognizable pattern? 2. What is the location of the variable? That is, where are the numbers centered? 3. How much variation is involved? Are the values widely dispersed or are they all fairly close in value? INFO8066 - DATA ANALYTICS - SEC XX 31 COMMON DESCRIPTIVE STATISTICS Measures of Central Tendency / Location Mean Median Mode Measures of Dispersion Range Interquartile range Variance Standard deviation , Frequency Distribution and Histogram Measure of Association Daytwo. (n.d.). Measures of central tendency: Mean, median and Mode. Measures of central tendency: Mean, median and mode. https://ledidi.com/academy/measures-of-central-tendency-mean-median-and-mode Correlation INFO8066 - DATA ANALYTICS - SEC XX 32 MEASUREMENT OF CENTRAL LOCATION 3 KINDS OF AVERAGES OF A DATASET The average, which is found by adding up all the values in a set of data and dividing Mean it by the total number of values you added together. =AVERAGE(C2:C8) Median The middle number in the set of values =MEDIAN(C2:C8) Mode The number of value, which appears most often in the set =MODE(C2:C8) Alternatively, to individually calling each function excel data analysis tool pak “descriptive statistics” option can find the key summary statistics for a given range of data. In this course, you need to know how to use both ways to summarize data INFO8066 - DATA ANALYTICS - SEC XX 33 EXCEL DATA ANALYSIS TOOL PAK The Analysis ToolPak is an Excel add-in that provides tools for complex data analysis. The ToolPak eliminates the need to know the detailed steps involved in executing certain calculations. INFO8066 - DATA ANALYTICS - SEC XX 34 ENABLE DATA ANALYSIS TOOL PAK ADD-IN ON EXCEL If you have never used the Data Analysis ToolPak, it is probably inactive on your Excel program. 1. Click on the File tab 2. Click on Options. 3. Next, click on “Add-Ins.” (left hand side) 4. Find the Manage option located at the bottom 5. Select “Excel Add-ins” from the drop down. Click Go 6. Select Analysis ToolPak, Click Go INFO8066 - DATA ANALYTICS - SEC XX 35 MEASURES OF CENTRAL TENDENCY / LOCATION – 2 WAYS Excel Built - in Functions Excel Data Analysis ToolPak Mean Click on “Data Analysis” in the data tab =AVERAGE(select range) Click on Descriptive Statistics Median Select input range =MEDIAN(select range) Select “Summary Statistics” Mode Click on =MODE(select range) A list of statistics will be created on a new worksheet INFO8066 - DATA ANALYTICS - SEC XX 36 MEASURE OF DISPERSION SCATTERED NESS OF THE DATA SERIES AROUND ITS AVERAGE 1. Range 2. Interquartile range 3. Variance 4. Standard deviation , Frequency Distribution and Histogram Find Clarity in the concept of dispersion. (2022, March 21). Unacademy. https://unacademy.com/content/cbse-class-11/study- material/mathematics/concept-of-dispersion/ INFO8066 - DATA ANALYTICS - SEC XX 37 MEASURE OF DISPERSION SCATTERED NESS OF THE DATA SERIES AROUND ITS AVERAGE 1. Range It is the simplest possible measure of dispersion and is defined as the difference between the largest and smallest values of the variable R=L–S Example: Imagine you’re analyzing the salaries of employees in a company. If the range of salaries is wide, it means that there’s a significant difference between the lowest-paid and highest-paid employees. On the other hand, if the range is narrow, most salaries are relatively close to each other INFO8066 - DATA ANALYTICS - SEC XX 38 MEASURE OF DISPERSION SCATTERED NESS OF THE DATA SERIES AROUND ITS AVERAGE 1. Standard Deviation The standard deviation is used to tell how much your data points tend to deviate from the mean on average Interpretation: If the standard deviation is low, it means that most of the data points are spread close to the mean. We expect that the data points are clustered around the mean. If the standard deviation is high, we expect that the data points are spread far from the mean. INFO8066 - DATA ANALYTICS - SEC XX 39 MEASURE OF DISPERSION SCATTERED NESS OF THE DATA SERIES AROUND ITS AVERAGE 1. Variance The average of the squared differences from the Mean. Informally, if estimates how far a set of numbers (random) are spread out from their mean value. A lower variance indicates that the values are less spread out. INFO8066 - DATA ANALYTICS - SEC XX 40 MEASURE OF DISPERSION SCATTERED NESS OF THE DATA SERIES AROUND ITS AVERAGE Expect interpretation questions on Quiz 1 Range Difference between the largest and smallest values of the variable Standard Deviation Measures the dispersion of a dataset relative to its mean =STDEV.S(C2:C8) Variance Average squared difference of the scores from the mean =VAR.S(C2:C8) Alternatively, to individually calling each function excel data analysis tool pak “descriptive statistics” option can find the key summary statistics for a given range of data. INFO8066 - DATA ANALYTICS - SEC XX 41 MEASURE OF DISPERSION FREQUENCY DISTRIBUTION Frequency is the number of times a character that has appeared in the collected data. Frequency distribution is also known as Frequency table. A frequency distribution is an orderly arrangement of data classified according to the size of the observations. When the data are grouped into classes of appropriate size indicating the number of observations in each class we get a frequency distribution. INFO8066 - DATA ANALYTICS - SEC XX 42 MEASURE OF DISPERSION FREQUENCY DISTRIBUTION INFO8066 - DATA ANALYTICS - SEC XX 43 MEASURE OF DISPERSION CONSTRUCTING A FREQUENCY DISTRIBUTION Lower Class Limits : smallest numbers that can belong to different class INFO8066 - DATA ANALYTICS - SEC XX 44 MEASURE OF DISPERSION CONSTRUCTING A FREQUENCY DISTRIBUTION Upper Class Limits : largest numbers that can belong to different class INFO8066 - DATA ANALYTICS - SEC XX 45 MEASURE OF DISPERSION CONSTRUCTING A FREQUENCY DISTRIBUTION Class Width: difference between two consecutive lower class 10 10 10 10 INFO8066 - DATA ANALYTICS - SEC XX 46 MEASURE OF DISPERSION CONSTRUCTING A FREQUENCY DISTRIBUTION 1. Decide on the number of classes (should be between 5 and 20). 2. Calculate (round up). class width ≈ (maximum value) – (minimum value) number of classes 3. Starting point: Begin by choosing a lower limit of the first class. 4. Using the lower limit of the first class and class width, proceed to list the lower class limits. 5. List the lower class limits in a vertical column and proceed to enter the upper class limits. 6. Go through the data set putting a tally in the appropriate class for each data value. INFO8066 - DATA ANALYTICS - SEC XX 47 MEASURE OF DISPERSION RELATIVE FREQUENCY DISTRIBUTION With a relative frequency distribution, we don’t want to know the counts. We want to know the percentages Relative Frequency = class frequency Sum of all frequencies The sum of relative frequencies is also equal to one, since the sum of all fractional parts must equal the whole. We often multiply the relative frequencies by 100 to express them as percentages. INFO8066 - DATA ANALYTICS - SEC XX 48 The End What did the data say when it was drunk? INFO8066 - DATA ANALYTICS - SEC XX 49 WEEK 3 Descriptive Analytics using Excel INFO8066 - DATA ANALYTICS – SecXX 1 PERCENTILES Percentiles are a way of expressing how a particular value in a data set compares to the rest of the values. Essentially, a percentile tells you what percentage of the data falls below a certain value. For example, if the 75th percentile of a dataset is 1400, it means that 75% of the data falls below 1400, and 25% of the data is above it. Roell, K. (2019, July 3). How to understand score percentiles. ThoughtCo. https://www.thoughtco.com/how-to- INFO8066 - DATA ANALYTICS – SecXX understand-score-percentiles-3211610 2 PERCENTILE Example – Income Percentiles If your family's income is in the 90th percentile, it means your family makes more money than 90% of other families. Example – Height Percentiles You are the fourth tallest person in a group of 20. 80% of people are shorter than you. That means you are at the 80th percentile. INFO8066 - DATA ANALYTICS – SecXX 3 PERCENTILE = PERCENTILE This function is in operation from the older versions and is still in place for backward compatibility purposes. = PERCENTILE.INC This is a newer version of the PERCENTILE function which has the same functionality as the PERCENTILE function. This is the most used function to calculate the percentile. = PERCENTILE.EXC This is a function introduced for newer versions of Excel. The value of k is between the range 0 to 1, but exclusive of 0 and 1. INFO8066 - DATA ANALYTICS – SecXX 4 PERCENTILE Given that the 80th percentile of the number of companies worked is 4.8, what can be inferred about the distribution of employees' tenure and mobility within the workforce? INFO8066 - DATA ANALYTICS – SecXX 5 QUARTILE A closely related metric to percentile is called quartiles. Quartiles break the data into four parts. The 25th percentile is called the first quartile,Q1; The 50th percentile is called the second quartile, Q2; The 75th percentile is called the third quartile, Q3; and The 100th percentile Is the fourth quartile, Q4. One-fourth of the data fall below the first quartile, one-half are below the second quartile, and three-fourths are below the third quartile. You can use EXCEL’s QUARTILE.INC function to find different data quartiles. Quartiles often are used in sales and survey data to divide populations into groups. “Fig. 2 Relationship of Quartiles and Inter-Quartile Range. Legends: Q 1...” ResearchGate, ResearchGate, 2018, For example, you can use QUARTILE to find the top 25 percent of incomes in a www.researchgate.net/figure/Relationship-of-quartiles-and-inter-quartile-range- population. Legends-Q-1-first-quartile-Q-3_fig2_324532937. Accessed 19 Sept. 2023. INFO8066 - DATA ANALYTICS – SecXX 6 QUARTILE Given that third quartile (Q3) value for Job Satisfaction score is 8, what conclusions can be drawn among employees? INFO8066 - DATA ANALYTICS – SecXX 7 CORRELATION Correlation measures the linear relationship between two variables. By measuring and relating the variance of each variable, correlation gives an indication of the strength of the relationship. How much does variable A (the independent variable) explain variable B (the dependent variable)? Kramer, L. (2021, December 18). How Can You Calculate Correlation Using Excel? Investopedia. https://www.investopedia.com/ask/answers/031015/how-can-you-calculate-correlation-using-excel.asp INFO8066 - DATA ANALYTICS – SecXX 8 CORRELATION The CORREL function returns the correlation coefficient of two cell ranges. Kramer, L. (2021, December 18). How Can You Calculate Correlation Using Excel? Investopedia. https://www.investopedia.com/ask/answers/031015/how-can-you-calculate-correlation-using-excel.asp INFO8066 - DATA ANALYTICS – SecXX 9 CORRELATION Using Data Analysis ToolPak Select “Data Analysis” in the top right-hand corner. Select Correlation. Define your data range and output. Evaluate your correlation coefficient. INFO8066 - DATA ANALYTICS – SecXX 10 INTERPRETING CORRELATION COEFFICIENTS Values can range from -1 to +1. 1 means there is a perfect positive relationship. As one thing increases, the other thing also increases in a perfectly consistent way. 0 means there is no relationship. Changes in one thing do not predict changes in the other. -1 means there is a perfect negative relationship. As one thing increases, the other thing decreases in a perfectly consistent way. The Correlation Coefficient (r). (n.d.). https://sphweb.bumc.bu.edu/otlt/MPH-Modules/PH717-QuantCore/PH717-Module9-Correlation-Regression/PH717-Module9-Correlation-Regression4.html INFO8066 - DATA ANALYTICS – SecXX 11 PIVOT TABLES A PivotTable is a powerful tool to calculate, summarize, and analyze data that lets you see comparisons, patterns, and trends in your data. INFO8066 - DATA ANALYTICS – SecXX 12 PIVOT TABLES – CASE STUDY Create a pivot table for the U.S. Voters dataset and use it to answer the following: 1. How many states had a Voter Population % below 55%? Which states? 2. How many confirmed voters in CA were over 65 years old in 2012? What percentage does that represent out of the total confirmed voters in CA? What percentage out of the confirmed voters in the entire country? 3. Show both Citizen Population and Confirmed Voters by Age, as % of Column Total. What percentage of the citizen population do 45- to 64-year- old represent? What percentage of the confirmed voter population? 4. As a politician seeking to improve voter turnout rates among young adults (18-24), which states would you target first? INFO8066 - DATA ANALYTICS – SecXX 13 QUIZ NEXT CLASS In Class using LockDown Browser (Ensure your devices are set up properly, No extra time will be given) You need to be physically present in the classroom to write the quiz 50 points, worth 15% Mix of MCQ, WR, TF, Matching, Interpretation and Analysis Week 1 - Week 3 content INFO8066 - DATA ANALYTICS – SecXX 14 The End My exam question was, what is plagiarism? So I copied my answer from the person beside me! INFO8066 - DATA ANALYTICS – SecXX 15

Use Quizgecko on...
Browser
Browser