Introduction to Data Analytics PDF | Algonquin College

Document Details

PersonalizedCottonPlant

Uploaded by PersonalizedCottonPlant

Algonquin College

Tags

data analytics machine learning CRISP-DM data science

Summary

This document from Algonquin College introduces the field of data analytics, covering key concepts such as quantitative and qualitative data and the CRISP-DM methodology. It explores machine learning techniques and delves into a Titanic dataset analysis to illustrate the principles of business understanding and data preparation. The document provides an accessible overview suitable for those new to the subject of data analysis.

Full Transcript

CST8390 BUSINESS INTELLIGENCE & DATA ANALYTICS Week 1 Introduction to Data Analytics Data Analytics refers to the set of quantitative and qualitative approaches for...

CST8390 BUSINESS INTELLIGENCE & DATA ANALYTICS Week 1 Introduction to Data Analytics Data Analytics refers to the set of quantitative and qualitative approaches for deriving valuable insights from data. It involves many processes that include extracting data and categorizing it in data science, in order to derive various patterns, relations, connections, and other valuable insights from it. https://intellipaat.com/blog/what-is-data-analytics/#no1 Quantitative Data je data apne numerically ghani sakta hoy jemke paisa,temprature Definition: Quantitative data refers to any information that can be measured and expressed numerically, often used for statistical analysis. This type of data includes quantities, amounts, or ranges. The two main types of quantitative data are discrete data and continuous data. Examples : - Annual income in dollars - Height in cm - Volume of water in litre - Number of items sold per day - Temperature in Degree Celsius Qualitative Data je data apne numerically ,ke apne mapi na sakta hoy Definition: Qualitative data is a type of information that is descriptive and non-numeric, and focuses on characteristics and concepts rather than statistics and numbers. It can't be measured, counted, or expressed numerically. Qualitative data is a type of categorical data Examples : - Case studies - Colors - Blood type - Gender - Categories of plants - Pass/ fail Mountains of Data We now have more data being gathered/collected. Governments are starting to adopt openness policies of making public data freely available on the internet. Canada Open Government: http://open.canada.ca/en Seattle Open Data https://data.seattle.gov/ Ontario Open Data https://www.ontario.ca/search/data-catalogue Ottawa Open Data: http://data.ottawa.ca/ Seattle Bicycle Traffic https://www.seattle.gov/transporta tion/projects-and- programs/programs/bike- program/bike-data A traffic counter counts number of bicycles on the East and West sidewalks. There are traffic spikes from 7-9 am on the West side, and from 5- 6pm, but only 5 days a week. Profit As an entrepreneur, where would you sell hot dogs, or advertise? http://www.blogto.com/eat_drink/2015/07/everything_to_know_about_hot_dog_stands_in_toronto/ Healthcare The Real-World Benefits of Machine Learning in Healthcare Machine Learning Healthcare Applications – 2018 and Beyond CRISP-DM CRoss-Industry Standard Process for Data Mining CRISP-DM organizes the data mining process into six phases: business understanding, data understanding, data preparation, modelling, evaluation, and deployment. Business Data Data Modeling Evaluation Deployment Understanding Understanding Preparation Determine Collect Select Select Evaluate Plan Business Initial Modeling Data Results Deployment Objectives Data Technique Plan Monitering Assess Describe Clean Generate Review & Situation Data Data Test Design Process Maintenance Determine Produce Explore Construct Build Determine Data Mining Final Data Data Model Next Steps Goals Report Verify Produce Integrate Assess Review Data Project Plan Data Model Project Quality Format Data Titanic Dataset Analysis Using CRISP-DM 1. Business Understanding Objective: Identify what you want to accomplish from a business perspective. In the case of the Titanic survival prediction, the business objective is to predict the likelihood of survival for each passenger based on their attributes (e.g., age, gender, class). Assess Situation: Determine the availability of resources. Assess risks Think about contingency plans for risks. Titanic Dataset Analysis Using CRISP-DM Determine Goal Build a model to predict which Titanic passengers survived based on available features (gender, age, class, etc.). Identify key features that contribute to survival, such as class, age, or gender. Provide insights into how these factors might influence survival chances in similar real-world situations. Produce Project Plan Titanic Dataset Analysis Using CRISP-DM Data Understanding Collect initial data: Gather the Titanic dataset. Adding information about where and when the data was downloaded, the period it covers, and any links or sources associated with it is an excellent practice! Passenger Name Sex Age Fare Survived ID Describe data – check quantity of data, data format, consistent coding schemes Attribute Description Datatype Passenger ID Unique identifier for passengers on the Titanic String Name Name of the passenger String Sex Gender of the passenger Categorical Age Age of the passenger Numeric Fare Ticket fare paid by the passenger Numeric Survival status (1 = survived, 0 = did not survive); this is the Survived target class Categorical Titanic Dataset Analysis Using CRISP-DM Data Understanding Explore data – visualize data, identify relationships among data, query data etc. https://mpolinowski.github.io/docs/Development/Python/2023-05-12-matplotlib-seaborn- titanic-dataset/2023-05-12/ Titanic Dataset Analysis Using CRISP-DM Data Quality Verify data quality – how clean is the data? Any noise? Missing data, duplicates, errors, measurement errors, inconsistent representation etc Attribute Data Quality Issue Passenger ID 1% data missing, 5 duplicates Name No issue values are written as Male, M, m (inconsistent Sex representation) Age value in decimal (e.g., 23.5 ), 20% data missing 0 fare (possible data entry error or missing value Fare that should be investigated or some passengers had a free ticket ) Survived No issue Data Preparation How can we organize the data to perform modeling? Select data – determine which data will be used and document reasons for inclusion/exclusion A unique identifier that has no relationship to survival. It’s only useful for data organization or merging Passenger ID datasets. irrelevant Name Based on name of passenger we cannot predict whether he will survive or not irrelevant Sex Gender played a significant role in survival outcomes due to "women and children first" policies. relevant Age Age influences survival likelihood, especially for children (higher priority for lifeboats). relevant eflects socio-economic status (SES). Higher fares might correlate with access to better cabins closer to Fare lifeboats. relevant Survived Target variable indicating survival (1 = survived, 0 = did not survive). This is what we aim to predict. relevant Data Preparation Clean data – solve all data quality issues Construct data – derive new attributes Integrate data – create new datasets by combining data from multiple sources Format data – re-format data as necessary (discretization etc.) Data Preparation The lengthiest process! When getting data from different sources, some work is needed when putting it together: Cleaning and filtering: Remove duplicate data, missing data, resolve incomplete data. Something like: Woodroffe Ave, Woodroffe, Woodroffe Avenue should all be the same. Remove outliers: (data that is far outside the average). Every semester, some students register for a course but don’t drop it. This means they get 0 for everything and lowers the class average. Another example is that sales for a store are $0 for regional some regional holidays. Variable transformations. Changing how variables are represented (metric / imperial) Data Preparation You get a data matrix with variables, attributes or features (columns) Instances or records are the rows (N). Instance Date Price Quantity Label 1 May 2, 2016 5.50 1 Regular 2 May 28, 2016 3.79 2 Sale The data are typically brought together from various parts of the organization. They must be transformed into a single data format, for instance, price must all be $USD, or temperatures must all be Celsius instead of Fahrenheit. Data Cleaning How do you detect outliers? One method is to sort the data. The outliers will be at either end of the sorted sequence. For tagged data, make all similar tags the same: Woodroffe Ave. What about missing data? Replace with random numbers from average and standard deviation? Replace with “Missing” or “Unknown” tag. Modeling What modelling technique should we apply? Select modelling technique – determine which algorithms to try Generate test design – how to split data for training, testing, validation etc. Build model – build decided models Assess model – check generated models, apply domain knowledge to interpret the results Evaluation Which model best meets the business objectives? Evaluate results – do the models meet the business criteria? Which ones should we approve? Review process – Review the work. Determine next steps – determine whether to proceed or iterate further etc. Machine Learning Supervised learning – classification, regression Unsupervised learning – clustering, outlier detection Semi-supervised Supervised Learning: Classification Data has class labels Based on the labels, classifiers are generated New data will be classified based on the generated classifier Predicts a discrete class label Example 1: Cancer dataset – Malignant and benign labels are present for each instance. Example 2: Iris dataset – data from 3 types of flowers – every instance has a class label https://sebastianraschka.com/Articles/2014_intro_supervised_learning.html Supervised Learning: Regression Regression predicts continuous values (numbers) as the output. Example, housing prices for various houses: 2 bedroom, 3 bedroom, garage size, property size, and the computer must interpolate predictions. https://medium.com/hanman/data-modeling-building-a-house-price-prediction-model- 1450f825073b Unsupervised Learning data has no class labels The algorithm tries to identify the objects as being part of some group using a clustering algorithm. Similar instances grouped together to form clusters. (Ex. Insurance: Identifying groups of motor insurance policy holders with a high average claim cost) Anomaly detection tries to find those instances which are distinct from the nature of the majority of instances. (Ex. Financial fraud detection) Semi-supervised Learning Typically, a small amount of labeled data with a large amount of unlabeled data Questions Thank You

Use Quizgecko on...
Browser
Browser