Text and Image Classification PDF
Document Details
Uploaded by Deleted User
Aligarh Muslim University
Dr. Mohammad Nadeem
Tags
Summary
This document provides an overview of text and image classification techniques. It compares traditional methods with deep learning approaches using convolutional neural networks (CNNs) and recurrent neural networks (RNNs). The document covers concepts like bag of words, vocabulary mapping, and preprocessing techniques.
Full Transcript
Text classification Dr. Mohammad Nadeem Department of Computer Science Aligarh Muslim University Copyright © Dept of Computer Science, AMU, Aligarh. Permission required for reproduction or display. Recap Image classification...
Text classification Dr. Mohammad Nadeem Department of Computer Science Aligarh Muslim University Copyright © Dept of Computer Science, AMU, Aligarh. Permission required for reproduction or display. Recap Image classification Traditional approach Convolutional neural networks Content Text classification Traditional approach Recurrent neural networks Classification Example: Whether game can be played tomorrow? Wind Temperature Game speed Possible? 34 27 Yes 50 35 No 43 33 Yes - - - - - - 39 22 ? Text classification Example: Whether the given email is a spam email? Email Spam/Not spam Dear Lucky Winner, You have been randomly selected in our Spam Big Money Giveaway! Click the link below to claim your prize of $1,000,000 immediately. Hi Team, This is a reminder for our weekly project sync-up Not Spam meeting scheduled for tomorrow at 3 PM in the main conference room. -------- ------ -------- ------ Hello, Are you suffering from any chronic pain! Our new miracle drug cures all diseases. Order now and receive a 50% ? discount on your first purchase. Text classification Example: Identify whether the given social media post exhibits suicidal tendency? Post Suicidal? Note vs no note. Would you want to be left one? Is it Yes inflicting more damage to leave a note for someone, or is it beneficial for closure? Is there any town in USA which is abandoned and where No people don’t visit it? -------- ------ -------- ------ I am leaving for Germany and will never come back! ? Traditional way Text Class x1 x2 --- xn Class text1 Yes - - - - Yes text2 No - - - - No -- -- - - - - - - - - - -- -- ? Bag of Words Email text: Attention please, This is to bring to your kind attention that the World Bank in affiliation with the (IMF) has sanctioned some African countries to compensate the scam victims, including people that had an unfinished transaction or international business that failed due to Government problems or due to corrupt Government officials. The compensation includes those that had lost their hard-earned money to scammers. Each of the victims will be compensated with the sum of $4.8 Million US dollar, to ensure that justice is done to the scam victims. This is as result of numerous reports of frauds perpetuated from some African countries. There have been reports that the victims had lost billions of dollars to the scammers, with the United States particularly targeted the most. Preprocessing One case: Convert email text either in lower-case or Upper-case. Treatment of Numbers: either remove them or convert them into ‘NUMBER’. Stemming of Words: all forms of words are reduced to basic form. Ex: ‘award’, ‘awarded’, ‘awarding’ -> ‘award’ Handling of non-words: extra spaces, punctuations etc. are removed. Removing stop-words: the, is, at, which, and. Preprocessing > Anyone knows how much it anyone know how much cost costs to host a web portal ? host web portal well depend how > many visitor expect anywhere Well, it depends on how many less than number buck visitors youre expecting. This month couple of dollar you can be anywhere from less should checkout httpaddr than 10 bucks a month to a couple of $100. You should perhaps Amazon if run checkout something big unsubscribe http://www.space.com/ or yourself from mail list send an perhaps Amazon EC2 if youre email to emailaddr running something big.. To unsubscribe yourself from this mailing list, send an email to: groupname- [email protected] Vocabulary Each pre-processed email in training set is index words broken into individual words, called tokens. 1 aa All tokens are combined. 2 ab 3 able The set of unique words is created....... 86 anyone Called vocabulary....... 2000 zip It serves as a foundation. Mapping Mapping of Vocabulary Spam Email Given Email anyone know how much cost index word X Value host web portal well depend how 1 aa X1 0 many visitor expect anywhere less than number buck 2 ab X2 0 month couple of dollar you 3 able X3 0 should checkout httpaddr perhaps Amazon if run............ something big unsubscribe 86 anyone X86 1 yourself from mail list send an email to emailaddr............ 2000 zip X2000 0 Y 1 (spam) Mapping X1 X2... X2000 Y Email #1 0 1.... 1 1 Email #2 0 0.... 0 0 Email #3 0 0.... 1 0 Email #4 1 0.... 0 0............................................................ Email #n 1 1.... 0 0 Deep Learning Word Embedding is created. Models are used that can process sequential data. Recurrent neural networks are common choice for text classification. Recurrent Neural Network (RNN) a class of deep neural networks. designed to recognize patterns in sequences of data such as text, speech etc. Traditional neural networks, which process inputs independently. RNNs have loops within them. Loops make RNNs particularly useful for sequential tasks. RNN Types One-to-many One-to-one Many-to-many Many-to-one RNNs for Text Classification Preprocessing: o Tokenization o Vectorization o Padding RNN Model: o Input layer o Recurrent Layers LSTM BiLSTM o Output Layer Summary Text classification has many applications. Traditional approach and CNNs do not perform well in complex sequential tasks. RNN based deep learning approach is efficient. Image classification Dr. Mohammad Nadeem Department of Computer Science Aligarh Muslim University Copyright © Dept of Computer Science, AMU, Aligarh. Permission required for reproduction or display. Recap Deep Learning Importance Difference with machine learning Content Image classification Traditional approach Convolutional neural networks Classification Example: Whether game can be played tomorrow? Wind Temperature Game speed Possible? 34 27 Yes 50 35 No 43 33 Yes - - - - - - 39 22 ? Image classification Example: Whether the given image is a dog? Image Dog? Yes No -- -- ? Image classification Example: Identify the emotion? Traditional way Image Dog? x1 x2 --- xn Dog? Yes - - - - Yes - - - - No - - - - - No - - - - ? -- -- ? Traditional way How to put it in a classification problem format: o Every grayscale image is a 20x20 pixel image, total 400 pixels. o Each pixel value can be considered an independent variable. P1 P2 P3............. P400 Dog? 255 89 0............. 67 1 0 46 104............. 34 0 - - -............. - - - - -............. - - Dataset Traditional way How to put it in a classification problem format: o Every colored image is a 20x20 pixel image, total 1,200 pixels. o Each channel of each pixel can be considered an independent variable. P1 P2 P3............. P1200 Dog? 255 89 0............. 67 1 0 46 104............. 34 0 - - -............. - - - - -............. - - Dataset Deep Learning No artificial fitting of pixel values to classification data format. Raw image can be input directly into the model, and the output is the classification result (End-to-End Learning). Simplified workflow. CNNs can capture complex patterns and relationships in the image. Convolutional Neural Networks a class of deep neural networks. used extensively in the field of computer vision. for image and video recognition, image classification, object detection, and many others. learn spatial hierarchies of features from input images. Components of CNNs Convolutional Layer Activation Function Pooling Layer Flatten Layer Fully Connected Layer Dropout Layer Normalization Layers (optional) Components of CNNs Working of CNNs Feature learning task Classification task Advantages: o Parameter sharing o Sparsity of connections Summary Image classification has many applications. Traditional approach does not perform well in complex tasks. CNN based deep learning approach is efficient. Deep Learning Dr. Mohammad Nadeem Department of Computer Science Aligarh Muslim University Copyright © Dept of Computer Science, AMU, Aligarh. Permission required for reproduction or display. Recap Machine Learning Supervised Learning Unsupervised Learning Content (Week) Deep Learning CNN RNN Image classification Text classification Content Deep Learning Importance Difference with machine learning Introduction A subset of machine learning. Uses neural networks with many layers (deep architectures) to learn from large amounts of data. Artificial neurons, layers, and activation functions are still the key components. Many benefits over traditional machine learning. Process Layers of connected nodes. Forward Propagation: Data inputs are processed layer by layer to make predictions. Backpropagation: Adjusting weights in the network based on the error of predictions. TensorFlow, Keras etc. simplify the development of deep learning models. Applications Computer Vision Speech Recognition and Generation Natural Language Processing (NLP) Autonomous Vehicles Recommendation Systems Machine Learning vs Deep Learning Core: o Data Handling o Feature Engineering o Complexity Training: o Model Architecture o Processing Power and Data Requirements o Learning Process o Accuracy Over Time Machine Learning vs Deep Learning Practical Implications : o Application Fields o Performance o Development Time and Resources o Interpretability Machine Learning vs Deep Learning More hyper-parameters: o Type of Layers o Activation Function o Batch Size o Optimizer o Dropout Rate Example X1 X2 X3............. Xn Dog? 1 1 0............. 1 Yes 0 1 0............. 0 No - - -............. - - - - -............. - - Assessment Handling Unstructured Data Automated Feature Extraction Complex Pattern Recognition Adaptability Technological factors Technological Advancements Increased Computational Power Big Data Explosion Improved Algorithms Cloud Computing Open Source Ecosystem Famous DNN architectures CNN RNN o LSTM o BiLSTM Transformers GAN Future Revolutionizing Industries Creating New Possibilities Economic and Social Impact Continuous Improvement Summary Deep Learning is an advanced form of machine learning. Used for complex tasks. Have more hyper-parameters. Many DNN architectures. End to End Project (Regression Task) Dr. Mohammad Nadeem Department of Computer Science Aligarh Muslim University, Aligarh-202002 1 Copyright © Dept of Computer Science, AMU, Aligarh. Permission required for reproduction or display. California Housing Prices Dataset The dataset contains median house prices for California districts derived from the 1990 census. It serves as an excellent introduction to implementing machine learning algorithms. The objective of the dataset is to predict median house value. Number of downloads: 136K Link to download dataset: https://www.kaggle.com/datasets/camnugent/california-housing-prices Features of California Housing Prices Dataset The datasets consist of nine (independent) variables and one target (dependent) variable, medianHouseValue (Median House Value). Total records: 20640 Features Independent variables includes: – longitude: A measure of how far west a house is; a higher value is farther west. – latitude: A measure of how far north a house is; a higher value is farther north. – housingMedianAge: Median age of a house within a block; a lower number is a newer building. – totalRooms: Total number of rooms within a block – totalBedrooms: Total number of bedrooms within a block Features Independent variables includes: – population: Total number of people residing within a block – households: Total number of households, a group of people residing within a home unit, for a block – medianIncome: Median income for households within a block of houses (measured in tens of thousands of US Dollars) – oceanProximity: Location of the house w.r.t ocean/sea Dependant Variable Dependent variable includes: – medianHouseValue: Median house value for households within a block (measured in US Dollars) Few Records of California Housing Prices Dataset longitude latitudehousing_ total_ total_bed population households median median_house ocean_proximity median_ room rooms _income _value age -122.23 37.88 41 880 129 322 126 8.3252 452600 NEAR BAY -122.22 37.86 21 7099 1106 2401 1138 8.3014 358500 NEAR BAY -122.24 37.85 52 1467 190 496 177 7.2574 352100 NEAR BAY -118.18 34.63 19 3562 606 1677 578 4.1573 228100 INLAND -118.17 34.61 7 2465 336 978 332 7.1381 292200 INLAND -118.16 34.6 2 11008 1549 4098 1367 6.4865 204400 INLAND -117.11 32.58 21 2894 685 2109 712 2.2755 125000 NEAR OCEAN -117.1 32.58 27 2616 591 1889 577 2.3824 127600 NEAR OCEAN Correlation Between Features Analysis Visualization Preprocessing – Handling null values – Apply standardization Linear Regression – Without polynomial features – With polynomial features Apply dimensionality reduction End to end Machine Learning Project Source: "Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow", Aurélien Géron, second edition, O’Reilly Dr. Faisal Anwer Department of Computer Science Aligarh Muslim University, Aligarh Copyright © Dept of Computer Science, AMU, Aligarh. Permission required for reproduction or display. Main steps for Machine learning project 1. Look at the big picture. 2. Get the data. 3. Discover and visualize the data to gain insights. 4. Prepare the data for Machine Learning algorithms. 5. Select a model and train it. 6. Fine-tune your model. 7. Present your solution. 8. Launch, monitor, and maintain your system. Frame the problem First, we need to frame the problem: is it supervised, unsupervised, or Reinforcement Learning? Is it a classification task, a regression task, or something else? Should we use batch learning or online learning techniques? Try to answer these questions for yourself. Working with Dataset A few places, where we can get data: Popular open data repositories: —UC Irvine Machine Learning Repository —Kaggle datasets —Amazon’s AWS datasets Meta portals (they list open data repositories): —http://dataportals.org/ —http://opendatamonitor.eu/ —http://quandl.com/ Other pages listing many popular open data repositories: —Wikipedia’s list of Machine Learning datasets —Quora.com question —Datasets subreddit Case Study of California census data You are asked to build a model of housing prices in California using the California census data. This data has metrics such as the population, median income, median housing price, and so on for each block group in California. Block groups are the smallest geographical unit for which the US Census Bureau publishes sample data (a block group typically has a population of 600 to 3,000 people). We will just call them ―districts‖ for short. Your model should learn from this data and be able to predict the median housing price in any district, given all the other metrics. It is clearly a typical supervised learning task since you are given labeled training examples (each instance comes with the expected output, i.e., the district’s median housing price). Moreover, it is also a typical regression task, since you are asked to predict a value. More specifically, this is a multiple regression problem since the system will use multiple features to make a prediction. It is also a univariate regression problem since we are only trying to predict a single value for each district. If we were trying to predict multiple values per district, it would be a multivariate regression problem. Finally, there is no continuous flow of data coming in the system, and the data is small enough to fit in memory, so plain batch learning should do just fine. Case Study of Pima Indians Diabetes Dataset The Pima Indian Diabetes Dataset, originally from the National Institute of Diabetes and Digestive and Kidney Diseases, contains information of 768 women from a population near Phoenix, Arizona, USA. The outcome tested was Diabetes, 258 tested positive and 500 tested negative. Therefore, there is one target (dependent) variable and the 8 attributes (TYNECKI, 2018): pregnancies, OGTT(Oral Glucose Tolerance Test), blood pressure, skin thickness, insulin, BMI(Body Mass Index), age, pedigree diabetes function. Your model should learn from this data and be able to predict the median housing price in any district, given all the other metrics. It is clearly a typical supervised learning task since you are given labeled training examples (each instance comes with the expected output, i.e., positive(1) or negative(0)). Moreover, it is also a typical regression task, since you are asked to predict a value. More specifically, this is a multiple regression problem since the system will use multiple features to make a prediction. It is also a univariate regression problem since we are only trying to predict a single value for each instance. Finally, there is no continuous flow of data coming in the system, and the data is small enough to fit in memory, so plain batch learning should do just fine. Selection of a Performance Measure A typical performance measure for regression problems is the Root Mean Square Error (RMSE). It gives an idea of how much error the system typically makes in its predictions. where, m is the number of instances in the dataset you are measuring. x(i) is a vector of all the feature values (excluding the label) of the ith instance in the dataset, and y(i) is its label (the desired output value for that instance). h is the system’s prediction function, also called a hypothesis. When the system is given an instance’s feature vector x(i), it outputs a predicted value h(x(i)) for that instance Selection of a Performance Measure Even though the RMSE is generally the preferred performance measure for regression tasks, in some contexts we may prefer to use another function: Mean Absolute Error (also called the Average Absolute Deviation). Both the RMSE and the MAE are ways to measure the distance between two vectors: the vector of predictions and the vector of target values. Dataset Loading A collection of data is called dataset. It is having the following two components − Features − The variables of data are called its features. They are also known as predictors, inputs or attributes. Response − It is the output variable that basically depends upon the feature variables. They are also known as target, label or output. Loading Dataset Let’s load the data using Pandas. You should write a small function to load the data: If loading dataset in google colab from google.colab import files k=files.upload() import pandas as pd housing = pd.read_csv('housing.csv') Look at the data structure of dataset print(housing.head(10)) --- displays 10 records with header print(housing.info()) ---The info() method is useful to get a quick description of the data, in particular the total number of rows, and each attribute’s type and number of non-null values print(housing["ocean_proximity"].value_counts()) --- shows count of each category of feature: ocean_proximity print(housing.describe()) ----- shows a summary of the numerical attributes. print(housing.iloc[:, 0:9]) --- shows arbitrary rows and columns. Visualizing Dataset Histogram housing.hist(bins=50, figsize=(20,30)) plt.show() To display histogram of 50 bins Box and Whisker Plots housing.boxplot("total_bedrooms") plt.show() Heat Map k=housing.corr() sb.heatmap(k) Scatter Plot import pandas as pf from pandas.plotting import scatter_matrix pd = pf.read_csv('PIMA_diabetes.csv') scatter_matrix(pd) plt.show() Scatter Matrix Plot Scatter plots shows how much one variable is affected by another or the relationship between them with the help of dots in two dimensions. Scatter plots are very much like line graphs in the concept that they use horizontal and vertical axes to plot data points. import pandas as pf from matplotlib import pyplot as plt from pandas.plotting import scatter_matrix import seaborn as sb pima = pf.read_csv('PIMA_diabetes.csv') scatter_matrix(pima, figsize=(18, 12)) plt.show() If you want to focus on a few promising attributes attributes = ["Pregnancies", "Glucose", "BloodPres sure","Insulin"] scatter_matrix(pima[attributes], figsize=(18, 12)) plt.show() Data Cleaning Most Machine Learning algorithms cannot properly work with missing features. You have three options: Get rid of the corresponding districts. Get rid of the whole attribute. Set the values to some value (zero, the mean, the median, etc.). You can accomplish these easily using DataFrame’s dropna(), drop(), and fillna() methods: housing.dropna(subset=["total_bedrooms"]) # option 1 housing.drop("total_bedrooms", axis=1) # option 2 median = housing["total_bedrooms"].median() # option 3 housing["total_bedrooms"].fillna(median, inplace=True) Feature Scaling Most probably our dataset comprises of the attributes with varying scale. Formally, If a feature in the dataset is big in scale compared to others, then in algorithms where Euclidean distance is measured this big scaled feature becomes dominating. Data rescaling makes sure that attributes are at same scale. There are two common ways to get all attributes to have the same scale: min-max scaling and standardization Min-max scaling Min-max scaling (many people call this normalization) is quite simple: values are shifted and rescaled so that they end up ranging from 0 to 1. We perform this by subtracting the min value and dividing by the max minus the min. Scikit-Learn provides a transformer called MinMaxScaler for this. It has a feature_range hyperparameter that lets you change the range if you don’t want 0–1 for some reason. from sklearn import preprocessing data_scaler= preprocessing.MinMaxScaler(feature_range=(0,1)) data_rescaled = data_scaler.fit_transform(X_test) Standardization Standardization is quite different: first it subtracts the mean value (so standardized values always have a zero mean), and then it divides by the standard deviation so that the resulting distribution has unit variance. Unlike min-max scaling, standardization does not bound values to a specific range, which may be a problem for some algorithms (e.g., neural networks often expect an input value ranging from 0 to 1). However, standardization is much less affected by outliers. For example, suppose a district had a median income equal to 100 (by mistake). Min-max scaling would then crush all the other values from 0–15 down to 0–0.15, whereas standardization would not be much affected. Scikit-Learn provides a transformer called StandardScaler for standardization. from sklearn import preprocessing data_scaler = preprocessing.StandardScaler() data_rescaled = data_scaler.fit_transform(X_test) Splitting the dataset To check the accuracy of our model, we split the dataset into two pieces-a training set and a testing set. Use the training set to train the model and testing set to test the model. In this we can evaluate how well our model performed. from sklearn.model_selection import train_test_split #for PIMA dataset pima = pd.read_csv('PIMA_diabetes.csv') pima_train, pima_test= train_test_split(pima, test_size = 0.3, random_state=42) #for housing dataset Train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42) This function has the following arguments − pima− Here, pima is the dataset, which need to be splited. test_size − This represents the ratio of test data to the total given data. As in the above example, we are setting test_data = 0.3. random_state − It is used to guarantee that the split will always be the same. This is useful in the situations where you want reproducible results. Train the Model Next, dataset can be used to train some prediction-model. As discussed, scikit-learn has wide range of Machine Learning (ML) algorithms which have a consistent interface for fitting, predicting accuracy, recall etc. Example: def modeling(X_train, X_test, y_train, y_test): classifier_knn = KNeighborsClassifier(n_neighbors = 3) classifier_knn.fit(X_train, y_train) y_pred = classifier_knn.predict_proba(X_test) return y_pred Image Classification We will be using the MNIST dataset, which is a set of 70,000 small images of digits handwritten by high school students and employees of the US Census Bureau. Each image is labeled with the digit it represents. This set has been studied so much that it is often called the ―Hello World‖ of Machine Learning. Whenever people come up with a new classification algorithm, they are curious to see how it will perform on MNIST. Whenever someone learns Machine Learning, sooner or later they tackle MNIST. Scikit-Learn provides many helper functions to download popular datasets. MNIST is one of them. The code mentioned in next slide fetches the MNIST dataset. Image Classification from sklearn.datasets import load_digits mnist = fetch_openml('mnist_784', version=1) mnist.keys() dict_keys(['data', 'target', 'feature_names', 'DESCR', 'details’, 'categories', 'url']) Datasets loaded by Scikit-Learn generally have a similar dictionary structure including: A DESCR key describing the dataset A data key containing an array with one row per instance and one column per feature A target key containing an array with the labels X, y = mnist["data"], mnist["target"] Performance Metrics for Classification problems from sklearn.metrics import precision_score, recall_score precision_score(y_train_5, y_train_pred) recall_score(y_train_5, y_train_pred) ROC The receiver operating characteristic (ROC) curve is another common metric used to check the performance of the classifiers. The ROC curve plots the true positive rate (TPR) (another name for recall) against the false positive rate (FPR). The FPR is the ratio of negative instances that are incorrectly classified as positive. It is equal to one minus the true negative rate (TNR), which is the ratio of negative instances that are correctly classified as negative. The TNR is also called specificity. Hence the ROC curve plots sensitivity (recall) versus 1 – specificity. Conclusion and Future Directions Dr. Mohammad Nadeem Department of Computer Science Aligarh Muslim University Copyright © Dept of Computer Science, AMU, Aligarh. Permission required for reproduction or display. Regarding this course About course Comprehensive Learning Experience Practical Application and Projects Future Directions Advanced Machine Learning Techniques Big Data Technologies Cloud Computing Specialized Fields Book references Kroese, Dirk P., Zdravko Botev, and Thomas Taimre. Data science and machine learning: mathematical and statistical methods. Chapman and Hall/CRC, 2019. Géron, Aurélien, Hands-on machine learning with Scikit- Learn, Keras, and TensorFlow, O'Reilly Media, Inc., 2022. Illowsky, Barbara, and Susan Dean, Introductory statistics, OpenStax, (2018). What more? Continued Learning Community Engagement Thank You AI, Machine Learning and Data Science Dr. Faisal Anwer Department of Computer Science Aligarh Muslim University, Aligarh-202002 1 Copyright © Dept of Computer Science, AMU, Aligarh. Permission required for reproduction or display. Review of Previous Lecture Motivation behind Data Science Objectives and Outcome of the Course Operations of Data Science Contents Data Science vs Machine Learning AI vs Data Science AI vs Machine Learning vs Data Science Data Science and Machine Learning Data Science Data science – Interdisciplinary field – Includes Scientific methods, Processes and Algorithms on structured and unstructured data to extract insights. Data Science combines: – Mathematics, – Statistics, – Computer science, and – specific expertise to uncover patterns, trends, and relationships within data. Data Science Stages Data collection. Data Storage Data preprocessing. Visualizing Data. Modeling and evaluation. Machine Learning Machine learning, a subfield of artificial intelligence. Machine learning algorithms detect patterns and correlations within dataset. Machine learning systems can make complex predictions and analyses. The fundamental principle of Machine Learning lies in its autonomous learning capability. Machine learning and Data Science (Similarities and Differences) Similarities Machine learning workflows, similar to data science Differences – Post Analysis Data scientists typically interpret and report their findings, whereas machine learning engineers focus on the model's deployment, monitoring, and maintaining in a production environment. Data Science and Machine Learning Artificial Intelligence and Data Science Artificial Intelligence Artificial intelligence makes machines seem like they have human intelligence. The goal of AI is to: – Mimic the human and – Create systems that can function intelligently and independently. AI and Data Science Tasks that constitute AI – Problem Solving – Knowledge Representation and Reasoning – Decision Making – Communication, perception and Actuation Problem Soving Problem Solving – Maze game – Tic tac toe (No data no modelling) (Need efficient algo. A*, DFS/BFS) Knowledge Representation & Reasoning Knowledge Representation & Reasoning – Eight Queen’s problem – complex game rules (Write rules and apply reasoning) (No data no modelling) (Need propositional and first order logic.) Decision Making Decision Making Expert Systems (when rules are well known) (Rules given by experts) (Rules encoded using knowledge representation) (Execution of rules by a program) Limitation of Expert Systems Rules may be too complex Rules may be inexpressible Rules may be unknown (specially for new diseases) Here we can take help from Machine Learning, Reinforcement Learning, Deep Learning ( No need to write rules). Communication, perception and Actuation Communication – Traditionally AI agents (or robots) are required to communicate with humans. – The agent should have capability of Natural Language Processing – Modern NLP is data driven and ML/DL is best choice for NLP (Example: An agent can perform voice analysis using Machine learning/Deep Learning) Communication, perception and Actuation Perception – How agent perceive the environment (i.e., how agent observe around its surrounding) – It gives ability of vision and speech – Modern CV (computer vision) and speech are data driven Actuation – Performing some action by an agent (i.e robot) – Reinforcement Learning – This is data driven now. AI vs Data Science Data Science AI Machine Learning Focus Extracting insight Enable computers to Provide a way to from structured accomplish complex synthesize data, and unstructured intellectual tasks like learn from it and data for decision- humans. use the insights to making and improve over planning. time. Application Solving Business Perform tasks like Extract problems using humans by learning, knowledge from descriptive and reasoning, and self- data to learn from predictive correction. that data and analytic. make predictions. Examples: Robots, Examples: Voice assistants, Online Examples: Disease gaming. recommendation prediction, system, Health Financial analysis monitoring. AI vs Machine Learning vs Data Science Summary In what way Data Science is related with Machine Learning. The way AI and Data Science are related Finally, the relation between AI, Machine Learning and Data Science. Introduction to Datasets Dr. Faisal Anwer Department of Computer Science Aligarh Muslim University, Aligarh-202002 1 Copyright © Dept of Computer Science, AMU, Aligarh. Permission required for reproduction or display. Recap of Previous Lecture Introduction to Google Colab Features of Google Colab How to read Datasets in Google Colab ? – Importing Kaggle Dataset into Colab – Downloading and uploading Dataset to Colab – Read directly through Google Drive How to read Github Project in Google Colab? Contents Popular Dataset Repositories Discussion on following Datasets: – Crop Recommendation Dataset – Pima Indians Diabetes Dataset – California Housing Prices Dataset – MNIST Dataset Working with Dataset A few places, where you can get data Popular open data repositories: —UC Irvine Machine Learning Repository (https://archive.ics.uci.edu/) —Kaggle datasets (https://www.kaggle.com/datasets) — Specific Datasets Other pages listing many popular open data repositories: —Wikipedia’s list of Machine Learning datasets (https://en.wikipedia.org/wiki/List_of_datasets_for_machine- learning_research) Crop Recommendation Dataset – A dataset which would allow the users to build a predictive model to recommend the most suitable crops to grow in a particular farm based on various parameters. – This dataset was build by augmenting datasets of rainfall, climate and fertilizer data available for India. – Link to download dataset: https://www.kaggle.com/datasets/atharvaingle/crop- recommendation-dataset Features of Crop Recommendation Dataset The datasets consist of seven independent variables and one target (dependent) variable, labels. Total records: 2200 Features of Crop Recommendation Dataset Independent Features: ▪ N - ratio of Nitrogen content in soil ▪ P - ratio of Phosphorous content in soil ▪ K - ratio of Potassium content in soil ▪ temperature - temperature in degree Celsius ▪ humidity - relative humidity in % ▪ ph - ph value of the soil ▪ rainfall - rainfall Dependant Feature: ▪ Label- Suitable crops Few Records of Crop Recommendation Dataset N P K temperature humidity ph rainfall label 90 42 43 20.87974 82.00274 6.502985 202.9355 rice 85 58 41 21.77046 80.31964 7.038096 226.6555 rice 60 55 44 23.00446 82.32076 7.840207 263.9642 rice 43 79 79 19.40752 18.98031 7.806748 80.25065 chickpea 44 74 85 20.18649 19.6372 7.150681 78.2604 chickpea 83 45 21 18.83344 58.75082 5.716223 79.75329 maize 100 48 16 25.71896 67.22191 5.549902 74.51491 maize Correlation Between Features Pima Indians Diabetes Dataset Dataset originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes. Number of download: 463K Link to download dataset: https://www.kaggle.com/datasets/uciml/pima-indians- diabetes-database Features of Pima Indians Diabetes Dataset The datasets consist of several medical predictor (independent) variables and one target (dependent) variable, Outcome. All patients here are females at least 21 years old Total records: 768 Features of Pima Indians Diabetes Dataset Independent variables includes: – Pregnancies: Number of times pregnant – Glucose: Glucose concentration – BloodPressure: Diastolic blood pressure – SkinThickness: Triceps skin fold thickness – Insulin: 2-Hour serum insulin – BMI: Body mass index – DiabetesPedigreeFunction: Diabetes pedigree function – Age Dependant Variable – Outcome: Class variable (0 or 1) 268 of 768 are 1, the others are 0 Few Records of Pima Indians Diabetes Dataset Diabetes BloodPre SkinThick Pedigree Pregnancies Glucose ssure ness Insulin BMI Function Age Outcome 6 148 72 35 0 33.6 0.627 50 1 1 85 66 29 0 26.6 0.351 31 0 8 183 64 0 0 23.3 0.672 32 1 1 89 66 23 94 28.1 0.167 21 0 0 137 40 35 168 43.1 2.288 33 1 5 116 74 0 0 25.6 0.201 30 0 Correlation Between Features California Housing Prices Dataset The dataset contains median house prices for California districts derived from the 1990 census. It serves as an excellent introduction to implementing machine learning algorithms. The objective of the dataset is to predict median house value. Number of downloads: 136K Link to download dataset: https://www.kaggle.com/datasets/camnugent/california- housing-prices Features of California Housing Prices Dataset The datasets consist of nine (independent) variables and one target (dependent) variable, medianHouseValue (Median House Value). Total records: 20640 Features of California Housing Prices Dataset Independent variables includes: – longitude: A measure of how far west a house is; a higher value is farther west – latitude: A measure of how far north a house is; a higher value is farther north – housingMedianAge: Median age of a house within a block; a lower number is a newer building – totalRooms: Total number of rooms within a block – totalBedrooms: Total number of bedrooms within a block Features of California Housing Prices Dataset Independent variables includes: – population: Total number of people residing within a block – households: Total number of households, a group of people residing within a home unit, for a block – medianIncome: Median income for households within a block of houses (measured in tens of thousands of US Dollars) – oceanProximity: Location of the house w.r.t ocean/sea Dependant Variable Dependent variable includes: – medianHouseValue: Median house value for households within a block (measured in US Dollars) Few Records of California Housing Prices Dataset latitu housing_median_a total_room total_bedroo populatio household median_incom median_house_val ocean_proximi longitude de ge s ms n s e ue ty -122.23 37.88 41 880 129 322 126 8.3252 452600NEAR BAY -122.22 37.86 21 7099 1106 2401 1138 8.3014 358500NEAR BAY -122.24 37.85 52 1467 190 496 177 7.2574 352100NEAR BAY -118.18 34.63 19 3562 606 1677 578 4.1573 228100INLAND -118.17 34.61 7 2465 336 978 332 7.1381 292200INLAND -118.16 34.6 2 11008 1549 4098 1367 6.4865 204400INLAND -117.11 32.58 21 2894 685 2109 712 2.2755 125000NEAR OCEAN -117.1 32.58 27 2616 591 1889 577 2.3824 127600NEAR OCEAN Correlation Between Features MNIST Dataset MNIST dataset, is a set of 70,000 small images of digits handwritten by high school students and employees of the US Census Bureau. Each image is labeled with the digit it represents. This set has been studied so much that it is often called the “Hello World” of Machine Learning The objective of the dataset is to predict the image between 0 to 9. Number of downloads: 127K Link to download dataset: https://www.kaggle.com/datasets/oddrationale/mnist-in-csv Features of MNIST Dataset Each image has 784 features. This is because each image is 28×28 pixels. Each feature simply represents one pixel’s intensity, from 0 (white) to 255 (black). We considered csv format of the dataset. The dataset contains two files one for training data set and one for testing dataset. Total records: 70000 (Training: 60000 + Testing: 10000) Features of MNIST Dataset Independent variables includes: – 1x1 – 1x2 – 1x3 – 1x4 – 1x5 – 1x6 – ….. – ….. – 28x28 Dependent variable includes: – label : digits (from 0 to 9) Records of MNIST Dataset CATEGORY 0 1 2 3 4 5 6 7 8 9 TOTAL #Training 5,923 6,742 5,958 6,131 5,842 5,421 5,918 6,265 5,851 5,949 60,000 Samples #Testing 980 1,135 1,032 1,010 982 892 958 1,028 974 1,009 10,000 Samples Sample Images of MNIST Dataset Summary We have popular Dataset Repositories We Discussed on following Datasets: – Crop Recommendation Dataset – Pima Indians Diabetes Dataset – California Housing Prices Dataset – MNIST Dataset Introduction to Python Dr. Faisal Anwer Department of Computer Science Aligarh Muslim University, Aligarh-202002 Copyright © Dept of Computer Science, AMU, Aligarh. Permission required for reproduction or display. Recap of Previous Lecture Introduction to Data Science AI, Machine Learning and Data Science Introduction Python is Developed by Guido van Rossum. The Python language is built around 19 guidelines from the Zen of Python. The first release appeared in 1991 as Python 0.9.0. Introduction: Features Python is an interactive, general-purpose, object-oriented, and high- level programming language. Open-Source Simplicity & Readability Portable Huge community of developers/users Python works on different platforms Python is widely used in various types of applications Introduction: Features Vast Support of Libraries and frameworks Data Science: NumPy, Pandas, SciPy, PyTorch, etc. Machine Learning: Tensorflow, Scikit-Learn, PyBrain, PyML, etc. Artificial Intelligence: Keras, OpenCV, NLTK, etc. Web Development: Django, Flask, Pyramid, etc. Network Programming: Asyncio, Pulsar, Pyzmq, etc. INTRODUCTION: OFFICIAL HOME ✓ The official home of the Python Programming Language is www.python.org ✓ The latest release can be checked (https://www.python.org/downloads/) ✓ Python extension is.py IDEs & Editors ✓Integrated Development Environment (IDE) ✓Text/Code Editor ✓How to select IDE/Editor? Beginner — IDLE (or Online Python Editors) is perfect. Intermediate/ Advanced —Google Colab, Jupyter Notebook, PyCharm, Sublime, etc. ✓What’s your purpose? Data Science —Jupyter Notebook, Colab, PyCharm Professional Web development — PyCharm Professional, VS Code IDEs & Editors Integrated Development and Learning Environment (IDLE) ✓IDLE is a cross-platform IDEs & Editors: open-source IDE that comes by default IDLEwith Python. ✓Suitable for simple projects Installation Steps for IDLE ✓ Go to https://www.python.org/downloads/ ✓ Download the latest release of Python as per your OS ( Windows/Linux/Mac) ✓ Run the downloaded (.exe) file IDEs & Editors: Jupyter Notebook ✓A web-based interactive development environment ✓Easy to use, open-source software that allows you to create and share live code, visualizations, and others. First Method: Installing Jupyter using Anaconda ✓Download latest version of Anaconda (https://www.anaconda.com/products/individual) ✓Install Anaconda (https://docs.anaconda.com/anaconda/install/windows/) IDEs & Editors: Jupyter Notebook Second Method: Installing Jupyter with pip (Python’s package manager) ✓Rum command prompt (cmd) ✓Upgrade pip (pip3 install --upgrade pip) ✓pip3 install jupyter Start with Jupyter Notebook ✓Rum command prompt (cmd) ✓Write jupyter notebook IDEs & Editors: Colab ✓A free Jupyter notebook environment that runs entirely in the cloud. ✓Requires no installation/setup ✓Allows you to work in a team ✓Supports many popular machine learning libraries IDEs & Editors: Getting Started ✓ Run IDLE or Jupyter Notebook or Colab ✓ print(‘Hello, World’) ✓ a=50 ✓ b=6 ✓ c=a%b Running code in the interactive shell Python is an interpreted language. Python expressions and statements can be run in an interactive programming environment called the shell. At the prompt (>>>) type: print (“Python: Hello World”) – Press [Enter] 13 Running code in the interactive shell Calculator: Interactive Mode Start the Python interpreter and use it as a calculator. Use an interactive shell to perform the following operations 2 + 3, 6/2, 6/4, 6%4, 6%2, 2*4-3, 4-2*6-4 14 Programming in Script Mode Open a New File from the File menu of the IDLE shell window Write your program and save your file using the “.py” extension Use Run Module (F5) from the Run menu. 15 Program 1.Write a program in Python programming and save it as MyfirstProgram.py to perform the following operations Addition, Subtraction, Multiplication, Division, Modulo division. 2.Write a program in Python programming and save it as MysecondProgram.py to display the following messages: “Hello World, I am studying Data Science using Python” “ Guido Van Rossum invented the Python programming language.” 16 Program Documentation Comment lines provide documentation about your program – Anything after the “#” symbol is a comment – non-executable (Ignored by the python shell) #My First Python Program #May February, 2024 17 Basic Concepts ✓ A case-sensitive language ✓ Comment (single line (#) or Multi-lines (triple-quotes)) ✓ Indentation is essential (no curly braces) ✓ Multiple statements are allowed with a semicolon ✓ a=5; b=6; c=a+b ✓ Keywords: A set of reserved words that can’t be used as variable names, function names, or any other identifier: ✓ import keyword as K ✓ len(K.kwlist) --- 36 ✓ To print the list, print(keyword.kwlist) Introduction to Python Advanced Libraries NumPy NumPy stands for Numerical Python. It is a library consisting of multidimensional array objects and a collection of routines for processing those arrays. Using NumPy, mathematical and logical operations on arrays can be performed. Pandas Pandas is a Python library used for working with data sets. It has functions for analyzing, cleaning, exploring, and manipulating data. Pandas can clean messy data sets, and make them readable and relevant. Seaborn Seaborn is a python library for making statistical graphics. It is built on top of matplotlib and integrates closely with pandas data structures. import pandas as pd import seaborn as sns df = pf.read_csv('PIMA_diabetes.csv') k=df.corr() sns.heatmap(k, annot=True, cmap="bwr") Scikit-learn (Sklearn) Scikit-learn (Sklearn) is the most useful and robust library for machine learning in Python. It provides a selection of efficient tools for machine learning and statistical modeling including classification, regression, clustering and dimensionality reduction via a consistence interface in Python. Scikit-learn Features Scikit-learn library focuses on modeling the data. Some of the most popular groups of models provided by Sklearn are as follows − Supervised Learning algorithms Unsupervised Learning algorithms Clustering − This model is used for grouping unlabeled data. Example from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_absolute_error,mean_squared_error lin_reg = LinearRegression() lin_reg.fit(X_train, Y_train) Y_pred=lin_reg.predict(X_test) Summary Features of Python IDE and Editors How to run a python code Introduction of Advanced Libraries Introduction to Google Colaboratory (Google Colab) Dr. Faisal Anwer Department of Computer Science Aligarh Muslim University, Aligarh-202002 1 Copyright © Dept of Computer Science, AMU, Aligarh. Permission required for reproduction or display. Recap of Previous Lecture Features of Python IDE and Editors How to run a python code Introduction of Advanced Libraries Contents Introduction to Google Colab Features of Google Colab How to read Datasets in Google Colab ? – Importing Kaggle Dataset into Colab – Downloading and uploading Dataset to Colab – Read directly through Google Drive How to read Github Project in Google Colab? Introduction Google Colab is a cloud-based Jupyter notebook environment from Google Research. Jupyter is the open source project on which Colab is based. Colab, or "Colaboratory", allows to write and execute Python in browser. Colab helps us get started with our data science journey with almost no setup. Introduction Advantages of Google Colab: ✓Pre-installed data science libraries ✓Easy sharing and collaboration ✓Seamless integration with GitHub ✓Working with data from various sources ✓Seamless integration with Kaggle Datsets ✓Automatic storage and version control ✓Access to hardware accelerators such as GPUs and TPUs Google Colab Colab notebooks are stored in Google Drive. It can be shared similar to Google Docs or Sheets. Click on Share button present at the top right of any Colab notebook to share colab notebooks. Mounting Google Drive on Colab allows any code in your notebook to access any files in your Google Drive. Google Colab Notebook A notebook is a collection of cells. Cells contain either explanatory text or executable code and its output. Once the toolbar button indicates CONNECTED, click in the cell to select it and execute the contents; Getting Started With Google Colab To start working with Colab you first need to log in to your Google account, then go to this link https://colab.research.google.com. Once you've signed in to Colab, you can create a new notebook by clicking on 'File' → 'New notebook’, On creating a new notebook, it will create a Jupyter notebook with Untitled0.ipynb and save it to your google drive in a folder named Colab Notebooks. Google Colab allows you to write Python code as well as text Working with data from various sources Loading data from local machine Mounting Google drive to Colab instance – By clicking on drive icon – Alternatively, by using the following code: from google.colab import drive drive.mount(‘/content/drive') Access to computer hardware accelerators GPUs and TPUs Goto Runtime and click change runtime. Here you can select the hardware accelerators. If you want to access premium GPUs, you can purchase additional compute units. You can also manage sessions by accessing Runtime ->Manage Sessions. Here you can terminate the other sessions if need arises. Reading Kaggle Dataset Importing Kaggle Dataset into Colab 1. Open your Google Colab Notebook 2. Download and Install the required packages. pip install opendatasets 3. Visit Kaggle website(www.kaggle.com) and go to your profile. Next click on account. 4. On the next page an API section will be vissible, where you will find a “Create New API Token”. Click on it, and a kaggle.json file will be downloaded. This file will contain your username and key that will be used in our next step. 5. Import the opendatasets library and download your Kaggle dataset by pasting the link on it import opendatasets as od import pandas od.download(" "https://www.kaggle.com/datasets/ camnugent/california-housing-prices" ") Downloading and uploading Dataset to Colab 1. Visit Kaggle website(www.kaggle.com) and go to particular dataset. Let say https://www.kaggle.com/datasets/camnugent/california- housing-prices/data 2. Download the dataset. 3. Open your Google Colab Notebook 4. Press files present at left bar. 5. Click on Upload to session storage and follow the instruction. 6. Now, you can read the dataset by executing below code: import pandas as pd df = pd.read_csv("/content/housing.csv") Read directly through Google Drive 1. Visit Kaggle website(www.kaggle.com) and go to particular dataset. 2. Let say https://www.kaggle.com/datasets/camnugent/california- housing-prices/data 3. Download the dataset. 4. Upload the dataset to Google drive Read directly through Google Drive 5. Open your Google Colab Notebook and mount your google drive by executing below code from google.colab import drive drive.mount('/content/drive’) 6. Once you will execute the above code Google colab notebook will ask permission to access your Google Drive files, which you need to provide. 7. Now, you can read the dataset by executing below code: import pandas as pd df = pd.read_csv("/content/drive/MyDrive/ABE/housing.csv") Reading Github Project into Google Colab Reading Github Project Visit the website of Github https://github.com/ Sign in to Github using your credentials Go the project that you want to access Open the google colab file present in the project and press open in Colab option Summary Introduced Google Colab Features of Google Colab Reading Datasets in Google Colab Reading Github Project in Google Colab Introduction to Data Science Dr. Faisal Anwer Department of Computer Science Aligarh Muslim University, Aligarh-202002 1 Copyright © Dept of Computer Science, AMU, Aligarh. Permission required for reproduction or display. Motivation Amazon personalized recommendation system. Targeted Advertising Google Voice, Siri, Cortana etc Introduction ✓Data can be considered as raw material while Data Science is the process of extracting meaningful information from the Data. ✓Data science is an interdisciplinary field that combines ✓statistics, ✓computer science, and ✓domain expertise to extract knowledge and insights from structured and unstructured data. Introduction ✓Data science is being employed in a wide range of industries, including ✓Healthcare ✓Finance, ✓Agriculture, ✓Retail, and more. ✓It empowers organizations to make data-driven decisions and gain a competitive advantage. Popularity of Data Science Digital devices are cheaper and more powerful. Democratization of Hardware and Software Internet world is producing huge amount of data. According to a report of the International Data Corporation (IDC 2021), worldwide data will grow 61% to 175 zettabytes by 2025 In the same report it was predicted that 49% of the world’s stored data will reside in public cloud environments. Due to immense computational resources available with us, we can perform analytics and train a machine using massive data. Learning Objectives Learning objectives of this course are: 1.Understand the fundamentals of Python programming. ▪ Acquire a working knowledge of Python syntax and control structures. ▪ Learn to write Python functions and work with Python data structures like lists, tuples, dictionaries, and sets. 2.Gain proficiency in using Python’s data science libraries. ▪ Learn to manipulate and analyze data using Pandas. ▪ Perform numerical computations with NumPy. ▪ Create visualizations with Matplotlib and Seaborn. 3.Develop skills in data manipulation and cleaning. ▪ Understand how to handle missing data, duplicate data, and data errors. ▪ Learn data transformation techniques essential for preparing datasets for analysis. Learning Objectives 4. Learn the basics of statistical analysis and hypothesis testing. ▪ Comprehend the concepts of descriptive statistics and inferential statistics. ▪ Perform hypothesis testing using Python to make data-driven inferences. 5.Understand machine learning concepts and apply them using Scikit- learn. ▪ Grasp the basics of machine learning algorithms including supervised and unsupervised learning. ▪ Implement machine learning models and understand the principles of model selection and evaluation. 6.Learn to present data insights effectively. ▪ Develop the ability to communicate results through visualizations. ▪ Create comprehensive reports on the findings from data analysis. Learning Outcomes By the end of this course, students should be able to: 1.Write Python scripts and programs for various data science applications. 2.Efficiently use python libraries/frameworks for data preprocessing tasks. 3.Conduct exploratory data analysis to uncover insights and prepare data for modeling. 4.Apply statistical methods to analyze data and validate hypotheses using Python. Learning Outcomes 5. Create meaningful data visualizations to present findings in a clear and impactful manner. 6. Build, test, and evaluate machine learning models with Scikit-learn. 7. Demonstrate the ability to extract and interpret data to inform decision-making processes. 8. Possess a solid foundation in data science that can be applied to real-world problems across various domains. Data Science Operations ✓Data Science is described as a field revolving around 5 data-related operations. Collection Storing Processing Describing Modelling Collection Data Collection is the process of gathering data (text, video, audio and others). Depends on Data scientist and the environment that the Data Scientist is working. Let say a data scientists is working on an agriculture-based company. Which crops yields higher production in districts of Maharashtra? Effect of type of seed, fertilizer, irrigation on crops Data Already Exists Access using SQL Collection Suppose a data scientists working for a political party. What is the winning chance of a candidate in a particular region ? Data exists at different sources such as Facebook, opinion polls etc. Access using API or crawler. Take another example where a data scientists working with Healthcare Systems. Effect of new drugs on patients. Data not available Needs to design experiments Essential Expertise Needed for Data Collection Knowledge of programming Knowledge of statistics (incase of design experiments to collect data) Knowledge of Databases(Intermediatory) Storing Data Operational Data. Operational data is the vital part of an organization. Generated by day-to-day operations. It's transactional data that reflects the business activities and includes sales records, banking transactions, or customer interactions. Unstructured Data This data is the wild frontier of data science. It doesn't fit neatly into traditional row and column databases and includes: Test Speech Video (Data coming from web, social media posts, online reviews etc.) Storing Data Structured Data Structured data is highly organized and easily searchable. It stores in relational databases and is what you typically find in an e-commerce setting in the form of customer profiles. Data from multiple databases Combine into common repository Expertise Needed for Storing Data Programming Skills (Basic) Understanding of Databases (Basic) Understanding of NoSQL Databases like JSON, XML etc. Understanding of Data Warehouses (data from many different sources into a single data repository) Processing Data Data Wrangling (the process of transforming and mapping data from one "raw" data form into another format) Extract Transform and Structure Load Data Cleaning Fill missing values Correct spelling errors Identify and remove outliers Processing Data Processing Data Data scaling (adjusting the range of feature values) Kilometers to miles, rupees to dollars Normalizing (shifted and rescaled so that they end up ranging between 0 and 1) Min-Max scaling Standardizing values are centered around the mean Expertise Needed for Processing Data Programming Skills (Essential) SQL and NoSQL Databases (Optional) Basic Statistics (Essential) Describing Data Descriptive Statistics Descriptive statistics provide us with a powerful way to describe and summarize data. Visualising Data Visualization is about representing data graphically to uncover patterns, see trends, and communicate information effectively Summarising Data Summarizing data involves reducing large datasets to their basic essence. Mean Median Mode, etc. Skills Required for Describing Data Statistics Excel Python R Tableau Data Modelling Statistical Modelling: Underlying relationships What is the relationship between x and y of a dataset? E.g.: There is a linear relationship between no. of days of treatment and BP. Modelling underlying data distribution Give statistical guarantees (p-values, goodness-of-fit tests) Algorithmic Modelling (Machine Learning Modelling) Alternative to statistical modelling when the relationship among output and input variables are not simple. If you only care about the prediction and not about why certain thing are happening. Statistical Modelling VS Algorithmic Modelling Simple models - - Complex models More suited for low-dimensional data - - Can work with high dimensional data. Data lean models -- Data hungry models More of statistics -- More of ML,DL Linear Regression, Logistic Regression, Linear Discriminant Analysis -- Linear Regression, Logistic Regression, Linear Discriminant Analysis, Decision Trees, K-NNs, SVMs, Naïve Bayes, Multilayered Neural Networks Modelling (Skills Required) Inferential Statistics Probability Theory ML and DL Python packages and frameworks(numpy, pandas, scikit- learn, Tensor Flow, PyTorch, Keras) Introduction to Statistics Dr. Faisal Anwer Department of Computer Science Aligarh Muslim University, Aligarh-202002 1 Copyright © Dept of Computer Science, AMU, Aligarh. Permission required for reproduction or display. Recap Database Repositories Popular Datasets and their explanation Week-2 Contents Introduction to Statistics Descriptive Statistics Inferential Statistics Introduction to Hypothesis Probability Contents Introduction to Statistics Descriptive Statistics Types of Data Describing Data Introduction to Statistics Statistics is the science of collecting, describing, analyzing and drawing inferences (modelling) from data. It's numbers and graphs that represent about the real world. It's the art and science of learning from data. Introduction to Statistics ✓ Which party will win in the current general election? ✓ Infeasible to survey all citizen ✓ Consider proportion of the population ✓ How people react to a particular product ? ✓ Infeasible to survey all citizen ✓ Consider proportion of the population Introduction to Statistics ✓ Population: Total elements/objects that we are studying. ✓ Sample: Subgroup of the people that we study to draw inferences about the population. ✓ The quantity estimated from a sample is called statistic. Mean, median, variance, standard deviation computed from a sample is called a statistic Types of Statistics There are two main branches in Statistics: 1. Descriptive Statistics: ▪ summarizing and organizing data ▪ can be broken down into measures of centrality and measures of spread. 2. Inferential Statistics: ▪ Making predictions or inferences about a population based on a sample of data taken from the population. ▪ It includes estimation, hypothesis testing, regression analysis and others. Descriptive Statistics 9 Descriptive Statistics Measures of Centrality Include the mean, median, and mode, which are used to find the center of a data set. Measures of Spread Include range, variance, and standard deviation, which are used to find the spread or dispersion within a data set. Types of Data ▪ What are different types of data? ▪ How can we describe qualitative data? ▪ How can we describe quantitative data? ▪ How can we describe relationship among attributes? Types of Data Qualitative Quantitative Nominal Ordinal Discrete Continuous Qualitative Data Qualitative data: Those attributes that describe the object using finite set of discrete classes. – Nominal Data: Does not have any natural ordering Example: Red, Blue, Green – Ordinal Data: Natural ordering in the attributes Poor, Good, Excellent Quantitative Data Quantitative Data: Attributes having numerical values and are used to count or measure certain properties of sample. Discrete Continuous – Discrete Data: Those quantitative # of eggs in a basket Body temperature attributes which can take only a # of kids in a class Wind speed Water finite number of numerical values. # of Facebook likes temperature # of diaper changes in a day Volts of electricity – Continuous Data: Those quantitative # of wins in a season Latitude attributes which can take fractional # of votes in an election Longitude values. Statistical Analysis The type of statistical analysis depends on the type of variable. Qualitative attributes: – What is the average color of the shirts of all the participants (Wrong). – What is the total no or frequency of color of the shirts of all the participants in sample space. (Right) – Regression analysis is not valid Statistical Analysis Quantitative Data(Discrete) – What is the average no of students enrolled in last 10 years. – Regression Analysis – Mean, Median, Mode – Variance of data Describing Qualitative Data The values of categorical data is usually having repeated value. Frequency of value: – No of times low rate is given to a product. – How many times “Red Color” appeared in the data? The count of total number of times a value appeared in the data is called frequency. Frequency plot is useful to describe frequencies of data. Frequency plots are useful in Analyzing errors in ML Describing Quantitative Data For discrete data we can find frequency and for that purpose histogram is useful. For continuous data we can also find frequency using histogram – We can adjust bin or class – Choosing the right bean size can provide useful insights. – We can also compute cumulative frequency and can compare multiple histograms. Describing Quantitative Data Frequency polygons: Almost identical to a histogram, which is used to compare sets of data. – Sort the values – Choose the class intervals – Compute frequency of each interval – Compute mid point of each interval – Plot the frequency above the mid points. – Cumulative Frequency polygons: For each class interval also add the sum of the frequencies of all class intervals. Histogram Left-skewed-histogram: Most of the short bars are towards left. Right-skewed-histogram: Most of the short bars are towards right. Uniform Histogram: Most of the bars are of similar height. Symmetric Histogram: Bars are almost mirror image. Histogram Source URL: https://serc.carleton.edu/download/images/338546/histogram_skew.webp Histogram Use of Histograms in ML: – Identifying Discriminatory features. – Plot the histogram of frequency polygons for BP level for diabetic or no diabetic patients. – If the trends for BP level are very different for diabetic or nondiabetic patients, then BP level is a good discriminatory features. Stem and Leaf Plots Efficient way of describing small to medium data. SL plot look like Histogram in inverted from. More informative: display within group value. For example, let's consider the following data set: 2,3,5,7,11,13,17,19,23,29,31,37,41,43,47,53 ,59,61,67,71,73,79,83,89,97. Scatter Plots: Describing relation between There could be different relation between attributes. Not for qualitative variables. You will observe linear, quadratic, exponential or mix of relation. If two features are co related, then in ML we use to feed only one feature among two. So better to use uncorrelated features. Summary Basics of Statistics Descriptive Statistics Histogram Frequency Plots Introduction to Hypothesis Dr. Faisal Anwer Department of Computer Science Aligarh Muslim University, Aligarh-202002 1 Copyright © Dept of Computer Science, AMU, Aligarh. Permission required for reproduction or display. Recap Sampling Methods Correlation and Regression Techniques to validate inferences Contents Types of Hypothesis Types of Hypothesis Tests One Sample t-Test Two Sample t-Test Hypothesis A hypothesis is an assumption or suggestion made. It is a statement about a population parameter, that we seek to analyze through sample data. A hypothesis is actually an educated guess or prediction about the relationship. What is Hypothesis Example 1: A new drug for controlling BP – Hypothesis: The new drug is more effective in controlling BP – Now We need to test the hypothesis Example 2: Two fertilizers A and B – Hypothesis: Mean yield per acre by using Fertilizer A is greater than Fertilizer B – Now We need to test the hypothesis Types of Hypotheses Null Hypothesis (H0): This is a statement of no effect or no difference. It's what we assume to be true. Alternative Hypothesis (H1 or Ha): This is what we want to prove to be true. It's a statement that indicates the presence of an effect or a difference. NULL Hypothesis A fundamental concept in statistical hypothesis testing. Denoted as H0 and it posits that there is no effect, no difference, or no change detected. Any observed effect in the data is due to chance or randomness, and not due to a systematic cause. Example of Hypothesis Consider a pharmaceutical company has developed a new drug intended to lower cholesterol levels. The company claims that their new drug is more effective than the current market leader. Perform a controlled experiment The Null hypothesis: H0: The new drug is no more effective at lowering cholesterol than the current market leader. The Alternative Hypothesis: H1: The new drug is more effective at lowering cholesterol than the current market leader. Hypothesis Tests Most commonly used hypothesis tests are : – T-test – Z-test – ANOVA (Analysis of Variance) – Chi-Square Test T-test A t-test is used to compare the means of two groups to see if they are statistically different from each other. It helps you assess whether any observed differences between the groups are likely to have occurred by chance or if they are statistically significant. Suppose we want to test if a new teaching method is more effective than the traditional method. P-value The P-value or probability value concept is used everywhere in statistical analysis. It determines the statistical significance and the measure of significance testing. P-value P-value Decision P-value > 0.05 don’t reject the null hypothesis since the result is not statistically significant. Reject the null hypothesis in favour of the alternative P-value < 0.05 hypothesis. The result is statistically significant. P-value < 0.01 rejects the null hypothesis in favour of the alternative hypothesis. The result is highly statistically significant. Types of T-tests There are three types of t-tests that we can perform based on the data at hand: One sample t-test Independent two-sample t-test Paired sample t-test One sample t-test In a one-sample t-test, we compare the average (or mean parameter) of one group against the set average (or mean). This set average can be the population mean. The formula to calculate one-sample t-test is: where, t = t-test m = mean of the group µ = theoretical value or population mean s = standard deviation of the group n = group size or sample size Two-sample t-test The two-sample t-test is used to compare the means of two different samples. Here’s the formula to calculate the t-statistic for a two-sample t- test: Where, Paired Sample t-test We compare separate means for a group at two different times or under two different conditions. The formula to calculate the t-statistic for a paired t-test is: Paired Sample t-test First Calculate the difference (di = yi - xi) between the two observations on each pair. Calculate the mean difference Calculate the standard deviation of the differences, sd, and use this to calculate the standard error of the mean difference : Calculate the t-statistic, which is given by ANOVA (Analysis of Variance) ANOVA is used when comparing the means of three or more groups. It tests the hypothesis that the means of two or more groups are equal. Chi-Square Test The Chi-Square test is used for categorical data to assess how likely it is that an observed distribution is due to chance. It's often used to see if there's a significant relationship between two categorical variables. How to Perform These Tests The process of performing these tests generally follows these steps: 1. State the Hypotheses 2. Select Significance Level (α) 3. Choose the Test and Calculate 4. Make a Decision 5. Interpret the Results Degrees of freedom (DF) The degrees of freedom (DF) in statistics indicate the number of independent values that can vary in an analysis. It is an essential idea that appears in many contexts throughout statistics including hypothesis tests, probability distributions, and linear regression. Degrees of Freedom Formula Calculating the degrees of freedom is often the sample size minus the number of parameters you’re estimating: DF = N – P Where: N = sample size P = the number of parameters or relationships One-Tailed Test vs Two-Tailed A one-tailed test is performed depending on alternative hypothesis that specifies a direction. i.e. when the alternative hypothesis states that the parameter is in fact either bigger or smaller. A two-tailed test is performed from an alternative hypothesis which does not specify a direction. i.e. when the alternative hypothesis states that the null hypothesis is wrong. Example (One-Tailed ) Example 1: Imagine you're a quality control manager at a factory that produces light bulbs. The light bulbs are claimed to have a lifespan of 1,200 hours on average. You suspect that the latest batch doesn't meet this specification and are burning out too soon. Solution: H0: The mean lifetime of a light bulb is 1200 hours. H1: The mean lifetime of a light bulb is less than 1200 hours Example (Two-Tailed) Example 2: Imagine you're a quality control manager at a factory that produces light bulbs. The light bulbs are claimed to have a lifespan of 1,200 hours on average. Set up a hypothesis test to check this claim and comment on what sort of test we need to use. Solution: H0: The mean lifetime of a light bulb is 1200 hours. H1: The mean lifetime of a light bulb is not 1200 hours. Practise Question Imagine you're a quality control manager at a factory that produces light bulbs. The light bulbs are supposed to have a lifespan of 1,200 hours. You suspect that the latest batch doesn't meet this specification and are burning out too soon. You randomly select 30 light bulbs from the batch and find that the average lifespan is 1,180 hours with a standard deviation of 100 hours. Steps of Solution 1. State Hypothesis Null Hypothesis (H0): The mean lifespan of the light bulbs is 1,200 hours. Alternative Hypothesis (H1): The mean lifespan of the light bulbs is less than 1,200 hours 2. Calculate the Test Statistic The formula for the test statistic in a one-sample t-test is: Steps of Solution t=1180−1200/100/√30 ≈−1.155 3. Now, using the t-distribution with 29 degrees of freedom (sample size minus one), you'd look up the one-tailed p-value associated with the t-statistic of -1.155. This can be done using statistical tables found in textbooks, statistical software 4. Compare the P-Value to the Significance Level 5. Make Your Decision Since the p-value is greater than the significance level, you fail to reject the null hypothesis. There isn't sufficient evidence to conclude that the batch of light bulbs has a mean lifespan different from 1,200 hours. 6. Interpretation Summary Types of Hypothesis Types of Hypothesis Tests One Sample t-Test Two Sample t-Test Descriptive Statistics Dr. Faisal Anwer Department of Computer Science Aligarh Muslim University, Aligarh-202002 1 Copyright © Dept of Computer Science, AMU, Aligarh. Permission required for reproduction or display. Recap Introduction to Statistics Descriptive Statistics Types of Data Describing Data Contents Measures of Centrality Measures of Spread Measures of Centrality and Spread A statistic is any numerical property of the sample of a population (which is used to estimate the population). Following can be computed for quantitative data: – Measure of Centrality (mean, median, mode) – Measure of spread (range, variance, Standard Deviation) Measures of Centrality ത 𝑛 Mean: Average value 𝑋= 1/nσ𝑖=1 𝑥𝑖 Median: Mid value in sorted list (if total number of elements are even then average of values present at two mid locations). Mode – Most frequent occurring values (Single Mode) – More than 1 most frequent value (Multiple Modes) Measures of Centrality Consider the sample 23 24 25 27 27 28 30 31 32 34 36 Find out: 1. Mean ? 2. Median ? 3. Mode? Sensitivity towards outliers Mean is the center of gravity – The sum of the deviations of all points from the mean is 0. – σ𝑛𝑖=1 𝑥𝑖 − 𝑋 =0 Outlier is any point which is far off from the other values in the data. – Mean is sensitive to outliers. – Median and Mode are not sensitive to outliers. Measure of Spread Sample 1: 23 24 25 27 28 30 31 32 34 36 – Mean: 29 – Median: 29 (Low variability in this sample) Sample 1: 12 16 20 24 26 32 34 37 42 47 – Mean: 29 – Median: 29 (High Variability in this sample) Measures of centrality don’t tell us anything about the spread and variability in the data. Measure of Spread (Range) Sample 1: 23 24 25 27 28 30 31 32 34 36 – Mean: 29 – Median: 29 (Range: Max value – Min value = 36-23 = 13) Sample 1: 12 16 20 24 26 32 34 37 42 47 – Mean: 29 – Median: 29 (Range: Max value – Min value = 47-12 = 35) Range tells us that second sample has more variability/spread as compared to first. Measure of Spread (Range) Sample 1: 23 24 25 27 28 30 31 32 34 100 – Range: Max value – Min value = 100-23 = 77 – Most value is close to 28 but due to outlier (100 in this case) the Range is exaggerated. – Like mean, range is also very sensitive to outliers. Measure of Spread (Variance) How different are the values in the data from the typical value(mean) in the data? One solution: Compute the sum (or average) of deviation of all points from the mean 𝑛 σ𝑖=1 𝑥𝑖 − 𝑥ҧ Sum of deviation does not talk about the spread of the data since the value is 0. 𝑛 Preferred Solution: 1/n σ𝑖=1(𝑥𝑖 − 𝑥)2 Measure of Spread (Variance) Variance: 1 s2 = σ𝑛𝑖=1(𝑥𝑖 − 𝑥) 2 𝑛−1 Standard Deviation 1 s= 𝑠2 = σ𝑛𝑖=1(𝑥𝑖 − 𝑥) 2 𝑛−1 Variance a measure of consistency: Low variance shows more consistent data. The primary objective of manufacturing industries is to ensure that there is little variance in their products. Desirable is 0 variance – Length of sleeves – Size of bags (in each category) Exercise on Variance and Standard Deviation Suppose a sample of data is 1, 2, 3,4, 5, 6 Calculate Following Variance Standard Deviation Solution of previous exercise Mean, x̅ = (1+2+3+4+5+6)/6 = 3.5 Variance(S2)= [(1-3.5)2 + (2-3.5)2 + (3-3.5)2 + (4-3.5)2 + (5-3.5)2 + (6-3.5)2]/(6-1) = (6.25+2.25+0.25+0.25+2.25+6.25)/5 = 17.5/5 =3.5 Standard Deviation(S) = √ 3.5 =1.87 Box and Whisker Plots A box and whisker plot—displays the five-number summary of a set of data. The five-number summary is the minimum, first(lower) quartile, median, third(upper) quartile, and maximum. – Minimum (Q0 or 0th percentile) – Maximum (Q4 or 100th percentile) – Median (Q2 or 50th percentile) – First quartile (Q1 or 25th percentile) – Third quartile (Q3 or 75th percentile) Box and Whisker Plots In a box plot: We draw a box from the first quartile to the third quartile. A vertical line goes through the box at the median. The whiskers go from each quartile to the minimum or maximum. The matplotlib.pyplot.boxplot() function plots a Box Plot Box Plots Box plots are used for visualizing spread, median and outliers. X is outlier if X < Q1 – 1.5*IQR or X > Q3 + 1.5*IQR , where IQR=Q3-Q1 Exercise on Box Plot A sample of 10 boxes of raisins has these weights (in grams): 28, 25, 29, 29, 30, 35, 34, 35, 37, 38 Draw box plot for this data Solution of previous exercise A sample of 10 boxes of raisins has these weights (in grams): 28, 25, 29, 29, 30, 35, 34, 35, 37, 38 Step 1: Order the data from smallest to largest. 25, 28, 29, 29, 30, 34, 35, 35, 37, 38 Step 2: Find the median. The median is the mean of the middle two numbers: The median is 32 Step 3: Find the quartiles. The first quartile is the median of the data points to the left of the median. 25 28, 29, 29, 30 Q1=29 The third quartile is the median of the data points to the right of the median. 34, 35, 35, 37, 38 Q3=35 Step 4: Complete the five-number summary by finding the min and the max. The min is the smallest data point, which is 25 The max is the largest data point, which is 38. the five-number summary is: 25, 29, 32, 35, 38 Summary Measures of Centrality Measures of Spread Effect of outliers on centrality and spread Inferential Statistics Dr. Faisal Anwer Department of Computer Science Aligarh Muslim University, Aligarh-202002 1 Copyright © Dept of Computer Science, AMU, Aligarh. Permission required for reproduction or display. Recap Measures of Centrality Measures of Spread Contents Idea behind Inferential Statistics Population vs. Sample Sampling Methods Correlation and Regression Introduction to Hypothesis Testing Why do we need Inferential Statistics? Suppose, you want to know the average age of Data Science professionals in India. Which of the following methods can be used to calculate it? 1. Meet every Data Science professional in India. Or 2. Hand pick a number of professionals in a city like Bangalore. Inferential Statistics Inferential Statistics is at the heart of making decisions and predictions using data. Inferential statistics allow us to make predictions or inferences about a larger population from a sample. It is like being a statistical detective—investigating and making educated guesses from data you have collected. Importance of Inferential Statistics Making conclusions from a sample about the population. If a sample selected is statistically significant. Comparing two or more models. Useful in feature selection. Key Concepts Population vs. Sample Sampling Methods Correlation and Regression Hypothesis Testing Populations and Sampling Every inferential statistic is based on the idea of a population and a sample. The population is the whole group you're interested in, while a sample is a smaller group drawn from that population. The goal is to use the sample data to make inferences about the population. Populations and Sampling There are different types of sampling methods, such as – Simple random sampling – Stratified sampling – Cluster Sampling A sample and resulting statistic will be useful only if it is representative of the population Simple random sampling We select a group of subjects (a sample) for study from a larger group (a population). Each individual is chosen randomly and entirely by chance. Each individual has the same probability of being chosen at any stage during the sampling process. Example: Imagine we are conducting a study on the eating habits of high school students in a city. Stratified sampling A method of sampling that involves dividing the population into smaller groups, known as strata. Samples are then taken from each stratum to ensure that the sample includes members from each segment of the population. Cluster Sampling Cluster sampling is applied when "natural" groupings are evident. It involves dividing the population into clusters, and then one or more clusters are chosen at random and everyone within the selected cluster is sampled. The clusters may be, for example, individual villages or geographical areas. Examples and Applications Medical researchers use inferential statistics to determine the effectiveness of a new drug. Economists use inferential statistics to predict future economic conditions. Businesses use it to understand consumer behavior and to improve products and services. Correlation and Regression Correlation and regression analysis allow us to examine the relationship between variables. Correlation measures the strength and direction of a relationship between two variables Regression allows us to predict one variable based on the value of another. Technique to validate Inferences To determine whether the inferences made on a sample are valid, statisticians use a variety of techniques. Here's a detailed explanation: Hypothesis Testing Validation Techniques in Data Science Hypothesis Testing One of the main tools of inferential statistics is hypothesis testing. It allows us to decide whether there's enough evidence to support a particular belief or hypothesis about a population. The validity of inferences is often tested by setting up a null hypothesis (H0) and an alternative hypothesis (H1). Validation Techniques in Data Science Cross-validation techniques such as k-fold cross-validation. Steps Partition the Data: Split your data into k equally sized segments or "folds.“ Train and Test: Use k-1 folds for training your model and the remaining fold for testing, rotating the test fold each time. Aggregate Results: After testing with each fold, average the results to get an overall performance metric. Summary Sampling Methods Correlation and Regression Techniques to va