Summary

This document is a summary of data science concepts, including key terms, data vs. big data, and data basics. It covers topics such as machine learning, statistical algorithms, data engineering, data preparation, and supervised vs. unsupervised learning. The summary also details clustering, dimensionality reduction, and data science practices.

Full Transcript

Data Science Zusammenfassung 1. Definition of Key terms - First data scientist was Johannes Kepler (astronomical data analyst) - Rudolphin tables by Kepler - Data Science: Analysis, processing and interpretation of data - Machine Learning: Machines can learn to predict a...

Data Science Zusammenfassung 1. Definition of Key terms - First data scientist was Johannes Kepler (astronomical data analyst) - Rudolphin tables by Kepler - Data Science: Analysis, processing and interpretation of data - Machine Learning: Machines can learn to predict and decide without being precisely programmed (by learning patterns) - Artificial intelligence: Ability to complete tasks, which require human intelligence - Business intelligence: Collects and analyzes data to easy decisions of business organisations - Predictive analytics: Predictions based on historical data, statistical algorithms and machine learning techniques - Data engineering: Design, maintain and develop systems, that are crucial to work with di erent data 2. Data vs Big Data - Data represents information stored digitally - Classical data: traditional data that is: Structured (arranged in a predefined schema) Low to moderate volume Predictability (follows a consistent structure, easy to analyze Centralized storage (stored in centralized databases, DBMS) Processing requirement is traditional tool Storage is centralized Analytical approaches are statistics - Big Data: three V`s, Volume, Velocity and Variety (tera to petabytes) Volume (size is massive Generated and processed at high speed, real time Various formats (structures, unstructured like pdfs, images ect.) Semi structured (like emails, structured sender and recipient but content is unstructured) Complexity like interconnectedness and diversity Processing requirements are advanced algorithms Storage is distributed Analytical approaches are machine learning, data mining and real time processing 3. Data Basics - Dataset: collection of related data in a structured format. Often represented as a table. Each row is a data object, each volume an attribute. - Object: individual unit of data within the data set. Represents for example a person. Characterized by set of attributes. - Attributes: characteristics or properties, describe data objects in a data set. - Qualitative attributes: categorical, no magnitude - Quantitative attributed: discrete (take on a finite number), continuous (take on any value ´ within given interval. - Attribute types: - nominal (names or labels, classify into distinct groups, categorical, no meaningful order), Binary just 2 categories, 0 indicates absence, 1 indicated presence - ordinal (meaningful order, di erence between values is unknown, drink size f e ) - numerical (quantitative, measurable quantities, integer or real value. Mean, median and mode can be calculated Mean = Durchschnitt Median= mittlere Zahl (Größenordnung) Mode= am häufigsten vorkommende Zahl - Interval scaled: scale with equal sized units, positive negative or zero. Ranking is possible and meaningful - Ratio scaled: inherent zero point (kelvin or income in euros) - 4. Data Preparation Read the file Remove duplicated rows Merge data (5 options): - Left just data 1 - Right just data 2 - Inner just in both - Outer all data just once. Pd.Series.isin() checks wether each element of a series is contained in another series. 5. Data preprocessing Scikit-learn, swiss knife for data preprocessing, transformation and learning - Missing values: data.isna().sum() returns a dataframe with Boolean values - Missing values in disguise: various encoded missing values. Replace missing values with none across the hole data frame: DataFrame.replace() - Reasons for missing values: data collection issues, technical challenges, intentional omissions - Drop columns, rows with data.dropna (axis=1) Set a threshold f a at 10 percent of missing values But loss of information! - Imputation technique: fill manually, global constant (eg no, but how to choose global constant?), central tendency (replace missing values with mean, median or mode), machine learning (linear regression, k nearest neighbors or decision trees) - Transformation Discretize numerical attributes (binning), removes meaningless info (noise); data is separated into intervals, smoothing e ects, Fit(X) learn parameters from data Transform(x): apply learned parameters to the data “quantile” creates bins with same number of observations Matrix is returned - Normalization: Min max normalization (between 0 and 1); Z score Normalization: data is centered around zero with standard deviation of one (normally distributed data) - One-Hot Encoding: categorical variables are encoded as numerical attributes (a binary vector, one bit is set to value 1) - Label encoding: categorical variables are encoded as a unique integer 6. Supervised vs unsupervised learning# - An algorithm is a set of mathematical instructions or rules that, especially if given to a computer, will help to calculate an answer to a problem. Used to solve problems such as modelling data to make predictions for unseen data, or clustering data to find patterns, - Supervised learning: algorithms learn from labeled training data, provided training data has the correct answers. Classification vs regression: Classification: predicting categories or labels. Output is fixed set of classes or binary classification. Regression: predict continuous numerical values. Finding patterns in data to estimate a mathematical function. Logistic regression (classification), Linear regression, Decision Tree, Random forest - Unsupervised learning: with unlabeled data to discover hidden patterns and structures. No correct answers are provided. Task is to find groupings, reduce complexity or reveal underlying strucures (there is not an out like supervised learning with output y) Clustering vs dimensionality reduction: Clustering: grouping of similar data points together based on their features. Dimensionality Reduction: aim to reduce number of input features while preserving most important information, goal is to simplify complex dta, speed up algorithms and improce model performance. Results are harder to evaluate Domain knowledge is crucial. Clustering (k-means), Dimensionality reduction (Principal Component Analysis , PCA) 7. Regression In linear regression, the relationship between variables is not exactly described. A random error is added. Goal is to find the best fit line by solving a minimization problem (of the sum of the squared residuals) Coe icient of determination: explains how well the regression model fits the observed data. Can range from 0 (extremely poor fit) to1 (all of variance is explained). 8. Classification Logistic regression: propabilities between 0 and 1. Lincear combination of the features (weight is assigned) and bias term added at the end. Sigmoid function: transforms z from function to term between 0 and 1. If sigmoid bigger 0,5 we predict 1, otherwise 0. The optimization problem: We use MLE (maximum likelihood estimation) for best parameters (weight and bias) to maximize the likelihood of the observed data. If model predicts for sigmoid value of 0,9 the outcome 1, the loss is small, if sigmoid value 0,1 and the model predicts 1, the loss is high. Split the data into training (teach our model) and test (use to evaluate how well our model learned) set: set the random state which ensures the same outcome every time. Evaluate model: model.score to evaluate how often the model was correct) 9. Decision tree Non linear, can be used for regression and classification Focus on particular algorithm called CART (Classification and Regression Trees) CART algorithm divides dataset into two subsets based on a specific feature and threshold. Greedy optimization starts with a single root node splitting the data into two partitions and adds additional nodes. Best split is determined by a criterion which di ers from regression and classification problems. Feature threshold combination is determined by minimizing the residual sum of squares error (RSS). With minimized RSS the splitting nodes are at better positions. Homogenic groups are build and the predictions are more correct. We optimize each split locally. For classification tasks we minimize the gini impurity. Gini index encourages leaf nodes where the majority of observations belong to a single class. Prediction at each leaf node Is the majority class among the training obsercations in that mode. Tree size: pruning is the process of removing nodes that do not improve the models performance. It balances the RSS error or gini impurity against model complexity. Advantages: easy to interpret and visualize Limitations: overfitting, sensitive to data 10. Randon forest Slightly reduced interpretability Combination of multiple decision trees and average their predictions. More robust model that improves generalization and reduces overfitting. Bootstrap sampling: draws samples with replacement. Some samples are repeated, others not included at all. Each tree sees a di erent version of training data. Random feature selection: we consider a random subset od features of features at each split. Regression-decisions of all trees are averaged Classification- majority vote is taken. Trees are not pruned. Each one on their own does not generize well. Together they form a strong model. We can measure how much each input variable contributes to predicting the target variable. Out of bag samples were not used to build the random forest. High value indicates that feature contributes more to making correct predictions. Feature importance helps with: - Feature selection (which features are most relevant for predictions) - Model interpretation (which features drive models decisions) - Data collection (highlight important measurements for future data collections) 11. Clustering Aims to uncover patterns of structures K-means: algorithm that groups similar data points together, based on attributes Theory: Assign each point x to cluster with closest center Objective is to minimize the sum of squared distanced between their assigned centers. (distortion measure) - First we initialize centers with random values - Assignment step: keep centers fixed. Minimize J by assigning r to closest data point - Keep r fixed. Minimize J by updating cluster centers. Elbow method: method to choose number of clusters. Plotting of J with di erent number of cluster K. Optimal number is the point, where the decrease flattens out. k-means requires numerical data. # k means can also be used for anomaly detection. Form clusters to detect anomalies. 12. Dimensionality reduction Principal Component Analysis: reduce dimenionality but retaining critical info. What is PCA: linear transformation technique that identifies the direction, in which the data varies the most Applications: - Data visualization - Prerprocessing - Feature engineering How does it work? Computes the covariance matrix (spread of the data) Eigen decomposition: Identify eigenvector (direction of principal components) and eigenvalues (amount of variance captured by each component) of matrix Ranking of the components: from high variance to smaller variance Transform the data: project the original data onto the top principal components to reduce its dimensionality. Dimensionality reduction helps combating curse of dimensionality. Performance of clustering models deteriorates with an increase in the number of features. Attention: PCA is sensitive to the scale (preprocessing steps!) Retain as much components to explain 90 to 95 % of the variance. 13. Data science in practice: - Nominal and ordinal attributes (one hot encoding) - Bins for age ect., because large number of outliers - Z score normalization to remaining numberical features to ensure that feature don’t have a larger impact on the model. we have to split the data, to prevent information leakage. If not, the test is no longer a good representation of the unseen data. Any scores calculated with the test set areno longer a good indicator of the models performance! Column transformer bundles preprocessing steps. Modelling: shu le splitting data if possible to avoid an order. Never call fit on test data (information leakage) If the class is imbalanced, the accuracy is not a good metric to evaluate the performance of the model. The balanced accuracy is a better metric for imbalanced classes: - Confusion matrix: TP, TN, FP, FN - We want to minimize FP and FN - Balanced accuracy in the binary case is similar to the arithmetic mean of sensitivity and specificity - Ranges from 0 to 1 Models performance varies greatly, if random state is on false, because its not reproductible! Logistic regression models often achieve similar performance to the random forest. Hyperparameter tuning is a tool to find the optimal parameters of f e random forest (depth, number of trees) to maximize the performance of the model If performance does not meet my expectations: - Feature engineering - Preprocessing steps - Model selection - Sometimes we need more and representative data 14. End to end Model persistence: with pickle we can save any python object and load it back later For example with pickle we can save any object as a list We can use a dictionary to save our model Pickle files can execute arbitrary code (security risk) 15. Bonus A pipeline is a sequence of steps where each step is a tuple containing a name and a transformer/estimator. Grid is a function, which evaluates all models and hyperparameters using a k fold cross validation. The best model is selected based on the balanced accuracy.