Pandas Resampling Methods

Machine Learning

Machine learning (ML) involves building mathematical models to understand data and make predictions or decisions based on that data.
"Learning" in ML refers to the ability of a model to adapt to observed data and make predictions or decisions based on that data.

Categories of Machine Learning

Supervised Learning: models learn from labeled data to predict labels or outcomes for new data.
- Classification: labels are discrete categories (e.g., spam vs. not spam emails).
- Regression: labels are continuous quantities (e.g., predicting a person's height).
Unsupervised Learning: models learn from unlabeled data to identify patterns or relationships.
- Clustering: identifying distinct groups of data.
- Dimensionality Reduction: finding more concise representations of data.
Semi-supervised Learning: combines supervised and unsupervised learning, using labeled and unlabeled data.
Reinforcement Learning: a model learns from interactions with a dynamic environment to achieve a goal.

Scikit-Learn

Scikit-learn is a Python library for machine learning, providing efficient algorithms for predictive data analysis.
Features:
- Classification, regression, and clustering algorithms.
- Support for supervised and unsupervised learning.
- Reinforcement learning requires GPUs for efficient computing.

Dataset

A dataset is a collection of data, often organized as tabular data.
Each row represents a sample, and each column represents a feature or attribute.
Features can be numerical, categorical, or other types of data.
A target variable is a feature whose values are used to make predictions.

Scikit-Learn's Estimator API

Consistent interface for all objects.
Inspectable parameters.
Limited object hierarchy.
Composition: many algorithms are composed of more fundamental algorithms.
Sensible defaults: default values for parameters.

Dataset Loaders and Generators

Scikit-learn provides dataset loaders and generators.
Dataset loaders: load popular datasets from online repositories.
Dataset generators: generate artificial datasets of controlled size and complexity.

Supervised Learning

Supervised learning involves learning from labeled data to predict labels or outcomes.
Goals: learn a function that maps input features to output labels.
Supervised learning algorithms: learn from labeled data and make predictions on new data.

Classification

Classification: predicting a discrete label or category.
Classification algorithms: predict a label or category based on input features.

Regression

Regression: predicting a continuous quantity or value.
Regression algorithms: predict a continuous value based on input features.

Fitting, Regression, and Least Squares

Linear regression: finds a linear relationship between input features and output values.
Least squares: minimizes the sum of squared differences between predicted and actual values.
Ordinary least squares (OLS): a common method for linear regression.

Nearest Neighbors Regression

Nearest neighbors regression: predicts a value based on the nearest neighbors in a dataset.
k-NN regression: predicts a value based on the k nearest neighbors.
Radius-based regression: predicts a value based on neighbors within a fixed radius.

Regression Metrics

R2 (coefficient of determination): measures the goodness of fit of a regression model.
Score: computes the coefficient of determination (R2) of a regression model.

Classification Metrics

Accuracy score: computes the accuracy of a classification model.
Confusion matrix: a table that summarizes the performance of a classification model.
Precision, recall, and F1 score: metrics that evaluate the performance of a classification model.### Distance Metrics
Euclidean distance metric for continuous variables
Hamming (Coverlap) distance metric for discrete variables

K-Nearest Neighbors (KNN)

Drawback: Skewed distribution affects KNN performance
Can be used with correlation coefficient (Pearson, Spearman) and assigned weights (1/k)
Value of k: A larger value reduces noise effect, but makes boundaries less distinct

Training Examples

Store feature vectors in multidimensional space with class labels
Used in classification to assign labels to unlabeled data

Multiclass SVM

2-class classifier: One vs. the rest, One vs. one

Clustering

Unsupervised learning: No target values, no labels, and no prior knowledge of classes
Clustering algorithms: Various algorithms with different understanding of clusters
Cluster models: Centroid, Connectivity, Distribution, Density

K-Means

Clustering algorithm: Partitioning n observations into k clusters
K-means with Sklearn: Compute k-means clustering, predict cluster index, and compute cluster centers
Attributes: cluster_centers, labels, inertia

K-Median

Variation of K-means: Calculates median instead of mean
More noise-tolerant: Minimizes sum of distances, not maximum distance

Mean Shift

Mode-seeking algorithm: Assigns datapoints to clusters based on density
Non-parametric method: No prior knowledge of number of clusters or shape of clusters

Clustering Metrics

Homogeneity: Each cluster contains only one class
Completeness: All members of a class are assigned to the same cluster
Mutual Information: Measures agreement between two assignments, ignoring permutations

Resampling

Resampling to lower frequency: Involves aggregation operation
Resampling to higher frequency: Involves interpolation or data filling methods
pandas resample() method: Splits DatetimeIndex into time bins and groups data by time bin

Rolling Windows

Splitting data into time windows: Aggregates data with a function (e.g., mean, median, sum)
Overlapping windows: 'Roll' along at the same frequency as the original time series

SVM - Classification

Supervised learning model: Binary linear classifier for classification, regression, and outliers detection
Maximal margin classifier: Finds optimal separating hyperplane, maximizing the margin

Kernel Trick

Non-linear classification: Implicitly maps inputs into high-dimensional feature spaces
Large margin classification: Fits the widest possible margin between two classes

Kernel Function

Determines weight of nearby points: For re-estimation of the mean in mean shift
Common kernel profiles: FLAT KERNEL and GAUSSIAN

Hard Margin

Linearly separable data: Selects two hyperplanes that separate the data

Soft Margin

Non-separable data: Maximizes the margin with a hinge loss function

Pandas Resampling Methods

Choose a study mode

Podcast

Questions and Answers

What does the kernel function determine in the mean shift algorithm?

Which of the following is the formula for the FLAT KERNEL?

For which type of clustering algorithms is the GAUSSIAN kernel typically used?

What determines the size of the region over which the mean shift algorithm calculates the local density?

What are the two hyperplanes in hard margin classification known as?

Which method is used to predict the correct class label with enough margin in soft margin classification?

What is the hinge loss function associated with?

Which kernel function is defined as $k(x) = \frac{1}{\sqrt{2\pi}}e^{-\frac{x^2}{2}}$?

Which kernel has a constant output value within a specified bandwidth?

When dealing with linearly separable data, what is selected to separate the data as much as possible?

Study Notes

Machine Learning

Categories of Machine Learning

Scikit-Learn

Dataset

Scikit-Learn's Estimator API

Dataset Loaders and Generators

Supervised Learning

Classification

Regression

Fitting, Regression, and Least Squares

Nearest Neighbors Regression

Regression Metrics

Classification Metrics

K-Nearest Neighbors (KNN)

Training Examples

Multiclass SVM

Clustering

K-Means

K-Median

Mean Shift

Clustering Metrics

Resampling

Rolling Windows

SVM - Classification

Kernel Trick

Kernel Function

Hard Margin

Soft Margin

Studying That Suits You

More Like This

Pandas Basics Quiz

Pandas and Missing Data

Importing and Analyzing Movie Data with Pandas

5 нче дәрес: Pandas пакетындагы мәгълүматларны эшкәртү