Pandas Resampling Methods

EliteAgate2297 avatar
EliteAgate2297
·
·
Download

Start Quiz

Study Flashcards

10 Questions

What does the kernel function determine in the mean shift algorithm?

The weight of nearby points for re-estimation of the mean

Which of the following is the formula for the FLAT KERNEL?

$k(x) = \begin{cases} 1 & \text{if } x \le h \ 0 & \text{if } x > h \end{cases}$

For which type of clustering algorithms is the GAUSSIAN kernel typically used?

Density-based clustering

What determines the size of the region over which the mean shift algorithm calculates the local density?

Bandwidth

What are the two hyperplanes in hard margin classification known as?

Margin boundaries

Which method is used to predict the correct class label with enough margin in soft margin classification?

Gradient descent

What is the hinge loss function associated with?

Soft margin classification

Which kernel function is defined as $k(x) = \frac{1}{\sqrt{2\pi}}e^{-\frac{x^2}{2}}$?

Gaussian kernel

Which kernel has a constant output value within a specified bandwidth?

Flat kernel

When dealing with linearly separable data, what is selected to separate the data as much as possible?

Two hyperplanes

Study Notes

Machine Learning

  • Machine learning (ML) involves building mathematical models to understand data and make predictions or decisions based on that data.
  • "Learning" in ML refers to the ability of a model to adapt to observed data and make predictions or decisions based on that data.

Categories of Machine Learning

  • Supervised Learning: models learn from labeled data to predict labels or outcomes for new data.
    • Classification: labels are discrete categories (e.g., spam vs. not spam emails).
    • Regression: labels are continuous quantities (e.g., predicting a person's height).
  • Unsupervised Learning: models learn from unlabeled data to identify patterns or relationships.
    • Clustering: identifying distinct groups of data.
    • Dimensionality Reduction: finding more concise representations of data.
  • Semi-supervised Learning: combines supervised and unsupervised learning, using labeled and unlabeled data.
  • Reinforcement Learning: a model learns from interactions with a dynamic environment to achieve a goal.

Scikit-Learn

  • Scikit-learn is a Python library for machine learning, providing efficient algorithms for predictive data analysis.
  • Features:
    • Classification, regression, and clustering algorithms.
    • Support for supervised and unsupervised learning.
    • Reinforcement learning requires GPUs for efficient computing.

Dataset

  • A dataset is a collection of data, often organized as tabular data.
  • Each row represents a sample, and each column represents a feature or attribute.
  • Features can be numerical, categorical, or other types of data.
  • A target variable is a feature whose values are used to make predictions.

Scikit-Learn's Estimator API

  • Consistent interface for all objects.
  • Inspectable parameters.
  • Limited object hierarchy.
  • Composition: many algorithms are composed of more fundamental algorithms.
  • Sensible defaults: default values for parameters.

Dataset Loaders and Generators

  • Scikit-learn provides dataset loaders and generators.
  • Dataset loaders: load popular datasets from online repositories.
  • Dataset generators: generate artificial datasets of controlled size and complexity.

Supervised Learning

  • Supervised learning involves learning from labeled data to predict labels or outcomes.
  • Goals: learn a function that maps input features to output labels.
  • Supervised learning algorithms: learn from labeled data and make predictions on new data.

Classification

  • Classification: predicting a discrete label or category.
  • Classification algorithms: predict a label or category based on input features.

Regression

  • Regression: predicting a continuous quantity or value.
  • Regression algorithms: predict a continuous value based on input features.

Fitting, Regression, and Least Squares

  • Linear regression: finds a linear relationship between input features and output values.
  • Least squares: minimizes the sum of squared differences between predicted and actual values.
  • Ordinary least squares (OLS): a common method for linear regression.

Nearest Neighbors Regression

  • Nearest neighbors regression: predicts a value based on the nearest neighbors in a dataset.
  • k-NN regression: predicts a value based on the k nearest neighbors.
  • Radius-based regression: predicts a value based on neighbors within a fixed radius.

Regression Metrics

  • R2 (coefficient of determination): measures the goodness of fit of a regression model.
  • Score: computes the coefficient of determination (R2) of a regression model.

Classification Metrics

  • Accuracy score: computes the accuracy of a classification model.
  • Confusion matrix: a table that summarizes the performance of a classification model.
  • Precision, recall, and F1 score: metrics that evaluate the performance of a classification model.### Distance Metrics
  • Euclidean distance metric for continuous variables
  • Hamming (Coverlap) distance metric for discrete variables

K-Nearest Neighbors (KNN)

  • Drawback: Skewed distribution affects KNN performance
  • Can be used with correlation coefficient (Pearson, Spearman) and assigned weights (1/k)
  • Value of k: A larger value reduces noise effect, but makes boundaries less distinct

Training Examples

  • Store feature vectors in multidimensional space with class labels
  • Used in classification to assign labels to unlabeled data

Multiclass SVM

  • 2-class classifier: One vs. the rest, One vs. one

Clustering

  • Unsupervised learning: No target values, no labels, and no prior knowledge of classes
  • Clustering algorithms: Various algorithms with different understanding of clusters
  • Cluster models: Centroid, Connectivity, Distribution, Density

K-Means

  • Clustering algorithm: Partitioning n observations into k clusters
  • K-means with Sklearn: Compute k-means clustering, predict cluster index, and compute cluster centers
  • Attributes: cluster_centers, labels, inertia

K-Median

  • Variation of K-means: Calculates median instead of mean
  • More noise-tolerant: Minimizes sum of distances, not maximum distance

Mean Shift

  • Mode-seeking algorithm: Assigns datapoints to clusters based on density
  • Non-parametric method: No prior knowledge of number of clusters or shape of clusters

Clustering Metrics

  • Homogeneity: Each cluster contains only one class
  • Completeness: All members of a class are assigned to the same cluster
  • Mutual Information: Measures agreement between two assignments, ignoring permutations

Resampling

  • Resampling to lower frequency: Involves aggregation operation
  • Resampling to higher frequency: Involves interpolation or data filling methods
  • pandas resample() method: Splits DatetimeIndex into time bins and groups data by time bin

Rolling Windows

  • Splitting data into time windows: Aggregates data with a function (e.g., mean, median, sum)
  • Overlapping windows: 'Roll' along at the same frequency as the original time series

SVM - Classification

  • Supervised learning model: Binary linear classifier for classification, regression, and outliers detection
  • Maximal margin classifier: Finds optimal separating hyperplane, maximizing the margin

Kernel Trick

  • Non-linear classification: Implicitly maps inputs into high-dimensional feature spaces
  • Large margin classification: Fits the widest possible margin between two classes

Kernel Function

  • Determines weight of nearby points: For re-estimation of the mean in mean shift
  • Common kernel profiles: FLAT KERNEL and GAUSSIAN

Hard Margin

  • Linearly separable data: Selects two hyperplanes that separate the data

Soft Margin

  • Non-separable data: Maximizes the margin with a hinge loss function

Learn about resampling methods in pandas, including aggregation and interpolation techniques, and how to use the DataFrame.resample() method to change frequency.

Make Your Own Quizzes and Flashcards

Convert your notes into interactive study material.

Get started for free

More Quizzes Like This

Pandas and Matplotlib
3 questions
Pandas and Missing Data
5 questions

Pandas and Missing Data

UnlimitedJasper4158 avatar
UnlimitedJasper4158
Use Quizgecko on...
Browser
Browser