Machine Learning - Data Preprocessing

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

Which of the following best describes the relationship between data and information?

  • Information is raw and unorganized, while data is processed and structured.
  • Information is used to create data.
  • Data is processed and formatted to become information. (correct)
  • Data and information are interchangeable terms with the same meaning.

In the hierarchy of data, information, and knowledge, what characterizes the transition from data to information?

  • An increase in quantity and an increase in the degree of abstraction.
  • An increase in quantity and a decrease in the degree of abstraction.
  • A decrease in quantity and a decrease in the degree of abstraction.
  • A decrease in quantity and an increase in the degree of abstraction. (correct)

Which of the following scenarios exemplifies the importance of data quality in machine learning?

  • Ignoring noise and outliers in the data to speed up the training process.
  • Using a complex algorithm to compensate for missing values in the dataset.
  • An intelligent algorithm failing to produce reliable results due to inconsistent data. (correct)
  • Achieving high accuracy with a small dataset containing numerous errors.

What percentage of data scientists' work is typically dedicated to data preparation?

<p>80% (A)</p> Signup and view all the answers

What is a key consideration when assessing the validity of a dataset?

<p>Determining whether the dataset is appropriate for the intended use. (D)</p> Signup and view all the answers

What is 'dimensionality' in the context of a dataset?

<p>The number of attributes or features in the dataset. (A)</p> Signup and view all the answers

Which of the following describes a potential consequence 'the curse of dimensionality'?

<p>The amount of training data required grows exponentially with dimensionality. (B)</p> Signup and view all the answers

What is the primary goal of data preprocessing?

<p>To transform raw data into an understandable and usable format. (D)</p> Signup and view all the answers

Which of the following is an example of data aggregation?

<p>Combining daily sales data to create monthly sales figures. (D)</p> Signup and view all the answers

What is the primary purpose of sampling in the context of data mining?

<p>To select a representative subset of data for analysis. (A)</p> Signup and view all the answers

In the context of sampling, what does it mean for a sample to be 'representative'?

<p>It mirrors the key properties and characteristics of the original dataset. (D)</p> Signup and view all the answers

What is a potential drawback of using a very small sample size?

<p>Inability to accurately detect underlying patterns in data. (D)</p> Signup and view all the answers

What is the main reason for performing dimensionality reduction on a dataset?

<p>To facilitate data visualization and reduce computational requirements. (A)</p> Signup and view all the answers

What is a potential consequence of excessively reducing the dimensionality of a dataset?

<p>Important information is lost potentially distorting results. (D)</p> Signup and view all the answers

Which of the following is a characteristic of high data precision coupled with high bias?

<p>Measurements are closely clustered, centered far from the true value. (D)</p> Signup and view all the answers

What is the difference between feature selection and feature extraction in dimensionality reduction?

<p>Feature extraction creates entirely new features, while feature selection picks a subset of the old ones. (B)</p> Signup and view all the answers

In the context of feature subset selection, what are redundant features?

<p>Features that are highly correlated with each other, duplicating information. (D)</p> Signup and view all the answers

What is the goal of 'feature creation' in data preprocessing?

<p>To automatically generate new features that capture important information more efficiently. (A)</p> Signup and view all the answers

What is Wavelet transformation primarily used for in feature engineering?

<p>To decompose signals into different frequency components, which is useful for feature extraction. (D)</p> Signup and view all the answers

What does data discretization involve?

<p>Transforming continuous variables into discrete or nominal variables. (A)</p> Signup and view all the answers

What is the purpose of attribute transformation?

<p>To map the original attribute values to new transformed values. (A)</p> Signup and view all the answers

In data preprocessing, what does 'scaling' refer to?

<p>The process of mapping all features to the exact same scale or range. (B)</p> Signup and view all the answers

What is the purpose of Min-Max Normalization?

<p>To scale all numeric variables between 0 and 1. (D)</p> Signup and view all the answers

How is the new value $x'$ calculated using Min-Max Normalization, where the original value is $x$, and the new range is $[a,b]$?

<p>$x' = a + \frac{(x - min(x))(b - a)}{max(x) - min(x)}$ (D)</p> Signup and view all the answers

How is Unit Length scaling typically achieved?

<p>By dividing each value in a vector by the magnitude of the vector. (C)</p> Signup and view all the answers

What does 'binning' accomplish in data preprocessing?

<p>Grouping numerical or continuous values into a small number of categories. (D)</p> Signup and view all the answers

During text preprocessing, what is Tokenization used for?

<p>Splitting the text into sequence of individual words. (D)</p> Signup and view all the answers

In text preprocessing, what is the purpose of removing stop words?

<p>To eliminate common words that do not carry important meaning. (C)</p> Signup and view all the answers

What is the difference between stemming versus lemmatization?

<p>Lemmatization accounts for word meaning. (A)</p> Signup and view all the answers

In image preprocessing, what does 'thresholding' achieve?

<p>Creating a binary image by setting pixel values above a threshold to one value and those below to another. (A)</p> Signup and view all the answers

Which of the following represents a morphological operation used in Image Processing?

<p>Image Erosion. (B)</p> Signup and view all the answers

What is the effect of 'erosion' in image processing?

<p>It shrinks bright regions and enlarges dark regions. (B)</p> Signup and view all the answers

In audio preprocessing, what does STFT stand for?

<p>Short-Time Fourier Transformation. (A)</p> Signup and view all the answers

In audio processing, what is the purpose of Mel-Frequency Cepstral Coefficients (MFCCs)?

<p>To transform audio signals into a format that mimics human auditory perception. (D)</p> Signup and view all the answers

What is Zero-Crossing Rate?

<p>Quantifies number of times a signal crosses zero amplitude within a time frame. (B)</p> Signup and view all the answers

What type of audio sound can noise reduction methods remove?

<p>These methods may not always be able to remove all types of sounds. (C)</p> Signup and view all the answers

What is the difference between high-pass vs low-pass filtering?

<p>High-pass only filters frequency components lower than an intended amplitude.. (B)</p> Signup and view all the answers

Which is a key characteristic of noise cancellation algorithms?

<p>Unlike passive methodologies, noise cancellation uses active noise reduction. (A)</p> Signup and view all the answers

Flashcards

What is Data?

Descriptions of things, events, activities, and transactions that have no inherent meaning and don't directly affect users.

What is Information?

Data that has been processed, formatted, and organized into a form that is meaningful and beneficial for decision-making.

Examples of bad data.

Errors, noise, outliers, duplicate data, missing values, inconsistent data, timeliness, relevance, etc.

Data Quality Issues

Mistakes in measurement, anomalous data points.

Signup and view all the flashcards

Precision vs. Bias

Closeness of measurements; systematic deviation from the true value.

Signup and view all the flashcards

Dimensionality

Number of attributes in a dataset

Signup and view all the flashcards

Curse of Dimensionality

As dimensionality grows, data thins out, making analysis difficult.

Signup and view all the flashcards

Data Preprocessing

It involves transforming raw data into an understandable and usable format.

Signup and view all the flashcards

Aggregation

Combining multiple attributes or objects into a single attribute.

Signup and view all the flashcards

Sampling

Selecting a subset that is representative of the larger dataset.

Signup and view all the flashcards

Simple Random Sampling:

Each item has an equal chance of being selected during sampling.

Signup and view all the flashcards

Sampling without Replacement

A sampling technique where selected items are removed from the population.

Signup and view all the flashcards

Sampling with Replacement

A sampling method where items are not removed, allowing multiple selections.

Signup and view all the flashcards

Reduksi Dimensi

A technique in which data dimensions are reduced, potentially losing information.

Signup and view all the flashcards

Stratified Sampling

The data is divided into several parts and sampling is done randomly from part.

Signup and view all the flashcards

Curse of Dimensionality

Data has too many features for available data points.

Signup and view all the flashcards

Feature Selection

Selecting relevant features to reduce dimensionality.

Signup and view all the flashcards

Feature Extraction

Creating new features from existing ones to improve model performance.

Signup and view all the flashcards

Information Gain

Reducing dimensions with high information gain may not help.

Signup and view all the flashcards

What is Binarization?

Convert numerical variables into binary

Signup and view all the flashcards

what is Feature subset selection?

Choose the best feature for the data.

Signup and view all the flashcards

Feature Creation

Creating new attributes that represent the information present in the dataset.

Signup and view all the flashcards

What are the methods to Mapping Data to New Space

Techniques to apply to new space of Data.

Signup and view all the flashcards

Diskretisasi

Converting continuous data into discrete buckets or intervals.

Signup and view all the flashcards

Transformation Attribute in Data Preprocessing?

Functions to map attribute value to new values.

Signup and view all the flashcards

What is Similaritas?

It's a a numeric measurement to what extent the similarity values are nearly each other.

Signup and view all the flashcards

What is Disimilaritas?

An approach for measuring the similarity, but it gives the difference value between each other.

Signup and view all the flashcards

Measuring distance

Mathematical calculations

Signup and view all the flashcards

Scaling:Min-Max Normalization

Scales to range [a,b].

Signup and view all the flashcards

What is Binning

Replacing values into a category.

Signup and view all the flashcards

Data cleaning of Data

URL Removal, alphanumeric removal, lowercase conversion, tokenization, stopword removal, stemming and lemmatization.

Signup and view all the flashcards

What is Tokenization?

Split to individual data

Signup and view all the flashcards

Stop word Removal

In the dataset for processing, it is important to remove stopwords that is less importance

Signup and view all the flashcards

What is Stemming and Lemmatization?

To find the meaning of word.

Signup and view all the flashcards

Scaling data for what?

Image data

Signup and view all the flashcards

Common techniques used in Audio Data

Short-Time Fourier Transform (STFT), Mel-Frequency Cepstral Coefficients (MFCCs)

Signup and view all the flashcards

Noise Reduction in Audio Data Preprocessing

Active Noise Cancellation algorithms, Filtering Methods.

Signup and view all the flashcards

What is low-filter processing

Filters that allow low-frequency components to pass while attenuating high-frequency components.

Signup and view all the flashcards

Study Notes

  • Machine Learning involves data preprocessing.

Data and Information

  • Data is descriptions of objects, events, activities, and transactions without inherent meaning or direct user impact.
  • Information is data that has been processed and formatted into a meaningful form, beneficial for decision-making.

Data Importance

  • The quality of data significantly affects algorithm performance.
  • Bad data leads to algorithm failure.
  • Around 80% of data scientists' work involves data preparation.
  • Error, noise, outliers, duplicate data and missing values are examples of data issues.

Data Quality

  • Examples of data quality issues include noise and outliers, missing values, duplicate data, wrong data, and fake data.
  • Data quality can be checked using variance and covariance calculations.
  • High data quantities are prone to measurement error and outliers.
  • Precision reflects the closeness of repeated measurements.
  • Bias marks the systematic deviation from the true value.
  • Variance is the reciprocal of precision and measures the spread of data.
  • Covariance measures how two random variables change together and is applied to calculate variable correlations.

Dataset

  • Datasets can include databases, IoT sensors, text, audio, speech, images, and video.
  • Important dataset considerations are the validity, size relative to the case, the minimum data required for machine learning, the distribution quality, and the number of attributes.
  • Dataset types are Record, Graph and Ordered

Data Points vs. Dimensionality

  • Dimensionality is the number of attributes.
  • Traditional data analysis methods assume many more data points than dimensions.
  • Many datasets feature high dimensions yet few data objects.

Curse of Dimensionality

  • As dimensionality grows, data sparsity increases.
  • An n dimensions of binary variables has 2^n joint states.
  • Learning requires extensive training and the predictive power reduces as dimensionality increases.

Data Preprocessing

  • It transforms raw data into an understandable format.
  • It resolves incomplete, inconsistent, or trend-lacking data.
  • It addresses data likely to include many errors, and it reduces dimensionality.
  • Kategori Data Preprocessing:
  • Can be distinguished in 2 types:
  • Based the object data (record) for creating/changing attributes, like Agregasi or sampling
  • Or based the object attribute for creating/changing attributes, like Pengurangan Dimensi or feature subset selection

Topics in Data Preprocessing

  • Aggregation
  • Sampling
  • Dimensionality Reduction
  • Feature Subset Selection
  • Feature Creation
  • Discretization and Binarization
  • Attribute Transformation
  • Similarity and Dissimilarity calculation using:
  • Euclidean distance
  • Minkowski distance
  • Mahalanobis Distance
  • Simple Matching
  • Jaccard Coefficients
  • Cosine
  • Tanimoto
  • Korelasi

Aggregation

  • Aggregation combines two or more attributes or objects into one. Goals:
  • Reducing data in terms of both attributes and objects.
  • Altering scales, e.g., combining city, province and country attributes.
  • Acquiring more "stable" data due to reduced variability

Sampling

  • Sampling is a primary technique for selecting data for investigation and analysis.
  • Sampling in data mining exists although complete data is available.
  • Principals: Effective sampling’s output is as good as using the entire dataset, and data samples must be representative, mirroring original data properties.
  • Types include simple random sampling where each item has an equal selection probability, sampling without replacement (items can't be re-picked), stratified sampling where data are split and samples are randomly selected from each part, and sampling with replacement where items can be re picked.
  • Sample size should be carefully determined and that a larger sample size increases the probability that it will be representative.
  • Adaptive/progressive sampling starts with a small sample size.

Dimensionality Reduction

  • Occurs because data sets can have numerous features, hindering data analysis and visualization.
  • Objectives include avoiding the Curse of Dimensionality, cutting memory use and processing time, aiding data visualization and eliminating irrelevant data or noise.
  • Involves feature selection and extraction.
  • Decreasing dimensions increases data structure density, risking information loss; expanding dimensions causes sparsity.
  • In dimensionality a reduction "Tinggi" feature is useless across all classes.
  • Techniques include feature subset selection, extraction, reduction via transformation, discretization and binarization.

Feature Subset Selection

  • Used for reducing dimensions while avoiding redundant features.
  • Features could be redundant, duplicating most information in other attributes, or irrelevant - lacking usable information for tasks.
  • Techniques are Brute-force, Embedded, Filter and Wrapper

Feature Creation

  • It constructs new attributes that represent the important information on the dataset.
  • Modalities:
  • Ekstraksi fitur: extraction of interesting caracteristics
  • Domain-specific:
  • Mapping Data ke New Space: Transformasi Fourier and Transformasi Wavelet
  • Konstruksi Feature
  • Kombinasi features:

Data Transformations

  • Transformations include Fourier and Wavelet transforms.

Discretization

  • Discretization is a technique that is applied using the label class.
  • Entropy based approach.

Attribute Transformation

  • This maps attribute values to new values, allowing original values to be identifiable.
  • Common tranformations include:
  • simple Functions like xk, log(x), ex, |x|,
  • standardization, and normalization.

Similarity and Dissimilarity

  • Similarity: Numerical measurement for the resemblance of objects in a range, and dissimilarity calculates the differences between two objects.
  • Measurements can use distance.
  • Similarity values range in [0,1], between two objects, based on type.

Techniques for Measuring Distance

  • Euclidean Distance
  • Minkowski distance
  • Mahalanobis Distance

Scaling Methods

  • Min-Max Normalization
  • Mean Normalization
  • Standardization (Z-score Normalization)
  • Scaling to Unit Length

Scaling Effect

  • Scaling Feature_2 with min-max normalization may have a range of [0-1].
  • The process can change euclidean distance from one record to another
  • Record closeness changes drastically.

Binning

  • Binning groups values into categories and it can replace salary values.
  • Three example categories are Low, Mid, and High.
  • Binning bolsters model robustness but sacrifices precision.

Data Preprocessing for Text Data

  • Steps include removing URLs, removing irrelevant characters, converting characters to lowercase, tokenization, removing stopwords, stemming and lemmatization, removing short words and converting tokens back.

Text Preprocessing Steps

  • Steps include removing URLs, removing irrelevant characters (like numbers and punctuation), converting to lowercase, tokenization, removing stopwords, stemming and lemmatization, and removing short words.

Image Data Preparation

  • Processes include cropping, resizing, image scaling, dimensionality reduction, morphological transformations, and thresholding.

Image Data Transformations Processes

  • Erosion shrinks bright regions and enlarges dark regions.
  • Dilation shrinks dark regions and enlarges bright regions.
  • Opening erosion is followed by dilation to remove any small spots.
  • Closing dilation is followed by erosion to remove any dark spots as well.

Audio Data Features

  • Common audio features include Short-time Fourier Transform (STFT), Mel-Frequency Cepstral Coefficients (MFCCs), and zero-crossing rate.
  • Noise reduction and enhancement using filtering methods, spectral subtraction and noise cancellation algorithm.

Short-Time Fourier Transform (STFT)

  • STFT is a mathematical technique used to analyze and represent the time-varying frequency content of a signal,

Mel-Frequency Cepstral Coefficients (MFCCs)

  • MFCCs are derived from the Mel-frequency scale and the cepstral analysis of power spectrum of an audio signal, and they are used in speech and audio pattern recognition classification.

Filtering methods (e.g., low-pass, high-pass)

  • Aim to emphasize or attenuate specific signal components to reduce noise
  • Low-Pass Filtering allows low-frequency components to pass while attenuating high-frequency components.
  • High-Pass Filtering allows high-frequency components to pass while attenuating low-frequency components.

Noise Cancellation Algorithms

  • Active noise cancellation (ANC) reduces or eliminates unwanted background noise.

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

More Like This

Use Quizgecko on...
Browser
Browser