Podcast
Questions and Answers
Which of the following best describes the relationship between data and information?
Which of the following best describes the relationship between data and information?
- Information is raw and unorganized, while data is processed and structured.
- Information is used to create data.
- Data is processed and formatted to become information. (correct)
- Data and information are interchangeable terms with the same meaning.
In the hierarchy of data, information, and knowledge, what characterizes the transition from data to information?
In the hierarchy of data, information, and knowledge, what characterizes the transition from data to information?
- An increase in quantity and an increase in the degree of abstraction.
- An increase in quantity and a decrease in the degree of abstraction.
- A decrease in quantity and a decrease in the degree of abstraction.
- A decrease in quantity and an increase in the degree of abstraction. (correct)
Which of the following scenarios exemplifies the importance of data quality in machine learning?
Which of the following scenarios exemplifies the importance of data quality in machine learning?
- Ignoring noise and outliers in the data to speed up the training process.
- Using a complex algorithm to compensate for missing values in the dataset.
- An intelligent algorithm failing to produce reliable results due to inconsistent data. (correct)
- Achieving high accuracy with a small dataset containing numerous errors.
What percentage of data scientists' work is typically dedicated to data preparation?
What percentage of data scientists' work is typically dedicated to data preparation?
What is a key consideration when assessing the validity of a dataset?
What is a key consideration when assessing the validity of a dataset?
What is 'dimensionality' in the context of a dataset?
What is 'dimensionality' in the context of a dataset?
Which of the following describes a potential consequence 'the curse of dimensionality'?
Which of the following describes a potential consequence 'the curse of dimensionality'?
What is the primary goal of data preprocessing?
What is the primary goal of data preprocessing?
Which of the following is an example of data aggregation?
Which of the following is an example of data aggregation?
What is the primary purpose of sampling in the context of data mining?
What is the primary purpose of sampling in the context of data mining?
In the context of sampling, what does it mean for a sample to be 'representative'?
In the context of sampling, what does it mean for a sample to be 'representative'?
What is a potential drawback of using a very small sample size?
What is a potential drawback of using a very small sample size?
What is the main reason for performing dimensionality reduction on a dataset?
What is the main reason for performing dimensionality reduction on a dataset?
What is a potential consequence of excessively reducing the dimensionality of a dataset?
What is a potential consequence of excessively reducing the dimensionality of a dataset?
Which of the following is a characteristic of high data precision coupled with high bias?
Which of the following is a characteristic of high data precision coupled with high bias?
What is the difference between feature selection and feature extraction in dimensionality reduction?
What is the difference between feature selection and feature extraction in dimensionality reduction?
In the context of feature subset selection, what are redundant features?
In the context of feature subset selection, what are redundant features?
What is the goal of 'feature creation' in data preprocessing?
What is the goal of 'feature creation' in data preprocessing?
What is Wavelet transformation primarily used for in feature engineering?
What is Wavelet transformation primarily used for in feature engineering?
What does data discretization involve?
What does data discretization involve?
What is the purpose of attribute transformation?
What is the purpose of attribute transformation?
In data preprocessing, what does 'scaling' refer to?
In data preprocessing, what does 'scaling' refer to?
What is the purpose of Min-Max Normalization?
What is the purpose of Min-Max Normalization?
How is the new value $x'$ calculated using Min-Max Normalization, where the original value is $x$, and the new range is $[a,b]$?
How is the new value $x'$ calculated using Min-Max Normalization, where the original value is $x$, and the new range is $[a,b]$?
How is Unit Length scaling typically achieved?
How is Unit Length scaling typically achieved?
What does 'binning' accomplish in data preprocessing?
What does 'binning' accomplish in data preprocessing?
During text preprocessing, what is Tokenization used for?
During text preprocessing, what is Tokenization used for?
In text preprocessing, what is the purpose of removing stop words?
In text preprocessing, what is the purpose of removing stop words?
What is the difference between stemming versus lemmatization?
What is the difference between stemming versus lemmatization?
In image preprocessing, what does 'thresholding' achieve?
In image preprocessing, what does 'thresholding' achieve?
Which of the following represents a morphological operation used in Image Processing?
Which of the following represents a morphological operation used in Image Processing?
What is the effect of 'erosion' in image processing?
What is the effect of 'erosion' in image processing?
In audio preprocessing, what does STFT stand for?
In audio preprocessing, what does STFT stand for?
In audio processing, what is the purpose of Mel-Frequency Cepstral Coefficients (MFCCs)?
In audio processing, what is the purpose of Mel-Frequency Cepstral Coefficients (MFCCs)?
What is Zero-Crossing Rate?
What is Zero-Crossing Rate?
What type of audio sound can noise reduction methods remove?
What type of audio sound can noise reduction methods remove?
What is the difference between high-pass vs low-pass filtering?
What is the difference between high-pass vs low-pass filtering?
Which is a key characteristic of noise cancellation algorithms?
Which is a key characteristic of noise cancellation algorithms?
Flashcards
What is Data?
What is Data?
Descriptions of things, events, activities, and transactions that have no inherent meaning and don't directly affect users.
What is Information?
What is Information?
Data that has been processed, formatted, and organized into a form that is meaningful and beneficial for decision-making.
Examples of bad data.
Examples of bad data.
Errors, noise, outliers, duplicate data, missing values, inconsistent data, timeliness, relevance, etc.
Data Quality Issues
Data Quality Issues
Signup and view all the flashcards
Precision vs. Bias
Precision vs. Bias
Signup and view all the flashcards
Dimensionality
Dimensionality
Signup and view all the flashcards
Curse of Dimensionality
Curse of Dimensionality
Signup and view all the flashcards
Data Preprocessing
Data Preprocessing
Signup and view all the flashcards
Aggregation
Aggregation
Signup and view all the flashcards
Sampling
Sampling
Signup and view all the flashcards
Simple Random Sampling:
Simple Random Sampling:
Signup and view all the flashcards
Sampling without Replacement
Sampling without Replacement
Signup and view all the flashcards
Sampling with Replacement
Sampling with Replacement
Signup and view all the flashcards
Reduksi Dimensi
Reduksi Dimensi
Signup and view all the flashcards
Stratified Sampling
Stratified Sampling
Signup and view all the flashcards
Curse of Dimensionality
Curse of Dimensionality
Signup and view all the flashcards
Feature Selection
Feature Selection
Signup and view all the flashcards
Feature Extraction
Feature Extraction
Signup and view all the flashcards
Information Gain
Information Gain
Signup and view all the flashcards
What is Binarization?
What is Binarization?
Signup and view all the flashcards
what is Feature subset selection?
what is Feature subset selection?
Signup and view all the flashcards
Feature Creation
Feature Creation
Signup and view all the flashcards
What are the methods to Mapping Data to New Space
What are the methods to Mapping Data to New Space
Signup and view all the flashcards
Diskretisasi
Diskretisasi
Signup and view all the flashcards
Transformation Attribute in Data Preprocessing?
Transformation Attribute in Data Preprocessing?
Signup and view all the flashcards
What is Similaritas?
What is Similaritas?
Signup and view all the flashcards
What is Disimilaritas?
What is Disimilaritas?
Signup and view all the flashcards
Measuring distance
Measuring distance
Signup and view all the flashcards
Scaling:Min-Max Normalization
Scaling:Min-Max Normalization
Signup and view all the flashcards
What is Binning
What is Binning
Signup and view all the flashcards
Data cleaning of Data
Data cleaning of Data
Signup and view all the flashcards
What is Tokenization?
What is Tokenization?
Signup and view all the flashcards
Stop word Removal
Stop word Removal
Signup and view all the flashcards
What is Stemming and Lemmatization?
What is Stemming and Lemmatization?
Signup and view all the flashcards
Scaling data for what?
Scaling data for what?
Signup and view all the flashcards
Common techniques used in Audio Data
Common techniques used in Audio Data
Signup and view all the flashcards
Noise Reduction in Audio Data Preprocessing
Noise Reduction in Audio Data Preprocessing
Signup and view all the flashcards
What is low-filter processing
What is low-filter processing
Signup and view all the flashcards
Study Notes
- Machine Learning involves data preprocessing.
Data and Information
- Data is descriptions of objects, events, activities, and transactions without inherent meaning or direct user impact.
- Information is data that has been processed and formatted into a meaningful form, beneficial for decision-making.
Data Importance
- The quality of data significantly affects algorithm performance.
- Bad data leads to algorithm failure.
- Around 80% of data scientists' work involves data preparation.
- Error, noise, outliers, duplicate data and missing values are examples of data issues.
Data Quality
- Examples of data quality issues include noise and outliers, missing values, duplicate data, wrong data, and fake data.
- Data quality can be checked using variance and covariance calculations.
- High data quantities are prone to measurement error and outliers.
- Precision reflects the closeness of repeated measurements.
- Bias marks the systematic deviation from the true value.
- Variance is the reciprocal of precision and measures the spread of data.
- Covariance measures how two random variables change together and is applied to calculate variable correlations.
Dataset
- Datasets can include databases, IoT sensors, text, audio, speech, images, and video.
- Important dataset considerations are the validity, size relative to the case, the minimum data required for machine learning, the distribution quality, and the number of attributes.
- Dataset types are Record, Graph and Ordered
Data Points vs. Dimensionality
- Dimensionality is the number of attributes.
- Traditional data analysis methods assume many more data points than dimensions.
- Many datasets feature high dimensions yet few data objects.
Curse of Dimensionality
- As dimensionality grows, data sparsity increases.
- An n dimensions of binary variables has 2^n joint states.
- Learning requires extensive training and the predictive power reduces as dimensionality increases.
Data Preprocessing
- It transforms raw data into an understandable format.
- It resolves incomplete, inconsistent, or trend-lacking data.
- It addresses data likely to include many errors, and it reduces dimensionality.
- Kategori Data Preprocessing:
- Can be distinguished in 2 types:
- Based the object data (record) for creating/changing attributes, like Agregasi or sampling
- Or based the object attribute for creating/changing attributes, like Pengurangan Dimensi or feature subset selection
Topics in Data Preprocessing
- Aggregation
- Sampling
- Dimensionality Reduction
- Feature Subset Selection
- Feature Creation
- Discretization and Binarization
- Attribute Transformation
- Similarity and Dissimilarity calculation using:
- Euclidean distance
- Minkowski distance
- Mahalanobis Distance
- Simple Matching
- Jaccard Coefficients
- Cosine
- Tanimoto
- Korelasi
Aggregation
- Aggregation combines two or more attributes or objects into one. Goals:
- Reducing data in terms of both attributes and objects.
- Altering scales, e.g., combining city, province and country attributes.
- Acquiring more "stable" data due to reduced variability
Sampling
- Sampling is a primary technique for selecting data for investigation and analysis.
- Sampling in data mining exists although complete data is available.
- Principals: Effective sampling’s output is as good as using the entire dataset, and data samples must be representative, mirroring original data properties.
- Types include simple random sampling where each item has an equal selection probability, sampling without replacement (items can't be re-picked), stratified sampling where data are split and samples are randomly selected from each part, and sampling with replacement where items can be re picked.
- Sample size should be carefully determined and that a larger sample size increases the probability that it will be representative.
- Adaptive/progressive sampling starts with a small sample size.
Dimensionality Reduction
- Occurs because data sets can have numerous features, hindering data analysis and visualization.
- Objectives include avoiding the Curse of Dimensionality, cutting memory use and processing time, aiding data visualization and eliminating irrelevant data or noise.
- Involves feature selection and extraction.
- Decreasing dimensions increases data structure density, risking information loss; expanding dimensions causes sparsity.
- In dimensionality a reduction "Tinggi" feature is useless across all classes.
- Techniques include feature subset selection, extraction, reduction via transformation, discretization and binarization.
Feature Subset Selection
- Used for reducing dimensions while avoiding redundant features.
- Features could be redundant, duplicating most information in other attributes, or irrelevant - lacking usable information for tasks.
- Techniques are Brute-force, Embedded, Filter and Wrapper
Feature Creation
- It constructs new attributes that represent the important information on the dataset.
- Modalities:
- Ekstraksi fitur: extraction of interesting caracteristics
- Domain-specific:
- Mapping Data ke New Space: Transformasi Fourier and Transformasi Wavelet
- Konstruksi Feature
- Kombinasi features:
Data Transformations
- Transformations include Fourier and Wavelet transforms.
Discretization
- Discretization is a technique that is applied using the label class.
- Entropy based approach.
Attribute Transformation
- This maps attribute values to new values, allowing original values to be identifiable.
- Common tranformations include:
- simple Functions like xk, log(x), ex, |x|,
- standardization, and normalization.
Similarity and Dissimilarity
- Similarity: Numerical measurement for the resemblance of objects in a range, and dissimilarity calculates the differences between two objects.
- Measurements can use distance.
- Similarity values range in [0,1], between two objects, based on type.
Techniques for Measuring Distance
- Euclidean Distance
- Minkowski distance
- Mahalanobis Distance
Scaling Methods
- Min-Max Normalization
- Mean Normalization
- Standardization (Z-score Normalization)
- Scaling to Unit Length
Scaling Effect
- Scaling Feature_2 with min-max normalization may have a range of [0-1].
- The process can change euclidean distance from one record to another
- Record closeness changes drastically.
Binning
- Binning groups values into categories and it can replace salary values.
- Three example categories are Low, Mid, and High.
- Binning bolsters model robustness but sacrifices precision.
Data Preprocessing for Text Data
- Steps include removing URLs, removing irrelevant characters, converting characters to lowercase, tokenization, removing stopwords, stemming and lemmatization, removing short words and converting tokens back.
Text Preprocessing Steps
- Steps include removing URLs, removing irrelevant characters (like numbers and punctuation), converting to lowercase, tokenization, removing stopwords, stemming and lemmatization, and removing short words.
Image Data Preparation
- Processes include cropping, resizing, image scaling, dimensionality reduction, morphological transformations, and thresholding.
Image Data Transformations Processes
- Erosion shrinks bright regions and enlarges dark regions.
- Dilation shrinks dark regions and enlarges bright regions.
- Opening erosion is followed by dilation to remove any small spots.
- Closing dilation is followed by erosion to remove any dark spots as well.
Audio Data Features
- Common audio features include Short-time Fourier Transform (STFT), Mel-Frequency Cepstral Coefficients (MFCCs), and zero-crossing rate.
- Noise reduction and enhancement using filtering methods, spectral subtraction and noise cancellation algorithm.
Short-Time Fourier Transform (STFT)
- STFT is a mathematical technique used to analyze and represent the time-varying frequency content of a signal,
Mel-Frequency Cepstral Coefficients (MFCCs)
- MFCCs are derived from the Mel-frequency scale and the cepstral analysis of power spectrum of an audio signal, and they are used in speech and audio pattern recognition classification.
Filtering methods (e.g., low-pass, high-pass)
- Aim to emphasize or attenuate specific signal components to reduce noise
- Low-Pass Filtering allows low-frequency components to pass while attenuating high-frequency components.
- High-Pass Filtering allows high-frequency components to pass while attenuating low-frequency components.
Noise Cancellation Algorithms
- Active noise cancellation (ANC) reduces or eliminates unwanted background noise.
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.