Podcast
Questions and Answers
What form does the input take in data mining methods?
What form does the input take in data mining methods?
- Numbers and symbols
- Text and audio data
- Images and videos
- Concepts, instances, and attributes (correct)
Instances in data mining are independent examples of the concept to be learned.
Instances in data mining are independent examples of the concept to be learned.
True (A)
What is the purpose of classification learning?
What is the purpose of classification learning?
To learn a way of classifying unseen examples from classified examples.
In data mining, each instance is characterized by the values of its __________.
In data mining, each instance is characterized by the values of its __________.
Which of the following describes association learning?
Which of the following describes association learning?
Match the learning style with its description:
Match the learning style with its description:
Background knowledge should always be excluded from input representations.
Background knowledge should always be excluded from input representations.
What is the characteristic that most data mining schemes deal with?
What is the characteristic that most data mining schemes deal with?
What is a common way to handle missing values in a dataset?
What is a common way to handle missing values in a dataset?
Normalization can involve subtracting the mean and dividing by the standard deviation.
Normalization can involve subtracting the mean and dividing by the standard deviation.
What is the term used for data that has unspecified values treated as zero?
What is the term used for data that has unspecified values treated as zero?
In order to normalize a variable, you might divide by the ______ value.
In order to normalize a variable, you might divide by the ______ value.
Match the following data characteristics with their definitions:
Match the following data characteristics with their definitions:
What is the main goal of numeric prediction?
What is the main goal of numeric prediction?
Classification learning is typically used for multilabelled instances.
Classification learning is typically used for multilabelled instances.
What are the two sets used in supervised learning?
What are the two sets used in supervised learning?
In association learning, rules are often limited to those that apply to a certain minimum number of examples called the ______.
In association learning, rules are often limited to those that apply to a certain minimum number of examples called the ______.
Match the type of learning with its description:
Match the type of learning with its description:
What is the primary purpose of clustering?
What is the primary purpose of clustering?
Association rules usually involve numeric attributes.
Association rules usually involve numeric attributes.
What two metrics are often considered when evaluating association rules?
What two metrics are often considered when evaluating association rules?
The challenge in association learning is to avoid being swamped by too many ______.
The challenge in association learning is to avoid being swamped by too many ______.
What is one way to measure the success of clustering?
What is one way to measure the success of clustering?
What is the primary focus of numeric prediction in machine learning?
What is the primary focus of numeric prediction in machine learning?
The presence of one attribute never depends on the value of another attribute.
The presence of one attribute never depends on the value of another attribute.
What is the first step in preparing input for a data mining investigation?
What is the first step in preparing input for a data mining investigation?
In some cases, a single example comprises a set of _______.
In some cases, a single example comprises a set of _______.
Match the following types of attributes with their descriptions:
Match the following types of attributes with their descriptions:
Which of the following is a common challenge when integrating data from different sources?
Which of the following is a common challenge when integrating data from different sources?
Data cleaning is a minor part of the effort in preparing data for a machine learning investigation.
Data cleaning is a minor part of the effort in preparing data for a machine learning investigation.
What is an example of a situation where different instances might have different attributes?
What is an example of a situation where different instances might have different attributes?
A machine learning input can generally be represented as a matrix of instances versus _______.
A machine learning input can generally be represented as a matrix of instances versus _______.
In which field might you need to gather data from outside the organization?
In which field might you need to gather data from outside the organization?
Flashcards
Input Representation: Concepts
Input Representation: Concepts
Concepts are the things to be learned. They are described in a way that is understandable, discussable, and applicable to real-world examples.
Input Representation: Instances
Input Representation: Instances
Instances are individual examples of the concept to be learned. Data is broken down into independent instances.
Input Representation: Attributes
Input Representation: Attributes
Attributes measure different aspects of an instance, often numeric or categorical.
Classification Learning
Classification Learning
Signup and view all the flashcards
Association Learning
Association Learning
Signup and view all the flashcards
Clustering Learning
Clustering Learning
Signup and view all the flashcards
Input Instances: Individual Examples
Input Instances: Individual Examples
Signup and view all the flashcards
Input Data Preparation
Input Data Preparation
Signup and view all the flashcards
Numeric Prediction
Numeric Prediction
Signup and view all the flashcards
Concept
Concept
Signup and view all the flashcards
Concept Description
Concept Description
Signup and view all the flashcards
Multi-labeled Instances
Multi-labeled Instances
Signup and view all the flashcards
Supervised Learning
Supervised Learning
Signup and view all the flashcards
Training Set
Training Set
Signup and view all the flashcards
Test Set
Test Set
Signup and view all the flashcards
Generalization
Generalization
Signup and view all the flashcards
Cross-validation
Cross-validation
Signup and view all the flashcards
Input Instances
Input Instances
Signup and view all the flashcards
Multi-Instance Example
Multi-Instance Example
Signup and view all the flashcards
Attributes
Attributes
Signup and view all the flashcards
Nominal Attributes
Nominal Attributes
Signup and view all the flashcards
Numeric Attributes
Numeric Attributes
Signup and view all the flashcards
Ordinal Attributes
Ordinal Attributes
Signup and view all the flashcards
Interval Attributes
Interval Attributes
Signup and view all the flashcards
Ratio Attributes
Ratio Attributes
Signup and view all the flashcards
Data Preparation
Data Preparation
Signup and view all the flashcards
Overlay Data
Overlay Data
Signup and view all the flashcards
Sparse Data
Sparse Data
Signup and view all the flashcards
Normalization
Normalization
Signup and view all the flashcards
Missing Values
Missing Values
Signup and view all the flashcards
Inaccurate Values
Inaccurate Values
Signup and view all the flashcards
Study Notes
Input Representations
- Input data for DM methods comes in various forms, including concepts, instances, and attributes.
- A concept description is what's being learned.
- Defining a concept precisely is challenging, so a description that's understandable, debatable, and usable as a guide is the aim.
Instances
- Input data is a collection of instances.
- Instances are individual, independent examples of the concept to be learned.
- Sometimes, raw data can't be neatly separated into individual instances.
- Background/contextual knowledge might be part of the input.
- Raw data might be a unified mass and not easily fragmented.
- A single, continuous sequence of data (e.g., time series) might not be easily subdivided.
- Instances are characterized by attribute values that measure different aspects.
- Many schemes use numeric and nominal attributes (categories).
- Input data can be a text file.
Styles of Learning
- Classification: Learning a way to categorize (classify) new, unseen examples. Most examples in Chapter 1 are classification problems (e.g., weather, contact lens). Multilabelled instances can be tricky.
- Association: Identifying and examining structures within data for any "interesting" relations among features, instead of focusing on predicting a specific class.
- Clustering: Identifying naturally grouped items.
- Numeric prediction: Predicting a numerical outcome rather than a category.
ARFF Format
- ARFF is a file format used to store data examples which can include numeric, nominal, string, or date type attributes.
- The ARFF file is structured with distinct sections for specifying the relation name, attributes, and the data instances.
Sparse Data
- Examples of sparse data include market basket data and document word count data.
- In sparse data, unspecified values are represented as zeroes, not as missing values.
Normalization
- Data normalization can be carried out by the DM algorithm itself.
- Normalization involves fixed ranges (e.g. 0-1) by dividing by the maximum value, or dividing the range.
- Another method subtracts the mean and divides by the standard deviation.
- "Distance" between attribute values is often crucial for many DM algorithms.
Measuring Distance for Nominal Attributes
- To measure "distance" between nominal attributes, a value of 1 is assigned if the attributes are the same, and 0 if they are different.
- Several synthetic binary attributes are created for each nominal attribute.
- A genuine mapping between nominal values and numeric scales might exist (e.g., zip codes to latitude and longitude).
- Nominal values can sometimes be expressed as integers (e.g., employee codes).
Ordinal vs. Nominal Data
- Sometimes data can be represented either as ordinal or nominal (e.g., age group).
- This representation choice depends on context (e.g., in the contact lens dataset)
Missing Values
- Missing values are often indicated by out-of-range entries, blank entries or dashes in nominal attribute.
- A variety of missing values distinctions (unknown, unrecorded, irrelevant) might be coded using different numbers e.g., as '-1', '-2', '-3'.
- The reasoning behind missing values can be important and might need to be considered.
- Missing information itself could be relevant to the analysis (e.g. an earlier test determined that the value wasn't applicable).
Inaccurate Values
- It is important to determine if the DM file has invalid data or values.
- Data for analysis is often not purposely gathered and might contain defects like unrecorded or blank values.
- Data errors and omissions are often important details.
- Errors in data entry and typing errors are common and need attention.
- Duplicate information might exist and need to be removed or recognized.
- The quality and freshness of the data might also be a factor to consider.
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.