Podcast
Questions and Answers
What form does the input take in data mining methods?
What form does the input take in data mining methods?
Instances in data mining are independent examples of the concept to be learned.
Instances in data mining are independent examples of the concept to be learned.
True
What is the purpose of classification learning?
What is the purpose of classification learning?
To learn a way of classifying unseen examples from classified examples.
In data mining, each instance is characterized by the values of its __________.
In data mining, each instance is characterized by the values of its __________.
Signup and view all the answers
Which of the following describes association learning?
Which of the following describes association learning?
Signup and view all the answers
Match the learning style with its description:
Match the learning style with its description:
Signup and view all the answers
Background knowledge should always be excluded from input representations.
Background knowledge should always be excluded from input representations.
Signup and view all the answers
What is the characteristic that most data mining schemes deal with?
What is the characteristic that most data mining schemes deal with?
Signup and view all the answers
What is a common way to handle missing values in a dataset?
What is a common way to handle missing values in a dataset?
Signup and view all the answers
Normalization can involve subtracting the mean and dividing by the standard deviation.
Normalization can involve subtracting the mean and dividing by the standard deviation.
Signup and view all the answers
What is the term used for data that has unspecified values treated as zero?
What is the term used for data that has unspecified values treated as zero?
Signup and view all the answers
In order to normalize a variable, you might divide by the ______ value.
In order to normalize a variable, you might divide by the ______ value.
Signup and view all the answers
Match the following data characteristics with their definitions:
Match the following data characteristics with their definitions:
Signup and view all the answers
What is the main goal of numeric prediction?
What is the main goal of numeric prediction?
Signup and view all the answers
Classification learning is typically used for multilabelled instances.
Classification learning is typically used for multilabelled instances.
Signup and view all the answers
What are the two sets used in supervised learning?
What are the two sets used in supervised learning?
Signup and view all the answers
In association learning, rules are often limited to those that apply to a certain minimum number of examples called the ______.
In association learning, rules are often limited to those that apply to a certain minimum number of examples called the ______.
Signup and view all the answers
Match the type of learning with its description:
Match the type of learning with its description:
Signup and view all the answers
What is the primary purpose of clustering?
What is the primary purpose of clustering?
Signup and view all the answers
Association rules usually involve numeric attributes.
Association rules usually involve numeric attributes.
Signup and view all the answers
What two metrics are often considered when evaluating association rules?
What two metrics are often considered when evaluating association rules?
Signup and view all the answers
The challenge in association learning is to avoid being swamped by too many ______.
The challenge in association learning is to avoid being swamped by too many ______.
Signup and view all the answers
What is one way to measure the success of clustering?
What is one way to measure the success of clustering?
Signup and view all the answers
What is the primary focus of numeric prediction in machine learning?
What is the primary focus of numeric prediction in machine learning?
Signup and view all the answers
The presence of one attribute never depends on the value of another attribute.
The presence of one attribute never depends on the value of another attribute.
Signup and view all the answers
What is the first step in preparing input for a data mining investigation?
What is the first step in preparing input for a data mining investigation?
Signup and view all the answers
In some cases, a single example comprises a set of _______.
In some cases, a single example comprises a set of _______.
Signup and view all the answers
Match the following types of attributes with their descriptions:
Match the following types of attributes with their descriptions:
Signup and view all the answers
Which of the following is a common challenge when integrating data from different sources?
Which of the following is a common challenge when integrating data from different sources?
Signup and view all the answers
Data cleaning is a minor part of the effort in preparing data for a machine learning investigation.
Data cleaning is a minor part of the effort in preparing data for a machine learning investigation.
Signup and view all the answers
What is an example of a situation where different instances might have different attributes?
What is an example of a situation where different instances might have different attributes?
Signup and view all the answers
A machine learning input can generally be represented as a matrix of instances versus _______.
A machine learning input can generally be represented as a matrix of instances versus _______.
Signup and view all the answers
In which field might you need to gather data from outside the organization?
In which field might you need to gather data from outside the organization?
Signup and view all the answers
Study Notes
Input Representations
- Input data for DM methods comes in various forms, including concepts, instances, and attributes.
- A concept description is what's being learned.
- Defining a concept precisely is challenging, so a description that's understandable, debatable, and usable as a guide is the aim.
Instances
- Input data is a collection of instances.
- Instances are individual, independent examples of the concept to be learned.
- Sometimes, raw data can't be neatly separated into individual instances.
- Background/contextual knowledge might be part of the input.
- Raw data might be a unified mass and not easily fragmented.
- A single, continuous sequence of data (e.g., time series) might not be easily subdivided.
- Instances are characterized by attribute values that measure different aspects.
- Many schemes use numeric and nominal attributes (categories).
- Input data can be a text file.
Styles of Learning
- Classification: Learning a way to categorize (classify) new, unseen examples. Most examples in Chapter 1 are classification problems (e.g., weather, contact lens). Multilabelled instances can be tricky.
- Association: Identifying and examining structures within data for any "interesting" relations among features, instead of focusing on predicting a specific class.
- Clustering: Identifying naturally grouped items.
- Numeric prediction: Predicting a numerical outcome rather than a category.
ARFF Format
- ARFF is a file format used to store data examples which can include numeric, nominal, string, or date type attributes.
- The ARFF file is structured with distinct sections for specifying the relation name, attributes, and the data instances.
Sparse Data
- Examples of sparse data include market basket data and document word count data.
- In sparse data, unspecified values are represented as zeroes, not as missing values.
Normalization
- Data normalization can be carried out by the DM algorithm itself.
- Normalization involves fixed ranges (e.g. 0-1) by dividing by the maximum value, or dividing the range.
- Another method subtracts the mean and divides by the standard deviation.
- "Distance" between attribute values is often crucial for many DM algorithms.
Measuring Distance for Nominal Attributes
- To measure "distance" between nominal attributes, a value of 1 is assigned if the attributes are the same, and 0 if they are different.
- Several synthetic binary attributes are created for each nominal attribute.
- A genuine mapping between nominal values and numeric scales might exist (e.g., zip codes to latitude and longitude).
- Nominal values can sometimes be expressed as integers (e.g., employee codes).
Ordinal vs. Nominal Data
- Sometimes data can be represented either as ordinal or nominal (e.g., age group).
- This representation choice depends on context (e.g., in the contact lens dataset)
Missing Values
- Missing values are often indicated by out-of-range entries, blank entries or dashes in nominal attribute.
- A variety of missing values distinctions (unknown, unrecorded, irrelevant) might be coded using different numbers e.g., as '-1', '-2', '-3'.
- The reasoning behind missing values can be important and might need to be considered.
- Missing information itself could be relevant to the analysis (e.g. an earlier test determined that the value wasn't applicable).
Inaccurate Values
- It is important to determine if the DM file has invalid data or values.
- Data for analysis is often not purposely gathered and might contain defects like unrecorded or blank values.
- Data errors and omissions are often important details.
- Errors in data entry and typing errors are common and need attention.
- Duplicate information might exist and need to be removed or recognized.
- The quality and freshness of the data might also be a factor to consider.
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.
Related Documents
Description
Explore the fundamentals of data mining methods, focusing on the different types of input representations, including concepts, instances, and attributes. This quiz will guide you through defining concepts and understanding how instances make up the training data used in various machine learning styles, particularly classification. Test your knowledge and enhance your understanding of how data is structured and utilized in data mining.