Data Mining Concepts and Instances

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson
Download our mobile app to listen on the go
Get App

Questions and Answers

What form does the input take in data mining methods?

  • Numbers and symbols
  • Text and audio data
  • Images and videos
  • Concepts, instances, and attributes (correct)

Instances in data mining are independent examples of the concept to be learned.

True (A)

What is the purpose of classification learning?

To learn a way of classifying unseen examples from classified examples.

In data mining, each instance is characterized by the values of its __________.

<p>attributes</p> Signup and view all the answers

Which of the following describes association learning?

<p>Identifying patterns among features without class prediction (A)</p> Signup and view all the answers

Match the learning style with its description:

<p>Classification Learning = Learning to classify unseen examples Association Learning = Finding relationships among features Clustering = Grouping examples that belong together Instance Learning = Learning from individual examples</p> Signup and view all the answers

Background knowledge should always be excluded from input representations.

<p>False (B)</p> Signup and view all the answers

What is the characteristic that most data mining schemes deal with?

<p>Numeric and nominal attributes</p> Signup and view all the answers

What is a common way to handle missing values in a dataset?

<p>Indicating them with out-of-range entries (D)</p> Signup and view all the answers

Normalization can involve subtracting the mean and dividing by the standard deviation.

<p>True (A)</p> Signup and view all the answers

What is the term used for data that has unspecified values treated as zero?

<p>Sparse Data</p> Signup and view all the answers

In order to normalize a variable, you might divide by the ______ value.

<p>maximum</p> Signup and view all the answers

Match the following data characteristics with their definitions:

<p>Nominal = Categorical data with no intrinsic ordering Ordinal = Categorical data with a defined order Sparse = Data with many unspecified zero values Missing Values = Entries in the dataset indicating unknown information</p> Signup and view all the answers

What is the main goal of numeric prediction?

<p>To predict a numeric quantity (A)</p> Signup and view all the answers

Classification learning is typically used for multilabelled instances.

<p>False (B)</p> Signup and view all the answers

What are the two sets used in supervised learning?

<p>Training set and test set</p> Signup and view all the answers

In association learning, rules are often limited to those that apply to a certain minimum number of examples called the ______.

<p>support</p> Signup and view all the answers

Match the type of learning with its description:

<p>Classification Learning = Predicting discrete classes Association Learning = Discovering structure in nonnumeric data Clustering = Grouping items without specified classes Numeric Prediction = Forecasting a numeric outcome</p> Signup and view all the answers

What is the primary purpose of clustering?

<p>To find natural groupings among items (D)</p> Signup and view all the answers

Association rules usually involve numeric attributes.

<p>False (B)</p> Signup and view all the answers

What two metrics are often considered when evaluating association rules?

<p>Support and confidence</p> Signup and view all the answers

The challenge in association learning is to avoid being swamped by too many ______.

<p>rules</p> Signup and view all the answers

What is one way to measure the success of clustering?

<p>Through subjective human judgment (B)</p> Signup and view all the answers

What is the primary focus of numeric prediction in machine learning?

<p>Predicting numeric values (B)</p> Signup and view all the answers

The presence of one attribute never depends on the value of another attribute.

<p>False (B)</p> Signup and view all the answers

What is the first step in preparing input for a data mining investigation?

<p>Gather the data</p> Signup and view all the answers

In some cases, a single example comprises a set of _______.

<p>instances</p> Signup and view all the answers

Match the following types of attributes with their descriptions:

<p>Nominal = Categorical without a natural order Ordinal = Categorical with a defined order Interval = Numeric with meaningful differences but no true zero Ratio = Numeric with meaningful differences and a true zero</p> Signup and view all the answers

Which of the following is a common challenge when integrating data from different sources?

<p>Inconsistent data formats (B)</p> Signup and view all the answers

Data cleaning is a minor part of the effort in preparing data for a machine learning investigation.

<p>False (B)</p> Signup and view all the answers

What is an example of a situation where different instances might have different attributes?

<p>Molecules with different shapes</p> Signup and view all the answers

A machine learning input can generally be represented as a matrix of instances versus _______.

<p>attributes</p> Signup and view all the answers

In which field might you need to gather data from outside the organization?

<p>Weather data collection (C)</p> Signup and view all the answers

Flashcards

Input Representation: Concepts

Concepts are the things to be learned. They are described in a way that is understandable, discussable, and applicable to real-world examples.

Input Representation: Instances

Instances are individual examples of the concept to be learned. Data is broken down into independent instances.

Input Representation: Attributes

Attributes measure different aspects of an instance, often numeric or categorical.

Classification Learning

Learning to classify unseen examples based on labeled (already classified) examples.

Signup and view all the flashcards

Association Learning

Learning relationships amongst features, not just those leading to a class value.

Signup and view all the flashcards

Clustering Learning

Learning groups of examples that naturally belong together.

Signup and view all the flashcards

Input Instances: Individual Examples

Instances are one-by-one examples needed for simple data mining.

Signup and view all the flashcards

Input Data Preparation

Format data for data mining as text or other formats.

Signup and view all the flashcards

Numeric Prediction

Predicting a numerical value as the outcome, not a category.

Signup and view all the flashcards

Concept

The thing being learned in any type of learning process, like a rule or pattern.

Signup and view all the flashcards

Concept Description

The output produced by a learning scheme, describing the learned pattern.

Signup and view all the flashcards

Multi-labeled Instances

Examples can belong to multiple categories simultaneously.

Signup and view all the flashcards

Supervised Learning

Learning from labeled examples to make predictions on unseen data.

Signup and view all the flashcards

Training Set

Labeled examples used to train the learning model.

Signup and view all the flashcards

Test Set

Unseen examples used to evaluate the model's performance.

Signup and view all the flashcards

Generalization

The ability of a model to perform well on unseen data.

Signup and view all the flashcards

Cross-validation

A technique to assess the generalization ability of a model by splitting data into training and testing sets multiple times.

Signup and view all the flashcards

Input Instances

Individual examples used as input for machine learning algorithms. Each instance represents a single observation or data point.

Signup and view all the flashcards

Multi-Instance Example

A situation where each example is actually a set of instances, not just one. This is common when a single instance is not enough to represent the concept.

Signup and view all the flashcards

Attributes

Features or characteristics that describe an instance. They can be numerical or categorical.

Signup and view all the flashcards

Nominal Attributes

Attributes that have distinct categories without any order or ranking.

Signup and view all the flashcards

Numeric Attributes

Attributes that represent quantities measured on a numerical scale.

Signup and view all the flashcards

Ordinal Attributes

Attributes that have categories with a defined order or ranking, but the difference between categories may not be equal.

Signup and view all the flashcards

Interval Attributes

Attributes with ordered categories where differences between values are meaningful, but there's no true zero point.

Signup and view all the flashcards

Ratio Attributes

Attributes with ordered categories, meaningful differences, and a true zero point. This allows for ratios to be calculated.

Signup and view all the flashcards

Data Preparation

The crucial process of cleaning, organizing, and transforming raw data before feeding it to a machine learning model.

Signup and view all the flashcards

Overlay Data

Data aggregated from multiple sources or time periods, giving a combined view.

Signup and view all the flashcards

Sparse Data

Data where most values are zero or unspecified, common in market basket analysis.

Signup and view all the flashcards

Normalization

Scaling data to a fixed range (often 0-1) for comparison and consistency.

Signup and view all the flashcards

Missing Values

Values that are absent or unknown in a dataset.

Signup and view all the flashcards

Inaccurate Values

Data with errors or inconsistencies that can affect analysis.

Signup and view all the flashcards

Study Notes

Input Representations

  • Input data for DM methods comes in various forms, including concepts, instances, and attributes.
  • A concept description is what's being learned.
  • Defining a concept precisely is challenging, so a description that's understandable, debatable, and usable as a guide is the aim.

Instances

  • Input data is a collection of instances.
  • Instances are individual, independent examples of the concept to be learned.
  • Sometimes, raw data can't be neatly separated into individual instances.
  • Background/contextual knowledge might be part of the input.
  • Raw data might be a unified mass and not easily fragmented.
  • A single, continuous sequence of data (e.g., time series) might not be easily subdivided.
  • Instances are characterized by attribute values that measure different aspects.
  • Many schemes use numeric and nominal attributes (categories).
  • Input data can be a text file.

Styles of Learning

  • Classification: Learning a way to categorize (classify) new, unseen examples. Most examples in Chapter 1 are classification problems (e.g., weather, contact lens). Multilabelled instances can be tricky.
  • Association: Identifying and examining structures within data for any "interesting" relations among features, instead of focusing on predicting a specific class.
  • Clustering: Identifying naturally grouped items.
  • Numeric prediction: Predicting a numerical outcome rather than a category.

ARFF Format

  • ARFF is a file format used to store data examples which can include numeric, nominal, string, or date type attributes.
  • The ARFF file is structured with distinct sections for specifying the relation name, attributes, and the data instances.

Sparse Data

  • Examples of sparse data include market basket data and document word count data.
  • In sparse data, unspecified values are represented as zeroes, not as missing values.

Normalization

  • Data normalization can be carried out by the DM algorithm itself.
  • Normalization involves fixed ranges (e.g. 0-1) by dividing by the maximum value, or dividing the range.
  • Another method subtracts the mean and divides by the standard deviation.
  • "Distance" between attribute values is often crucial for many DM algorithms.

Measuring Distance for Nominal Attributes

  • To measure "distance" between nominal attributes, a value of 1 is assigned if the attributes are the same, and 0 if they are different.
  • Several synthetic binary attributes are created for each nominal attribute.
  • A genuine mapping between nominal values and numeric scales might exist (e.g., zip codes to latitude and longitude).
  • Nominal values can sometimes be expressed as integers (e.g., employee codes).

Ordinal vs. Nominal Data

  • Sometimes data can be represented either as ordinal or nominal (e.g., age group).
  • This representation choice depends on context (e.g., in the contact lens dataset)

Missing Values

  • Missing values are often indicated by out-of-range entries, blank entries or dashes in nominal attribute.
  • A variety of missing values distinctions (unknown, unrecorded, irrelevant) might be coded using different numbers e.g., as '-1', '-2', '-3'.
  • The reasoning behind missing values can be important and might need to be considered.
  • Missing information itself could be relevant to the analysis (e.g. an earlier test determined that the value wasn't applicable).

Inaccurate Values

  • It is important to determine if the DM file has invalid data or values.
  • Data for analysis is often not purposely gathered and might contain defects like unrecorded or blank values.
  • Data errors and omissions are often important details.
  • Errors in data entry and typing errors are common and need attention.
  • Duplicate information might exist and need to be removed or recognized.
  • The quality and freshness of the data might also be a factor to consider.

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

Input Representations Chap2 PDF

More Like This

Use Quizgecko on...
Browser
Browser