Module 1:Data Mining: Chapter 1 (introduction) Chapter 2 (Getting to Know Your Data)

Podcast

Listen to an AI-generated conversation about this lesson

Download our mobile app to listen on the go

Get App

Questions and Answers

Which of the following is a result of the rapid growth of Data?

Decreased data collection
Digital Transformation and Automation (correct)
Reduced data availability
Data scarcity

Which of the following fields is NOT mentioned as being impacted by the rapid growth of data?

Business
Society
Arts (correct)
Science

What is the purpose of data mining?

To obscure knowledge within data
To complicate data analysis
To help find knowledge within data (correct)
To ignore data

Data mining is also known as:

Knowledge discovery in databases (KDD) (B)

Signup and view all the answers

What is the definition of Data Mining?

The process of extracting knowledge from data (C)

Signup and view all the answers

Data mining is helpful in extracting:

Interesting patterns (C)

Signup and view all the answers

Which of the following is NOT a step in the Knowledge Discovery (KDD) process?

Data Obfuscation (B)

Signup and view all the answers

What is the purpose of data cleaning in the KDD process?

To remove noise and inconsistent data (B)

Signup and view all the answers

What happens during data integration in the KDD process?

Multiple data sources may be combined (B)

Signup and view all the answers

What is done during data transformation in the KDD process?

Data is transformed and consolidated (C)

Signup and view all the answers

What is the purpose of knowledge presentation in the KDD process?

To present mined knowledge to users (A)

Signup and view all the answers

Which type of database is mentioned?

Relational database (B)

Signup and view all the answers

Which type of application is mentioned?

Time-series data (C)

Signup and view all the answers

What is the purpose of generalization in data mining?

To generalize, summarize, and contrast data characteristics (A)

Signup and view all the answers

Which of the following describes the goal of Cluster Analysis?

Unsupervised learning (A)

Signup and view all the answers

Which of the following is NOT a data mining technique?

Data Obfuscation (C)

Signup and view all the answers

What is the goal of classification in data mining?

To build a model that can classify objects (A)

Signup and view all the answers

In cluster analysis, data objects in a group are:

Similar (D)

Signup and view all the answers

Machine Learning uses which of the following?

All of the above (D)

Signup and view all the answers

Algorithms must be highly scalable to handle such as:

Tera-bytes of data (B)

Signup and view all the answers

What is Web page analysis used for?

Clustering to PageRank & HITS algorithms (B)

Signup and view all the answers

What does data mining NOT impact?

Quantum physics (D)

Signup and view all the answers

Which of the following is NOT something data mining faces?

Mining knowledge in a one dimensional space (D)

Signup and view all the answers

What should mining methods be able to do?

Handle complex types of data (D)

Signup and view all the answers

A data object represents?

An entity (B)

Signup and view all the answers

What object might you find in a sales database?

Store Item (A)

Signup and view all the answers

What is an attribute?

Data field (D)

Signup and view all the answers

Which of the following terms is used interchangeably with 'attribute'?

Feature (C)

Signup and view all the answers

What is the data mining and database professionals used term?

Attribute (B)

Signup and view all the answers

Nominal attributes are also called:

Categorical (B)

Signup and view all the answers

Hair color is an example of:

Nominal attribute (C)

Signup and view all the answers

A binary attribute has how many categories or states?

Two (C)

Signup and view all the answers

Medical Test result is an example of:

Binary Attribute (B)

Signup and view all the answers

Attributes with a meaningful order but unknown magnitude between values are:

Ordinal (B)

Signup and view all the answers

Small, Medium and Large are examples of?

Ordinal Attributes (A)

Signup and view all the answers

Which attribute is an interval-scaled?

Numeric (D)

Signup and view all the answers

Which of the following attributes is a ratio-scaled attribute?

Area (C)

Signup and view all the answers

A finite set of variables are what discrete attribute?

Nominal (D)

Signup and view all the answers

Which is not a Basic Statistical Descriptions of Data?

Why (C)

Signup and view all the answers

What is the goal of finding the Basic Statistical Descriptions of Data?

To better understand the data (C)

Signup and view all the answers

What is a 'Mode'?

A value that occurs most frequently in the data (B)

Signup and view all the answers

Which of the following best describes a Boxplot?

Graphic display of five-number summary (B)

Signup and view all the answers

What does a Histogram show?

x-axis are values, y-axis represent frequencies (D)

Signup and view all the answers

What is a goal of data mining?

Automated analysis of massive data sets. (D)

Signup and view all the answers

Which task is NOT typically considered data mining?

Simple search and query processing. (D)

Signup and view all the answers

What is a key step in data mining?

To remove noise and inconsistent data (D)

Signup and view all the answers

What is the purpose of 'Data transformation'?

To transform and consolidate data into appropriate forms for mining (C)

Signup and view all the answers

What does data integration involve?

Combining multiple data sources. (B)

Signup and view all the answers

What is a 'relational database'?

A type of database-oriented data set (A)

Signup and view all the answers

What is 'association analysis'?

Identifying frequent patterns and relationships. (B)

Signup and view all the answers

What is the data mining technique?

Cluster Analysis (D)

Signup and view all the answers

What do data mining systems/tools assist with?

Invisible data mining. (A)

Signup and view all the answers

What is a challenge in data mining related to data?

Handling the diversity of data types (C)

Signup and view all the answers

What is a data object?

A representation of an entity. (A)

Signup and view all the answers

What is 'feature'?

Attribute (C)

Signup and view all the answers

What kind of attribute is 'zip code'?

Nominal (B)

Signup and view all the answers

What is a 'medical test result' an example of?

Binary Attribute (C)

Signup and view all the answers

Which of the following is an example of a discreet attribute?

Nominal (C)

Signup and view all the answers

In data analysis, what is the primary purpose of motivation?

To better understand the data (C)

Signup and view all the answers

What are you measuring if looking at the central tendency?

Mean, Median, and Mode (B)

Signup and view all the answers

What is the 'Mean'?

An average. (C)

Signup and view all the answers

What is the main focus of the basic statistical descriptions of a set of data?

The data's central tendency and distribution. (A)

Signup and view all the answers

How is the median found when the data are already sorted?

The middle Value (B)

Signup and view all the answers

What is 'Midrange'?

Another measure of central of tendency (A)

Signup and view all the answers

What is the main purpose of plotting outliers individually in a box plot?

To see data outside of the main dataset. (A)

Signup and view all the answers

What does a quantile plot display?

All of the data sorted in increasing order. (C)

Signup and view all the answers

In data visualization, should aim be?

Communicate clearly and effectively through graphical representation (B)

Signup and view all the answers

Flashcards

Why Data Mining?

The rapid increase and accessibility of data, driven by digital transformation and automation, across various domains.

What is Data Mining?

The process of extracting meaningful information from data, regardless of its purpose. Also known as KDD.

Data Cleaning

Removing noise and inconsistencies from data.

Data Integration

Combining data from various sources.

Signup and view all the flashcards

Data Selection

Retrieving data relevant to the analysis task from the database.

Signup and view all the flashcards

Data Transformation

Transforming and consolidating data into appropriate forms for mining.

Signup and view all the flashcards

Data Mining (process)

Applying intelligent methods to find data patterns/knowledge.

Signup and view all the flashcards

Pattern Evaluation

Identifying genuinely interesting knowledge-representing patterns.

Signup and view all the flashcards

Knowledge Presentation

Presenting mined knowledge through visualizations for users.

Signup and view all the flashcards

Four Attribute Types

Nominal, Binary, Ordinal and Numeric

Signup and view all the flashcards

Nominal Attributes

Representing a category, code, or state without meaningful order.

Signup and view all the flashcards

Binary Attributes

Representing with only two categories or states (0 or 1).

Signup and view all the flashcards

Ordinal Attributes

Representing values with meaningful ranks, but unknown magnitude.

Signup and view all the flashcards

Numeric Attributes

Quantitative, measurable values either interval/ratio-scaled.

Signup and view all the flashcards

Discrete Attribute

Finite or countably infinite set of values.

Signup and view all the flashcards

Continuous Attribute

Real numbers as attribute values.

Signup and view all the flashcards

Basic Statistical Descriptions

Measuring data's central tendency and distribution characteristics.

Signup and view all the flashcards

Mean

Average value also data's center; sum of values over sample size.

Signup and view all the flashcards

Median

Sorted data's middle value.

Signup and view all the flashcards

Mode

A value that happens the most in the data.

Signup and view all the flashcards

Data dispersion

Describes data distribution, using quartiles, outliers, and boxplots.

Signup and view all the flashcards

Quantile Plots

Dividing data in increasing other

Signup and view all the flashcards

Quantile-Quantile Plot

Comparing quantities in data

Signup and view all the flashcards

Scatter Plot

Looking at pair of values

Signup and view all the flashcards

Data Visualization

Clear communication of data using graphical representation.

Signup and view all the flashcards

Pixel Oriented Visualization Techniques

Where each window represents an attriube or feature

Signup and view all the flashcards

Geometric Projection Visualization Techniques

Visualization of geometric trandofrmations and projections of the data

Signup and view all the flashcards

Study Notes

DS472 focuses on Data Mining.
Module 1, Chapter 1 is an introduction to Data Mining.
Module 1, Chapter 2 addresses getting to know your data.

Learning Outcomes

A key learning outcome is to define data mining and its applications.
Students will recognize the kinds of data and patterns that can be mined.
Another aim is to describe data objects and attribute types

Introduction to Data Mining

Data mining is important because of the rapid growth of data
Digital Transformation and Automation have increased data availability and collection.
Data mining is applicable across all fields including business (e-commerce, transactions, stocks), science (remote sensing, bioinformatics, scientific simulation), and society (news, digital cameras, YouTube).
Data mining helps finding knowledge within the large amounts of available data.
Data mining offers automated analysis of massive data sets.
Data mining has various names dependent on time and field, like KDD and business intelligence.
Data mining extracts knowledge from data regardless of the intended use.
In essence, it extracts interesting, non-trivial, implicit, previously unknown, and potentially useful patterns or knowledge.
Note that overstating data mining by including similar tasks like simple query processing or expert systems usage is easy.

Knowledge Discovery Process

Data cleaning removes noise and inconsistent data.
Data integration combines multiple data sources.
Data selection retrieves relevant data for the analysis task from the database.
Data transformation consolidates data into appropriate forms for mining.
Data mining extracts data patterns or knowledge using intelligent methods.
Pattern evaluation identifies interesting patterns representing knowledge.
Knowledge presentation uses mined knowledge visualizations.

Types of Data

Data can be mined.
This includes Relational, warehouse, and transactional databases
Also included are Data streams, sensor data, time-series, temporal data, sequence data (including bio-sequences), structure data, graphs, social networks, multi-linked data, object-relational and heterogeneous databases
Spatial, spatiotemporal, multimedia, text databases and The World-Wide Web are mineable data.

Patterns that can be mined

The data mining techniques used in generalization help summarizing and contrasting data characteristics, for example, dry vs. wet regions
Association and Correlation Analysis identifies frequent patterns (or frequent itemsets), such as items frequently co-purchased at a supermarket
An association rule: Bread -> Milk [0.5%, 75%] offers quantifiable support and confidence metrics
There are relationships of association, correlation vs. causality.
Not all strongly correlated items are correlated.
Building a model able to classify on object characteristics is possible via Classification (supervised learning).
Model learning uses training data.
Diagnoses and credit card fraud can be detected.

Relevant Technologies

Machine Learning, pattern recognition and Statistics are relevant technologies for data mining.
Visualization algorithms, Databases and high-performance computing all enable data mining.

Multiple Disciplines in Data Mining

Having scalable algorithms enables handling tera-bytes of data.
Micro-arrays can have tens of thousands of dimensions.
Highly complex data includes Data streams and sensor data, time-series and temporal data, sequence data, structured data, graphs, social networks, multi-linked data, heterogeneous databases, legacy databases, spatial, spatiotemporal, multimedia, text and Web data and programs.

Data Mining Applications

Applications include from web page classification, clustering to PageRank & HITS algorithms.
Web page analysis, analysis & recommender systems, basket data analysis to targeted marketing are also examples.
Biological and medical data analysis, software engineering is possible
There are many existing data mining tools (e.g., SAS, Oracle Data Mining Tools).

Issues in Data Mining

Data mining algorithms must be efficient and scalable.
Mining methods must be parallel, distributed, stream-oriented, and incremental.
Data must be diverse and complex.
Mining must be able to handle dynamic, networked, global data repositories.
Social impacts, privacy-preserving techniques, and invisible mining are all important.

Data Objects and Attributes

Data sets are made up of data objects.
An entity is represented by a sales database, the objects may be customers, store items, and sales
Patients in medicine, students, and professors at university.
Data objects are described or represented by attributes.
These data objects are referred to as samples, examples, instances, data points, or objects.

Important definitions

An attribute is a data field that represents a characteristic or feature of a data object.
interchangeable nouns attribute/dimension/feature/variable
Dimension is most used in warehousing
Machine learning literature tends to use the term feature
Statisticians most likely use variable
Data mining and database professionals commonly use the term attribute.

Attribute Types

An attribute type is set by the type of possible values.
The four types: Nominal, Binary, Ordinal and Numeric (interval/ratio scaled)
Nominal attributes are categories, codes, or states without meaningful order like hair color, marital status
Binary attributes only have two categories/states, like 0/1 (absent/present) and Boolean
Ordinal attributes have ordered values but unknown magnitude between these, e.g Size or Grades
Numeric Attributes are quantities represented by integer/real values.

Categorizing numeric attributes

Interval-Scaled attributes are scales of equal size units without true zero-point like temperature in C˚or F˚
Ratio-Scaled Attributes are numeric with zero-point.
Ratio between two data object's attribute can be calculated from area, weight, height, any count and monetary quantities.
Discrete Attribute (Nominal are generally these
They have a countable set of values, like zip codes or profession.

Continuous Variables

Attribute (Ratio and interval) have real numbers as values.
They can be measured practically using number of digits
They are typically represented as floating-point variables
Statistical Descriptors help understand central tendency and distribution.

Basic Statistical Descriptions of Data

This process measures data's central tendency, dispersion, spread and variation
To measure a tendency one needs to know Mean, Median, Mode and Midrange.
To measure dispersion one needs know Range, max, min, variance standard deviation, quantiles, outliers.

Working Out The Tendency

Mean works by dividing all values by the sample size.
Weighed average uses weight to reflect significance.
Trimmed means values can be calculated from cleaned data.
Sorting to get the median sorts the data
Then the middle value is found if the data size is odd or the two middle numbers are averaged otherwise
This computation method is expensive for big data so approximate values are necessary.

Describing the median in detail

Median assumptions are that data are grouped into x-values and that frequencies are known.
First, find the median interval.
By interpolation compute

median = L1 + (n / 2 − (∑ freq )l/ freqmedian) width where L1 = lower boundary of interval N = total number of sample values, (∑ freq) sum of frequencies lower than interval. freqmedian = frequency of interval width = width of the median interval. Mode Value is the most frequent.

More tendency data characteristics

Unimodal sets have the same hightest values and most frequency i.e bimodal and trimodal as well
One value with highest frequency = Unimodel, two are bimodal and three are trimodal
An approximate unimodel calculation involves using mean − mode = 3 × (mean − median)
Midrange uses minimum and maximum values
Compute this easily like any SQL using aggregate functions i.e max() and min().
If symmetric distribution, the tendencies will be centred

Dispersion basics

World application data tends to have asymmetric
Data distributions are skewed
Quartiles are 25th (Q1), and 75th (Q3).
Inter-quartile range equals Q3 – Q1.
Five number summary equals min, Q1, median, Q3, max.
Boxplot illustrates ends of the box at the quartiles.
Whisker plot illustrates outliers plot outliers individually higher or lower values than 1.5 x IQR
Standard deviation includes sample: s, population: σ.
Scalable algebraic computations measure variance.

Common formulas

Variance: σ ̂2 = (1/N) (Σ__(i=1..n) (x_i − μ)^2) = (1/N) Σ__(i=1..n) x_i^2 − μ^2
The root represents standard deviation
Popular visualization plots help visualize such distributions.
The five number summary creates boxplots
With Histograms the x-axis values while y-axis shows the frquency.

Data visualization can

This technique is used extensively across many applications like business operations, the tracking of progress of tasks, and used to discover data relationships that are otherwise not easily observable.
There is visual proofs of computer representations.
Pixel visualization and geometric projection are available techniques for graphics Pixel representation is proportional to the value. A feature can represent one attribute. Space filling also saves space Other techniques such as as geometric transformations and projections can enhance the view. Additionaly, direct and scattered visualizations can be used.

Icon-based visualization

Matrices, pursuit techniques and hyperslices can also be leveraged
General techniques include shape/color codes and small representation

Hierarchical Vizualizaton

Can partition dimensions in sub-data set to easier visualize simultaneously for large and high dimensionality data
Dimensional Stacking, Worlds-within-Worlds, Tree-Map, Cone Trees and InfoCube are relevant methods

Hierarchical visualization techniques

Dimensional Stacking involves partition n-dimensional attribute space in 2-D subspaces which are 'stacked' into each other
Adequate data can be ordinal - low dimensionality to show
High quantity of data with over nine dimensions is complicated.
Mapping dimensions appropriately makes the display more effective.

Key Readings

Data Mining: Concepts and Techniques, (introduction)
Knowing Your Data
Data Mining: Practical Machine Learning Tools and Techniques Chapter concept

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Module 1:Data Mining: Chapter 1 (introduction) Chapter 2 (Getting to Know Your Data)

Choose a study mode

Podcast

Questions and Answers

Which of the following is a result of the rapid growth of Data?

Which of the following fields is NOT mentioned as being impacted by the rapid growth of data?

What is the purpose of data mining?

Data mining is also known as:

What is the definition of Data Mining?

Data mining is helpful in extracting:

Which of the following is NOT a step in the Knowledge Discovery (KDD) process?

What is the purpose of data cleaning in the KDD process?

What happens during data integration in the KDD process?

What is done during data transformation in the KDD process?

What is the purpose of knowledge presentation in the KDD process?

Which type of database is mentioned?

Which type of application is mentioned?

What is the purpose of generalization in data mining?

Which of the following describes the goal of Cluster Analysis?

Which of the following is NOT a data mining technique?

What is the goal of classification in data mining?

In cluster analysis, data objects in a group are:

Machine Learning uses which of the following?

Algorithms must be highly scalable to handle such as:

What is Web page analysis used for?

What does data mining NOT impact?

Which of the following is NOT something data mining faces?

What should mining methods be able to do?

A data object represents?

What object might you find in a sales database?

What is an attribute?

Which of the following terms is used interchangeably with 'attribute'?

What is the data mining and database professionals used term?

Nominal attributes are also called:

Hair color is an example of:

A binary attribute has how many categories or states?

Medical Test result is an example of:

Attributes with a meaningful order but unknown magnitude between values are:

Small, Medium and Large are examples of?

Which attribute is an interval-scaled?

Which of the following attributes is a ratio-scaled attribute?

A finite set of variables are what discrete attribute?

Which is not a Basic Statistical Descriptions of Data?

What is the goal of finding the Basic Statistical Descriptions of Data?

What is a 'Mode'?

Which of the following best describes a Boxplot?

What does a Histogram show?

What is a goal of data mining?

Which task is NOT typically considered data mining?

What is a key step in data mining?

What is the purpose of 'Data transformation'?

What does data integration involve?

What is a 'relational database'?

What is 'association analysis'?

What is the data mining technique?

What do data mining systems/tools assist with?

What is a challenge in data mining related to data?

What is a data object?

What is 'feature'?

What kind of attribute is 'zip code'?

What is a 'medical test result' an example of?

Which of the following is an example of a discreet attribute?

In data analysis, what is the primary purpose of motivation?

What are you measuring if looking at the central tendency?

What is the 'Mean'?

What is the main focus of the basic statistical descriptions of a set of data?

How is the median found when the data are already sorted?

What is 'Midrange'?

What is the main purpose of plotting outliers individually in a box plot?

What does a quantile plot display?

In data visualization, should aim be?

Flashcards

Why Data Mining?

What is Data Mining?

Data Cleaning

Data Integration

Data Selection

Data Transformation

Data Mining (process)

Pattern Evaluation