Module 1:Data Mining: Chapter 1 (introduction)  Chapter 2 (Getting to Know Your Data)

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

Which of the following is a result of the rapid growth of Data?

  • Decreased data collection
  • Digital Transformation and Automation (correct)
  • Reduced data availability
  • Data scarcity

Which of the following fields is NOT mentioned as being impacted by the rapid growth of data?

  • Business
  • Society
  • Arts (correct)
  • Science

What is the purpose of data mining?

  • To obscure knowledge within data
  • To complicate data analysis
  • To help find knowledge within data (correct)
  • To ignore data

Data mining is also known as:

<p>Knowledge discovery in databases (KDD) (B)</p> Signup and view all the answers

What is the definition of Data Mining?

<p>The process of extracting knowledge from data (C)</p> Signup and view all the answers

Data mining is helpful in extracting:

<p>Interesting patterns (C)</p> Signup and view all the answers

Which of the following is NOT a step in the Knowledge Discovery (KDD) process?

<p>Data Obfuscation (B)</p> Signup and view all the answers

What is the purpose of data cleaning in the KDD process?

<p>To remove noise and inconsistent data (B)</p> Signup and view all the answers

What happens during data integration in the KDD process?

<p>Multiple data sources may be combined (B)</p> Signup and view all the answers

What is done during data transformation in the KDD process?

<p>Data is transformed and consolidated (C)</p> Signup and view all the answers

What is the purpose of knowledge presentation in the KDD process?

<p>To present mined knowledge to users (A)</p> Signup and view all the answers

Which type of database is mentioned?

<p>Relational database (B)</p> Signup and view all the answers

Which type of application is mentioned?

<p>Time-series data (C)</p> Signup and view all the answers

What is the purpose of generalization in data mining?

<p>To generalize, summarize, and contrast data characteristics (A)</p> Signup and view all the answers

Which of the following describes the goal of Cluster Analysis?

<p>Unsupervised learning (A)</p> Signup and view all the answers

Which of the following is NOT a data mining technique?

<p>Data Obfuscation (C)</p> Signup and view all the answers

What is the goal of classification in data mining?

<p>To build a model that can classify objects (A)</p> Signup and view all the answers

In cluster analysis, data objects in a group are:

<p>Similar (D)</p> Signup and view all the answers

Machine Learning uses which of the following?

<p>All of the above (D)</p> Signup and view all the answers

Algorithms must be highly scalable to handle such as:

<p>Tera-bytes of data (B)</p> Signup and view all the answers

What is Web page analysis used for?

<p>Clustering to PageRank &amp; HITS algorithms (B)</p> Signup and view all the answers

What does data mining NOT impact?

<p>Quantum physics (D)</p> Signup and view all the answers

Which of the following is NOT something data mining faces?

<p>Mining knowledge in a one dimensional space (D)</p> Signup and view all the answers

What should mining methods be able to do?

<p>Handle complex types of data (D)</p> Signup and view all the answers

A data object represents?

<p>An entity (B)</p> Signup and view all the answers

What object might you find in a sales database?

<p>Store Item (A)</p> Signup and view all the answers

What is an attribute?

<p>Data field (D)</p> Signup and view all the answers

Which of the following terms is used interchangeably with 'attribute'?

<p>Feature (C)</p> Signup and view all the answers

What is the data mining and database professionals used term?

<p>Attribute (B)</p> Signup and view all the answers

Nominal attributes are also called:

<p>Categorical (B)</p> Signup and view all the answers

Hair color is an example of:

<p>Nominal attribute (C)</p> Signup and view all the answers

A binary attribute has how many categories or states?

<p>Two (C)</p> Signup and view all the answers

Medical Test result is an example of:

<p>Binary Attribute (B)</p> Signup and view all the answers

Attributes with a meaningful order but unknown magnitude between values are:

<p>Ordinal (B)</p> Signup and view all the answers

Small, Medium and Large are examples of?

<p>Ordinal Attributes (A)</p> Signup and view all the answers

Which attribute is an interval-scaled?

<p>Numeric (D)</p> Signup and view all the answers

Which of the following attributes is a ratio-scaled attribute?

<p>Area (C)</p> Signup and view all the answers

A finite set of variables are what discrete attribute?

<p>Nominal (D)</p> Signup and view all the answers

Which is not a Basic Statistical Descriptions of Data?

<p>Why (C)</p> Signup and view all the answers

What is the goal of finding the Basic Statistical Descriptions of Data?

<p>To better understand the data (C)</p> Signup and view all the answers

What is a 'Mode'?

<p>A value that occurs most frequently in the data (B)</p> Signup and view all the answers

Which of the following best describes a Boxplot?

<p>Graphic display of five-number summary (B)</p> Signup and view all the answers

What does a Histogram show?

<p>x-axis are values, y-axis represent frequencies (D)</p> Signup and view all the answers

What is a goal of data mining?

<p>Automated analysis of massive data sets. (D)</p> Signup and view all the answers

Which task is NOT typically considered data mining?

<p>Simple search and query processing. (D)</p> Signup and view all the answers

What is a key step in data mining?

<p>To remove noise and inconsistent data (D)</p> Signup and view all the answers

What is the purpose of 'Data transformation'?

<p>To transform and consolidate data into appropriate forms for mining (C)</p> Signup and view all the answers

What does data integration involve?

<p>Combining multiple data sources. (B)</p> Signup and view all the answers

What is a 'relational database'?

<p>A type of database-oriented data set (A)</p> Signup and view all the answers

What is 'association analysis'?

<p>Identifying frequent patterns and relationships. (B)</p> Signup and view all the answers

What is the data mining technique?

<p>Cluster Analysis (D)</p> Signup and view all the answers

What do data mining systems/tools assist with?

<p>Invisible data mining. (A)</p> Signup and view all the answers

What is a challenge in data mining related to data?

<p>Handling the diversity of data types (C)</p> Signup and view all the answers

What is a data object?

<p>A representation of an entity. (A)</p> Signup and view all the answers

What is 'feature'?

<p>Attribute (C)</p> Signup and view all the answers

What kind of attribute is 'zip code'?

<p>Nominal (B)</p> Signup and view all the answers

What is a 'medical test result' an example of?

<p>Binary Attribute (C)</p> Signup and view all the answers

Which of the following is an example of a discreet attribute?

<p>Nominal (C)</p> Signup and view all the answers

In data analysis, what is the primary purpose of motivation?

<p>To better understand the data (C)</p> Signup and view all the answers

What are you measuring if looking at the central tendency?

<p>Mean, Median, and Mode (B)</p> Signup and view all the answers

What is the 'Mean'?

<p>An average. (C)</p> Signup and view all the answers

What is the main focus of the basic statistical descriptions of a set of data?

<p>The data's central tendency and distribution. (A)</p> Signup and view all the answers

How is the median found when the data are already sorted?

<p>The middle Value (B)</p> Signup and view all the answers

What is 'Midrange'?

<p>Another measure of central of tendency (A)</p> Signup and view all the answers

What is the main purpose of plotting outliers individually in a box plot?

<p>To see data outside of the main dataset. (A)</p> Signup and view all the answers

What does a quantile plot display?

<p>All of the data sorted in increasing order. (C)</p> Signup and view all the answers

In data visualization, should aim be?

<p>Communicate clearly and effectively through graphical representation (B)</p> Signup and view all the answers

Flashcards

Why Data Mining?

The rapid increase and accessibility of data, driven by digital transformation and automation, across various domains.

What is Data Mining?

The process of extracting meaningful information from data, regardless of its purpose. Also known as KDD.

Data Cleaning

Removing noise and inconsistencies from data.

Data Integration

Combining data from various sources.

Signup and view all the flashcards

Data Selection

Retrieving data relevant to the analysis task from the database.

Signup and view all the flashcards

Data Transformation

Transforming and consolidating data into appropriate forms for mining.

Signup and view all the flashcards

Data Mining (process)

Applying intelligent methods to find data patterns/knowledge.

Signup and view all the flashcards

Pattern Evaluation

Identifying genuinely interesting knowledge-representing patterns.

Signup and view all the flashcards

Knowledge Presentation

Presenting mined knowledge through visualizations for users.

Signup and view all the flashcards

Four Attribute Types

Nominal, Binary, Ordinal and Numeric

Signup and view all the flashcards

Nominal Attributes

Representing a category, code, or state without meaningful order.

Signup and view all the flashcards

Binary Attributes

Representing with only two categories or states (0 or 1).

Signup and view all the flashcards

Ordinal Attributes

Representing values with meaningful ranks, but unknown magnitude.

Signup and view all the flashcards

Numeric Attributes

Quantitative, measurable values either interval/ratio-scaled.

Signup and view all the flashcards

Discrete Attribute

Finite or countably infinite set of values.

Signup and view all the flashcards

Continuous Attribute

Real numbers as attribute values.

Signup and view all the flashcards

Basic Statistical Descriptions

Measuring data's central tendency and distribution characteristics.

Signup and view all the flashcards

Mean

Average value also data's center; sum of values over sample size.

Signup and view all the flashcards

Median

Sorted data's middle value.

Signup and view all the flashcards

Mode

A value that happens the most in the data.

Signup and view all the flashcards

Data dispersion

Describes data distribution, using quartiles, outliers, and boxplots.

Signup and view all the flashcards

Quantile Plots

Dividing data in increasing other

Signup and view all the flashcards

Quantile-Quantile Plot

Comparing quantities in data

Signup and view all the flashcards

Scatter Plot

Looking at pair of values

Signup and view all the flashcards

Data Visualization

Clear communication of data using graphical representation.

Signup and view all the flashcards

Pixel Oriented Visualization Techniques

Where each window represents an attriube or feature

Signup and view all the flashcards

Geometric Projection Visualization Techniques

Visualization of geometric trandofrmations and projections of the data

Signup and view all the flashcards

Study Notes

  • DS472 focuses on Data Mining.
  • Module 1, Chapter 1 is an introduction to Data Mining.
  • Module 1, Chapter 2 addresses getting to know your data.

Learning Outcomes

  • A key learning outcome is to define data mining and its applications.
  • Students will recognize the kinds of data and patterns that can be mined.
  • Another aim is to describe data objects and attribute types

Introduction to Data Mining

  • Data mining is important because of the rapid growth of data
  • Digital Transformation and Automation have increased data availability and collection.
  • Data mining is applicable across all fields including business (e-commerce, transactions, stocks), science (remote sensing, bioinformatics, scientific simulation), and society (news, digital cameras, YouTube).
  • Data mining helps finding knowledge within the large amounts of available data.
  • Data mining offers automated analysis of massive data sets.
  • Data mining has various names dependent on time and field, like KDD and business intelligence.
  • Data mining extracts knowledge from data regardless of the intended use.
  • In essence, it extracts interesting, non-trivial, implicit, previously unknown, and potentially useful patterns or knowledge.
  • Note that overstating data mining by including similar tasks like simple query processing or expert systems usage is easy.

Knowledge Discovery Process

  • Data cleaning removes noise and inconsistent data.
  • Data integration combines multiple data sources.
  • Data selection retrieves relevant data for the analysis task from the database.
  • Data transformation consolidates data into appropriate forms for mining.
  • Data mining extracts data patterns or knowledge using intelligent methods.
  • Pattern evaluation identifies interesting patterns representing knowledge.
  • Knowledge presentation uses mined knowledge visualizations.

Types of Data

  • Data can be mined.
  • This includes Relational, warehouse, and transactional databases
  • Also included are Data streams, sensor data, time-series, temporal data, sequence data (including bio-sequences), structure data, graphs, social networks, multi-linked data, object-relational and heterogeneous databases
  • Spatial, spatiotemporal, multimedia, text databases and The World-Wide Web are mineable data.

Patterns that can be mined

  • The data mining techniques used in generalization help summarizing and contrasting data characteristics, for example, dry vs. wet regions
  • Association and Correlation Analysis identifies frequent patterns (or frequent itemsets), such as items frequently co-purchased at a supermarket
  • An association rule: Bread -> Milk [0.5%, 75%] offers quantifiable support and confidence metrics
  • There are relationships of association, correlation vs. causality.
  • Not all strongly correlated items are correlated.
  • Building a model able to classify on object characteristics is possible via Classification (supervised learning).
  • Model learning uses training data.
  • Diagnoses and credit card fraud can be detected.

More on Patterns that can be Mined

  • In Cluster Analysis (unsupervised learning) data objects can by grouped when similar to each other
  • Targeting ads to specific groups is possible, like news readers.
  • Outlier Analysis detects data objects non-compliant with the general behavior of the data
  • The identification of shared and disparate object characteristics helps defining outliers.
  • Noise is different to an outlier
  • Analysis can use classification or clustering techniques.
  • Rare event analysis and transaction fraud detection are examples of usage.

Relevant Technologies

  • Machine Learning, pattern recognition and Statistics are relevant technologies for data mining.
  • Visualization algorithms, Databases and high-performance computing all enable data mining.

Multiple Disciplines in Data Mining

  • Having scalable algorithms enables handling tera-bytes of data.
  • Micro-arrays can have tens of thousands of dimensions.
  • Highly complex data includes Data streams and sensor data, time-series and temporal data, sequence data, structured data, graphs, social networks, multi-linked data, heterogeneous databases, legacy databases, spatial, spatiotemporal, multimedia, text and Web data and programs.

Data Mining Applications

  • Applications include from web page classification, clustering to PageRank & HITS algorithms.
  • Web page analysis, analysis & recommender systems, basket data analysis to targeted marketing are also examples.
  • Biological and medical data analysis, software engineering is possible
  • There are many existing data mining tools (e.g., SAS, Oracle Data Mining Tools).

Issues in Data Mining

  • Data mining algorithms must be efficient and scalable.
  • Mining methods must be parallel, distributed, stream-oriented, and incremental.
  • Data must be diverse and complex.
  • Mining must be able to handle dynamic, networked, global data repositories.
  • Social impacts, privacy-preserving techniques, and invisible mining are all important.

Data Objects and Attributes

  • Data sets are made up of data objects.
  • An entity is represented by a sales database, the objects may be customers, store items, and sales
  • Patients in medicine, students, and professors at university.
  • Data objects are described or represented by attributes.
  • These data objects are referred to as samples, examples, instances, data points, or objects.

Important definitions

  • An attribute is a data field that represents a characteristic or feature of a data object.
  • interchangeable nouns attribute/dimension/feature/variable
  • Dimension is most used in warehousing
  • Machine learning literature tends to use the term feature
  • Statisticians most likely use variable
  • Data mining and database professionals commonly use the term attribute.

Attribute Types

  • An attribute type is set by the type of possible values.
  • The four types: Nominal, Binary, Ordinal and Numeric (interval/ratio scaled)
  • Nominal attributes are categories, codes, or states without meaningful order like hair color, marital status
  • Binary attributes only have two categories/states, like 0/1 (absent/present) and Boolean
  • Ordinal attributes have ordered values but unknown magnitude between these, e.g Size or Grades
  • Numeric Attributes are quantities represented by integer/real values.

Categorizing numeric attributes

  • Interval-Scaled attributes are scales of equal size units without true zero-point like temperature in CËšor FËš
  • Ratio-Scaled Attributes are numeric with zero-point.
  • Ratio between two data object's attribute can be calculated from area, weight, height, any count and monetary quantities.
  • Discrete Attribute (Nominal are generally these
  • They have a countable set of values, like zip codes or profession.

Continuous Variables

  • Attribute (Ratio and interval) have real numbers as values.
  • They can be measured practically using number of digits
  • They are typically represented as floating-point variables
  • Statistical Descriptors help understand central tendency and distribution.

Basic Statistical Descriptions of Data

  • This process measures data's central tendency, dispersion, spread and variation
  • To measure a tendency one needs to know Mean, Median, Mode and Midrange.
  • To measure dispersion one needs know Range, max, min, variance standard deviation, quantiles, outliers.

Working Out The Tendency

  • Mean works by dividing all values by the sample size.
  • Weighed average uses weight to reflect significance.
  • Trimmed means values can be calculated from cleaned data.
  • Sorting to get the median sorts the data
  • Then the middle value is found if the data size is odd or the two middle numbers are averaged otherwise
  • This computation method is expensive for big data so approximate values are necessary.

Describing the median in detail

  • Median assumptions are that data are grouped into x-values and that frequencies are known.
  • First, find the median interval.
  • By interpolation compute

median = L1 + (n / 2 − (∑ freq )l/ freqmedian) width where L1 = lower boundary of interval N = total number of sample values, (∑ freq) sum of frequencies lower than interval. freqmedian = frequency of interval width = width of the median interval. Mode Value is the most frequent.

More tendency data characteristics

  • Unimodal sets have the same hightest values and most frequency i.e bimodal and trimodal as well
  • One value with highest frequency = Unimodel, two are bimodal and three are trimodal
  • An approximate unimodel calculation involves using mean − mode = 3 × (mean − median)
  • Midrange uses minimum and maximum values
  • Compute this easily like any SQL using aggregate functions i.e max() and min().
  • If symmetric distribution, the tendencies will be centred

Dispersion basics

  • World application data tends to have asymmetric
  • Data distributions are skewed
  • Quartiles are 25th (Q1), and 75th (Q3).
  • Inter-quartile range equals Q3 – Q1.
  • Five number summary equals min, Q1, median, Q3, max.
  • Boxplot illustrates ends of the box at the quartiles.
  • Whisker plot illustrates outliers plot outliers individually higher or lower values than 1.5 x IQR
  • Standard deviation includes sample: s, population: σ.
  • Scalable algebraic computations measure variance.

Common formulas

  • Variance: σ Ì‚2 = (1/N) (Σ__(i=1..n) (x_i − μ)^2) = (1/N) Σ__(i=1..n) x_i^2 − μ^2
  • The root represents standard deviation
  • Popular visualization plots help visualize such distributions.
  • The five number summary creates boxplots
  • With Histograms the x-axis values while y-axis shows the frquency.

More on graphic visualization

  • Quantile plots pair values with a percentage to show data amount i.e x,, f.
  • Quantile graphs univariate distribution against the corresponding ones i.e q-q with points/ co-ordinates against the place with visualization.
  • Data visualization has graphical representations help communicate it clearly and effectively.

Data visualization can

  • This technique is used extensively across many applications like business operations, the tracking of progress of tasks, and used to discover data relationships that are otherwise not easily observable.
  • There is visual proofs of computer representations.
  • Pixel visualization and geometric projection are available techniques for graphics Pixel representation is proportional to the value. A feature can represent one attribute. Space filling also saves space Other techniques such as as geometric transformations and projections can enhance the view. Additionaly, direct and scattered visualizations can be used.

Icon-based visualization

  • Matrices, pursuit techniques and hyperslices can also be leveraged
  • General techniques include shape/color codes and small representation

Hierarchical Vizualizaton

  • Can partition dimensions in sub-data set to easier visualize simultaneously for large and high dimensionality data
  • Dimensional Stacking, Worlds-within-Worlds, Tree-Map, Cone Trees and InfoCube are relevant methods

Hierarchical visualization techniques

  • Dimensional Stacking involves partition n-dimensional attribute space in 2-D subspaces which are 'stacked' into each other
  • Adequate data can be ordinal - low dimensionality to show
  • High quantity of data with over nine dimensions is complicated.
  • Mapping dimensions appropriately makes the display more effective.

Key Readings

  • Data Mining: Concepts and Techniques, (introduction)
  • Knowing Your Data
  • Data Mining: Practical Machine Learning Tools and Techniques Chapter concept

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

More Like This

Processo de KDD em Mineração de Dados
20 questions
KDD en Ciencias de Datos
8 questions
Use Quizgecko on...
Browser
Browser