Podcast
Questions and Answers
Which of the following is a result of the rapid growth of Data?
Which of the following is a result of the rapid growth of Data?
- Decreased data collection
- Digital Transformation and Automation (correct)
- Reduced data availability
- Data scarcity
Which of the following fields is NOT mentioned as being impacted by the rapid growth of data?
Which of the following fields is NOT mentioned as being impacted by the rapid growth of data?
- Business
- Society
- Arts (correct)
- Science
What is the purpose of data mining?
What is the purpose of data mining?
- To obscure knowledge within data
- To complicate data analysis
- To help find knowledge within data (correct)
- To ignore data
Data mining is also known as:
Data mining is also known as:
What is the definition of Data Mining?
What is the definition of Data Mining?
Data mining is helpful in extracting:
Data mining is helpful in extracting:
Which of the following is NOT a step in the Knowledge Discovery (KDD) process?
Which of the following is NOT a step in the Knowledge Discovery (KDD) process?
What is the purpose of data cleaning in the KDD process?
What is the purpose of data cleaning in the KDD process?
What happens during data integration in the KDD process?
What happens during data integration in the KDD process?
What is done during data transformation in the KDD process?
What is done during data transformation in the KDD process?
What is the purpose of knowledge presentation in the KDD process?
What is the purpose of knowledge presentation in the KDD process?
Which type of database is mentioned?
Which type of database is mentioned?
Which type of application is mentioned?
Which type of application is mentioned?
What is the purpose of generalization in data mining?
What is the purpose of generalization in data mining?
Which of the following describes the goal of Cluster Analysis?
Which of the following describes the goal of Cluster Analysis?
Which of the following is NOT a data mining technique?
Which of the following is NOT a data mining technique?
What is the goal of classification in data mining?
What is the goal of classification in data mining?
In cluster analysis, data objects in a group are:
In cluster analysis, data objects in a group are:
Machine Learning uses which of the following?
Machine Learning uses which of the following?
Algorithms must be highly scalable to handle such as:
Algorithms must be highly scalable to handle such as:
What is Web page analysis used for?
What is Web page analysis used for?
What does data mining NOT impact?
What does data mining NOT impact?
Which of the following is NOT something data mining faces?
Which of the following is NOT something data mining faces?
What should mining methods be able to do?
What should mining methods be able to do?
A data object represents?
A data object represents?
What object might you find in a sales database?
What object might you find in a sales database?
What is an attribute?
What is an attribute?
Which of the following terms is used interchangeably with 'attribute'?
Which of the following terms is used interchangeably with 'attribute'?
What is the data mining and database professionals used term?
What is the data mining and database professionals used term?
Nominal attributes are also called:
Nominal attributes are also called:
Hair color is an example of:
Hair color is an example of:
A binary attribute has how many categories or states?
A binary attribute has how many categories or states?
Medical Test result is an example of:
Medical Test result is an example of:
Attributes with a meaningful order but unknown magnitude between values are:
Attributes with a meaningful order but unknown magnitude between values are:
Small, Medium and Large are examples of?
Small, Medium and Large are examples of?
Which attribute is an interval-scaled?
Which attribute is an interval-scaled?
Which of the following attributes is a ratio-scaled attribute?
Which of the following attributes is a ratio-scaled attribute?
A finite set of variables are what discrete attribute?
A finite set of variables are what discrete attribute?
Which is not a Basic Statistical Descriptions of Data?
Which is not a Basic Statistical Descriptions of Data?
What is the goal of finding the Basic Statistical Descriptions of Data?
What is the goal of finding the Basic Statistical Descriptions of Data?
What is a 'Mode'?
What is a 'Mode'?
Which of the following best describes a Boxplot?
Which of the following best describes a Boxplot?
What does a Histogram show?
What does a Histogram show?
What is a goal of data mining?
What is a goal of data mining?
Which task is NOT typically considered data mining?
Which task is NOT typically considered data mining?
What is a key step in data mining?
What is a key step in data mining?
What is the purpose of 'Data transformation'?
What is the purpose of 'Data transformation'?
What does data integration involve?
What does data integration involve?
What is a 'relational database'?
What is a 'relational database'?
What is 'association analysis'?
What is 'association analysis'?
What is the data mining technique?
What is the data mining technique?
What do data mining systems/tools assist with?
What do data mining systems/tools assist with?
What is a challenge in data mining related to data?
What is a challenge in data mining related to data?
What is a data object?
What is a data object?
What is 'feature'?
What is 'feature'?
What kind of attribute is 'zip code'?
What kind of attribute is 'zip code'?
What is a 'medical test result' an example of?
What is a 'medical test result' an example of?
Which of the following is an example of a discreet attribute?
Which of the following is an example of a discreet attribute?
In data analysis, what is the primary purpose of motivation?
In data analysis, what is the primary purpose of motivation?
What are you measuring if looking at the central tendency?
What are you measuring if looking at the central tendency?
What is the 'Mean'?
What is the 'Mean'?
What is the main focus of the basic statistical descriptions of a set of data?
What is the main focus of the basic statistical descriptions of a set of data?
How is the median found when the data are already sorted?
How is the median found when the data are already sorted?
What is 'Midrange'?
What is 'Midrange'?
What is the main purpose of plotting outliers individually in a box plot?
What is the main purpose of plotting outliers individually in a box plot?
What does a quantile plot display?
What does a quantile plot display?
In data visualization, should aim be?
In data visualization, should aim be?
Flashcards
Why Data Mining?
Why Data Mining?
The rapid increase and accessibility of data, driven by digital transformation and automation, across various domains.
What is Data Mining?
What is Data Mining?
The process of extracting meaningful information from data, regardless of its purpose. Also known as KDD.
Data Cleaning
Data Cleaning
Removing noise and inconsistencies from data.
Data Integration
Data Integration
Signup and view all the flashcards
Data Selection
Data Selection
Signup and view all the flashcards
Data Transformation
Data Transformation
Signup and view all the flashcards
Data Mining (process)
Data Mining (process)
Signup and view all the flashcards
Pattern Evaluation
Pattern Evaluation
Signup and view all the flashcards
Knowledge Presentation
Knowledge Presentation
Signup and view all the flashcards
Four Attribute Types
Four Attribute Types
Signup and view all the flashcards
Nominal Attributes
Nominal Attributes
Signup and view all the flashcards
Binary Attributes
Binary Attributes
Signup and view all the flashcards
Ordinal Attributes
Ordinal Attributes
Signup and view all the flashcards
Numeric Attributes
Numeric Attributes
Signup and view all the flashcards
Discrete Attribute
Discrete Attribute
Signup and view all the flashcards
Continuous Attribute
Continuous Attribute
Signup and view all the flashcards
Basic Statistical Descriptions
Basic Statistical Descriptions
Signup and view all the flashcards
Mean
Mean
Signup and view all the flashcards
Median
Median
Signup and view all the flashcards
Mode
Mode
Signup and view all the flashcards
Data dispersion
Data dispersion
Signup and view all the flashcards
Quantile Plots
Quantile Plots
Signup and view all the flashcards
Quantile-Quantile Plot
Quantile-Quantile Plot
Signup and view all the flashcards
Scatter Plot
Scatter Plot
Signup and view all the flashcards
Data Visualization
Data Visualization
Signup and view all the flashcards
Pixel Oriented Visualization Techniques
Pixel Oriented Visualization Techniques
Signup and view all the flashcards
Geometric Projection Visualization Techniques
Geometric Projection Visualization Techniques
Signup and view all the flashcards
Study Notes
- DS472 focuses on Data Mining.
- Module 1, Chapter 1 is an introduction to Data Mining.
- Module 1, Chapter 2 addresses getting to know your data.
Learning Outcomes
- A key learning outcome is to define data mining and its applications.
- Students will recognize the kinds of data and patterns that can be mined.
- Another aim is to describe data objects and attribute types
Introduction to Data Mining
- Data mining is important because of the rapid growth of data
- Digital Transformation and Automation have increased data availability and collection.
- Data mining is applicable across all fields including business (e-commerce, transactions, stocks), science (remote sensing, bioinformatics, scientific simulation), and society (news, digital cameras, YouTube).
- Data mining helps finding knowledge within the large amounts of available data.
- Data mining offers automated analysis of massive data sets.
- Data mining has various names dependent on time and field, like KDD and business intelligence.
- Data mining extracts knowledge from data regardless of the intended use.
- In essence, it extracts interesting, non-trivial, implicit, previously unknown, and potentially useful patterns or knowledge.
- Note that overstating data mining by including similar tasks like simple query processing or expert systems usage is easy.
Knowledge Discovery Process
- Data cleaning removes noise and inconsistent data.
- Data integration combines multiple data sources.
- Data selection retrieves relevant data for the analysis task from the database.
- Data transformation consolidates data into appropriate forms for mining.
- Data mining extracts data patterns or knowledge using intelligent methods.
- Pattern evaluation identifies interesting patterns representing knowledge.
- Knowledge presentation uses mined knowledge visualizations.
Types of Data
- Data can be mined.
- This includes Relational, warehouse, and transactional databases
- Also included are Data streams, sensor data, time-series, temporal data, sequence data (including bio-sequences), structure data, graphs, social networks, multi-linked data, object-relational and heterogeneous databases
- Spatial, spatiotemporal, multimedia, text databases and The World-Wide Web are mineable data.
Patterns that can be mined
- The data mining techniques used in generalization help summarizing and contrasting data characteristics, for example, dry vs. wet regions
- Association and Correlation Analysis identifies frequent patterns (or frequent itemsets), such as items frequently co-purchased at a supermarket
- An association rule: Bread -> Milk [0.5%, 75%] offers quantifiable support and confidence metrics
- There are relationships of association, correlation vs. causality.
- Not all strongly correlated items are correlated.
- Building a model able to classify on object characteristics is possible via Classification (supervised learning).
- Model learning uses training data.
- Diagnoses and credit card fraud can be detected.
More on Patterns that can be Mined
- In Cluster Analysis (unsupervised learning) data objects can by grouped when similar to each other
- Targeting ads to specific groups is possible, like news readers.
- Outlier Analysis detects data objects non-compliant with the general behavior of the data
- The identification of shared and disparate object characteristics helps defining outliers.
- Noise is different to an outlier
- Analysis can use classification or clustering techniques.
- Rare event analysis and transaction fraud detection are examples of usage.
Relevant Technologies
- Machine Learning, pattern recognition and Statistics are relevant technologies for data mining.
- Visualization algorithms, Databases and high-performance computing all enable data mining.
Multiple Disciplines in Data Mining
- Having scalable algorithms enables handling tera-bytes of data.
- Micro-arrays can have tens of thousands of dimensions.
- Highly complex data includes Data streams and sensor data, time-series and temporal data, sequence data, structured data, graphs, social networks, multi-linked data, heterogeneous databases, legacy databases, spatial, spatiotemporal, multimedia, text and Web data and programs.
Data Mining Applications
- Applications include from web page classification, clustering to PageRank & HITS algorithms.
- Web page analysis, analysis & recommender systems, basket data analysis to targeted marketing are also examples.
- Biological and medical data analysis, software engineering is possible
- There are many existing data mining tools (e.g., SAS, Oracle Data Mining Tools).
Issues in Data Mining
- Data mining algorithms must be efficient and scalable.
- Mining methods must be parallel, distributed, stream-oriented, and incremental.
- Data must be diverse and complex.
- Mining must be able to handle dynamic, networked, global data repositories.
- Social impacts, privacy-preserving techniques, and invisible mining are all important.
Data Objects and Attributes
- Data sets are made up of data objects.
- An entity is represented by a sales database, the objects may be customers, store items, and sales
- Patients in medicine, students, and professors at university.
- Data objects are described or represented by attributes.
- These data objects are referred to as samples, examples, instances, data points, or objects.
Important definitions
- An attribute is a data field that represents a characteristic or feature of a data object.
- interchangeable nouns attribute/dimension/feature/variable
- Dimension is most used in warehousing
- Machine learning literature tends to use the term feature
- Statisticians most likely use variable
- Data mining and database professionals commonly use the term attribute.
Attribute Types
- An attribute type is set by the type of possible values.
- The four types: Nominal, Binary, Ordinal and Numeric (interval/ratio scaled)
- Nominal attributes are categories, codes, or states without meaningful order like hair color, marital status
- Binary attributes only have two categories/states, like 0/1 (absent/present) and Boolean
- Ordinal attributes have ordered values but unknown magnitude between these, e.g Size or Grades
- Numeric Attributes are quantities represented by integer/real values.
Categorizing numeric attributes
- Interval-Scaled attributes are scales of equal size units without true zero-point like temperature in CËšor FËš
- Ratio-Scaled Attributes are numeric with zero-point.
- Ratio between two data object's attribute can be calculated from area, weight, height, any count and monetary quantities.
- Discrete Attribute (Nominal are generally these
- They have a countable set of values, like zip codes or profession.
Continuous Variables
- Attribute (Ratio and interval) have real numbers as values.
- They can be measured practically using number of digits
- They are typically represented as floating-point variables
- Statistical Descriptors help understand central tendency and distribution.
Basic Statistical Descriptions of Data
- This process measures data's central tendency, dispersion, spread and variation
- To measure a tendency one needs to know Mean, Median, Mode and Midrange.
- To measure dispersion one needs know Range, max, min, variance standard deviation, quantiles, outliers.
Working Out The Tendency
- Mean works by dividing all values by the sample size.
- Weighed average uses weight to reflect significance.
- Trimmed means values can be calculated from cleaned data.
- Sorting to get the median sorts the data
- Then the middle value is found if the data size is odd or the two middle numbers are averaged otherwise
- This computation method is expensive for big data so approximate values are necessary.
Describing the median in detail
- Median assumptions are that data are grouped into x-values and that frequencies are known.
- First, find the median interval.
- By interpolation compute
median = L1 + (n / 2 − (∑ freq )l/ freqmedian) width
where
L1
= lower boundary of interval
N = total number of sample values, (∑ freq) sum of frequencies lower than interval.
freqmedian = frequency of interval
width = width of the median interval.
Mode
Value is the most frequent.
More tendency data characteristics
- Unimodal sets have the same hightest values and most frequency i.e bimodal and trimodal as well
- One value with highest frequency = Unimodel, two are bimodal and three are trimodal
- An approximate unimodel calculation involves using mean − mode = 3 × (mean − median)
- Midrange uses minimum and maximum values
- Compute this easily like any SQL using aggregate functions i.e max() and min().
- If symmetric distribution, the tendencies will be centred
Dispersion basics
- World application data tends to have asymmetric
- Data distributions are skewed
- Quartiles are 25th (Q1), and 75th (Q3).
- Inter-quartile range equals Q3 – Q1.
- Five number summary equals min, Q1, median, Q3, max.
- Boxplot illustrates ends of the box at the quartiles.
- Whisker plot illustrates outliers plot outliers individually higher or lower values than 1.5 x IQR
- Standard deviation includes sample: s, population: σ.
- Scalable algebraic computations measure variance.
Common formulas
- Variance:
σ ̂2 = (1/N) (Σ__(i=1..n) (x_i − μ)^2) = (1/N) Σ__(i=1..n) x_i^2 − μ^2
- The root represents standard deviation
- Popular visualization plots help visualize such distributions.
- The five number summary creates boxplots
- With Histograms the x-axis values while y-axis shows the frquency.
More on graphic visualization
- Quantile plots pair values with a percentage to show data amount i.e x,, f.
- Quantile graphs univariate distribution against the corresponding ones i.e q-q with points/ co-ordinates against the place with visualization.
- Data visualization has graphical representations help communicate it clearly and effectively.
Data visualization can
- This technique is used extensively across many applications like business operations, the tracking of progress of tasks, and used to discover data relationships that are otherwise not easily observable.
- There is visual proofs of computer representations.
- Pixel visualization and geometric projection are available techniques for graphics Pixel representation is proportional to the value. A feature can represent one attribute. Space filling also saves space Other techniques such as as geometric transformations and projections can enhance the view. Additionaly, direct and scattered visualizations can be used.
Icon-based visualization
- Matrices, pursuit techniques and hyperslices can also be leveraged
- General techniques include shape/color codes and small representation
Hierarchical Vizualizaton
- Can partition dimensions in sub-data set to easier visualize simultaneously for large and high dimensionality data
- Dimensional Stacking, Worlds-within-Worlds, Tree-Map, Cone Trees and InfoCube are relevant methods
Hierarchical visualization techniques
- Dimensional Stacking involves partition n-dimensional attribute space in 2-D subspaces which are 'stacked' into each other
- Adequate data can be ordinal - low dimensionality to show
- High quantity of data with over nine dimensions is complicated.
- Mapping dimensions appropriately makes the display more effective.
Key Readings
- Data Mining: Concepts and Techniques, (introduction)
- Knowing Your Data
- Data Mining: Practical Machine Learning Tools and Techniques Chapter concept
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.