Untitled
30 Questions
0 Views

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

What is the primary distinction between noise and outliers in data analysis?

  • Noise is systematic error, while outliers are random errors.
  • Noise represents data points with high variance, while outliers are data points with missing attributes.
  • Noise refers to irrelevant or meaningless data, whereas outliers are data points that deviate significantly from the norm. (correct)
  • Noise consists of extreme values in a dataset, while outliers are data points that conform to the general pattern.

Which of the following techniques are applicable for outlier analysis?

  • Hypothesis testing and A/B testing
  • Classification and clustering (correct)
  • Data normalization and feature scaling
  • Regression analysis and time series forecasting

Which of the following scenarios is most suitable for outlier analysis?

  • Detecting fraudulent transactions in a financial dataset. (correct)
  • Predicting customer churn based on historical transaction data.
  • Segmenting customers into different groups based on purchasing behavior.
  • Optimizing marketing campaign performance through A/B testing.

Which one is an application of outlier analysis?

<p>Rare event analysis (A)</p> Signup and view all the answers

How does using classification in outlier analysis improve the detection process compared to manual inspection?

<p>Classification allows for the identification of outliers based on predefined categories, increasing efficiency. (D)</p> Signup and view all the answers

Which attribute type is characterized by values that represent categories or names, without any inherent order?

<p>Nominal Attributes (A)</p> Signup and view all the answers

What distinguishes ratio-scaled attributes from interval-scaled attributes?

<p>Ratio-scaled attributes possess a non-arbitrary zero point. (D)</p> Signup and view all the answers

Which of the following scenarios is best described using an ordinal attribute?

<p>Assigning shirt sizes (S, M, L, XL). (C)</p> Signup and view all the answers

Consider a dataset containing information about different types of fruits. Which attribute type would be most suitable for representing the color of each fruit?

<p>Nominal Attributes (D)</p> Signup and view all the answers

In statistical analysis, which attribute type allows for the calculation of meaningful ratios between observations?

<p>Ratio-Scaled Attributes (B)</p> Signup and view all the answers

Which type of attribute is characterized by values representing categories without any inherent order?

<p>Nominal Attributes (A)</p> Signup and view all the answers

A dataset contains attributes such as 'city of residence,' 'eye color,' and 'type of car.' Which type of attribute do these most likely represent?

<p>Nominal Attributes (B)</p> Signup and view all the answers

In a survey, respondents are asked about their preferred brand of coffee. The brands are coded as 'A,' 'B,' 'C,' and 'D.' What type of attribute are these brand codes?

<p>Nominal Attributes (D)</p> Signup and view all the answers

Which of the following attributes cannot be meaningfully used for ranking or ordering?

<p>Street names in a city (A)</p> Signup and view all the answers

A researcher is analyzing survey data that includes responses about favorite colors (red, blue, green, etc.). What is the most appropriate way to describe the nature of the 'favorite color' attribute?

<p>The attribute is nominal and represents distinct categories. (A)</p> Signup and view all the answers

A dataset contains the following values: 12, 15, 18, 21, 15, 12, 15. Which measure of central tendency would be most appropriate to represent this data if the goal is to reflect the most frequently occurring value?

<p>Mode (D)</p> Signup and view all the answers

In a dataset with extreme outliers, which measure of central tendency is least affected by these outliers?

<p>Median (D)</p> Signup and view all the answers

To determine the average sale price of homes in a neighborhood, which measure of central tendency would be most appropriate if the dataset includes a few very expensive homes that are significantly higher in value than the others?

<p>Median (C)</p> Signup and view all the answers

A teacher wants to quickly estimate the average score on a test. They sort the scores in ascending order and take the average of the highest and lowest scores. Which measure of central tendency are they calculating?

<p>Midrange (C)</p> Signup and view all the answers

A real estate company wants to describe the 'typical' home price in a certain area to potential clients. They have collected historical sales data, but notice that there are a few very high-priced homes that could skew the average. Which measure of central tendency would give the most accurate representation of a 'typical' home price in this scenario?

<p>Median (C)</p> Signup and view all the answers

How does trimming the data typically affect the calculated mean?

<p>It reduces the influence of extreme values on the mean. (D)</p> Signup and view all the answers

Which scenario would benefit most from using a trimmed mean instead of a regular mean?

<p>Analyzing a dataset with several extreme outliers that could skew the average. (D)</p> Signup and view all the answers

In the context of data mining, what is the primary role of attributes?

<p>To represent specific characteristics or measurements of instances. (C)</p> Signup and view all the answers

Which element in data mining provides the raw material or individual examples that are characterized by attributes?

<p>Instance (A)</p> Signup and view all the answers

What is the primary reason for using a trimmed mean in statistical analysis?

<p>To make the mean more resistant to the effects of outliers. (A)</p> Signup and view all the answers

If a dataset contains information about customers including age, purchase history, and email address, what are these individual pieces of information considered as in data mining terminology?

<p>Attributes (C)</p> Signup and view all the answers

If a dataset has a significant positive skew, how would the trimmed mean compare to the regular mean?

<p>The trimmed mean would always be lower than the regular mean. (C)</p> Signup and view all the answers

In calculating a 10% trimmed mean for a dataset of 100 values, how many values are removed from each end of the dataset?

<p>10 values from each end. (B)</p> Signup and view all the answers

Consider a scenario where you're building a model to predict customer churn. What constitutes an 'instance' in this data mining task?

<p>A single customer with their associated data. (B)</p> Signup and view all the answers

What is the relationship between 'concepts', instances', and 'attributes' in a data mining task focused on classifying different types of flowers?

<p>Concepts define instances, which in turn are described by attributes. (A)</p> Signup and view all the answers

Flashcards

What is noise?

Unwanted background disturbances that obscure relevant information.

What are outliers?

Data points that significantly deviate from the norm.

Outlier analysis with classification

Identifying unusual data points using categorization methods.

Outlier analysis with clustering

Identifying unusual data points by grouping similar points together.

Signup and view all the flashcards

Applications of outlier analysis

Detecting fraudulent activities and spotting unusual occurrences.

Signup and view all the flashcards

Nominal Attributes

Attributes that represent categories, codes, or states.

Signup and view all the flashcards

Nominal Attribute Order

Categorical attributes without any intrinsic order.

Signup and view all the flashcards

Categorical Attributes

Nominal attributes are also referred to as this.

Signup and view all the flashcards

Nominal Attribute Values

Values represented as categories or codes.

Signup and view all the flashcards

Nominal Attributes: Definition

A type of data attribute that represent categories, codes, or states without any meaningful order.

Signup and view all the flashcards

Central Tendency

A measure that identifies the center of a data set.

Signup and view all the flashcards

Data Distribution

How spread out or varied the data points are.

Signup and view all the flashcards

Mean

The sum of all values divided by the number of values.

Signup and view all the flashcards

Median

The middle value when data is ordered from least to greatest.

Signup and view all the flashcards

Mode

The value that appears most frequently in a data set.

Signup and view all the flashcards

Binary Attributes

Nominal attributes with only two categories or states (0 or 1, true or false, yes or no).

Signup and view all the flashcards

Ordinal Attributes

Attributes with values that have a meaningful order or ranking, but the magnitude between successive values is not known.

Signup and view all the flashcards

Numeric Attributes

Attributes that are quantifiable and represented in integer or real values.

Signup and view all the flashcards

Interval-Scaled Attributes

Measured on a scale of equal-sized units, but has no true zero point (e.g., temperature in Celsius or Fahrenheit).

Signup and view all the flashcards

What is the 'mean'?

The average value of a dataset.

Signup and view all the flashcards

How is the mean calculated?

Sum of all values divided by the number of values.

Signup and view all the flashcards

What is x   xi / n i 1?

The formula for calculating the mean.

Signup and view all the flashcards

What is a 'trimmed mean'?

A mean calculated after removing extreme values from the dataset to reduce the influence of outliers.

Signup and view all the flashcards

What is trimming data?

Removing extreme values.

Signup and view all the flashcards

What is Data Mining?

The process of discovering interesting and useful patterns and knowledge from large amounts of data.

Signup and view all the flashcards

What is Data?

Raw facts, figures, and symbols; can be anything about the world.

Signup and view all the flashcards

What is a Concept?

A description of a single, self-contained task or example to be learned.

Signup and view all the flashcards

What are Instances?

Specific examples of the concept being studied, characterized by their attributes.

Signup and view all the flashcards

What are Attributes?

Characteristics or properties describing each instance; they can be numerical or categorical.

Signup and view all the flashcards

Study Notes

  • Module 1 is an introduction to data mining and how to get to know your data.
  • The learning outcomes are to define data mining and its applications, recognize data types and patterns, and describe data objects and attribute types.

Chapter 1: Introduction (1.1 - 1.5)

  • The topics covered will be: why data mining, data mining definition, data types that can be mined, pattern types that can be mined, technologies used, applications, and major issues.

Chapter 2: Getting to Know Your Data (2.1 - 2.3)

  • The topics covered will be: data objects and attribute types, basic statistical descriptions of data, and data visualization.

Why Data Mining

  • Data is growing at a rapid rate with increased collection and availability due to digital transformation and automation across all fields.
  • Data mining helps find knowledge within this data because the amount of data is overwhelming.
  • Data mining is the automated analysis of massive data sets, fulfilling the necessity driven by data overload.

What is Data Mining?

  • Data mining has changed names based on its time and field, these alternative names are: knowledge discovery, mining in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.
  • Data mining extracts knowledge from data no matter its purpose.
  • In other words, the extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from data.
  • Do not mistake data mining for simple search, query processing or expert systems that do not rely on data.

Knowledge Discovery in Databases (KDD) process

  • This includes data cleaning, data integration, data selection, data transformation, data mining, pattern evaluation, and knowledge presentation.
  • Data cleaning is to remove noise and inconsistent data.
  • Data integration is where multiple data sources may be combined.
  • Data selection is where data relevant to the analysis task are retrieved from the database.
  • Data transformation is where data are transformed and consolidated into forms appropriate for mining by performing summary or aggregation operations.
  • Data mining uses intelligent methods to extract data patterns or knowledge.
  • Pattern evaluation identifies the truly interesting patterns representing knowledge based on interestingness measures.
  • Knowledge presentation uses visualization and knowledge representation techniques to present mined knowledge to users.

Data Types

  • Data types that can be mined include relational databases, data warehouses, and transactional databases.
  • Advanced data that can be mined include data streams, sensor data, time-series data, temporal data, sequence data, structure data, graphs, social networks, multi-linked data, object-relational databases, heterogeneous databases, legacy databases, spatial/spatiotemporal data, multimedia databases, text databases, and the World-Wide Web.

What kinds of patterns can be mined?

  • Generalization is when you summarize and contrast data characteristics e.g., dry vs. wet region.
  • Association and correlation analysis identifies frequent patterns or itemsets, e.g., items frequently purchased together in a supermarket.
    • Association involves correlation vs. causality.
    • A typical association rule is: Bread -> Milk [0.5%, 75%] (support, confidence).
  • Classification is supervised learning. A model that classifies objects based on characteristics and learns from examples (training data). Applications: credit card fraud detection and patient diagnoses.
  • Cluster analysis is unsupervised learning, where data objects are grouped by similarity. Used in segmenting online shoppers for targeted ads.
  • Outlier analysis identifies data objects that do not comply with general behavior and can utilize classification/clustering techniques. Applications include transaction fraud detection and rare event analysis.

Technologies Used for Data Mining

  • Machine learning
  • Pattern recognition
  • Statistics
  • Applications
  • Database technology
  • Visualization
  • Algorithms
  • High-performance computing

Why Confluence of Multiple Disciplines

  • It is due to the tremendous amount of data that require algorithms to be highly scalable.
  • It is due to the high-dimensionality of data, like Micro-arrays potentially having tens of thousands of dimensions.
  • It is due to the high complexity of data found in datastreams, sensor data, time-series data, social networks, multimedia, software program simulations etc.

Data Mining Applications

  • Web page analysis for classification/clustering or PageRank and HITS algorithms.
  • Collaborative analysis & recommender systems for targeted marketing and biological/medical analysis
  • It classification, cluster analysis (microarray data analysis), biological sequence analysis, and biological network analysis.
  • Data mining can be used in software engineering as seen in the IEEE Computer, Aug. 2009 issue.
  • Data mining can be performed using SAS, MS SQL-Server Analysis Manager, Oracle Data Mining Tools and invisible data mining systems or tools.

Major Issues in Data Mining

  • Mining methodology is one issue. Mining various and new kinds of knowledge, Mining knowledge in multi-dimensional space, Data mining: An interdisciplinary effort, Boosting the power of discovery in a networked environment, Handling noise, uncertainty, and incompleteness of data, Pattern evaluation and pattern- or constraint-guided mining.
  • User interaction is an issue. Interactive mining,Incorporation of background knowledge and Presentation and visualization of data mining results Efficiency and Scalability is another. Efficiency and scalability of data mining algorithms, Parallel, distributed, stream, and incremental mining methods are major issues.
  • Diversity of data types poses an issue. Handling complex types of data and Mining dynamic, networked, and global data repositories is another major consideration.
  • Data mining and society are major issues. Social impacts of data mining, Privacy-preserving data mining and Invisible data mining demand consideration.

Data Objects and Attributes Types

  • Datasets are made up of Data objects that represent an entity.
    • Examples of entities: in a sales database - customers, store items, and sales; in a medical database - patients; in a university database - students, professors, and courses.
  • Data objects are described/represented by attributes.
  • Data objects can also be referred to as samples, examples, instances, data points, or objects.

Attributes

  • An attribute is a data field that represents a characteristic/feature of a data object.
  • The words attribute, dimension, feature, and variable are often used interchangeably. But:
    • "Dimension" is commonly used in data warehousing.
    • "Feature" is often used in Machine learning literature.
    • Statisticians prefer "variable."
    • Data mining and database professionals commonly use the term attribute, and "attribute" is used in this course.

Attribute Types

  • An attribute type is based on the set of possible values it can have. The four types are: Nominal, Binary, Ordinal, and Numeric.
  • Nominal Attributes: Each value represents some kind of category, code, or state. Values do not have meaningful order. E.g. Hair color, marital status, occupation, ID numbers, zip codes
  • Binary Attributes: A nominal type with two categories/states: 0 or 1. Where 0 means absent and 1 means present. If the states correspond to true and false, binary attributes are referred to as Boolean. E.g. Medical test result or gender
  • Ordinal Attributes: Attributes with meaningful order/ranking but the magnitude between successive values is unknown. E.g. Size = {small, medium, large}, grades = {A, B, C, D, F} and army rankings.
  • Numeric Attributes: Quantitative, measurable values and can be integer or real values. Numeric attributes can be interval-scaled or ratio-scaled.
    • Interval-Scaled Attributes: Measured on a scale of equal-size units. There is no true zero-point. E.g. temperature in CËš or F°, calendar dates.
    • Ratio-Scaled Attributes: Numeric with an inherent zero-point. E.g., area, weight, height, length, counts, monetary quantities. Ratio between two data object's attribute can be calculated.

Discrete vs. Continuous Attributes

  • Discrete Attribute (Nominal, Binary and Ordinal): Only a finite countable set of E.g., zip codes, profession, or the set of words in a collection of documents and is represented as an integer variable
  • Continuous or Numeric Attribute (Ratio and Interval): Real numbers as attribute values. E.g., temperature, height, or weight. Used practically, real values measured/represented using a finite number of digits and is a floating-point variable.

Basic Statistical Descriptions of Data

  • Statistical descriptions are used to better understand the data by measuring its central tendency and distribution (variation and spread).
  • This includes measuring characteristics like mean, median, mode, midrange, range, max, min, quantiles, outliers, variance, and standard deviation.

Measuring Central Tendency Characteristics

  • The Mean is the average and center, calculated as sum of data divided by sample size.
  • The trimmed mean is the mean of a dataset when trimming extreme values.
  • The Weighted average or Weighted arithmetic mean differs from the regular mean by weighting data based on significance or importance.
  • The Median is the middle value after sorting the dataset. The middle value or the sum of the two middle numbers divided by 2, otherwise.
  • The Mode is the value that occurs most frequently in the data. There can exist multiple, which is known as Unimodal(1), Bimodal(2), Trimodal(3).
  • Midrange= average of the min and max data values.
  • Data in real world application tend to have asymmetric data distribution or positively (negatively) skewed data.

Measuring Distribution Characteristics

  • This includes quartiles/outliers using boxplots. Quartiles: Q1 (25th percentile) & Q3 (75th percentile)
  • Inter-quartile range: IQR = Q3 - Q1, Five number summary: min, Q1, median, Q3, max
    • Boxplot: ends of the box are the quartiles; median is marked; add whiskers, and plot outliers individually
    • Outlier: usually, a value higher/lower than 1.5 x IQR
  • Standard Deviation: the square root of variance s2 (or σ2)

Distribution of Data

  • It should be visualized.
  • Boxplot is a visual graphic display of the five-number summary.
  • Histograms: x-axis are values, y-axis represent frequencies
  • Quantile plots pair each data value (xi) with the indication, fi. (100 fi) % of data are xi or less. Quantile-quantile (q-q) plots: graphs quantiles of different univariate distributions against one another.
  • Scatter plots: pairs of values serve as coordinates for points, plotted on a plane

Data Visualization

  • Data visualization aims to make data clear via graphical representation. Helps discover data relationships and it provides visual proof of the data.
  • This can be done through pixel-oriented, geometric projection, icon-based, or hierarchical visualization techniques and visualizing complex data/relations.

Data Visualization - Pixel-oriented

  • Each window represents an attribute, and each data object is is represented as a pixel on each window. The density/color is proportional to the value.
  • Pixels cab be laid out in Circle Segments. This saves space, and it presents connections.

Data Visualization - Geometric projection

  • Geometric Visualization consists of geometric transformations and projections.
  • This includes:
    • Direct visualization:
    • Scatterplot/scatterplot matrices
    • Landscapes, Projection pursuit
    • Techniqueprojection: Help users find meaningful projections of multidimensional data
    • Prosection views, Hyperslice
    • Parallel coordinates

Data Visualization - Icon-based

  • Involves icons, and the visualization of the data values as features of the icons.
  • Common methods are Chernoff faces or stick figures.
  • Using shape coding, shapes represent certain information.
  • Color icons use color to encode information.
  • Tile bars use small icons to portray vectors in retrieval.

Data Visualization - Hierarchical

  • For data with high dimensionality, it is difficult to visualize dimensions simultaneously.
  • Therefore, the visualization partitions all dimensions into subsets in a hierarchical manner.
  • This includes Dimensional Stacking, Worlds-within-Worlds, Tree-Map, Cone Trees
  • Partitioning of attribute values into stacked subspaces.
  • Important attributes have priority to the outer levels if you partition the attribute values into ranges.

Complex data relations

  • Can be found through visualization, and most notably tag clouds.
  • These present non-numerical data such as text and social networks, where tag sized is representative of its importance.

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

More Like This

Untitled Quiz
6 questions

Untitled Quiz

AdoredHealing avatar
AdoredHealing
Untitled
6 questions

Untitled

StrikingParadise avatar
StrikingParadise
Untitled Quiz
18 questions

Untitled Quiz

RighteousIguana avatar
RighteousIguana
Untitled Quiz
50 questions

Untitled Quiz

JoyousSulfur avatar
JoyousSulfur
Use Quizgecko on...
Browser
Browser