Podcast
Questions and Answers
What is the primary focus of graph mining?
What is the primary focus of graph mining?
How do links in information networks contribute to analysis?
How do links in information networks contribute to analysis?
Which method is NOT associated with trend analysis?
Which method is NOT associated with trend analysis?
What is an important aspect of evaluation in data analysis?
What is an important aspect of evaluation in data analysis?
Signup and view all the answers
Which of the following best describes web mining?
Which of the following best describes web mining?
Signup and view all the answers
Which type of attribute can be categorized but does not have a meaningful order between categories?
Which type of attribute can be categorized but does not have a meaningful order between categories?
Signup and view all the answers
Which attribute type allows for all arithmetic operations, including multiplication and division?
Which attribute type allows for all arithmetic operations, including multiplication and division?
Signup and view all the answers
What characteristic differentiates a binary attribute from a nominal attribute?
What characteristic differentiates a binary attribute from a nominal attribute?
Signup and view all the answers
In which attribute type are values ordered but the exact magnitude of difference between them is not known?
In which attribute type are values ordered but the exact magnitude of difference between them is not known?
Signup and view all the answers
Which of the following is an example of an interval-scaled attribute?
Which of the following is an example of an interval-scaled attribute?
Signup and view all the answers
What type of attribute is represented by real numbers?
What type of attribute is represented by real numbers?
Signup and view all the answers
Which of the following examples represents a discrete attribute?
Which of the following examples represents a discrete attribute?
Signup and view all the answers
What does the median represent in a data set?
What does the median represent in a data set?
Signup and view all the answers
Which mean is calculated by removing extreme values from the data set?
Which mean is calculated by removing extreme values from the data set?
Signup and view all the answers
Which of the following best describes graph data?
Which of the following best describes graph data?
Signup and view all the answers
If a data set has two values that occur with the highest frequency, what is it classified as?
If a data set has two values that occur with the highest frequency, what is it classified as?
Signup and view all the answers
In the context of statistical measures, what is the primary purpose of calculating the mode?
In the context of statistical measures, what is the primary purpose of calculating the mode?
Signup and view all the answers
Which variable representation is commonly used for continuous attributes?
Which variable representation is commonly used for continuous attributes?
Signup and view all the answers
Which of the following is NOT considered a dimension of evaluation in data analysis?
Which of the following is NOT considered a dimension of evaluation in data analysis?
Signup and view all the answers
In which application area would you primarily find claims and fraud analysis?
In which application area would you primarily find claims and fraud analysis?
Signup and view all the answers
Which challenge in data analysis involves ensuring algorithms can handle growing data sizes efficiently?
Which challenge in data analysis involves ensuring algorithms can handle growing data sizes efficiently?
Signup and view all the answers
What are the data objects in a data set typically described by?
What are the data objects in a data set typically described by?
Signup and view all the answers
Which of the following represents a major challenge in mining methodology?
Which of the following represents a major challenge in mining methodology?
Signup and view all the answers
What aspect does privacy-preserving data analysis focus on?
What aspect does privacy-preserving data analysis focus on?
Signup and view all the answers
Which of the following best describes data objects in the context of a dataset?
Which of the following best describes data objects in the context of a dataset?
Signup and view all the answers
What type of data analysis method involves processing data in real-time as it is generated?
What type of data analysis method involves processing data in real-time as it is generated?
Signup and view all the answers
What is the value of the inter-quartile range (IQR) if Q1 is 20 and Q3 is 30?
What is the value of the inter-quartile range (IQR) if Q1 is 20 and Q3 is 30?
Signup and view all the answers
Which of the following describes a positively skewed distribution?
Which of the following describes a positively skewed distribution?
Signup and view all the answers
What is a key characteristic of a boxplot?
What is a key characteristic of a boxplot?
Signup and view all the answers
What does the presence of outliers indicate in a dataset when using the 1.5 x IQR rule?
What does the presence of outliers indicate in a dataset when using the 1.5 x IQR rule?
Signup and view all the answers
How is variance for a sample denoted mathematically?
How is variance for a sample denoted mathematically?
Signup and view all the answers
What does a histogram primarily display?
What does a histogram primarily display?
Signup and view all the answers
What is the characteristic of the Jaccard coefficient?
What is the characteristic of the Jaccard coefficient?
Signup and view all the answers
What does the term 'proximity' refer to in statistical analysis?
What does the term 'proximity' refer to in statistical analysis?
Signup and view all the answers
Which of the following is NOT a property of the Minkowski distance?
Which of the following is NOT a property of the Minkowski distance?
Signup and view all the answers
Which distance form corresponds to h = 2 in the Minkowski distance?
Which distance form corresponds to h = 2 in the Minkowski distance?
Signup and view all the answers
How are outliers plotted in a boxplot?
How are outliers plotted in a boxplot?
Signup and view all the answers
Which of the following statements is true about quartiles?
Which of the following statements is true about quartiles?
Signup and view all the answers
In proximity measures for nominal attributes, what is the formula for Simple Matching?
In proximity measures for nominal attributes, what is the formula for Simple Matching?
Signup and view all the answers
What is the main use of a boxplot compared to a histogram?
What is the main use of a boxplot compared to a histogram?
Signup and view all the answers
Study Notes
Graph Mining
- Finding frequent subgraphs, substructures, chemical compounds, web fragments
Information Network Analysis
- Actors are objects or nodes.
- Relationships are edges.
- Examples: author networks in computer science, terrorist networks.
- A person can be part of multiple information networks
- Links carry semantic information.
Web Mining
- The web is a large information network like Google and PageRank
- Analysis includes usage mining, opinion mining, and web community discovery
Time and Ordering
- Trend, time-series, and deviation analysis: e.g., regression and value prediction
- Sequential pattern mining: e.g., buying a digital camera then a large SD memory card.
- Periodicity analysis: e.g., weekly sales spikes
- Biological sequence analysis: e.g., motifs
- Mining data streams: ordered, time-varying, potentially infinite data
- Conversational analysis
Evaluation
- Knowledge discovery, predictions, and assumptions are validated.
- Multi-dimensional evaluation includes: accuracy, interestingness, completeness, efficiency, explainability, diversity, and representativeness.
Application Areas
- Application areas include: E-commerce, social media, finance, insurance, telecommunications, transport, and data service providers
Major Challenges in Data Analysis (1): Mining Methodology
- Mining various and new types of knowledge
- Data analysis is an interdisciplinary effort.
- Handling noise, uncertainty, and incompleteness of data
- Pattern evaluation and pattern- or constraint-guided mining
Major Challenges in Data Analysis (2): User Interaction
- Interactive analysis
- Incorporation of background knowledge
- Presentation and visualization of data analysis results
Major Challenges in Data Analysis (3): Efficiency and Scalability
- Efficiency and scalability of data analysis algorithms
- Parallel, distributed, stream, and incremental analysis methods
Major Challenges in Data Analysis (4): Diversity of Data Types
- Handling complex types of data
- Analyzing dynamic, networked, and global data repositories
Major Challenges in Data Analysis (5): Data Analysis and Society
- Social impacts of data analysis
- Privacy-preserving data analysis
Data Objects and Attribute Types
- Data sets are made up of data objects.
- Each data object represents an entity.
- Examples: sales data: customers, store items, sales; medical data: patients, treatments, university data: students, professors, courses.
- Also called samples, examples, instances, data points, objects, or tuples.
Attributes
- An attribute is a data field that represents a characteristic of a data object.
- Examples: customer _ID, name, address
- Types include: nominal, binary, ordinal, numeric, interval-scaled, ratio-scaled.
Attribute Types: Nominal
- Categories, states, or “names of things”
- Examples: Hair color, marital status, occupation, ID numbers, zip codes
Attribute Types: Binary
- Nominal attributes with only 2 states (0 and 1)
- Symmetric binary: both outcomes equally important (e.g., gender)
- Asymmetric binary: outcomes not equally important (e.g., medical test)
- Convention: assign 1 to the most important outcome (e.g., HIV positive)
Attribute Types: Ordinal
- Values have a meaningful order (ranking) but magnitude between successive values is not known.
- Examples: Size = {small, medium, large}, grades, army rankings
Attribute Types: Numeric
- Quantity (integer or real-valued)
- Interval: measured on a scale of equal-sized units. Values have order. No true zero-point.
- Examples: temperature in C˚, calendar dates.
- Ratio: a numeric attribute with an inherent zero-point.
- Examples: 10istwiceashighas10 is twice as high as 10istwiceashighas5
Discrete vs.Continuous Attributes
- Discrete Attribute: has only a finite or countably infinite set of values.
- Examples: zip codes, profession, or the set of words in a collection of documents.
- Continuous Attribute: has real numbers as attribute values, measured and represented using a finite number of digits.
- Examples: temperature, height, or weight.
Some Special Data Formats
- Sequential data: ordered objects, time series, text data
- Graph data: nodes (vertices), edges (links)
Measuring the Central Tendency
- Mean (algebraic measure, sample vs.population): x = (åxi) / n
- Weighted arithmetic mean: x= (åwixi) / (åwi)
- Trimmed mean: chopping extreme values
- Median: middle value
- Mode: value that occurs most frequently
Symmetric vs.Skewed Data
- Median, mean and mode of symmetric, positively and negatively skewed data
- Positively skewed: mode<median<mean
- Negatively skewed: mean<median<mode
Measuring the Dispersion of Data
- Quartiles, outliers and boxplots
- Quartiles: Q1 (25th percentile), Q3 (75th percentile)
- Inter-quartile range: IQR = Q3 – Q1
- Five number summary: min, Q1, median, Q3, max
- Boxplot: ends of the box are the quartiles; median is marked; add whiskers, and plot outliers individually
- Outlier: usually, a value higher/lower than 1.5 x IQR
- Variance and standard deviation (sample: s, population: σ)
- Variance: s2 = (å (xi - x)2) / (n - 1)
- Standard deviation: s = the square root of the variance
Boxplot Analysis
- Five-number summary of a distribution: minimum, Q1, median, Q3, maximum
- Boxplot:
- Ends of the box are at the first and third quartiles, i.e., the height of the box is IQR
- Median is marked by a line within the box
- Whiskers: two lines outside the box extended to Minimum and Maximum
- Outliers: plotted individually, beyond a specified outlier threshold
Histogram Analysis
- Histogram: Graph display of tabulated frequencies, shown as bars
- Shows the proportion of cases falling into each interval
Similarity and Dissimilarity
- Similarity: Numerical measure of how alike two data objects are
- Value higher when objects are more alike
- Dissimilarity (e.g., distance): Numerical measure of how different two data objects are
- Lower when objects are more alike
- Proximity refers to a similarity or dissimilarity
Proximity Measure for Nominal Attributes
- Method 1: Simple matching: d (i, j) = p - (m / p)
- Method 2: Creating a new binary attribute for each of the M nominal states – then apply binary proximity measures
Proximity Measure for Binary Attributes
- Contingency table for binary data. Measures include a distance measure for symmetric binary variables, a distance measure for asymmetric binary variables, and the Jaccard coefficient (similarity measure for asymmetric binary variables)
Distance on Numeric Data: Minkowski Distance
- A popular distance measure: d (i, j) = ( (| xi1 - xj1 |h + | xi2 - xj2 |h +...+ | xip - xjp |h ) / (1 / h) )
- Properties:
- d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)
- d(i, j) = d(j, i) (Symmetry)
- d(i, j) £ d(i, k) + d(k, j) (Triangle Inequality)
- A distance is a metric if it satisfies these properties.
Special Cases of Minkowski Distance
- h = 1: Manhattan (city block, L1 norm) distance
- h = 2: (L2 norm) Euclidean distance
- h ® ¥: Supremum (L¥ norm) distance
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.
Related Documents
Description
Explore the realms of graph mining and information network analysis in this quiz. Discover concepts like frequent subgraphs, the structure of information networks, and the techniques used in web mining for effective data analysis. Test your knowledge on time-series analysis and the evaluation of knowledge discovery.