Podcast
Questions and Answers
What is the primary focus of graph mining?
What is the primary focus of graph mining?
- Evaluating the performance of social networks
- Finding frequent subgraphs and substructures (correct)
- Analyzing time-series data for trends
- Mining text documents for sentiment analysis
How do links in information networks contribute to analysis?
How do links in information networks contribute to analysis?
- They provide visual representations of data.
- They are utilized only in web mining, not in social network analysis.
- They are used primarily for filtering data streams.
- They carry semantic information that enhances understanding of relationships. (correct)
Which method is NOT associated with trend analysis?
Which method is NOT associated with trend analysis?
- Sequential pattern mining
- Value prediction
- Approximate and consecutive motifs (correct)
- Time-series analysis
What is an important aspect of evaluation in data analysis?
What is an important aspect of evaluation in data analysis?
Which of the following best describes web mining?
Which of the following best describes web mining?
Which type of attribute can be categorized but does not have a meaningful order between categories?
Which type of attribute can be categorized but does not have a meaningful order between categories?
Which attribute type allows for all arithmetic operations, including multiplication and division?
Which attribute type allows for all arithmetic operations, including multiplication and division?
What characteristic differentiates a binary attribute from a nominal attribute?
What characteristic differentiates a binary attribute from a nominal attribute?
In which attribute type are values ordered but the exact magnitude of difference between them is not known?
In which attribute type are values ordered but the exact magnitude of difference between them is not known?
Which of the following is an example of an interval-scaled attribute?
Which of the following is an example of an interval-scaled attribute?
What type of attribute is represented by real numbers?
What type of attribute is represented by real numbers?
Which of the following examples represents a discrete attribute?
Which of the following examples represents a discrete attribute?
What does the median represent in a data set?
What does the median represent in a data set?
Which mean is calculated by removing extreme values from the data set?
Which mean is calculated by removing extreme values from the data set?
Which of the following best describes graph data?
Which of the following best describes graph data?
If a data set has two values that occur with the highest frequency, what is it classified as?
If a data set has two values that occur with the highest frequency, what is it classified as?
In the context of statistical measures, what is the primary purpose of calculating the mode?
In the context of statistical measures, what is the primary purpose of calculating the mode?
Which variable representation is commonly used for continuous attributes?
Which variable representation is commonly used for continuous attributes?
Which of the following is NOT considered a dimension of evaluation in data analysis?
Which of the following is NOT considered a dimension of evaluation in data analysis?
In which application area would you primarily find claims and fraud analysis?
In which application area would you primarily find claims and fraud analysis?
Which challenge in data analysis involves ensuring algorithms can handle growing data sizes efficiently?
Which challenge in data analysis involves ensuring algorithms can handle growing data sizes efficiently?
What are the data objects in a data set typically described by?
What are the data objects in a data set typically described by?
Which of the following represents a major challenge in mining methodology?
Which of the following represents a major challenge in mining methodology?
What aspect does privacy-preserving data analysis focus on?
What aspect does privacy-preserving data analysis focus on?
Which of the following best describes data objects in the context of a dataset?
Which of the following best describes data objects in the context of a dataset?
What type of data analysis method involves processing data in real-time as it is generated?
What type of data analysis method involves processing data in real-time as it is generated?
What is the value of the inter-quartile range (IQR) if Q1 is 20 and Q3 is 30?
What is the value of the inter-quartile range (IQR) if Q1 is 20 and Q3 is 30?
Which of the following describes a positively skewed distribution?
Which of the following describes a positively skewed distribution?
What is a key characteristic of a boxplot?
What is a key characteristic of a boxplot?
What does the presence of outliers indicate in a dataset when using the 1.5 x IQR rule?
What does the presence of outliers indicate in a dataset when using the 1.5 x IQR rule?
How is variance for a sample denoted mathematically?
How is variance for a sample denoted mathematically?
What does a histogram primarily display?
What does a histogram primarily display?
What is the characteristic of the Jaccard coefficient?
What is the characteristic of the Jaccard coefficient?
What does the term 'proximity' refer to in statistical analysis?
What does the term 'proximity' refer to in statistical analysis?
Which of the following is NOT a property of the Minkowski distance?
Which of the following is NOT a property of the Minkowski distance?
Which distance form corresponds to h = 2 in the Minkowski distance?
Which distance form corresponds to h = 2 in the Minkowski distance?
How are outliers plotted in a boxplot?
How are outliers plotted in a boxplot?
Which of the following statements is true about quartiles?
Which of the following statements is true about quartiles?
In proximity measures for nominal attributes, what is the formula for Simple Matching?
In proximity measures for nominal attributes, what is the formula for Simple Matching?
What is the main use of a boxplot compared to a histogram?
What is the main use of a boxplot compared to a histogram?
Study Notes
Graph Mining
- Finding frequent subgraphs, substructures, chemical compounds, web fragments
Information Network Analysis
- Actors are objects or nodes.
- Relationships are edges.
- Examples: author networks in computer science, terrorist networks.
- A person can be part of multiple information networks
- Links carry semantic information.
Web Mining
- The web is a large information network like Google and PageRank
- Analysis includes usage mining, opinion mining, and web community discovery
Time and Ordering
- Trend, time-series, and deviation analysis: e.g., regression and value prediction
- Sequential pattern mining: e.g., buying a digital camera then a large SD memory card.
- Periodicity analysis: e.g., weekly sales spikes
- Biological sequence analysis: e.g., motifs
- Mining data streams: ordered, time-varying, potentially infinite data
- Conversational analysis
Evaluation
- Knowledge discovery, predictions, and assumptions are validated.
- Multi-dimensional evaluation includes: accuracy, interestingness, completeness, efficiency, explainability, diversity, and representativeness.
Application Areas
- Application areas include: E-commerce, social media, finance, insurance, telecommunications, transport, and data service providers
Major Challenges in Data Analysis (1): Mining Methodology
- Mining various and new types of knowledge
- Data analysis is an interdisciplinary effort.
- Handling noise, uncertainty, and incompleteness of data
- Pattern evaluation and pattern- or constraint-guided mining
Major Challenges in Data Analysis (2): User Interaction
- Interactive analysis
- Incorporation of background knowledge
- Presentation and visualization of data analysis results
Major Challenges in Data Analysis (3): Efficiency and Scalability
- Efficiency and scalability of data analysis algorithms
- Parallel, distributed, stream, and incremental analysis methods
Major Challenges in Data Analysis (4): Diversity of Data Types
- Handling complex types of data
- Analyzing dynamic, networked, and global data repositories
Major Challenges in Data Analysis (5): Data Analysis and Society
- Social impacts of data analysis
- Privacy-preserving data analysis
Data Objects and Attribute Types
- Data sets are made up of data objects.
- Each data object represents an entity.
- Examples: sales data: customers, store items, sales; medical data: patients, treatments, university data: students, professors, courses.
- Also called samples, examples, instances, data points, objects, or tuples.
Attributes
- An attribute is a data field that represents a characteristic of a data object.
- Examples: customer _ID, name, address
- Types include: nominal, binary, ordinal, numeric, interval-scaled, ratio-scaled.
Attribute Types: Nominal
- Categories, states, or “names of things”
- Examples: Hair color, marital status, occupation, ID numbers, zip codes
Attribute Types: Binary
- Nominal attributes with only 2 states (0 and 1)
- Symmetric binary: both outcomes equally important (e.g., gender)
- Asymmetric binary: outcomes not equally important (e.g., medical test)
- Convention: assign 1 to the most important outcome (e.g., HIV positive)
Attribute Types: Ordinal
- Values have a meaningful order (ranking) but magnitude between successive values is not known.
- Examples: Size = {small, medium, large}, grades, army rankings
Attribute Types: Numeric
- Quantity (integer or real-valued)
- Interval: measured on a scale of equal-sized units. Values have order. No true zero-point.
- Examples: temperature in C˚, calendar dates.
- Ratio: a numeric attribute with an inherent zero-point.
- Examples: 10istwiceashighas10 is twice as high as 10istwiceashighas5
Discrete vs.Continuous Attributes
- Discrete Attribute: has only a finite or countably infinite set of values.
- Examples: zip codes, profession, or the set of words in a collection of documents.
- Continuous Attribute: has real numbers as attribute values, measured and represented using a finite number of digits.
- Examples: temperature, height, or weight.
Some Special Data Formats
- Sequential data: ordered objects, time series, text data
- Graph data: nodes (vertices), edges (links)
Measuring the Central Tendency
- Mean (algebraic measure, sample vs.population): x = (åxi) / n
- Weighted arithmetic mean: x= (åwixi) / (åwi)
- Trimmed mean: chopping extreme values
- Median: middle value
- Mode: value that occurs most frequently
Symmetric vs.Skewed Data
- Median, mean and mode of symmetric, positively and negatively skewed data
- Positively skewed: mode<median<mean
- Negatively skewed: mean<median<mode
Measuring the Dispersion of Data
- Quartiles, outliers and boxplots
- Quartiles: Q1 (25th percentile), Q3 (75th percentile)
- Inter-quartile range: IQR = Q3 – Q1
- Five number summary: min, Q1, median, Q3, max
- Boxplot: ends of the box are the quartiles; median is marked; add whiskers, and plot outliers individually
- Outlier: usually, a value higher/lower than 1.5 x IQR
- Variance and standard deviation (sample: s, population: σ)
- Variance: s2 = (å (xi - x)2) / (n - 1)
- Standard deviation: s = the square root of the variance
Boxplot Analysis
- Five-number summary of a distribution: minimum, Q1, median, Q3, maximum
- Boxplot:
- Ends of the box are at the first and third quartiles, i.e., the height of the box is IQR
- Median is marked by a line within the box
- Whiskers: two lines outside the box extended to Minimum and Maximum
- Outliers: plotted individually, beyond a specified outlier threshold
Histogram Analysis
- Histogram: Graph display of tabulated frequencies, shown as bars
- Shows the proportion of cases falling into each interval
Similarity and Dissimilarity
- Similarity: Numerical measure of how alike two data objects are
- Value higher when objects are more alike
- Dissimilarity (e.g., distance): Numerical measure of how different two data objects are
- Lower when objects are more alike
- Proximity refers to a similarity or dissimilarity
Proximity Measure for Nominal Attributes
- Method 1: Simple matching: d (i, j) = p - (m / p)
- Method 2: Creating a new binary attribute for each of the M nominal states – then apply binary proximity measures
Proximity Measure for Binary Attributes
- Contingency table for binary data. Measures include a distance measure for symmetric binary variables, a distance measure for asymmetric binary variables, and the Jaccard coefficient (similarity measure for asymmetric binary variables)
Distance on Numeric Data: Minkowski Distance
- A popular distance measure: d (i, j) = ( (| xi1 - xj1 |h + | xi2 - xj2 |h +...+ | xip - xjp |h ) / (1 / h) )
- Properties:
- d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)
- d(i, j) = d(j, i) (Symmetry)
- d(i, j) £ d(i, k) + d(k, j) (Triangle Inequality)
- A distance is a metric if it satisfies these properties.
Special Cases of Minkowski Distance
- h = 1: Manhattan (city block, L1 norm) distance
- h = 2: (L2 norm) Euclidean distance
- h ® ¥: Supremum (L¥ norm) distance
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.
Related Documents
Description
Explore the realms of graph mining and information network analysis in this quiz. Discover concepts like frequent subgraphs, the structure of information networks, and the techniques used in web mining for effective data analysis. Test your knowledge on time-series analysis and the evaluation of knowledge discovery.