Graph Mining and Information Network Analysis

Podcast

Play an AI-generated podcast conversation about this lesson

Download our mobile app to listen on the go

Get App

Questions and Answers

What is the primary focus of graph mining?

Evaluating the performance of social networks
Finding frequent subgraphs and substructures (correct)
Analyzing time-series data for trends
Mining text documents for sentiment analysis

How do links in information networks contribute to analysis?

They provide visual representations of data.
They are utilized only in web mining, not in social network analysis.
They are used primarily for filtering data streams.
They carry semantic information that enhances understanding of relationships. (correct)

Which method is NOT associated with trend analysis?

Sequential pattern mining
Value prediction
Approximate and consecutive motifs (correct)
Time-series analysis

What is an important aspect of evaluation in data analysis?

Ensuring assumptions are validated and arguments are substantiated (B)

Signup and view all the answers

Which of the following best describes web mining?

The study of web community discovery and similar patterns (A)

Signup and view all the answers

Which type of attribute can be categorized but does not have a meaningful order between categories?

Nominal (A)

Signup and view all the answers

Which attribute type allows for all arithmetic operations, including multiplication and division?

Ratio (D)

Signup and view all the answers

What characteristic differentiates a binary attribute from a nominal attribute?

Binary attributes have only two possible states. (B)

Signup and view all the answers

In which attribute type are values ordered but the exact magnitude of difference between them is not known?

Ordinal (D)

Signup and view all the answers

Which of the following is an example of an interval-scaled attribute?

Temperature in Celsius (B)

Signup and view all the answers

What type of attribute is represented by real numbers?

Continuous attribute (A)

Signup and view all the answers

Which of the following examples represents a discrete attribute?

Zip code (B)

Signup and view all the answers

What does the median represent in a data set?

The middle value (A)

Signup and view all the answers

Which mean is calculated by removing extreme values from the data set?

Trimmed mean (B)

Signup and view all the answers

Which of the following best describes graph data?

A set of nodes linked by edges (C)

Signup and view all the answers

If a data set has two values that occur with the highest frequency, what is it classified as?

Bimodal (D)

Signup and view all the answers

In the context of statistical measures, what is the primary purpose of calculating the mode?

To identify the most frequent value (B)

Signup and view all the answers

Which variable representation is commonly used for continuous attributes?

Floating-point variables (C)

Signup and view all the answers

Which of the following is NOT considered a dimension of evaluation in data analysis?

Creativity (B)

Signup and view all the answers

In which application area would you primarily find claims and fraud analysis?

Social Media (D)

Signup and view all the answers

Which challenge in data analysis involves ensuring algorithms can handle growing data sizes efficiently?

Efficiency and scalability (C)

Signup and view all the answers

What are the data objects in a data set typically described by?

Attributes (B)

Signup and view all the answers

Which of the following represents a major challenge in mining methodology?

Mining various and new kinds of knowledge (B)

Signup and view all the answers

What aspect does privacy-preserving data analysis focus on?

Protecting individual privacy during data analysis (B)

Signup and view all the answers

Which of the following best describes data objects in the context of a dataset?

Entities represented within the dataset (C)

Signup and view all the answers

What type of data analysis method involves processing data in real-time as it is generated?

Stream analysis (A)

Signup and view all the answers

What is the value of the inter-quartile range (IQR) if Q1 is 20 and Q3 is 30?

10 (B)

Signup and view all the answers

Which of the following describes a positively skewed distribution?

Mean > Median > Mode (A)

Signup and view all the answers

What is a key characteristic of a boxplot?

It represents the five-number summary. (A)

Signup and view all the answers

What does the presence of outliers indicate in a dataset when using the 1.5 x IQR rule?

Variability beyond typical data points (B)

Signup and view all the answers

How is variance for a sample denoted mathematically?

$s^2$ (A)

Signup and view all the answers

What does a histogram primarily display?

Tabulated frequencies for intervals (C)

Signup and view all the answers

What is the characteristic of the Jaccard coefficient?

It is applicable only to nominal data. (D)

Signup and view all the answers

What does the term 'proximity' refer to in statistical analysis?

Similarity or dissimilarity between data objects (B)

Signup and view all the answers

Which of the following is NOT a property of the Minkowski distance?

Monotonicity (A)

Signup and view all the answers

Which distance form corresponds to h = 2 in the Minkowski distance?

Euclidean distance (C)

Signup and view all the answers

How are outliers plotted in a boxplot?

Individually outside the whiskers (A)

Signup and view all the answers

Which of the following statements is true about quartiles?

Q1 represents the 25th percentile. (B)

Signup and view all the answers

In proximity measures for nominal attributes, what is the formula for Simple Matching?

$p - m$ (A)

Signup and view all the answers

What is the main use of a boxplot compared to a histogram?

To summarize key descriptive statistics succinctly. (D)

Signup and view all the answers

Flashcards are hidden until you start studying

Study Notes

Graph Mining

Finding frequent subgraphs, substructures, chemical compounds, web fragments

Information Network Analysis

Actors are objects or nodes.
Relationships are edges.
Examples: author networks in computer science, terrorist networks.
A person can be part of multiple information networks
Links carry semantic information.

Web Mining

The web is a large information network like Google and PageRank
Analysis includes usage mining, opinion mining, and web community discovery

Time and Ordering

Trend, time-series, and deviation analysis: e.g., regression and value prediction
Sequential pattern mining: e.g., buying a digital camera then a large SD memory card.
Periodicity analysis: e.g., weekly sales spikes
Biological sequence analysis: e.g., motifs
Mining data streams: ordered, time-varying, potentially infinite data
Conversational analysis

Evaluation

Knowledge discovery, predictions, and assumptions are validated.
Multi-dimensional evaluation includes: accuracy, interestingness, completeness, efficiency, explainability, diversity, and representativeness.

Application Areas

Application areas include: E-commerce, social media, finance, insurance, telecommunications, transport, and data service providers

Major Challenges in Data Analysis (1): Mining Methodology

Mining various and new types of knowledge
Data analysis is an interdisciplinary effort.
Handling noise, uncertainty, and incompleteness of data
Pattern evaluation and pattern- or constraint-guided mining

Major Challenges in Data Analysis (2): User Interaction

Interactive analysis
Incorporation of background knowledge
Presentation and visualization of data analysis results

Major Challenges in Data Analysis (3): Efficiency and Scalability

Efficiency and scalability of data analysis algorithms
Parallel, distributed, stream, and incremental analysis methods

Major Challenges in Data Analysis (4): Diversity of Data Types

Handling complex types of data
Analyzing dynamic, networked, and global data repositories

Major Challenges in Data Analysis (5): Data Analysis and Society

Social impacts of data analysis
Privacy-preserving data analysis

Data Objects and Attribute Types

Data sets are made up of data objects.
Each data object represents an entity.
Examples: sales data: customers, store items, sales; medical data: patients, treatments, university data: students, professors, courses.
Also called samples, examples, instances, data points, objects, or tuples.

Attributes

An attribute is a data field that represents a characteristic of a data object.
Examples: customer _ID, name, address
Types include: nominal, binary, ordinal, numeric, interval-scaled, ratio-scaled.

Attribute Types: Nominal

Categories, states, or “names of things”
Examples: Hair color, marital status, occupation, ID numbers, zip codes

Attribute Types: Binary

Nominal attributes with only 2 states (0 and 1)
Symmetric binary: both outcomes equally important (e.g., gender)
Asymmetric binary: outcomes not equally important (e.g., medical test)
Convention: assign 1 to the most important outcome (e.g., HIV positive)

Attribute Types: Ordinal

Values have a meaningful order (ranking) but magnitude between successive values is not known.
Examples: Size = {small, medium, large}, grades, army rankings

Attribute Types: Numeric

Quantity (integer or real-valued)
Interval: measured on a scale of equal-sized units. Values have order. No true zero-point.
Examples: temperature in C˚, calendar dates.
Ratio: a numeric attribute with an inherent zero-point.
Examples: 10istwiceashighas10 is twice as high as 10istwiceashighas5

Discrete vs.Continuous Attributes

Discrete Attribute: has only a finite or countably infinite set of values.
Examples: zip codes, profession, or the set of words in a collection of documents.
Continuous Attribute: has real numbers as attribute values, measured and represented using a finite number of digits.
Examples: temperature, height, or weight.

Some Special Data Formats

Sequential data: ordered objects, time series, text data
Graph data: nodes (vertices), edges (links)

Measuring the Central Tendency

Mean (algebraic measure, sample vs.population): x = (åxi) / n
Weighted arithmetic mean: x= (åwixi) / (åwi)
Trimmed mean: chopping extreme values
Median: middle value
Mode: value that occurs most frequently

Symmetric vs.Skewed Data

Median, mean and mode of symmetric, positively and negatively skewed data
Positively skewed: mode<median<mean
Negatively skewed: mean<median<mode

Measuring the Dispersion of Data

Quartiles, outliers and boxplots
Quartiles: Q1 (25th percentile), Q3 (75th percentile)
Inter-quartile range: IQR = Q3 – Q1
Five number summary: min, Q1, median, Q3, max
Boxplot: ends of the box are the quartiles; median is marked; add whiskers, and plot outliers individually
Outlier: usually, a value higher/lower than 1.5 x IQR
Variance and standard deviation (sample: s, population: σ)
Variance: s2 = (å (xi - x)2) / (n - 1)
Standard deviation: s = the square root of the variance

Boxplot Analysis

Five-number summary of a distribution: minimum, Q1, median, Q3, maximum
Boxplot:
Ends of the box are at the first and third quartiles, i.e., the height of the box is IQR
Median is marked by a line within the box
Whiskers: two lines outside the box extended to Minimum and Maximum
Outliers: plotted individually, beyond a specified outlier threshold

Histogram Analysis

Histogram: Graph display of tabulated frequencies, shown as bars
Shows the proportion of cases falling into each interval

Similarity and Dissimilarity

Similarity: Numerical measure of how alike two data objects are
Value higher when objects are more alike
Dissimilarity (e.g., distance): Numerical measure of how different two data objects are
Lower when objects are more alike
Proximity refers to a similarity or dissimilarity

Proximity Measure for Nominal Attributes

Method 1: Simple matching: d (i, j) = p - (m / p)
Method 2: Creating a new binary attribute for each of the M nominal states – then apply binary proximity measures

Proximity Measure for Binary Attributes

Contingency table for binary data. Measures include a distance measure for symmetric binary variables, a distance measure for asymmetric binary variables, and the Jaccard coefficient (similarity measure for asymmetric binary variables)

Distance on Numeric Data: Minkowski Distance

A popular distance measure: d (i, j) = ( (| xi1 - xj1 |h + | xi2 - xj2 |h +...+ | xip - xjp |h ) / (1 / h) )
Properties:
d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)
d(i, j) = d(j, i) (Symmetry)
d(i, j) £ d(i, k) + d(k, j) (Triangle Inequality)
A distance is a metric if it satisfies these properties.

Special Cases of Minkowski Distance

h = 1: Manhattan (city block, L1 norm) distance
h = 2: (L2 norm) Euclidean distance
h ® ¥: Supremum (L¥ norm) distance

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Graph Mining and Information Network Analysis

Choose a study mode

Podcast

Questions and Answers

What is the primary focus of graph mining?

How do links in information networks contribute to analysis?

Which method is NOT associated with trend analysis?

What is an important aspect of evaluation in data analysis?

Which of the following best describes web mining?

Which type of attribute can be categorized but does not have a meaningful order between categories?

Which attribute type allows for all arithmetic operations, including multiplication and division?

What characteristic differentiates a binary attribute from a nominal attribute?

In which attribute type are values ordered but the exact magnitude of difference between them is not known?

Which of the following is an example of an interval-scaled attribute?

What type of attribute is represented by real numbers?

Which of the following examples represents a discrete attribute?

What does the median represent in a data set?

Which mean is calculated by removing extreme values from the data set?

Which of the following best describes graph data?

If a data set has two values that occur with the highest frequency, what is it classified as?

In the context of statistical measures, what is the primary purpose of calculating the mode?

Which variable representation is commonly used for continuous attributes?

Which of the following is NOT considered a dimension of evaluation in data analysis?

In which application area would you primarily find claims and fraud analysis?

Which challenge in data analysis involves ensuring algorithms can handle growing data sizes efficiently?

What are the data objects in a data set typically described by?

Which of the following represents a major challenge in mining methodology?

What aspect does privacy-preserving data analysis focus on?

Which of the following best describes data objects in the context of a dataset?

What type of data analysis method involves processing data in real-time as it is generated?

What is the value of the inter-quartile range (IQR) if Q1 is 20 and Q3 is 30?

Which of the following describes a positively skewed distribution?

What is a key characteristic of a boxplot?

What does the presence of outliers indicate in a dataset when using the 1.5 x IQR rule?

How is variance for a sample denoted mathematically?

What does a histogram primarily display?

What is the characteristic of the Jaccard coefficient?

What does the term 'proximity' refer to in statistical analysis?

Which of the following is NOT a property of the Minkowski distance?

Which distance form corresponds to h = 2 in the Minkowski distance?

How are outliers plotted in a boxplot?

Which of the following statements is true about quartiles?

In proximity measures for nominal attributes, what is the formula for Simple Matching?

What is the main use of a boxplot compared to a histogram?

Study Notes

Graph Mining

Information Network Analysis

Web Mining

Time and Ordering

Evaluation

Application Areas

Major Challenges in Data Analysis (1): Mining Methodology

Major Challenges in Data Analysis (2): User Interaction

Major Challenges in Data Analysis (3): Efficiency and Scalability

Major Challenges in Data Analysis (4): Diversity of Data Types

Major Challenges in Data Analysis (5): Data Analysis and Society

Data Objects and Attribute Types

Attributes

Attribute Types: Nominal

Attribute Types: Binary

Attribute Types: Ordinal

Attribute Types: Numeric

Discrete vs.Continuous Attributes

Some Special Data Formats

Measuring the Central Tendency

Symmetric vs.Skewed Data

Measuring the Dispersion of Data

Boxplot Analysis

Histogram Analysis

Similarity and Dissimilarity

Proximity Measure for Nominal Attributes

Proximity Measure for Binary Attributes

Distance on Numeric Data: Minkowski Distance

Special Cases of Minkowski Distance

Studying That Suits You

Related Documents

More Like This

RDF Knowledge Graphs Similarity Search

Root Words: Graph and Gram

Exploring Greek Root 'Graph'