Graph Mining and Information Network Analysis

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson
Download our mobile app to listen on the go
Get App

Questions and Answers

What is the primary focus of graph mining?

  • Evaluating the performance of social networks
  • Finding frequent subgraphs and substructures (correct)
  • Analyzing time-series data for trends
  • Mining text documents for sentiment analysis

How do links in information networks contribute to analysis?

  • They provide visual representations of data.
  • They are utilized only in web mining, not in social network analysis.
  • They are used primarily for filtering data streams.
  • They carry semantic information that enhances understanding of relationships. (correct)

Which method is NOT associated with trend analysis?

  • Sequential pattern mining
  • Value prediction
  • Approximate and consecutive motifs (correct)
  • Time-series analysis

What is an important aspect of evaluation in data analysis?

<p>Ensuring assumptions are validated and arguments are substantiated (B)</p> Signup and view all the answers

Which of the following best describes web mining?

<p>The study of web community discovery and similar patterns (A)</p> Signup and view all the answers

Which type of attribute can be categorized but does not have a meaningful order between categories?

<p>Nominal (A)</p> Signup and view all the answers

Which attribute type allows for all arithmetic operations, including multiplication and division?

<p>Ratio (D)</p> Signup and view all the answers

What characteristic differentiates a binary attribute from a nominal attribute?

<p>Binary attributes have only two possible states. (B)</p> Signup and view all the answers

In which attribute type are values ordered but the exact magnitude of difference between them is not known?

<p>Ordinal (D)</p> Signup and view all the answers

Which of the following is an example of an interval-scaled attribute?

<p>Temperature in Celsius (B)</p> Signup and view all the answers

What type of attribute is represented by real numbers?

<p>Continuous attribute (A)</p> Signup and view all the answers

Which of the following examples represents a discrete attribute?

<p>Zip code (B)</p> Signup and view all the answers

What does the median represent in a data set?

<p>The middle value (A)</p> Signup and view all the answers

Which mean is calculated by removing extreme values from the data set?

<p>Trimmed mean (B)</p> Signup and view all the answers

Which of the following best describes graph data?

<p>A set of nodes linked by edges (C)</p> Signup and view all the answers

If a data set has two values that occur with the highest frequency, what is it classified as?

<p>Bimodal (D)</p> Signup and view all the answers

In the context of statistical measures, what is the primary purpose of calculating the mode?

<p>To identify the most frequent value (B)</p> Signup and view all the answers

Which variable representation is commonly used for continuous attributes?

<p>Floating-point variables (C)</p> Signup and view all the answers

Which of the following is NOT considered a dimension of evaluation in data analysis?

<p>Creativity (B)</p> Signup and view all the answers

In which application area would you primarily find claims and fraud analysis?

<p>Social Media (D)</p> Signup and view all the answers

Which challenge in data analysis involves ensuring algorithms can handle growing data sizes efficiently?

<p>Efficiency and scalability (C)</p> Signup and view all the answers

What are the data objects in a data set typically described by?

<p>Attributes (B)</p> Signup and view all the answers

Which of the following represents a major challenge in mining methodology?

<p>Mining various and new kinds of knowledge (B)</p> Signup and view all the answers

What aspect does privacy-preserving data analysis focus on?

<p>Protecting individual privacy during data analysis (B)</p> Signup and view all the answers

Which of the following best describes data objects in the context of a dataset?

<p>Entities represented within the dataset (C)</p> Signup and view all the answers

What type of data analysis method involves processing data in real-time as it is generated?

<p>Stream analysis (A)</p> Signup and view all the answers

What is the value of the inter-quartile range (IQR) if Q1 is 20 and Q3 is 30?

<p>10 (B)</p> Signup and view all the answers

Which of the following describes a positively skewed distribution?

<p>Mean &gt; Median &gt; Mode (A)</p> Signup and view all the answers

What is a key characteristic of a boxplot?

<p>It represents the five-number summary. (A)</p> Signup and view all the answers

What does the presence of outliers indicate in a dataset when using the 1.5 x IQR rule?

<p>Variability beyond typical data points (B)</p> Signup and view all the answers

How is variance for a sample denoted mathematically?

<p>$s^2$ (A)</p> Signup and view all the answers

What does a histogram primarily display?

<p>Tabulated frequencies for intervals (C)</p> Signup and view all the answers

What is the characteristic of the Jaccard coefficient?

<p>It is applicable only to nominal data. (D)</p> Signup and view all the answers

What does the term 'proximity' refer to in statistical analysis?

<p>Similarity or dissimilarity between data objects (B)</p> Signup and view all the answers

Which of the following is NOT a property of the Minkowski distance?

<p>Monotonicity (A)</p> Signup and view all the answers

Which distance form corresponds to h = 2 in the Minkowski distance?

<p>Euclidean distance (C)</p> Signup and view all the answers

How are outliers plotted in a boxplot?

<p>Individually outside the whiskers (A)</p> Signup and view all the answers

Which of the following statements is true about quartiles?

<p>Q1 represents the 25th percentile. (B)</p> Signup and view all the answers

In proximity measures for nominal attributes, what is the formula for Simple Matching?

<p>$p - m$ (A)</p> Signup and view all the answers

What is the main use of a boxplot compared to a histogram?

<p>To summarize key descriptive statistics succinctly. (D)</p> Signup and view all the answers

Flashcards are hidden until you start studying

Study Notes

Graph Mining

  • Finding frequent subgraphs, substructures, chemical compounds, web fragments

Information Network Analysis

  • Actors are objects or nodes.
  • Relationships are edges.
  • Examples: author networks in computer science, terrorist networks.
  • A person can be part of multiple information networks
  • Links carry semantic information.

Web Mining

  • The web is a large information network like Google and PageRank
  • Analysis includes usage mining, opinion mining, and web community discovery

Time and Ordering

  • Trend, time-series, and deviation analysis: e.g., regression and value prediction
  • Sequential pattern mining: e.g., buying a digital camera then a large SD memory card.
  • Periodicity analysis: e.g., weekly sales spikes
  • Biological sequence analysis: e.g., motifs
  • Mining data streams: ordered, time-varying, potentially infinite data
  • Conversational analysis

Evaluation

  • Knowledge discovery, predictions, and assumptions are validated.
  • Multi-dimensional evaluation includes: accuracy, interestingness, completeness, efficiency, explainability, diversity, and representativeness.

Application Areas

  • Application areas include: E-commerce, social media, finance, insurance, telecommunications, transport, and data service providers

Major Challenges in Data Analysis (1): Mining Methodology

  • Mining various and new types of knowledge
  • Data analysis is an interdisciplinary effort.
  • Handling noise, uncertainty, and incompleteness of data
  • Pattern evaluation and pattern- or constraint-guided mining

Major Challenges in Data Analysis (2): User Interaction

  • Interactive analysis
  • Incorporation of background knowledge
  • Presentation and visualization of data analysis results

Major Challenges in Data Analysis (3): Efficiency and Scalability

  • Efficiency and scalability of data analysis algorithms
  • Parallel, distributed, stream, and incremental analysis methods

Major Challenges in Data Analysis (4): Diversity of Data Types

  • Handling complex types of data
  • Analyzing dynamic, networked, and global data repositories

Major Challenges in Data Analysis (5): Data Analysis and Society

  • Social impacts of data analysis
  • Privacy-preserving data analysis

Data Objects and Attribute Types

  • Data sets are made up of data objects.
  • Each data object represents an entity.
  • Examples: sales data: customers, store items, sales; medical data: patients, treatments, university data: students, professors, courses.
  • Also called samples, examples, instances, data points, objects, or tuples.

Attributes

  • An attribute is a data field that represents a characteristic of a data object.
  • Examples: customer _ID, name, address
  • Types include: nominal, binary, ordinal, numeric, interval-scaled, ratio-scaled.

Attribute Types: Nominal

  • Categories, states, or “names of things”
  • Examples: Hair color, marital status, occupation, ID numbers, zip codes

Attribute Types: Binary

  • Nominal attributes with only 2 states (0 and 1)
  • Symmetric binary: both outcomes equally important (e.g., gender)
  • Asymmetric binary: outcomes not equally important (e.g., medical test)
  • Convention: assign 1 to the most important outcome (e.g., HIV positive)

Attribute Types: Ordinal

  • Values have a meaningful order (ranking) but magnitude between successive values is not known.
  • Examples: Size = {small, medium, large}, grades, army rankings

Attribute Types: Numeric

  • Quantity (integer or real-valued)
  • Interval: measured on a scale of equal-sized units. Values have order. No true zero-point.
  • Examples: temperature in C˚, calendar dates.
  • Ratio: a numeric attribute with an inherent zero-point.
  • Examples: 10istwiceashighas10 is twice as high as 10istwiceashighas5

Discrete vs.Continuous Attributes

  • Discrete Attribute: has only a finite or countably infinite set of values.
  • Examples: zip codes, profession, or the set of words in a collection of documents.
  • Continuous Attribute: has real numbers as attribute values, measured and represented using a finite number of digits.
  • Examples: temperature, height, or weight.

Some Special Data Formats

  • Sequential data: ordered objects, time series, text data
  • Graph data: nodes (vertices), edges (links)

Measuring the Central Tendency

  • Mean (algebraic measure, sample vs.population): x = (åxi) / n
  • Weighted arithmetic mean: x= (åwixi) / (åwi)
  • Trimmed mean: chopping extreme values
  • Median: middle value
  • Mode: value that occurs most frequently

Symmetric vs.Skewed Data

  • Median, mean and mode of symmetric, positively and negatively skewed data
  • Positively skewed: mode<median<mean
  • Negatively skewed: mean<median<mode

Measuring the Dispersion of Data

  • Quartiles, outliers and boxplots
  • Quartiles: Q1 (25th percentile), Q3 (75th percentile)
  • Inter-quartile range: IQR = Q3 – Q1
  • Five number summary: min, Q1, median, Q3, max
  • Boxplot: ends of the box are the quartiles; median is marked; add whiskers, and plot outliers individually
  • Outlier: usually, a value higher/lower than 1.5 x IQR
  • Variance and standard deviation (sample: s, population: σ)
  • Variance: s2 = (å (xi - x)2) / (n - 1)
  • Standard deviation: s = the square root of the variance

Boxplot Analysis

  • Five-number summary of a distribution: minimum, Q1, median, Q3, maximum
  • Boxplot:
  • Ends of the box are at the first and third quartiles, i.e., the height of the box is IQR
  • Median is marked by a line within the box
  • Whiskers: two lines outside the box extended to Minimum and Maximum
  • Outliers: plotted individually, beyond a specified outlier threshold

Histogram Analysis

  • Histogram: Graph display of tabulated frequencies, shown as bars
  • Shows the proportion of cases falling into each interval

Similarity and Dissimilarity

  • Similarity: Numerical measure of how alike two data objects are
  • Value higher when objects are more alike
  • Dissimilarity (e.g., distance): Numerical measure of how different two data objects are
  • Lower when objects are more alike
  • Proximity refers to a similarity or dissimilarity

Proximity Measure for Nominal Attributes

  • Method 1: Simple matching: d (i, j) = p - (m / p)
  • Method 2: Creating a new binary attribute for each of the M nominal states – then apply binary proximity measures

Proximity Measure for Binary Attributes

  • Contingency table for binary data. Measures include a distance measure for symmetric binary variables, a distance measure for asymmetric binary variables, and the Jaccard coefficient (similarity measure for asymmetric binary variables)

Distance on Numeric Data: Minkowski Distance

  • A popular distance measure: d (i, j) = ( (| xi1 - xj1 |h + | xi2 - xj2 |h +...+ | xip - xjp |h ) / (1 / h) )
  • Properties:
  • d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)
  • d(i, j) = d(j, i) (Symmetry)
  • d(i, j) £ d(i, k) + d(k, j) (Triangle Inequality)
  • A distance is a metric if it satisfies these properties.

Special Cases of Minkowski Distance

  • h = 1: Manhattan (city block, L1 norm) distance
  • h = 2: (L2 norm) Euclidean distance
  • h ® ¥: Supremum (L¥ norm) distance

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

Week 1B PDF - Data Analysis

More Like This

RDF Knowledge Graphs Similarity Search
36 questions
Root Words: Graph and Gram
11 questions
Graph Shapes Flashcards
13 questions

Graph Shapes Flashcards

VersatileCopernicium avatar
VersatileCopernicium
Algebra 2: Graph Transformations
7 questions
Use Quizgecko on...
Browser
Browser