Graph Mining and Information Network Analysis
40 Questions
0 Views

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

What is the primary focus of graph mining?

  • Evaluating the performance of social networks
  • Finding frequent subgraphs and substructures (correct)
  • Analyzing time-series data for trends
  • Mining text documents for sentiment analysis
  • How do links in information networks contribute to analysis?

  • They provide visual representations of data.
  • They are utilized only in web mining, not in social network analysis.
  • They are used primarily for filtering data streams.
  • They carry semantic information that enhances understanding of relationships. (correct)
  • Which method is NOT associated with trend analysis?

  • Sequential pattern mining
  • Value prediction
  • Approximate and consecutive motifs (correct)
  • Time-series analysis
  • What is an important aspect of evaluation in data analysis?

    <p>Ensuring assumptions are validated and arguments are substantiated</p> Signup and view all the answers

    Which of the following best describes web mining?

    <p>The study of web community discovery and similar patterns</p> Signup and view all the answers

    Which type of attribute can be categorized but does not have a meaningful order between categories?

    <p>Nominal</p> Signup and view all the answers

    Which attribute type allows for all arithmetic operations, including multiplication and division?

    <p>Ratio</p> Signup and view all the answers

    What characteristic differentiates a binary attribute from a nominal attribute?

    <p>Binary attributes have only two possible states.</p> Signup and view all the answers

    In which attribute type are values ordered but the exact magnitude of difference between them is not known?

    <p>Ordinal</p> Signup and view all the answers

    Which of the following is an example of an interval-scaled attribute?

    <p>Temperature in Celsius</p> Signup and view all the answers

    What type of attribute is represented by real numbers?

    <p>Continuous attribute</p> Signup and view all the answers

    Which of the following examples represents a discrete attribute?

    <p>Zip code</p> Signup and view all the answers

    What does the median represent in a data set?

    <p>The middle value</p> Signup and view all the answers

    Which mean is calculated by removing extreme values from the data set?

    <p>Trimmed mean</p> Signup and view all the answers

    Which of the following best describes graph data?

    <p>A set of nodes linked by edges</p> Signup and view all the answers

    If a data set has two values that occur with the highest frequency, what is it classified as?

    <p>Bimodal</p> Signup and view all the answers

    In the context of statistical measures, what is the primary purpose of calculating the mode?

    <p>To identify the most frequent value</p> Signup and view all the answers

    Which variable representation is commonly used for continuous attributes?

    <p>Floating-point variables</p> Signup and view all the answers

    Which of the following is NOT considered a dimension of evaluation in data analysis?

    <p>Creativity</p> Signup and view all the answers

    In which application area would you primarily find claims and fraud analysis?

    <p>Social Media</p> Signup and view all the answers

    Which challenge in data analysis involves ensuring algorithms can handle growing data sizes efficiently?

    <p>Efficiency and scalability</p> Signup and view all the answers

    What are the data objects in a data set typically described by?

    <p>Attributes</p> Signup and view all the answers

    Which of the following represents a major challenge in mining methodology?

    <p>Mining various and new kinds of knowledge</p> Signup and view all the answers

    What aspect does privacy-preserving data analysis focus on?

    <p>Protecting individual privacy during data analysis</p> Signup and view all the answers

    Which of the following best describes data objects in the context of a dataset?

    <p>Entities represented within the dataset</p> Signup and view all the answers

    What type of data analysis method involves processing data in real-time as it is generated?

    <p>Stream analysis</p> Signup and view all the answers

    What is the value of the inter-quartile range (IQR) if Q1 is 20 and Q3 is 30?

    <p>10</p> Signup and view all the answers

    Which of the following describes a positively skewed distribution?

    <p>Mean &gt; Median &gt; Mode</p> Signup and view all the answers

    What is a key characteristic of a boxplot?

    <p>It represents the five-number summary.</p> Signup and view all the answers

    What does the presence of outliers indicate in a dataset when using the 1.5 x IQR rule?

    <p>Variability beyond typical data points</p> Signup and view all the answers

    How is variance for a sample denoted mathematically?

    <p>$s^2$</p> Signup and view all the answers

    What does a histogram primarily display?

    <p>Tabulated frequencies for intervals</p> Signup and view all the answers

    What is the characteristic of the Jaccard coefficient?

    <p>It is applicable only to nominal data.</p> Signup and view all the answers

    What does the term 'proximity' refer to in statistical analysis?

    <p>Similarity or dissimilarity between data objects</p> Signup and view all the answers

    Which of the following is NOT a property of the Minkowski distance?

    <p>Monotonicity</p> Signup and view all the answers

    Which distance form corresponds to h = 2 in the Minkowski distance?

    <p>Euclidean distance</p> Signup and view all the answers

    How are outliers plotted in a boxplot?

    <p>Individually outside the whiskers</p> Signup and view all the answers

    Which of the following statements is true about quartiles?

    <p>Q1 represents the 25th percentile.</p> Signup and view all the answers

    In proximity measures for nominal attributes, what is the formula for Simple Matching?

    <p>$p - m$</p> Signup and view all the answers

    What is the main use of a boxplot compared to a histogram?

    <p>To summarize key descriptive statistics succinctly.</p> Signup and view all the answers

    Study Notes

    Graph Mining

    • Finding frequent subgraphs, substructures, chemical compounds, web fragments

    Information Network Analysis

    • Actors are objects or nodes.
    • Relationships are edges.
    • Examples: author networks in computer science, terrorist networks.
    • A person can be part of multiple information networks
    • Links carry semantic information.

    Web Mining

    • The web is a large information network like Google and PageRank
    • Analysis includes usage mining, opinion mining, and web community discovery

    Time and Ordering

    • Trend, time-series, and deviation analysis: e.g., regression and value prediction
    • Sequential pattern mining: e.g., buying a digital camera then a large SD memory card.
    • Periodicity analysis: e.g., weekly sales spikes
    • Biological sequence analysis: e.g., motifs
    • Mining data streams: ordered, time-varying, potentially infinite data
    • Conversational analysis

    Evaluation

    • Knowledge discovery, predictions, and assumptions are validated.
    • Multi-dimensional evaluation includes: accuracy, interestingness, completeness, efficiency, explainability, diversity, and representativeness.

    Application Areas

    • Application areas include: E-commerce, social media, finance, insurance, telecommunications, transport, and data service providers

    Major Challenges in Data Analysis (1): Mining Methodology

    • Mining various and new types of knowledge
    • Data analysis is an interdisciplinary effort.
    • Handling noise, uncertainty, and incompleteness of data
    • Pattern evaluation and pattern- or constraint-guided mining

    Major Challenges in Data Analysis (2): User Interaction

    • Interactive analysis
    • Incorporation of background knowledge
    • Presentation and visualization of data analysis results

    Major Challenges in Data Analysis (3): Efficiency and Scalability

    • Efficiency and scalability of data analysis algorithms
    • Parallel, distributed, stream, and incremental analysis methods

    Major Challenges in Data Analysis (4): Diversity of Data Types

    • Handling complex types of data
    • Analyzing dynamic, networked, and global data repositories

    Major Challenges in Data Analysis (5): Data Analysis and Society

    • Social impacts of data analysis
    • Privacy-preserving data analysis

    Data Objects and Attribute Types

    • Data sets are made up of data objects.
    • Each data object represents an entity.
    • Examples: sales data: customers, store items, sales; medical data: patients, treatments, university data: students, professors, courses.
    • Also called samples, examples, instances, data points, objects, or tuples.

    Attributes

    • An attribute is a data field that represents a characteristic of a data object.
    • Examples: customer _ID, name, address
    • Types include: nominal, binary, ordinal, numeric, interval-scaled, ratio-scaled.

    Attribute Types: Nominal

    • Categories, states, or “names of things”
    • Examples: Hair color, marital status, occupation, ID numbers, zip codes

    Attribute Types: Binary

    • Nominal attributes with only 2 states (0 and 1)
    • Symmetric binary: both outcomes equally important (e.g., gender)
    • Asymmetric binary: outcomes not equally important (e.g., medical test)
    • Convention: assign 1 to the most important outcome (e.g., HIV positive)

    Attribute Types: Ordinal

    • Values have a meaningful order (ranking) but magnitude between successive values is not known.
    • Examples: Size = {small, medium, large}, grades, army rankings

    Attribute Types: Numeric

    • Quantity (integer or real-valued)
    • Interval: measured on a scale of equal-sized units. Values have order. No true zero-point.
    • Examples: temperature in C˚, calendar dates.
    • Ratio: a numeric attribute with an inherent zero-point.
    • Examples: 10istwiceashighas10 is twice as high as 10istwiceashighas5

    Discrete vs.Continuous Attributes

    • Discrete Attribute: has only a finite or countably infinite set of values.
    • Examples: zip codes, profession, or the set of words in a collection of documents.
    • Continuous Attribute: has real numbers as attribute values, measured and represented using a finite number of digits.
    • Examples: temperature, height, or weight.

    Some Special Data Formats

    • Sequential data: ordered objects, time series, text data
    • Graph data: nodes (vertices), edges (links)

    Measuring the Central Tendency

    • Mean (algebraic measure, sample vs.population): x = (åxi) / n
    • Weighted arithmetic mean: x= (åwixi) / (åwi)
    • Trimmed mean: chopping extreme values
    • Median: middle value
    • Mode: value that occurs most frequently

    Symmetric vs.Skewed Data

    • Median, mean and mode of symmetric, positively and negatively skewed data
    • Positively skewed: mode<median<mean
    • Negatively skewed: mean<median<mode

    Measuring the Dispersion of Data

    • Quartiles, outliers and boxplots
    • Quartiles: Q1 (25th percentile), Q3 (75th percentile)
    • Inter-quartile range: IQR = Q3 – Q1
    • Five number summary: min, Q1, median, Q3, max
    • Boxplot: ends of the box are the quartiles; median is marked; add whiskers, and plot outliers individually
    • Outlier: usually, a value higher/lower than 1.5 x IQR
    • Variance and standard deviation (sample: s, population: σ)
    • Variance: s2 = (å (xi - x)2) / (n - 1)
    • Standard deviation: s = the square root of the variance

    Boxplot Analysis

    • Five-number summary of a distribution: minimum, Q1, median, Q3, maximum
    • Boxplot:
    • Ends of the box are at the first and third quartiles, i.e., the height of the box is IQR
    • Median is marked by a line within the box
    • Whiskers: two lines outside the box extended to Minimum and Maximum
    • Outliers: plotted individually, beyond a specified outlier threshold

    Histogram Analysis

    • Histogram: Graph display of tabulated frequencies, shown as bars
    • Shows the proportion of cases falling into each interval

    Similarity and Dissimilarity

    • Similarity: Numerical measure of how alike two data objects are
    • Value higher when objects are more alike
    • Dissimilarity (e.g., distance): Numerical measure of how different two data objects are
    • Lower when objects are more alike
    • Proximity refers to a similarity or dissimilarity

    Proximity Measure for Nominal Attributes

    • Method 1: Simple matching: d (i, j) = p - (m / p)
    • Method 2: Creating a new binary attribute for each of the M nominal states – then apply binary proximity measures

    Proximity Measure for Binary Attributes

    • Contingency table for binary data. Measures include a distance measure for symmetric binary variables, a distance measure for asymmetric binary variables, and the Jaccard coefficient (similarity measure for asymmetric binary variables)

    Distance on Numeric Data: Minkowski Distance

    • A popular distance measure: d (i, j) = ( (| xi1 - xj1 |h + | xi2 - xj2 |h +...+ | xip - xjp |h ) / (1 / h) )
    • Properties:
    • d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)
    • d(i, j) = d(j, i) (Symmetry)
    • d(i, j) £ d(i, k) + d(k, j) (Triangle Inequality)
    • A distance is a metric if it satisfies these properties.

    Special Cases of Minkowski Distance

    • h = 1: Manhattan (city block, L1 norm) distance
    • h = 2: (L2 norm) Euclidean distance
    • h ® ¥: Supremum (L¥ norm) distance

    Studying That Suits You

    Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

    Quiz Team

    Related Documents

    Week 1B PDF - Data Analysis

    Description

    Explore the realms of graph mining and information network analysis in this quiz. Discover concepts like frequent subgraphs, the structure of information networks, and the techniques used in web mining for effective data analysis. Test your knowledge on time-series analysis and the evaluation of knowledge discovery.

    More Like This

    RDF Knowledge Graphs Similarity Search
    36 questions
    Root Words: Graph and Gram
    11 questions
    Graph Shapes Flashcards
    13 questions

    Graph Shapes Flashcards

    VersatileCopernicium avatar
    VersatileCopernicium
    Algebra 2: Graph Transformations
    7 questions
    Use Quizgecko on...
    Browser
    Browser