Data Attributes and Missing Values Quiz
44 Questions
0 Views

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

Which attribute type includes values with a meaningful order but unknown magnitude between successive values?

  • Discrete
  • Ordinal (correct)
  • Nominal
  • Binary
  • Which of the following is an example of a ratio-scaled attribute?

  • Temperature in Celsius
  • Weight in kilograms (correct)
  • Hair color
  • Army rankings
  • What best describes a binary attribute?

  • A nominal attribute with only two states. (correct)
  • An attribute with three or more possible values.
  • An attribute that can take on any numeric value.
  • An attribute that is always measurable on an ordered scale.
  • In which of the following scenarios would you use an interval-scaled attribute?

    <p>Recording temperatures in Fahrenheit.</p> Signup and view all the answers

    Which of the following is NOT a characteristic of nominal attributes?

    <p>Values can be ordered or ranked.</p> Signup and view all the answers

    What is a common issue with manually filling in missing values?

    <p>It is tedious and often infeasible.</p> Signup and view all the answers

    Which of the following methods is considered smarter for filling in missing values?

    <p>Using the attribute mean for all samples belonging to the same class.</p> Signup and view all the answers

    What causes noise in data?

    <p>Faulty data collection instruments.</p> Signup and view all the answers

    Which of the following is NOT listed as a problem associated with data?

    <p>Excessive documentation.</p> Signup and view all the answers

    What is the purpose of binning in handling noisy data?

    <p>To reduce cardinality and smooth the data.</p> Signup and view all the answers

    What does the cosine similarity formula calculate?

    <p>The similarity between two vectors</p> Signup and view all the answers

    What is a common issue encountered when integrating multiple databases?

    <p>Data redundancy</p> Signup and view all the answers

    Which operation is used to calculate the dot product of two vectors?

    <p>Multiplying corresponding elements of the vectors</p> Signup and view all the answers

    Which of the following is NOT a method of data transformation?

    <p>Data validation</p> Signup and view all the answers

    What is the significance of calculating the lengths of vectors in cosine similarity?

    <p>To normalize the vectors for accurate comparison</p> Signup and view all the answers

    What issue do traditional similarity measures face?

    <p>They fail to capture complex semantics</p> Signup and view all the answers

    What does the min-max normalization formula achieve?

    <p>It maps values to a new specified range</p> Signup and view all the answers

    Which of the following is an attribute that may vary across different databases?

    <p>Derived attribute</p> Signup and view all the answers

    How is the cosine similarity value represented mathematically?

    <p>As the dot product divided by the product of the magnitudes of the two vectors</p> Signup and view all the answers

    What method can be used to detect redundant attributes within datasets?

    <p>Correlation analysis</p> Signup and view all the answers

    Which example illustrates the limitation of vector space models?

    <p>Both previous sentences convey the same meaning</p> Signup and view all the answers

    Which of the following describes 'smoothing' in data transformation?

    <p>Removing noise from data</p> Signup and view all the answers

    What new methods are suggested for handling complex semantics?

    <p>Distributive representation and representation learning</p> Signup and view all the answers

    What is the primary purpose of discretization in data transformation?

    <p>To categorize continuous data</p> Signup and view all the answers

    What fundamental aspect do data quality and data cleaning focus on?

    <p>Maintaining integrity and consistency of data</p> Signup and view all the answers

    What does attribute construction involve in data transformation?

    <p>Creating new attributes from existing ones</p> Signup and view all the answers

    What does a boxplot represent regarding a dataset?

    <p>The quartiles and outliers of the dataset</p> Signup and view all the answers

    What is the inter-quartile range (IQR)?

    <p>The difference between the 75th percentile and the 25th percentile</p> Signup and view all the answers

    What does a quantile-quantile (q-q) plot compare?

    <p>The quantiles of one distribution with those of another</p> Signup and view all the answers

    Which statement about positively and negatively correlated data is true?

    <p>One half can be positively correlated while the other is negatively correlated.</p> Signup and view all the answers

    Which of the following describes an ordinal variable?

    <p>A variable where the order of values is significant, like rankings</p> Signup and view all the answers

    What is the correct threshold for identifying potential outliers in a dataset using IQR?

    <p>1.5 x IQR</p> Signup and view all the answers

    In a scatter plot, what does each point represent?

    <p>A pair of coordinates representing two variables</p> Signup and view all the answers

    What can cosine similarity be used to measure?

    <p>The frequency of terms in a document</p> Signup and view all the answers

    Which of the following is NOT part of the five-number summary?

    <p>Range</p> Signup and view all the answers

    How is the distance between ordinal values calculated?

    <p>By using the rank of the values</p> Signup and view all the answers

    What is a key limitation when using equal-frequency partitioning for data presentation?

    <p>It does not handle skewed data well.</p> Signup and view all the answers

    Which method is employed in smoothing data using the binning approach?

    <p>Replacing all values in a bin with the bin's mean.</p> Signup and view all the answers

    What is a common challenge when managing categorical attributes in data processing?

    <p>They can complicate data partitioning.</p> Signup and view all the answers

    What characterizes lossless compression in data handling?

    <p>It ensures that the original data can be perfectly reconstructed.</p> Signup and view all the answers

    Which of the following is true regarding audio/video compression?

    <p>It generally employs lossy techniques for refinement.</p> Signup and view all the answers

    What does the term 'bin boundaries' refer to in the context of data smoothing?

    <p>The nearest values used to represent bins.</p> Signup and view all the answers

    What happens to outliers when using straightforward data representation methods?

    <p>They can dominate the presentation of the data.</p> Signup and view all the answers

    What is the result of using equal-depth partitioning?

    <p>All intervals contain approximately the same number of samples.</p> Signup and view all the answers

    Study Notes

    Data, Measurements, and Data Preprocessing

    • This chapter covers various types of data sets and methods for preprocessing data.
    • Data types include relational records, data matrices (e.g., numerical matrices, crosstabs), transaction data, and document data (represented as term-frequency vectors).
    • Graphs and networks are also mentioned, including transportation networks, World Wide Web, molecular structures, and social/information networks.
    • Ordered data is presented, including video data (sequences of images), temporal data (time series), sequential data (transaction sequences), and genetic sequence data.

    Types of Data Sets

    • (1) Record Data: Relational records—highly structured relational tables, data matrices, crosstabs are included
    • Various examples of data, like records for persons and cars in a data set are shown, including a data matrix example.
    • (2) Graphs and Networks: Transportation network, World Wide Web, molecular structures, and social/information networks are discussed.
    • (3) Ordered Data: Video data, temporal data (time-series), sequential data (transaction sequences). Genetic sequence data are examples.
    • (4) Spatial, Image, and Multimedia Data: Spatial data (maps, vector and raster representations) and image and video data are included.

    Data Objects

    • Data sets consist of data objects.
    • Each data object represents an entity.
    • Examples given are sales databases (customers, items, sales), medical databases (patients, treatments) and university databases (students, professors, courses)
    • Data objects are characterized by attributes.
    • Database rows represent data objects, while columns represent attributes.

    Attributes

    • Attributes are characteristics or features/variables of a data object.
    • Examples are customer ID, name, and address.
    • Attribute types include nominal (e.g., red, blue), binary (e.g., true, false), ordinal (e.g., freshman, sophomore), numeric (quantitative), interval-scaled, and ratio-scaled.
    • Discrete attributes have finite or countably infinite values; continuous attributes have real-valued attribute values.

    Attribute Types

    • Nominal: Categorical values (e.g., hair color, marital status)
    • Binary: Two states (0/1 or true/false) - symmetric (equal importance) or asymmetric (unequal importance).
    • Ordinal: Values with a meaningful order (e.g., grades, rankings).
    • Numeric: Quantitative values; further categorized as:
      • Interval-scaled: Differences between values are meaningful (e.g., Celsius temperature, calendar dates)
      • Ratio-scaled: Values have a true zero point (e.g., Kelvin temperature, length, counts)

    Statistics of Data

    • Measures of central tendency (mean, median, mode) and dispersion are important for understanding the data distribution
    • Covariance and correlation analysis reveal the relationship between numerical variables.
    • Graphs display basic statistical descriptions like boxplots, histograms, quantile plots, scatter plots.

    Measuring the Central Tendency

    • Mean: The average value (algebraic measure, sample vs. population)
    • Weighted arithmetic mean: Multiplies each value by its corresponding weight.
    • Trimmed mean: Removes extreme values.
    • Median: The middle value. Estimable by interpolation on grouped data.
    • Mode: The most frequent value.

    Measuring the Dispersion of Data

    • Quartiles: Q₁ (25th percentile), Q₃ (75th percentile), and the Interquartile Range (IQR).
    • Five-number summary: Min, Q₁, median, Q₃, and Max.
    • Boxplots: Graphical representation of the five-number summary.
    • Outliers are discussed.

    Measuring Data Distribution

    • Variance: A measure of the spread of data points around the mean.
    • Standard deviation: The square root of variance.
    • Formulas for sample and population variances/standard deviations are included.

    Correlation

    • Correlation measures the linear relationship between two numerical variables.
    • It's computed as the normalized covariance between the variables.
    • Correlation values range from -1 to 1; values near 1 indicate positive correlation; near -1 indicates negative correlation; values near 0 indicate no correlation.

    Visualizing Changes of Correlation Coefficients

    • Scatter plots can visualize correlation coefficients varying between -1 and 1.

    Graphic Displays of Basic Statistical Descriptions

    • Boxplots: Graphically show the five-number summary (minimum, Q1, median, Q3, maximum).

    • Histograms: Display data frequencies by creating bins of values along the X-axis.

    • Quantile plots: Pairs a value with its rank to show data distribution.

    • Scatter plots: Plot pairs of values of two attributes.

    Data Quality Issues

    • The discussed measures of data quality include accuracy, completeness, consistency and believability, and interpretability

    Data Cleaning

    • Incomplete data: Missing attribute values
    • Noisy data: Errors or variations
    • Inconsistency data: Different representations or values for the same entity
    • Methods for handling missing and noisy data include binning, smoothing, clustering, and detecting outliers
    • Handling: Ignore, imputation (fill in missing values; various techniques like mean, most frequent, global constant, etc.).

    Data Integration

    • Combining data from multiple sources (databases, data cubes, files).
    • Goals include reducing noise, obtaining a more complete data view, and improving the speed and quality of data mining.
    • Methods include schema integration (e.g., merging metadata from different sources), entity identification (identifying real-world entities in different data sources having different names), and handling data value conflicts (e.g., conflicting values of the same attribute in different files for the same real-world object)

    Data Transformation

    • Normalization: Scaling data to fall in a specified range (e.g., min-max, z-score, normalization by decimal scaling)
    • Discretization: Converting continuous data into categorical data (e.g., equal-width, equal-frequency)
    • Data compression : Reduce data size.
    • Sampling: Reducing data set size by selecting a representative subset.

    Automatic Concept Hierarchy Generation

    • Methods to automatically create a hierarchical structure from attributes,
    • Examples (weekday, month, quarter, year, country, province/state, city, street) are included.

    Sampling

    • Methods for selecting a representative subset from a large data set.
    • Sampling methods include simple random sampling, stratified sampling.

    Cosine Similarity

    • Cosine similarity is a measure of the similarity between two vectors (e.g., term-frequency vectors, gene feature vectors).
    • It's used to determine the degree of similarity between two documents (or other objects) based on the angle between their vectors in a vector space.
    • The formula for cosine similarity is given.

    Studying That Suits You

    Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

    Quiz Team

    Related Documents

    Description

    Test your knowledge on data attributes including nominal, ordinal, interval, and ratio scales. Learn about handling missing values, data noise, and the purpose of binning in data processing. This quiz covers key concepts important for data handling and analysis.

    More Like This

    Data Preparation for Machine Learning
    18 questions
    Data Visualization Principles
    219 questions

    Data Visualization Principles

    AffluentRisingAction9914 avatar
    AffluentRisingAction9914
    Data Types and Categorization Chapter 4
    40 questions
    Use Quizgecko on...
    Browser
    Browser