Data Management: Concepts and Techniques

Podcast

Play an AI-generated podcast conversation about this lesson

Download our mobile app to listen on the go

Get App

Questions and Answers

The data ______ encompasses the stages of creation, storage, access, and preservation of data objects.

lifecycle

Before processing, assessing data ______ is important to address problems, such as missing values, inconsistencies and noise

quality

______ involves resolving redundancies and correlations to enhance data integrity and storage efficiency during data integration.

Data cleaning

Methods such as forward selection and backward elimination are examples of attribute ______ selection techniques used in data reduction.

subset Signup and view all the answers

______ with and without replacement is a data reduction technique used to reduce data.

Sampling Signup and view all the answers

______, which allows a user to extract and view data from different angles, holds a central role in the realm of data reduction in data management.

Data cube Signup and view all the answers

In data management, ______ encompasses the policies and procedures that govern the collection, storage, and usage of data within an organization.

data governance Signup and view all the answers

Within data architecture, contrasting concepts such as tight versus loose ______ dictate the level of inter-dependence between system components.

coupling Signup and view all the answers

[Blank] systems, such as those used for purchases and banking records, are common sources of data generation.

Transactional Signup and view all the answers

The process of extracting data from websites is known as ______.

Web scraping Signup and view all the answers

[Blank] and Playwright are examples of headless browsers, often used for web scraping and automated testing.

Puppeteer Signup and view all the answers

In the context of data warehousing, information collected from multiple sources is stored under a unified schema, typically residing at a single site; this structure is know as ______.

Data warehouse Signup and view all the answers

In data analysis, rows of a database correspond to data objects, while columns correspond to ______.

attributes Signup and view all the answers

The rows of a database correspond to the data objects, and the columns correspond to the ______.

attributes Signup and view all the answers

A ______ attribute is a categorical attribute that relates to names and does not have any inherent order.

Nominal Signup and view all the answers

A ______ attribute is a type of nominal attribute with only two categories, and is often referred to as Boolean.

Binary Signup and view all the answers

A ______ attribute is characterized by data with a specific order but without equal intervals between categories, such as education level or product ratings.

ordinal Signup and view all the answers

[Blank] data refers to data with infinite possible values within a given range, exemplified by measurements like height, weight, and temperature.

continuous Signup and view all the answers

The ______ is a measure of central tendency that represents the value separating the higher half from the lower half of a data sample, useful when data is skewed.

median Signup and view all the answers

When using Pandas, the `______` method is a convenient shortcut to count the number of entries in each category of a variable, providing insights into data distribution.

value_counts Signup and view all the answers

The ______ is calculated as the average of the squared differences between each data point and the mean, quantifying the spread of data around the average.

variance Signup and view all the answers

[Blank] divide a sorted dataset into four equal parts, with Q2 representing the median of the entire dataset.

quartiles Signup and view all the answers

The ______ is calculated as Q3 - Q1 and provides a measure of statistical dispersion, indicating the spread of the middle 50% of the data.

interquartile range Signup and view all the answers

A ______ measure returns a value of 0 if two objects are completely unlike and increases as the objects become more similar, typically reaching 1 for identical objects.

similarity Signup and view all the answers

A data matrix, also known as a 'two-mode' matrix, organizes `n` data objects as an ________ table, showing `n` objects by `p` attributes.

relational Signup and view all the answers

Unlike a data matrix, a ________ matrix stores proximities for all pairs of `n` objects, indicating the dissimilarity or difference `d(i, j)` between objects `i` and `j`.

dissimilarity Signup and view all the answers

For nominal attributes, dissimilarity can be measured using methods that account for the absence or presence of specific ________ across different objects.

characteristics Signup and view all the answers

The ________ distance measures the 'straight-line' distance between two points, allowing for diagonal movement and representing the shortest path.

Euclidean Signup and view all the answers

The ________ distance calculates the distance between two points as the sum of the absolute differences of their coordinates, resembling movement along city blocks.

Manhattan Signup and view all the answers

The Minkowski distance is a generalization of both Euclidean and Manhattan distances, defined as $\sqrt______{\sum_{i=1}^{n} |x_i - y_i|^p}$, where varying ________ values change the nature of the distance calculated.

p Signup and view all the answers

Also known as Lmax or L∞ norm, ________ distance quantifies the maximum difference along any coordinate dimension between two points in a multidimensional space.

Chebyshev Signup and view all the answers

While the triangle inequality specifies that $d(x, z) ≤ d(x, y) + d(y, z)$, the ________ distance quantifies dissimilarity between data sets based on the ratio of shared characteristics to the total characteristics.

Jaccard Signup and view all the answers

Flashcards

Binary Attribute

Attributes with only two categories.

Ordinal Attribute

Attributes with a meaningful order but inconsistent intervals.