Data Management: Concepts and Techniques
32 Questions
0 Views

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

The data ______ encompasses the stages of creation, storage, access, and preservation of data objects.

lifecycle

Before processing, assessing data ______ is important to address problems, such as missing values, inconsistencies and noise

quality

______ involves resolving redundancies and correlations to enhance data integrity and storage efficiency during data integration.

Data cleaning

Methods such as forward selection and backward elimination are examples of attribute ______ selection techniques used in data reduction.

<p>subset</p> Signup and view all the answers

______ with and without replacement is a data reduction technique used to reduce data.

<p>Sampling</p> Signup and view all the answers

______, which allows a user to extract and view data from different angles, holds a central role in the realm of data reduction in data management.

<p>Data cube</p> Signup and view all the answers

In data management, ______ encompasses the policies and procedures that govern the collection, storage, and usage of data within an organization.

<p>data governance</p> Signup and view all the answers

Within data architecture, contrasting concepts such as tight versus loose ______ dictate the level of inter-dependence between system components.

<p>coupling</p> Signup and view all the answers

[Blank] systems, such as those used for purchases and banking records, are common sources of data generation.

<p>Transactional</p> Signup and view all the answers

The process of extracting data from websites is known as ______.

<p>Web scraping</p> Signup and view all the answers

[Blank] and Playwright are examples of headless browsers, often used for web scraping and automated testing.

<p>Puppeteer</p> Signup and view all the answers

In the context of data warehousing, information collected from multiple sources is stored under a unified schema, typically residing at a single site; this structure is know as ______.

<p>Data warehouse</p> Signup and view all the answers

In data analysis, rows of a database correspond to data objects, while columns correspond to ______.

<p>attributes</p> Signup and view all the answers

The rows of a database correspond to the data objects, and the columns correspond to the ______.

<p>attributes</p> Signup and view all the answers

A ______ attribute is a categorical attribute that relates to names and does not have any inherent order.

<p>Nominal</p> Signup and view all the answers

A ______ attribute is a type of nominal attribute with only two categories, and is often referred to as Boolean.

<p>Binary</p> Signup and view all the answers

A ______ attribute is characterized by data with a specific order but without equal intervals between categories, such as education level or product ratings.

<p>ordinal</p> Signup and view all the answers

[Blank] data refers to data with infinite possible values within a given range, exemplified by measurements like height, weight, and temperature.

<p>continuous</p> Signup and view all the answers

The ______ is a measure of central tendency that represents the value separating the higher half from the lower half of a data sample, useful when data is skewed.

<p>median</p> Signup and view all the answers

When using Pandas, the ______ method is a convenient shortcut to count the number of entries in each category of a variable, providing insights into data distribution.

<p>value_counts</p> Signup and view all the answers

The ______ is calculated as the average of the squared differences between each data point and the mean, quantifying the spread of data around the average.

<p>variance</p> Signup and view all the answers

[Blank] divide a sorted dataset into four equal parts, with Q2 representing the median of the entire dataset.

<p>quartiles</p> Signup and view all the answers

The ______ is calculated as Q3 - Q1 and provides a measure of statistical dispersion, indicating the spread of the middle 50% of the data.

<p>interquartile range</p> Signup and view all the answers

A ______ measure returns a value of 0 if two objects are completely unlike and increases as the objects become more similar, typically reaching 1 for identical objects.

<p>similarity</p> Signup and view all the answers

A data matrix, also known as a 'two-mode' matrix, organizes n data objects as an ________ table, showing n objects by p attributes.

<p>relational</p> Signup and view all the answers

Unlike a data matrix, a ________ matrix stores proximities for all pairs of n objects, indicating the dissimilarity or difference d(i, j) between objects i and j.

<p>dissimilarity</p> Signup and view all the answers

For nominal attributes, dissimilarity can be measured using methods that account for the absence or presence of specific ________ across different objects.

<p>characteristics</p> Signup and view all the answers

The ________ distance measures the 'straight-line' distance between two points, allowing for diagonal movement and representing the shortest path.

<p>Euclidean</p> Signup and view all the answers

The ________ distance calculates the distance between two points as the sum of the absolute differences of their coordinates, resembling movement along city blocks.

<p>Manhattan</p> Signup and view all the answers

The Minkowski distance is a generalization of both Euclidean and Manhattan distances, defined as $\sqrt______{\sum_{i=1}^{n} |x_i - y_i|^p}$, where varying ________ values change the nature of the distance calculated.

<p>p</p> Signup and view all the answers

Also known as Lmax or L∞ norm, ________ distance quantifies the maximum difference along any coordinate dimension between two points in a multidimensional space.

<p>Chebyshev</p> Signup and view all the answers

While the triangle inequality specifies that $d(x, z) ≤ d(x, y) + d(y, z)$, the ________ distance quantifies dissimilarity between data sets based on the ratio of shared characteristics to the total characteristics.

<p>Jaccard</p> Signup and view all the answers

Flashcards

Binary Attribute

Attributes with only two categories.

Ordinal Attribute

Attributes with a meaningful order but inconsistent intervals.

Interval Scaled Attribute

No true zero point

Ratio Scaled Attribute

Attributes with a true zero point

Signup and view all the flashcards

Continuous Data

Infinite possible values within a range.

Signup and view all the flashcards

Discrete Data

Finite number of values.

Signup and view all the flashcards

Central Tendency

Measures the location of the middle or center of a data distribution.

Signup and view all the flashcards

Standard Deviation

Average distance between each data point and the mean.

Signup and view all the flashcards

Data Cleaning

The process of correcting inaccurate, incomplete, or irrelevant data.

Signup and view all the flashcards

Data Integration

Combining data from multiple sources into a unified dataset.

Signup and view all the flashcards

Data Selection

The process of choosing relevant data for analysis.

Signup and view all the flashcards

Data Transformation

Converting data into a suitable format for analysis.

Signup and view all the flashcards

Data Generation/Collection

Data generated through active input (e.g., surveys) or passive collection (e.g., sensor logs).

Signup and view all the flashcards

Web Scraping

Extracting data from websites, often for analysis or aggregation.

Signup and view all the flashcards

Relational Database

A structured collection of tables with related data.

Signup and view all the flashcards

Nominal Attribute

Categorical data without inherent order (e.g. colors).

Signup and view all the flashcards

Data Matrix

Stores n data objects as an n-by-p matrix (n objects x p attributes). Also called a 'two-mode' matrix.

Signup and view all the flashcards

Dissimilarity Matrix

Stores proximities (similarities or dissimilarities) for all pairs of n objects. Also called a 'one-mode' matrix

Signup and view all the flashcards

Euclidean Distance

Straight-line distance between two points.

Signup and view all the flashcards

Manhattan Distance

Sum of absolute differences of coordinates.

Signup and view all the flashcards

Minkowski Distance

A generalization of Euclidean and Manhattan distances.

Signup and view all the flashcards

Chebyshev / Supremum Distance

Maximum difference along any coordinate dimension.

Signup and view all the flashcards

Non-negativity

Distance is always non-negative: d(x, y) ≥ 0

Signup and view all the flashcards

Jaccard Distance

Measures dissimilarity between two sets of data.

Signup and view all the flashcards

Data Curation

The process of organizing, maintaining, and adding value to data throughout its lifecycle.

Signup and view all the flashcards

Data Lifecycle

The complete journey of data from its creation to its eventual disposal or archiving.

Signup and view all the flashcards

Handling Missing Values

Techniques to handle missing values by either imputing them, or removing them.

Signup and view all the flashcards

Data Reduction

The process of reducing the volume of data while preserving its integrity for analysis.

Signup and view all the flashcards

ETL

Extract, Transform, Load; a traditional data integration approach, focus on transformation before loading into the data warehouse.

Signup and view all the flashcards

ELT

Extract, Load, Transform; modern data integration approach which loads data first and transforms after.

Signup and view all the flashcards

Data Governance

Management of data assets in an organization with policies and procedures related to data quality, availability and usability.

Signup and view all the flashcards

Study Notes

  • The world is data rich, but information poor
  • Key steps in data curation are cleaning, integration, selection, and transformation, leading to analysis and visualization

Data Lifecycle

  • The stages in data lifecycle are generation, collection, processing, storage, interpretation, visualization analysis, and management
  • Data is created or acquired from various sources during generation
  • Data is gathered from different sources and prepared for processing during the Collection phase
  • Raw data is processed and manipulated to be useable and consistent during processing
  • Processed data is securely stored in databases or data warehouses during storage
  • Results are interpreted to inform decision-making and drive actions during interpretation
  • Insights are presented in graphical or visual formats for easier interpretation during visualization
  • Data is examined to extract insights and patterns during data analysis
  • Data is organized, maintained, and governed to ensure quality and accessibility during data management

Data Generation vs Data Collection

  • Data generation can be either active or passive
  • Common sources of data include human-generated data like surveys and forms, machine-generated data like IoT sensors and logs, transactional systems like purchases and banking records, and web scraping

Web Scraping

  • Web scraping is the extraction of data from websites
  • Considerations for web scraping include how the website content is presented, whether it is structured, and how the extracted content will be saved
  • Sentiment analysis is a common application of web scraping
  • Possible ways to do web scraping include using libraries in programming languages, browser automation tools, APIs for data retrieval, headless browsers, and no-code or low-code tools

Headless Browsers

  • Headless browsers such as Puppeteer and Playwright, have applications in e-commerce price tracking and social media monitoring

Common Data Sources

  • Database data, specifically relational databases, consist of tables with unique names, attributes (columns or fields), and tuples (records or rows), modeled using the Entity-Relationship Model
  • Data warehouses are repositories of information collected from multiple sources under a unified schema, residing at a single site, and often modeled as a data cube

Data Objects

  • Datasets are composed of data objects representing entities
  • In sales data, objects can be customers, store items, and sales
  • Data objects can also be referred to as samples, examples, instances, or data points
  • Data objects stored in a database are known as data tuples
  • Data objects are described by attributes and the rows of a database correspond to the data objects, while the columns correspond to the attributes
  • The terms attribute, dimension, feature, and variable are often used interchangeably

Types of Data Attributes

  • Nominal attributes are relating to name and are categorical, like gender, color, or city, without any inherent order
  • Binary attributes are nominal attributes with only two categories and are also called Boolean
  • Ordinal attributes have a specific order but lack equal intervals between categories, such as education level or product ratings (low, medium, high)
  • Numerical attributes can be interval scaled with no true zero point, like temperature in Celsius or Fahrenheit, or ratio scaled, with a true zero point, like temperature in Kelvin Numerical attributes
  • Continuous data has infinite possible values within a given range, such as height, weight, or temperature
  • Discrete data has a finite number of values, such as the number of children or products sold

Statistical Description of Data

  • Statistical description aims to provide an overall picture of the data
  • Central tendency measures the location of the middle or center of a data distribution
  • Dispersion measures how the data is spread out
  • Common approaches for measuring central tendency include the mean, median, and mode

Summary Statistics using Pandas

  • Size includes NaN values, while count excludes them
  • value_counts is a shortcut to count the number of entries in each category of a variable
  • Operations typically follow a Split-Apply-Combine strategy

Measuring the Dispersion of Data

  • Range is calculated as the maximum value minus the minimum value
  • Standard deviation measures the average distance between each data point and the mean
  • Variance is the average of the squared differences between each data point and the mean
  • Quartiles divide the dataset into four equal parts when sorted
  • Q1 (1st quartile) is the median of the lower half (25th percentile)
  • Q2 (2nd quartile) is the median of the dataset (50th percentile)
  • Q3 (3rd quartile) is the median of the upper half (75th percentile)
  • Interquartile Range (IQR) is the distance between the first and third quartiles and is calculated as IQR = Q3 - Q1

Measuring Data Similarity / Dissimilarity

  • Measuring similarity and dissimilarity helps understand relationships between data points and is also called proximity measurement
  • A similarity measure for two objects, i and j, typically returns a value of 0 if the objects are unalike
  • The higher the similarity value, the greater the similarity between objects, where a value of 1 typically indicates complete similarity
  • A dissimilarity measure works the opposite way
  • Data can be represented as either a data matrix or a dissimilarity matrix

Data vs Dissimilarity Matrix

  • A data Matrix stores n data objects in a relational table or an n-by-p matrix, where n is number of objects and p number of attributes and is also called a "two-mode" matrix
  • A dissimilarity matrix stores a collection of proximities available for all pairs of n objects
  • d(i, j) represents the measured dissimilarity or difference between object i and j
    • sim(i, j) = 1 - d(i, j), where sim is the similarity
    • Also called a one-mode matrix

Measuring Dissimilarity - Nominal Attributes

  • Dissimilarity between nominal attributes is measured based on whether the attributes match or differ

Euclidean Distance

  • Measures the straight-line distance
  • Diagonal movement is allowed
  • The shortest distance between any two points
  • Calculated as d(x,y) = √(x1 - y1)² + (x2 - y2)² in a two-dimensional plane
  • Extended to n dimensions: d(x,y) = √Σ(xi - yi)²

Manhattan Distance

  • Measures the sum of the absolute differences of the coordinates
  • Calculated as d(x, y) = |x1 - y1| + |x2 - y2| in a 2-D plane
  • In n-dimensional space d(x, y) = ∑ |xi - yi|

Minkowski Distance

  • Minkowski distance is a generalization of the Euclidean and Manhattan distances
  • Defined as d(x,y) = (∑|xi-yi|^p)^(1/p) for p≥ 1
  • If p=1, then Minkowski distance is the same as Manhattan distance (L1 - norm)
  • If p=2, then the Minkowski distance is equivalent to the Euclidean distance (L2-norm)

Chebyshev / Supremum Distance

  • Measures the maximum difference along any coordinate dimension between two points in a multidimensional space
  • It is also known as Lmax, L∞ norm, or uniform norm
  • dChebyshev(P, Q) = maxi(|pi - qi|)

Mathematical Properties

  • Distance is always non-negative, d(x, y) ≥ 0
  • Distance between a point and itself is always 0, d(x, y) = 0 if and only if x = y Symmetry
  • The order of points doesn't matter in distance calculation, d(x, y) = d(y, x) Triangle inequality
  • The maximum difference between x and z along any dimension cannot be greater than the sum of the maximum differences from x to y and from y to z
  • d(x, z) ≤ d(x, y) + d(y, z)

Jaccard Distance

  • Quantifies the dissimilarity between two sets of data
  • Derived from the Jaccard Index (or Similarity Coefficient) Jaccard Distance
  • 1-Jaccard Index
  • Jaccard Index= |A∩B| / |AUB|
  • A∩B: The number of elements common to both sets A and B (intersection)
  • AUB: The number of unique elements in either set A or B (union)
  • Jaccard Index ranges from 0 to 1, where 1 indicates identical sets and 0 indicates disjoint sets
  • Use case : Compare documents or text similarity or compare sets of pixels in images for similarity

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

Data Curation Techniques PDF

Description

Explore data management principles covering creation, storage, access, and preservation. Learn about data quality assessment to address inconsistencies, and data integration techniques to resolve redundancies. Discover attribute selection and sampling methods for effective data reduction.

More Like This

Data Mining Quiz
8 questions

Data Mining Quiz

LionheartedMountainPeak avatar
LionheartedMountainPeak
Data Reduction Techniques Quiz
10 questions
Data Reduction Strategies Quiz
10 questions
Use Quizgecko on...
Browser
Browser