Podcast
Questions and Answers
The data ______ encompasses the stages of creation, storage, access, and preservation of data objects.
The data ______ encompasses the stages of creation, storage, access, and preservation of data objects.
lifecycle
Before processing, assessing data ______ is important to address problems, such as missing values, inconsistencies and noise
Before processing, assessing data ______ is important to address problems, such as missing values, inconsistencies and noise
quality
______ involves resolving redundancies and correlations to enhance data integrity and storage efficiency during data integration.
______ involves resolving redundancies and correlations to enhance data integrity and storage efficiency during data integration.
Data cleaning
Methods such as forward selection and backward elimination are examples of attribute ______ selection techniques used in data reduction.
Methods such as forward selection and backward elimination are examples of attribute ______ selection techniques used in data reduction.
______ with and without replacement is a data reduction technique used to reduce data.
______ with and without replacement is a data reduction technique used to reduce data.
______, which allows a user to extract and view data from different angles, holds a central role in the realm of data reduction in data management.
______, which allows a user to extract and view data from different angles, holds a central role in the realm of data reduction in data management.
In data management, ______ encompasses the policies and procedures that govern the collection, storage, and usage of data within an organization.
In data management, ______ encompasses the policies and procedures that govern the collection, storage, and usage of data within an organization.
Within data architecture, contrasting concepts such as tight versus loose ______ dictate the level of inter-dependence between system components.
Within data architecture, contrasting concepts such as tight versus loose ______ dictate the level of inter-dependence between system components.
[Blank] systems, such as those used for purchases and banking records, are common sources of data generation.
[Blank] systems, such as those used for purchases and banking records, are common sources of data generation.
The process of extracting data from websites is known as ______.
The process of extracting data from websites is known as ______.
[Blank] and Playwright are examples of headless browsers, often used for web scraping and automated testing.
[Blank] and Playwright are examples of headless browsers, often used for web scraping and automated testing.
In the context of data warehousing, information collected from multiple sources is stored under a unified schema, typically residing at a single site; this structure is know as ______.
In the context of data warehousing, information collected from multiple sources is stored under a unified schema, typically residing at a single site; this structure is know as ______.
In data analysis, rows of a database correspond to data objects, while columns correspond to ______.
In data analysis, rows of a database correspond to data objects, while columns correspond to ______.
The rows of a database correspond to the data objects, and the columns correspond to the ______.
The rows of a database correspond to the data objects, and the columns correspond to the ______.
A ______ attribute is a categorical attribute that relates to names and does not have any inherent order.
A ______ attribute is a categorical attribute that relates to names and does not have any inherent order.
A ______ attribute is a type of nominal attribute with only two categories, and is often referred to as Boolean.
A ______ attribute is a type of nominal attribute with only two categories, and is often referred to as Boolean.
A ______ attribute is characterized by data with a specific order but without equal intervals between categories, such as education level or product ratings.
A ______ attribute is characterized by data with a specific order but without equal intervals between categories, such as education level or product ratings.
[Blank] data refers to data with infinite possible values within a given range, exemplified by measurements like height, weight, and temperature.
[Blank] data refers to data with infinite possible values within a given range, exemplified by measurements like height, weight, and temperature.
The ______ is a measure of central tendency that represents the value separating the higher half from the lower half of a data sample, useful when data is skewed.
The ______ is a measure of central tendency that represents the value separating the higher half from the lower half of a data sample, useful when data is skewed.
When using Pandas, the ______
method is a convenient shortcut to count the number of entries in each category of a variable, providing insights into data distribution.
When using Pandas, the ______
method is a convenient shortcut to count the number of entries in each category of a variable, providing insights into data distribution.
The ______ is calculated as the average of the squared differences between each data point and the mean, quantifying the spread of data around the average.
The ______ is calculated as the average of the squared differences between each data point and the mean, quantifying the spread of data around the average.
[Blank] divide a sorted dataset into four equal parts, with Q2 representing the median of the entire dataset.
[Blank] divide a sorted dataset into four equal parts, with Q2 representing the median of the entire dataset.
The ______ is calculated as Q3 - Q1 and provides a measure of statistical dispersion, indicating the spread of the middle 50% of the data.
The ______ is calculated as Q3 - Q1 and provides a measure of statistical dispersion, indicating the spread of the middle 50% of the data.
A ______ measure returns a value of 0 if two objects are completely unlike and increases as the objects become more similar, typically reaching 1 for identical objects.
A ______ measure returns a value of 0 if two objects are completely unlike and increases as the objects become more similar, typically reaching 1 for identical objects.
A data matrix, also known as a 'two-mode' matrix, organizes n
data objects as an ________ table, showing n
objects by p
attributes.
A data matrix, also known as a 'two-mode' matrix, organizes n
data objects as an ________ table, showing n
objects by p
attributes.
Unlike a data matrix, a ________ matrix stores proximities for all pairs of n
objects, indicating the dissimilarity or difference d(i, j)
between objects i
and j
.
Unlike a data matrix, a ________ matrix stores proximities for all pairs of n
objects, indicating the dissimilarity or difference d(i, j)
between objects i
and j
.
For nominal attributes, dissimilarity can be measured using methods that account for the absence or presence of specific ________ across different objects.
For nominal attributes, dissimilarity can be measured using methods that account for the absence or presence of specific ________ across different objects.
The ________ distance measures the 'straight-line' distance between two points, allowing for diagonal movement and representing the shortest path.
The ________ distance measures the 'straight-line' distance between two points, allowing for diagonal movement and representing the shortest path.
The ________ distance calculates the distance between two points as the sum of the absolute differences of their coordinates, resembling movement along city blocks.
The ________ distance calculates the distance between two points as the sum of the absolute differences of their coordinates, resembling movement along city blocks.
The Minkowski distance is a generalization of both Euclidean and Manhattan distances, defined as $\sqrt______{\sum_{i=1}^{n} |x_i - y_i|^p}$, where varying ________ values change the nature of the distance calculated.
The Minkowski distance is a generalization of both Euclidean and Manhattan distances, defined as $\sqrt______{\sum_{i=1}^{n} |x_i - y_i|^p}$, where varying ________ values change the nature of the distance calculated.
Also known as Lmax or L∞ norm, ________ distance quantifies the maximum difference along any coordinate dimension between two points in a multidimensional space.
Also known as Lmax or L∞ norm, ________ distance quantifies the maximum difference along any coordinate dimension between two points in a multidimensional space.
While the triangle inequality specifies that $d(x, z) ≤ d(x, y) + d(y, z)$, the ________ distance quantifies dissimilarity between data sets based on the ratio of shared characteristics to the total characteristics.
While the triangle inequality specifies that $d(x, z) ≤ d(x, y) + d(y, z)$, the ________ distance quantifies dissimilarity between data sets based on the ratio of shared characteristics to the total characteristics.
Flashcards
Binary Attribute
Binary Attribute
Attributes with only two categories.
Ordinal Attribute
Ordinal Attribute
Attributes with a meaningful order but inconsistent intervals.
Interval Scaled Attribute
Interval Scaled Attribute
No true zero point
Ratio Scaled Attribute
Ratio Scaled Attribute
Signup and view all the flashcards
Continuous Data
Continuous Data
Signup and view all the flashcards
Discrete Data
Discrete Data
Signup and view all the flashcards
Central Tendency
Central Tendency
Signup and view all the flashcards
Standard Deviation
Standard Deviation
Signup and view all the flashcards
Data Cleaning
Data Cleaning
Signup and view all the flashcards
Data Integration
Data Integration
Signup and view all the flashcards
Data Selection
Data Selection
Signup and view all the flashcards
Data Transformation
Data Transformation
Signup and view all the flashcards
Data Generation/Collection
Data Generation/Collection
Signup and view all the flashcards
Web Scraping
Web Scraping
Signup and view all the flashcards
Relational Database
Relational Database
Signup and view all the flashcards
Nominal Attribute
Nominal Attribute
Signup and view all the flashcards
Data Matrix
Data Matrix
Signup and view all the flashcards
Dissimilarity Matrix
Dissimilarity Matrix
Signup and view all the flashcards
Euclidean Distance
Euclidean Distance
Signup and view all the flashcards
Manhattan Distance
Manhattan Distance
Signup and view all the flashcards
Minkowski Distance
Minkowski Distance
Signup and view all the flashcards
Chebyshev / Supremum Distance
Chebyshev / Supremum Distance
Signup and view all the flashcards
Non-negativity
Non-negativity
Signup and view all the flashcards
Jaccard Distance
Jaccard Distance
Signup and view all the flashcards
Data Curation
Data Curation
Signup and view all the flashcards
Data Lifecycle
Data Lifecycle
Signup and view all the flashcards
Handling Missing Values
Handling Missing Values
Signup and view all the flashcards
Data Reduction
Data Reduction
Signup and view all the flashcards
ETL
ETL
Signup and view all the flashcards
ELT
ELT
Signup and view all the flashcards
Data Governance
Data Governance
Signup and view all the flashcards
Study Notes
- The world is data rich, but information poor
- Key steps in data curation are cleaning, integration, selection, and transformation, leading to analysis and visualization
Data Lifecycle
- The stages in data lifecycle are generation, collection, processing, storage, interpretation, visualization analysis, and management
- Data is created or acquired from various sources during generation
- Data is gathered from different sources and prepared for processing during the Collection phase
- Raw data is processed and manipulated to be useable and consistent during processing
- Processed data is securely stored in databases or data warehouses during storage
- Results are interpreted to inform decision-making and drive actions during interpretation
- Insights are presented in graphical or visual formats for easier interpretation during visualization
- Data is examined to extract insights and patterns during data analysis
- Data is organized, maintained, and governed to ensure quality and accessibility during data management
Data Generation vs Data Collection
- Data generation can be either active or passive
- Common sources of data include human-generated data like surveys and forms, machine-generated data like IoT sensors and logs, transactional systems like purchases and banking records, and web scraping
Web Scraping
- Web scraping is the extraction of data from websites
- Considerations for web scraping include how the website content is presented, whether it is structured, and how the extracted content will be saved
- Sentiment analysis is a common application of web scraping
- Possible ways to do web scraping include using libraries in programming languages, browser automation tools, APIs for data retrieval, headless browsers, and no-code or low-code tools
Headless Browsers
- Headless browsers such as Puppeteer and Playwright, have applications in e-commerce price tracking and social media monitoring
Common Data Sources
- Database data, specifically relational databases, consist of tables with unique names, attributes (columns or fields), and tuples (records or rows), modeled using the Entity-Relationship Model
- Data warehouses are repositories of information collected from multiple sources under a unified schema, residing at a single site, and often modeled as a data cube
Data Objects
- Datasets are composed of data objects representing entities
- In sales data, objects can be customers, store items, and sales
- Data objects can also be referred to as samples, examples, instances, or data points
- Data objects stored in a database are known as data tuples
- Data objects are described by attributes and the rows of a database correspond to the data objects, while the columns correspond to the attributes
- The terms attribute, dimension, feature, and variable are often used interchangeably
Types of Data Attributes
- Nominal attributes are relating to name and are categorical, like gender, color, or city, without any inherent order
- Binary attributes are nominal attributes with only two categories and are also called Boolean
- Ordinal attributes have a specific order but lack equal intervals between categories, such as education level or product ratings (low, medium, high)
- Numerical attributes can be interval scaled with no true zero point, like temperature in Celsius or Fahrenheit, or ratio scaled, with a true zero point, like temperature in Kelvin Numerical attributes
- Continuous data has infinite possible values within a given range, such as height, weight, or temperature
- Discrete data has a finite number of values, such as the number of children or products sold
Statistical Description of Data
- Statistical description aims to provide an overall picture of the data
- Central tendency measures the location of the middle or center of a data distribution
- Dispersion measures how the data is spread out
- Common approaches for measuring central tendency include the mean, median, and mode
Summary Statistics using Pandas
- Size includes NaN values, while count excludes them
value_counts
is a shortcut to count the number of entries in each category of a variable- Operations typically follow a Split-Apply-Combine strategy
Measuring the Dispersion of Data
- Range is calculated as the maximum value minus the minimum value
- Standard deviation measures the average distance between each data point and the mean
- Variance is the average of the squared differences between each data point and the mean
- Quartiles divide the dataset into four equal parts when sorted
- Q1 (1st quartile) is the median of the lower half (25th percentile)
- Q2 (2nd quartile) is the median of the dataset (50th percentile)
- Q3 (3rd quartile) is the median of the upper half (75th percentile)
- Interquartile Range (IQR) is the distance between the first and third quartiles and is calculated as IQR = Q3 - Q1
Measuring Data Similarity / Dissimilarity
- Measuring similarity and dissimilarity helps understand relationships between data points and is also called proximity measurement
- A similarity measure for two objects, i and j, typically returns a value of 0 if the objects are unalike
- The higher the similarity value, the greater the similarity between objects, where a value of 1 typically indicates complete similarity
- A dissimilarity measure works the opposite way
- Data can be represented as either a data matrix or a dissimilarity matrix
Data vs Dissimilarity Matrix
- A data Matrix stores n data objects in a relational table or an n-by-p matrix, where n is number of objects and p number of attributes and is also called a "two-mode" matrix
- A dissimilarity matrix stores a collection of proximities available for all pairs of n objects
d(i, j)
represents the measured dissimilarity or difference between object i and jsim(i, j) = 1 - d(i, j)
, where sim is the similarity- Also called a one-mode matrix
Measuring Dissimilarity - Nominal Attributes
- Dissimilarity between nominal attributes is measured based on whether the attributes match or differ
Euclidean Distance
- Measures the straight-line distance
- Diagonal movement is allowed
- The shortest distance between any two points
- Calculated as d(x,y) = √(x1 - y1)² + (x2 - y2)² in a two-dimensional plane
- Extended to n dimensions: d(x,y) = √Σ(xi - yi)²
Manhattan Distance
- Measures the sum of the absolute differences of the coordinates
- Calculated as d(x, y) = |x1 - y1| + |x2 - y2| in a 2-D plane
- In n-dimensional space
d(x, y) = ∑ |xi - yi|
Minkowski Distance
- Minkowski distance is a generalization of the Euclidean and Manhattan distances
- Defined as
d(x,y) = (∑|xi-yi|^p)^(1/p) for p≥ 1
- If p=1, then Minkowski distance is the same as Manhattan distance (L1 - norm)
- If p=2, then the Minkowski distance is equivalent to the Euclidean distance (L2-norm)
Chebyshev / Supremum Distance
- Measures the maximum difference along any coordinate dimension between two points in a multidimensional space
- It is also known as Lmax, L∞ norm, or uniform norm
dChebyshev(P, Q) = maxi(|pi - qi|)
Mathematical Properties
- Distance is always non-negative, d(x, y) ≥ 0
- Distance between a point and itself is always 0, d(x, y) = 0 if and only if x = y Symmetry
- The order of points doesn't matter in distance calculation, d(x, y) = d(y, x) Triangle inequality
- The maximum difference between x and z along any dimension cannot be greater than the sum of the maximum differences from x to y and from y to z
- d(x, z) ≤ d(x, y) + d(y, z)
Jaccard Distance
- Quantifies the dissimilarity between two sets of data
- Derived from the Jaccard Index (or Similarity Coefficient) Jaccard Distance
- 1-Jaccard Index
- Jaccard Index= |A∩B| / |AUB|
- A∩B: The number of elements common to both sets A and B (intersection)
- AUB: The number of unique elements in either set A or B (union)
- Jaccard Index ranges from 0 to 1, where 1 indicates identical sets and 0 indicates disjoint sets
- Use case : Compare documents or text similarity or compare sets of pixels in images for similarity
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.
Related Documents
Description
Explore data management principles covering creation, storage, access, and preservation. Learn about data quality assessment to address inconsistencies, and data integration techniques to resolve redundancies. Discover attribute selection and sampling methods for effective data reduction.