Data Curation Techniques PDF
Document Details
Uploaded by ProvenGalaxy527
Siddharth R
Tags
Summary
This document covers various data curation techniques, including the data lifecycle, web scraping, and different data sources and objects. It explores data attributes, and data similarity along with dissimilarities, offering a comprehensive overview of essential concepts.
Full Transcript
Data Curation Techniques Siddharth R Syllabus Unit 1: Introduction to Data Lifecycle The data lifecycle (creation, storage, access, preservation) - Data objects and attribute types - Basic statistical descriptions of data - Measuring data similarity and dissimilarity - Database vs data war...
Data Curation Techniques Siddharth R Syllabus Unit 1: Introduction to Data Lifecycle The data lifecycle (creation, storage, access, preservation) - Data objects and attribute types - Basic statistical descriptions of data - Measuring data similarity and dissimilarity - Database vs data warehouse - Data Quality: Why Preprocess the Data? Unit 2: Data Preprocessing Techniques Data cleaning workflow - Handling missing values, noisy data, outlier - Data integration : redundancy and correlation, duplication - Data transformation : normalization - data discretization using Binning, histogram, clusters - concept hierarchy generation for nominal data. Hands-on activity: Data preprocessing using Python libraries (Numpy, Pandas) Unit 3: Data Reduction Attribute subset selection method : forward selection, backward elimination, equal width and equal frequency histogram - sampling with and without replacement - data aggregation and summarization - overview of data cube - Basics of feature selection and feature extraction. Hands on: Data reduction using python (scikit-learn) Unit 4: Data Management ETL vs ELT - Data governance - data modeling and design - data integration and interoperability - Challenges of working with heterogeneous data sources - Common data formats (CSV, JSON, XML)- master data management - metadata - Discussion: Integration challenges in IoT data Unit 5: Data Architecture Principles of good data architecture - Architecture concepts : tight vs loose coupling - user access : single vs multi tenant - event driven architecture - data storage systems - storage abstraction - Hot, warm, and cold data - Discussion : Latest trends in storage using open source tools. Text / Reference Books 1. “Data Mining: Concepts and Techniques” by Jiawei Han, Jian Pei, Hanghang Tong , Morgan Kaufmann, 2022, ISBN: 9780128117613 (For Unit 1 to 3) 2. “Fundamentals of Data Engineering” by Joe Reis and Matt Housley, O’Reilly Media, 2022, ISBN: ISBN: 9781098108304 (for Unit 4 and 5) 3. Latest related research articles from reputed journal/conferences Assessment components Quiz - 1 : 5% Mid - Term : 25% Quiz - 2 :5% Lab Components : 25% End Sem : 40% Why Data Curation? The world is data rich but information poor. Why Data Curation? Key steps: Data Cleaning Data Integration Data Selection Data Transformation Data Analysis and Visualization Data Life Cycle Source: https://www.knime.com/blog/the-data-lifecycle Data Generation vs Data Collection - Either active or passive - Common sources: - Human-generated (e.g., surveys, forms) - Machine-generated (e.g., IoT sensors, logs) - Transactional systems (e.g., purchases, banking records) - Web scraping What is Web Scraping (Common Data Collection) Extraction of data from website How the website content is presented? Is it structured ? How you are going to save the extracted content ? Can you name few common applications?? ○ Sentiment Analysis? Possible Ways to do Web Scraping Using Libraries in Programming Languages Browser Automation Tools APIs for Data Retrieval Headless Browsers No-Code or Low-Code Tools Reading Assignment Headless Browser - Puppeteer and Playwright Key Applications : 1. E-commerce Price tracking 2. Social media monitoring Common Data Sources - Database data - A relational database is a collection of tables, each of which is assigned a unique name. Each table consists of a set of attributes (columns or fields) and usually stores a large set of tuples (records or rows). - Model : Entity - Relationship Model - Data Warehouse - A data warehouse is a repository of information collected from multiple sources, stored under a unified schema, and usually residing at a single site. - Model : Data cube Data Objects Data sets are made up of data objects. A data object represents an entity ○ in a sales data, the objects may be customers, store items, and sales ○ In a university data ??? Data objects can also be referred to as samples, examples, instances, data points, or objects. If the data objects are stored in a database, they are data tuples. Data objects are typically described by attributes. That is, the rows of a database correspond to the data objects, and the columns correspond to the attributes. The attribute, dimension, feature, and variable are often used interchangeably Types of data attributes Nominal attribute: ○ Relating to name ○ Categorical ○ Data without any inherent order Examples: Gender, color, city ○ Is your aadhar number nominal? ○ How do you measure the central tendency ? Binary attribute: ○ Nominal attribute with only categories ○ Also called as Boolean ○ Give some examples ?? ○ Symmetric vs Asymmetric binary attribute Types of data attributes Ordinal attribute ○ Data with a specific order but without equal intervals between categories. ○ Examples: Education level, product rating (low, medium, high) ○ What is the preferred measure of central tendency? Whether the nominal, binary and ordinal data are quantitative or qualitative ? Numerical attribute: ○ Interval scaled attributes : No true zero point Example: Temperature in celsius / fahrenheit ○ Ratio scaled attributes : true zero point Example: Temperature in Kelvin ○ Continuous data: Data with infinite possible values within a given range. Examples: Height, weight, temperature ○ Discrete data: Data with a finite number of values. Examples: Number of children, number of products sold Statistical Description of Data To have overall picture of your data Central tendency - measure the location of the middle or center of a data distribution. Dispersion of data - how are the data spread out Common approaches for Measuring the Central Tendency: Mean, Median, and Mode Summary Statistics using Pandas [Refer Pandas code] Key points to remember: Whereas size includes NaN values and count excludes the missing values. value_counts is a convenient shortcut to count the number of entries in each category of a variable. Split-Apply-Combine Measuring the Dispersion of Data Range : (Maximum - minimum) Standard Deviation : It provides a measure of the average distance between each data point and the mean Variance : The average of the squared differences between each data point and the mean. Quartiles: Quartiles divide the dataset into four equal parts when it is sorted. ○ Q1 (1st quartile): Median of the lower half (25th percentile). ○ Q2 (2nd quartile): Median of the dataset (50th percentile). ○ Q3 (3rd quartile): Median of the upper half (75th percentile). Interquartile Range (IQR) : The distance between the first and third quartiles , IQR = Q3 - Q1 Measuring Data Similarity / Dissimilarity Why we need to measure the similarity / dissimilarity Also called as measure of proximity A similarity measure for two objects, i and j, will typically return the value 0 if the objects are unalike. The higher the similarity value, the greater the similarity between objects. (Typically,a value of 1 indicates complete similarity,that is,the objects are identical.) A Dissimilarity Measure works the opposite way. Data matrix vs Dissimilarity matrix Data Matrix vs Dissimilarity Matrix Data Matrix :This structure stores the n data objects in the form of a relational table, or n-by-p matrix (n objects X p attributes) Also called as “two-mode” matrix Dissimilarity Matrix: This structure stores a collection of proximities that are available for all pairs of n objects. where d(i, j) is the measured dissimilarity or difference between objects i and j. sim(i, j) = 1 - d(i, j), where sim is the similarity Also called as one-mode matrix Measuring Dissimilarity - Nominal Attributes Euclidean Distance Measures the straight-line distance (as the crow flies). Diagonal movement allowed Euclidean distance is the shortest distance between any two points Mathematically, the Euclidean distance between the points x and y in two-dimensional plane is given by: Extending to n dimensions, the points x and y are of the form x = (x1, x2, …, xn) and y = (y1, y2, …, yn), Manhattan Distance (city-block) Measures the sum of the absolute differences of the coordinates. In a 2-D plane, the Manhattan distance between the points x and y is given by: In n-dimensional space, where each point has n coordinates, the Manhattan distance is given by: Minkowski Distance Minkowski distance is a generalization of the Euclidean and Manhattan distances. It is defined as If p=1, then Minkowski distance equation takes the same form as that of Manhattan distance (L1 - norm) Similarly, for p = 2, the Minkowski distance is equivalent to the Euclidean distance (L2-norm) Chebyshev / Supremum distance It is a measure of distance that calculates the maximum difference along any coordinate dimension between two points in a multidimensional space. Also known as Lmax,L∞ norm or uniform norm Let x1 = (1, 2) and x2 = (3, 5) , find Euclidean, Manhattan and Supremum Mathematical Properties Non-negativity : distance is always non-negative , d(x, y) ≥ 0 Identity of indiscernibles : distance between a point and itself is always 0. d(x, y) = 0 if and only if x = y Symmetry: The order of points doesn't matter in distance calculation. d(x, y) = d(y, x) Triangle inequality : The maximum difference between x and z along any dimension cannot be greater than the sum of the maximum differences from x to y and from y to z. d(x, z) ≤ d(x, y) + d(y, z) Jaccard Distance ○ Quantify the dissimilarity between two sets of data. ○ Derived from the Jaccard Index (or Similarity Coefficient) ○ Jaccard Distance=1−Jaccard Index, where Jaccard Index=∣A∩B∣ / ∣A∪B∣ ○ A∩B: The number of elements common to both sets A and B (intersection). ○ A∪B: The number of unique elements in either set A or B (union). ○ The Jaccard Index ranges from 0 to 1, where 1 indicates identical sets and 0 indicated disjoint sets ○ Set A={1,2,3,4} and Set B={3,4,5,6} , ∣A∩B∣ =? , ∣A∪B∣ =? , Jaccard Index =?, Jaccard Distance =? ○ Use case: Compare documents or text similarity, sets of pixels in images for similarity Overview Thank You !!!