AI Course: Intro and Data Quality

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

In the context of data analysis, what does 'ETL' stand for in the staging area of a data warehouse architecture?

  • Extract, Transfer, Load
  • Evaluate, Translate, Load
  • Execute, Transform, Load
  • Extract, Transform, Load (correct)

What is the primary focus of Statistics, Data Mining, and Data Science concerning the 'inductive phase of learning'?

  • Creating new data collection methods.
  • Moving from idea/theory to observation.
  • Moving from observation to idea/theory/hypothesis. (correct)
  • Analyzing data to validate existing theories.

What is the general equation that represents the composition of data?

  • Data = Signal - Noise
  • Data = Information + Error
  • Data = Signal + Interference
  • Data = Fit + Noise (correct)

Which "V" of Big Data refers to the trustworthiness and accuracy of the data?

<p>Veracity (A)</p> Signup and view all the answers

What is the primary goal when dealing with data?

<p>To reveal information, models, patterns, associations, trends, and clusters hidden in the data. (B)</p> Signup and view all the answers

What is 'data warehousing or datalakes'?

<p>Assembling historical data in a consistent manner from transactional processes. (A)</p> Signup and view all the answers

In the context of machine learning, what process relies primarily on inductive learning?

<p>Using computers to make machines program themselves. (A)</p> Signup and view all the answers

What is the Machine Learning equivalent of 'Individuals' in Statistics?

<p>Instances (B)</p> Signup and view all the answers

Within the context of Data Science as an interdisciplinary field, which components are combined?

<p>Computer Science/IT, Math and Statistics, and Domain/Business Knowledge (A)</p> Signup and view all the answers

Which of the following is NOT a typical step in the data science process?

<p>Deployment to production (C)</p> Signup and view all the answers

Under what condition does multivariate data arise?

<p>When researchers record the values of several variables/attributes on a set of units in which they are interested. (A)</p> Signup and view all the answers

What is a key characteristic of the rows in a data file?

<p>They represent individuals or instances. (B)</p> Signup and view all the answers

What is a key restriction regarding columns of a data file?

<p>Same variables measured in all individuals and in the same order. (B)</p> Signup and view all the answers

What type of variable is 'citizenship' ('Mexican', 'German', etc.)?

<p>Nominal Variable (B)</p> Signup and view all the answers

What type of variable is 'size in clothes' ('XXL', 'XL', 'L', 'M', 'S')?

<p>Ordinal Variable (B)</p> Signup and view all the answers

What is a primary characteristic of 'Count' data?

<p>It is the result of a count. (A)</p> Signup and view all the answers

Which file is used to understand the meaning of the variables in the data?

<p>the metadata file (A)</p> Signup and view all the answers

What is the purpose of the 'Feature Selection' step in data preprocessing?

<p>To filter out the uninteresting variables. (C)</p> Signup and view all the answers

What does 'complete case analysis' imply in handling missing values?

<p>Omiting any case with a missing value on any of the variables. (A)</p> Signup and view all the answers

What should be done about the presence of outliers?

<p>Detect them and treat them in any way. (C)</p> Signup and view all the answers

What is a goal of multivariate outlier detection?

<p>To detect observations not detected by univariate methods. (D)</p> Signup and view all the answers

What does 'LOF' refers to?

<p>Local Outlier Factor is an algorithm to find density-based local outliers. (A)</p> Signup and view all the answers

Which is a technique to detecting outliers by means of Random Forest?

<p>Isolation Tree Algorithm. (D)</p> Signup and view all the answers

Why do outliers use fewer divisions than normal data, with the Isolation Tree method?

<p>Because the anomalies require less divisions because it is easier to isolate them, and these are easy to detect. (E)</p> Signup and view all the answers

When a PCA model is obtained, such as calculating eigenvectors, projections, and means, what is done?

<p>The initial observations can be reconstructed and used. (C)</p> Signup and view all the answers

How is the mean squared reconstruction error calculated?

<p>It is the average of the squared differences between the variables with the reconstructed variables. (B)</p> Signup and view all the answers

What is Attribute wise Learning for Scoring Outliers?

<p>A unsupervised outlier detection algorithm. (B)</p> Signup and view all the answers

When you have detected outliers, what do you have to do?

<p>Evaluate all actions, evaluate them, to see which benefits in the data and which ones don't. (C)</p> Signup and view all the answers

Which of the following is true about preprocessing?

<p>It is used to create EDA to automate analysis to understand the data faster. (B)</p> Signup and view all the answers

Why is preprocessing important?

<p>The data needs cleaning to get better data quality. (C)</p> Signup and view all the answers

What is one of the first steps inside the data mining chain?

<p>The first summary of Data: measures of central tendency and dispersion. (A)</p> Signup and view all the answers

What is the role Variables in the data?

<p>It must have meaning from the metadata. (C)</p> Signup and view all the answers

What are the steps on data processing?

<p>The steps are data colletion, data preparation (wrangling during interactive data analysis) which leverage visualization for exploratory data analysis (EDA), data preprocessing and future engineering. (E)</p> Signup and view all the answers

In the context of handling outliers, what does 'Declare outliers as missing values' mean?

<p>Treat outliers as null/undefined values. (D)</p> Signup and view all the answers

What is the use of 'Internal encoding'?

<p>Order ordinal variables. (C)</p> Signup and view all the answers

In which situations you have to treat for bias data?

<p>Unbias, Balance, (Detection &amp; Mitigation) in a data pre processing step. (C)</p> Signup and view all the answers

What is the first step for data preprocessing?

<p>Inventory Data Sources. (B)</p> Signup and view all the answers

In data preprocessing, what does the Scale(Normalize, Standardize) step involve?

<p>Tranform to the same level of variables by Scaling or Normalization. (B)</p> Signup and view all the answers

What is the main characteristics outliers detection in Random Forest(Anomaly score)?

<p>They required less divisions to isolate than nominal data. (B)</p> Signup and view all the answers

Flashcards

What is Learning?

An iterative process that happens between real-world facts and the hypothesized theories.

What is deduction?

Movement from an idea, theory or hypothesis to observation.

What is Induction?

Movement from observation to idea, theory, or hypothesis.

Statistics/Data mining/Data Science

The inductive phase of learning.

Signup and view all the flashcards

What is Big Data?

The exponential increase of data generation and storage.

Signup and view all the flashcards

What is Velocity in Big Data?

Speed at which the data is generated and analysed.

Signup and view all the flashcards

What is Volume in Big Data?

Amount of data generated each second (social media, credit cards).

Signup and view all the flashcards

What is Value in Big Data?

The worth of the extracted data, needing correct data amount.

Signup and view all the flashcards

What is Variety in Big Data?

Describes the different types of data generated.

Signup and view all the flashcards

What is Veracity of Data?

How trustworthy the data is. If the data is inaccurate or poor quality, it is of little use.

Signup and view all the flashcards

What is Visulalisation?

How challenging data can be to use. Limitations such as poor scalability or functionality can impact on visualisation.

Signup and view all the flashcards

What is data warehousing or data lakes?

The process of assembling historical data in a consistent manner from data.

Signup and view all the flashcards

What is Statistics?

Making inferences with confidence measures.

Signup and view all the flashcards

What is Computer Science?

Developing machines/algorithms that solve problems.

Signup and view all the flashcards

What is Machine Learning?

Based on statistics, machines (algorithms) that program themselves to solve tasks.

Signup and view all the flashcards

What are multivariate data?

Arise when researchers/users record the values of several variables/attributes on a set of units in which they are interested.

Signup and view all the flashcards

What are Tables?

Individual per Variables (conts. or categ).

Signup and view all the flashcards

What are rows in a data file?

Each row represents individual data points.

Signup and view all the flashcards

What are columns in a data file?

Columns represent the characteristics.

Signup and view all the flashcards

What are nominal variables?

Represented by symbols, serving only as labels or names.

Signup and view all the flashcards

What are Ordinal Variables?

Impose order on values, but the no is distance between values defined.

Signup and view all the flashcards

What is Count data?

Very often a variable is a result of a count.

Signup and view all the flashcards

What is Preprocessing?

First summary of data: measures of central tendency and dispersion.

Signup and view all the flashcards

What is Data Preprocessing?

Clean (Replace, Impute, Remove Outliers, Duplicates).

Signup and view all the flashcards

What is Correlation Between Features?

Drops features that have a high correlation with others.

Signup and view all the flashcards

What are Statistical Tests?

Checks the relationship of each feature individually with the output variable.

Signup and view all the flashcards

Recursive Feature Elimination

An algorithm trains a model with the dataset and calculates the performance of the model.

Signup and view all the flashcards

What is Variance Threshold?

Detects features with high variability and selects those that go over the threshold.

Signup and view all the flashcards

What is Feature selection?

filtering the uninteresting variables.

Signup and view all the flashcards

What is Feature extraction?

deriving new variables.

Signup and view all the flashcards

What you have to do with Errors: typos..?

Identify and correct them

Signup and view all the flashcards

What you have to do with missing values..?

They may bias the results. Arise for several reasons like Non-response in sample surveys, Dropouts in longitudinal data, Refusal to answer particular questions in a questionnaire.

Signup and view all the flashcards

What are outliers..?

Douglas Hawkins says that the Outlier is an observation which deviates so much from the other observations.

Signup and view all the flashcards

What are the Boxplot(Tukey, 1977)?

Is a graphical display for exploratory data analysis, where the outliers appear tagged.

Signup and view all the flashcards

What is Mahalanobis distance?

Mahalanobis distance: Is the distance between a point i and a distribution G.

Signup and view all the flashcards

Study Notes

Introduction to Artificial Intelligence Degree Course

  • Introduction and Data Quality, lectured by Prof. Dante Conti and Prof. Sergi Ramirez.

Course Resources

  • The Guia Docente & Web Site / Atenea is available at:
  • https://www.fib.upc.edu/es/estudios/grados/grado-en-inteligencia-artificial/plan-de-estudios/asignaturas/PMAAD-GIA
  • https://ramia-lab.github.io/AdvancedModelling/

Project Details

  • Project groups consist of 5-6 people, with a total of 4 groups per laboratory.
  • Practical Work:
    • Choose a "real-world" problem or case study.
    • Implement algorithms and methods.
    • Write a technical/managerial report.
    • Oral defense.
  • R is the primary language, but Python is also acceptable.

Importance of Data in Learning

  • Learning involves an iterative process between real-world facts and hypothesized theories.
  • Deduction moves from idea/theory/hypothesis to observation.
  • Induction moves from observation to idea/theory/hypothesis.
  • Statistics, Data mining, and Data Science focus on the inductive phase of learning.
  • Data can be represented as Data = Fit + Noise.
  • Exponential increase in data generation and storage,
    • including bank, telecom, scientific, web, text, e-commerce, and social network data.
  • Increase in data formats:
    • relational tables, non-structured tables, log files, textual data, and image data.
  • Real-time, streaming data sources.

Big Data Overview

  • Big Data = Transactions + Interactions + Observations
  • Encompasses various inputs like sensors, mobile web, user click streams, and weblogs.
  • Includes outputs like user-generated content, sentiment analysis, social interactions, spatial coordinates, and external demographics.
  • Characterized by increasing data variety and complexity.

The 8 Vs of Big Data

  • Velocity: The speed at which data is generated, collected, and analyzed.
  • Volume: The amount of data generated each second.
  • Value: The worth of the extracted data.
  • Variety: Describes the different types of data generated, often referring to unstructured data.
  • Veracity: How trustworthy the data is.
  • Validity: Accuracy of the data for its intended use.
  • Volatility: The age of the data; fresh data can quickly make stored data irrelevant.
  • Visulalisation: Challenges in using the data, as impacted by factors like scalability and functionality.

Data as a Valuable Resource

  • Stored data contains information about the generating phenomenon (statistical regularity).
  • The goal is to reveal information like models, patterns, associations, trends, and clusters hidden in the data.
  • Data is a treasure for organizations if the data quality is reliable.
  • All digital interactions can be valuable data sources that can be enhanced through collected data analysis.
  • Reveal interesting data insights by selecting & reporting
  • SQL queries alone are not sufficient.
  • Assembling historical data consistently to store is called data warehousing or data lakes and these are the memory of the company.
  • Need to learn from the data.

Data Warehouse Architecture

  • Data flows from sources through staging areas to a warehouse, then to data marts, and finally to users who perform analysis, reporting, and mining.
  • Key components include operational systems, ERP, CRM, flat files, meta data, summary data, and raw data.

Interdisciplinarity in Machine Learning

  • Computer Science is for developing machines or algorithms that solve problems.
  • Statistics is designed to make inferences with confidence measures.
  • Machine learning is based on statistics to create machines/algorithms that self-program to solve tasks.

Rosetta Stone Comparison of Statistics and Machine Learning

  • Statistics terms map to machine learning equivalents:
    • Variables are attributes/features.
    • Individuals are instances.
    • Explanatory variables are inputs.
    • Response variables are outputs/targets/concepts.
    • Models are networks/trees.
    • Coefficients are weights.
    • Fit criteria are cost functions.
    • Estimation is learning/training.
    • Classification is clustering/unsupervised classification.
    • Discrimination is supervised classification.

Data Science Venn Diagram

  • Data Science combines Computer Science/IT
  • Math and Statistics
  • Domains/Business Knowledge.

Data Science Process Steps

  • Data Collection
  • Data Preparation
  • Model Fitting
  • Model Evaluation
  • Hyperparameter Tuning

Multivariate Data Definition

  • Multivariate data arise when researchers/users record values of several variables/attributes on a set of units.
  • This leads to a vector-valued or multidimensional observation for each unit.

Data File Characteristics

  • Data is multivariate.
  • Tables can contain individual variables or counts.
  • Useful data types
    • Transaction's data
    • Graphs
    • Similarity matrices, Link data
      • Textual data - Documents, Html/Xml
      • Stream data - Sensors, Podcasting
      • Image data - Medical, Instagram

The Data Matrix

  • A multivariate data matrix, X ∈ R^(n×p), has form with x_ij representing the value of the jth variable for the ith unit.
  • Theoretical entities describing univariate distributions are denoted by random variables X_1,..., X_p.
  • Rows (columns) of X can be written as vectors x_1,x_2,...,x_n or (x_(1), x_(2), ..., x_(p)).

Sample Mean Vector and Covariance Matrix

  • The sample mean of the jth variable is calculated as xÌ„_j = (1/n) Σ x_ij.
  • The sample mean vector is xÌ„ = (xÌ„_1, ..., xÌ„_p)^T.
  • The sample covariance of the jth and kth variables is s_jk = (1/(n-1)) Σ (x_ij - xÌ„_j)(x_ik - xÌ„_k).
  • The p × p matrix S = (s_jk) is the sample covariance matrix.

Rows and Columns Characteristics

  • Rows of a data file represent individuals or instances, of tens to millions.
  • Also referred to as samples, examples, or records.
  • Represent repeated units forming a population under study and can be classified, associated, or clustered and characterized by a predetermined set of attributes.
  • Use all available data.

Columns Characteristics:

  • Each row is defined some features, variables or attributes.
  • Variable are individuals that can take several forms (according to a distribition)
  • Attribute types:
    • Binary, nominal, ordinal, interval, ratio, textual,...
  • Same variables measured in all individuals and same orders.
  • Dictionary of variable in the first rows

Variable Types Hierarchy

  • Data types includes discrete and continuous, quantitative and qualitative
    • Discrete data that include categorical (nominal/ordinal)
    • Continous that include quantitative (metric)

Nominal Variables

  • Nominal variables have distinct categories represented by symbols, and these serve as labels or names.
  • No relation is implied among nominal values, with no ordering or distance measure.
  • Percentages and tables can be calculated, and bar plots are used for graphical representation. - *Ex: Citizenship (Mexican, German, French) or marital status (single, married, divorced, widow)
  • A special case is the binary/dichotomous variable .
  • DS algorithms cannot operate on nonimal data directly since the input should be numeric so Binarisation (one-hot-encoding) needs to be performed

Ordinal Variables

  • Impose order on values, no distance between values defined.
    • *Ex: size in clothes (XXL > XL > L > M > S) Social status is ordinal
    • (upper class > middle/high > middle > middle/lower > lower class)
  • Arithmetic calculations are not possible.
  • Includes Tables and percentages and bar plots but emphasizing the ordering of values.
  • Internal encoding of ordinal variables preserving the order. (lower class -> 1)

Count or Discrete Data

  • Often are the result of counting a variable.
    • *Ex: attribue number of words in a sentence, number of students, number of bugs - Modeled by the poisson distribution

Understanding the Data

  • Variables must have meaning & its described in the metadata file
  • Role of Variables
    • Response Variables include target, model, and predicting, usually displayed as (Y) -Exploratory Variables include inputs& predictors that we can used to predict past variables (X)
  • Data Origin can include what we collect (surveys/sampling) or the Secondary source (public data)

Two Types of Data Files

  • Framework depending on with or without response variable. - *Ex: transactions, ecological, survey
  • Data to explore ,describe ,association
  • Data to build model and predict the response

Data Mining Chain (Batch Mode)

    1. Preprocessing:
      • Summary, cleaning, Analysis
    1. Summary
      • Unvariate, Bivariate
    1. Multi Exploration - Visulization, clustering, profiling
    1. Modeling - Optimal model, Estimation
    1. Deployement Communications, usage for a certain context

Advanced Preprocessing

  • Inventory Data Sources
  • Fix Quality Issues
  • Identify Important Features
  • Apply Feature Engineering Libraries
  • Validate Results
  • Repeat or Complete

Additional Steps For Data Preprocessing

  • Data profiling
  • Data cleansing
  • Data extraction
  • Data transformation
  • Data enrichment
  • Data validation

Process

  • Correlation between Features
  • Statistical Tests
  • Recursive Feature Elimination
  • Variance Threshhold

Data Processing Steps

  • Data Collections with label
  • Data Preprocessing, Cleaning and Removing Duplicates, Split into Test, Validation
  • Scale, balance and augment Data.

Preprocessing:

  • feature selection: filtering variables
  • feature extraction variables
  • Transformations: -Recoding numerical for categorical -Quantifying a nominal variable -Normalized

Data Cleaning

  • Errors & Typo, missing data, outliers
  • Non-response is sample surveys
  • Drop outs on longitudinal data
  • Refusal to answer questions
  • Complete case analysis missing data set
  • Estimate quantities
  • Imputation data

Outlier Detections

  • A observation deviates with other observations generates a different mechanism
  • Always by data with statistics
  • Non-linked events with the current model
  • Data that fails in current models

Univariate Detections

  • box plot display for exploration, outliers are tagged
    • mild outliers and extreme outliers - Observation X declared an extreme outliers if is outside off Q1-3

Multivariate Detections

  • Not univariate
  • Then detection is based on computing Mahalanobis
  • Distance between I and G that is called generation if is the value of g

Mahalanobis Distance

  • The distances for a normal distributed point
  • Allow establishment for outliers
  • Short distanced occurs oreoften

MultiVariant Outliers

  • Problem is corrupted outlies (V is corrupted)
  • initialization is G and V
    • Compute Mahalanobis Rank is based on lower DM and updates V&G till convergence

Data Outliers (R)

  • Metrics function
    • With Quantiles
  • Values
    • Md Values of classical Mahalanobis Distance
    • Rd robust Mahalanobis Distance
    • The cut is in the outlier cuts

Not parametric

  • Local Outliers
    • Algorithm is identification metrics based in local dist / DL(Acm.orgh)

Summary Table

  • LOF to estimate of outliers and values greater than 1 suggest

Outlier: Random Forest

  • Algorithm is Isolation Tree
    • Las Anomalias

Approaches

  • Encoding
    • Rencustruction
      • It is important to have consideration in the analysis

Steps

  • Cleaning data
  • Detecting rare events
  • Fraud , net, analysis
  • Eliminate theme
  • Weight the individuals and diminish stats
  • Estimation with generating individual

Autoamte

  • Goal number and deciding about automatic data faster
  • Example EDA for R

Additional Library

  • Data mad
  • Python
  • Data Explorer/YData
  • SmartEDA Dtale
  • The Sweet Viz Autoviz

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

More Like This

Data Quality Quiz
3 questions

Data Quality Quiz

NobleSardonyx avatar
NobleSardonyx
Data Quality Management Quiz
5 questions
Dimensions of Data Quality Quiz
3 questions
Use Quizgecko on...
Browser
Browser