Podcast
Questions and Answers
In the context of data analysis, what does 'ETL' stand for in the staging area of a data warehouse architecture?
In the context of data analysis, what does 'ETL' stand for in the staging area of a data warehouse architecture?
- Extract, Transfer, Load
- Evaluate, Translate, Load
- Execute, Transform, Load
- Extract, Transform, Load (correct)
What is the primary focus of Statistics, Data Mining, and Data Science concerning the 'inductive phase of learning'?
What is the primary focus of Statistics, Data Mining, and Data Science concerning the 'inductive phase of learning'?
- Creating new data collection methods.
- Moving from idea/theory to observation.
- Moving from observation to idea/theory/hypothesis. (correct)
- Analyzing data to validate existing theories.
What is the general equation that represents the composition of data?
What is the general equation that represents the composition of data?
- Data = Signal - Noise
- Data = Information + Error
- Data = Signal + Interference
- Data = Fit + Noise (correct)
Which "V" of Big Data refers to the trustworthiness and accuracy of the data?
Which "V" of Big Data refers to the trustworthiness and accuracy of the data?
What is the primary goal when dealing with data?
What is the primary goal when dealing with data?
What is 'data warehousing or datalakes'?
What is 'data warehousing or datalakes'?
In the context of machine learning, what process relies primarily on inductive learning?
In the context of machine learning, what process relies primarily on inductive learning?
What is the Machine Learning equivalent of 'Individuals' in Statistics?
What is the Machine Learning equivalent of 'Individuals' in Statistics?
Within the context of Data Science as an interdisciplinary field, which components are combined?
Within the context of Data Science as an interdisciplinary field, which components are combined?
Which of the following is NOT a typical step in the data science process?
Which of the following is NOT a typical step in the data science process?
Under what condition does multivariate data arise?
Under what condition does multivariate data arise?
What is a key characteristic of the rows in a data file?
What is a key characteristic of the rows in a data file?
What is a key restriction regarding columns of a data file?
What is a key restriction regarding columns of a data file?
What type of variable is 'citizenship' ('Mexican', 'German', etc.)?
What type of variable is 'citizenship' ('Mexican', 'German', etc.)?
What type of variable is 'size in clothes' ('XXL', 'XL', 'L', 'M', 'S')?
What type of variable is 'size in clothes' ('XXL', 'XL', 'L', 'M', 'S')?
What is a primary characteristic of 'Count' data?
What is a primary characteristic of 'Count' data?
Which file is used to understand the meaning of the variables in the data?
Which file is used to understand the meaning of the variables in the data?
What is the purpose of the 'Feature Selection' step in data preprocessing?
What is the purpose of the 'Feature Selection' step in data preprocessing?
What does 'complete case analysis' imply in handling missing values?
What does 'complete case analysis' imply in handling missing values?
What should be done about the presence of outliers?
What should be done about the presence of outliers?
What is a goal of multivariate outlier detection?
What is a goal of multivariate outlier detection?
What does 'LOF' refers to?
What does 'LOF' refers to?
Which is a technique to detecting outliers by means of Random Forest?
Which is a technique to detecting outliers by means of Random Forest?
Why do outliers use fewer divisions than normal data, with the Isolation Tree method?
Why do outliers use fewer divisions than normal data, with the Isolation Tree method?
When a PCA model is obtained, such as calculating eigenvectors, projections, and means, what is done?
When a PCA model is obtained, such as calculating eigenvectors, projections, and means, what is done?
How is the mean squared reconstruction error calculated?
How is the mean squared reconstruction error calculated?
What is Attribute wise Learning for Scoring Outliers?
What is Attribute wise Learning for Scoring Outliers?
When you have detected outliers, what do you have to do?
When you have detected outliers, what do you have to do?
Which of the following is true about preprocessing?
Which of the following is true about preprocessing?
Why is preprocessing important?
Why is preprocessing important?
What is one of the first steps inside the data mining chain?
What is one of the first steps inside the data mining chain?
What is the role Variables in the data?
What is the role Variables in the data?
What are the steps on data processing?
What are the steps on data processing?
In the context of handling outliers, what does 'Declare outliers as missing values' mean?
In the context of handling outliers, what does 'Declare outliers as missing values' mean?
What is the use of 'Internal encoding'?
What is the use of 'Internal encoding'?
In which situations you have to treat for bias data?
In which situations you have to treat for bias data?
What is the first step for data preprocessing?
What is the first step for data preprocessing?
In data preprocessing, what does the Scale(Normalize, Standardize) step involve?
In data preprocessing, what does the Scale(Normalize, Standardize) step involve?
What is the main characteristics outliers detection in Random Forest(Anomaly score)?
What is the main characteristics outliers detection in Random Forest(Anomaly score)?
Flashcards
What is Learning?
What is Learning?
An iterative process that happens between real-world facts and the hypothesized theories.
What is deduction?
What is deduction?
Movement from an idea, theory or hypothesis to observation.
What is Induction?
What is Induction?
Movement from observation to idea, theory, or hypothesis.
Statistics/Data mining/Data Science
Statistics/Data mining/Data Science
Signup and view all the flashcards
What is Big Data?
What is Big Data?
Signup and view all the flashcards
What is Velocity in Big Data?
What is Velocity in Big Data?
Signup and view all the flashcards
What is Volume in Big Data?
What is Volume in Big Data?
Signup and view all the flashcards
What is Value in Big Data?
What is Value in Big Data?
Signup and view all the flashcards
What is Variety in Big Data?
What is Variety in Big Data?
Signup and view all the flashcards
What is Veracity of Data?
What is Veracity of Data?
Signup and view all the flashcards
What is Visulalisation?
What is Visulalisation?
Signup and view all the flashcards
What is data warehousing or data lakes?
What is data warehousing or data lakes?
Signup and view all the flashcards
What is Statistics?
What is Statistics?
Signup and view all the flashcards
What is Computer Science?
What is Computer Science?
Signup and view all the flashcards
What is Machine Learning?
What is Machine Learning?
Signup and view all the flashcards
What are multivariate data?
What are multivariate data?
Signup and view all the flashcards
What are Tables?
What are Tables?
Signup and view all the flashcards
What are rows in a data file?
What are rows in a data file?
Signup and view all the flashcards
What are columns in a data file?
What are columns in a data file?
Signup and view all the flashcards
What are nominal variables?
What are nominal variables?
Signup and view all the flashcards
What are Ordinal Variables?
What are Ordinal Variables?
Signup and view all the flashcards
What is Count data?
What is Count data?
Signup and view all the flashcards
What is Preprocessing?
What is Preprocessing?
Signup and view all the flashcards
What is Data Preprocessing?
What is Data Preprocessing?
Signup and view all the flashcards
What is Correlation Between Features?
What is Correlation Between Features?
Signup and view all the flashcards
What are Statistical Tests?
What are Statistical Tests?
Signup and view all the flashcards
Recursive Feature Elimination
Recursive Feature Elimination
Signup and view all the flashcards
What is Variance Threshold?
What is Variance Threshold?
Signup and view all the flashcards
What is Feature selection?
What is Feature selection?
Signup and view all the flashcards
What is Feature extraction?
What is Feature extraction?
Signup and view all the flashcards
What you have to do with Errors: typos..?
What you have to do with Errors: typos..?
Signup and view all the flashcards
What you have to do with missing values..?
What you have to do with missing values..?
Signup and view all the flashcards
What are outliers..?
What are outliers..?
Signup and view all the flashcards
What are the Boxplot(Tukey, 1977)?
What are the Boxplot(Tukey, 1977)?
Signup and view all the flashcards
What is Mahalanobis distance?
What is Mahalanobis distance?
Signup and view all the flashcards
Study Notes
Introduction to Artificial Intelligence Degree Course
- Introduction and Data Quality, lectured by Prof. Dante Conti and Prof. Sergi Ramirez.
Course Resources
- The Guia Docente & Web Site / Atenea is available at:
- https://www.fib.upc.edu/es/estudios/grados/grado-en-inteligencia-artificial/plan-de-estudios/asignaturas/PMAAD-GIA
- https://ramia-lab.github.io/AdvancedModelling/
Project Details
- Project groups consist of 5-6 people, with a total of 4 groups per laboratory.
- Practical Work:
- Choose a "real-world" problem or case study.
- Implement algorithms and methods.
- Write a technical/managerial report.
- Oral defense.
- R is the primary language, but Python is also acceptable.
Importance of Data in Learning
- Learning involves an iterative process between real-world facts and hypothesized theories.
- Deduction moves from idea/theory/hypothesis to observation.
- Induction moves from observation to idea/theory/hypothesis.
- Statistics, Data mining, and Data Science focus on the inductive phase of learning.
- Data can be represented as Data = Fit + Noise.
Trends Leading to Data Flood
- Exponential increase in data generation and storage,
- including bank, telecom, scientific, web, text, e-commerce, and social network data.
- Increase in data formats:
- relational tables, non-structured tables, log files, textual data, and image data.
- Real-time, streaming data sources.
Big Data Overview
- Big Data = Transactions + Interactions + Observations
- Encompasses various inputs like sensors, mobile web, user click streams, and weblogs.
- Includes outputs like user-generated content, sentiment analysis, social interactions, spatial coordinates, and external demographics.
- Characterized by increasing data variety and complexity.
The 8 Vs of Big Data
- Velocity: The speed at which data is generated, collected, and analyzed.
- Volume: The amount of data generated each second.
- Value: The worth of the extracted data.
- Variety: Describes the different types of data generated, often referring to unstructured data.
- Veracity: How trustworthy the data is.
- Validity: Accuracy of the data for its intended use.
- Volatility: The age of the data; fresh data can quickly make stored data irrelevant.
- Visulalisation: Challenges in using the data, as impacted by factors like scalability and functionality.
Data as a Valuable Resource
- Stored data contains information about the generating phenomenon (statistical regularity).
- The goal is to reveal information like models, patterns, associations, trends, and clusters hidden in the data.
- Data is a treasure for organizations if the data quality is reliable.
- All digital interactions can be valuable data sources that can be enhanced through collected data analysis.
- Reveal interesting data insights by selecting & reporting
- SQL queries alone are not sufficient.
- Assembling historical data consistently to store is called data warehousing or data lakes and these are the memory of the company.
- Need to learn from the data.
Data Warehouse Architecture
- Data flows from sources through staging areas to a warehouse, then to data marts, and finally to users who perform analysis, reporting, and mining.
- Key components include operational systems, ERP, CRM, flat files, meta data, summary data, and raw data.
Interdisciplinarity in Machine Learning
- Computer Science is for developing machines or algorithms that solve problems.
- Statistics is designed to make inferences with confidence measures.
- Machine learning is based on statistics to create machines/algorithms that self-program to solve tasks.
Rosetta Stone Comparison of Statistics and Machine Learning
- Statistics terms map to machine learning equivalents:
- Variables are attributes/features.
- Individuals are instances.
- Explanatory variables are inputs.
- Response variables are outputs/targets/concepts.
- Models are networks/trees.
- Coefficients are weights.
- Fit criteria are cost functions.
- Estimation is learning/training.
- Classification is clustering/unsupervised classification.
- Discrimination is supervised classification.
Data Science Venn Diagram
- Data Science combines Computer Science/IT
- Math and Statistics
- Domains/Business Knowledge.
Data Science Process Steps
- Data Collection
- Data Preparation
- Model Fitting
- Model Evaluation
- Hyperparameter Tuning
Multivariate Data Definition
- Multivariate data arise when researchers/users record values of several variables/attributes on a set of units.
- This leads to a vector-valued or multidimensional observation for each unit.
Data File Characteristics
- Data is multivariate.
- Tables can contain individual variables or counts.
- Useful data types
- Transaction's data
- Graphs
- Similarity matrices, Link data
- Textual data - Documents, Html/Xml
- Stream data - Sensors, Podcasting
- Image data - Medical, Instagram
The Data Matrix
- A multivariate data matrix, X ∈ R^(n×p), has form with x_ij representing the value of the jth variable for the ith unit.
- Theoretical entities describing univariate distributions are denoted by random variables X_1,..., X_p.
- Rows (columns) of X can be written as vectors x_1,x_2,...,x_n or (x_(1), x_(2), ..., x_(p)).
Sample Mean Vector and Covariance Matrix
- The sample mean of the jth variable is calculated as x̄_j = (1/n) Σ x_ij.
- The sample mean vector is x̄ = (x̄_1, ..., x̄_p)^T.
- The sample covariance of the jth and kth variables is s_jk = (1/(n-1)) Σ (x_ij - x̄_j)(x_ik - x̄_k).
- The p × p matrix S = (s_jk) is the sample covariance matrix.
Rows and Columns Characteristics
- Rows of a data file represent individuals or instances, of tens to millions.
- Also referred to as samples, examples, or records.
- Represent repeated units forming a population under study and can be classified, associated, or clustered and characterized by a predetermined set of attributes.
- Use all available data.
Columns Characteristics:
- Each row is defined some features, variables or attributes.
- Variable are individuals that can take several forms (according to a distribition)
- Attribute types:
- Binary, nominal, ordinal, interval, ratio, textual,...
- Same variables measured in all individuals and same orders.
- Dictionary of variable in the first rows
Variable Types Hierarchy
- Data types includes discrete and continuous, quantitative and qualitative
- Discrete data that include categorical (nominal/ordinal)
- Continous that include quantitative (metric)
Nominal Variables
- Nominal variables have distinct categories represented by symbols, and these serve as labels or names.
- No relation is implied among nominal values, with no ordering or distance measure.
- Percentages and tables can be calculated, and bar plots are used for graphical representation. - *Ex: Citizenship (Mexican, German, French) or marital status (single, married, divorced, widow)
- A special case is the binary/dichotomous variable .
- DS algorithms cannot operate on nonimal data directly since the input should be numeric so Binarisation (one-hot-encoding) needs to be performed
Ordinal Variables
- Impose order on values, no distance between values defined.
- *Ex: size in clothes (XXL > XL > L > M > S) Social status is ordinal
- (upper class > middle/high > middle > middle/lower > lower class)
- Arithmetic calculations are not possible.
- Includes Tables and percentages and bar plots but emphasizing the ordering of values.
- Internal encoding of ordinal variables preserving the order. (lower class -> 1)
Count or Discrete Data
- Often are the result of counting a variable.
- *Ex: attribue number of words in a sentence, number of students, number of bugs - Modeled by the poisson distribution
Understanding the Data
- Variables must have meaning & its described in the metadata file
- Role of Variables
- Response Variables include target, model, and predicting, usually displayed as (Y) -Exploratory Variables include inputs& predictors that we can used to predict past variables (X)
- Data Origin can include what we collect (surveys/sampling) or the Secondary source (public data)
Two Types of Data Files
- Framework depending on with or without response variable. - *Ex: transactions, ecological, survey
- Data to explore ,describe ,association
- Data to build model and predict the response
Data Mining Chain (Batch Mode)
-
- Preprocessing:
- Summary, cleaning, Analysis
- Preprocessing:
-
- Summary
- Unvariate, Bivariate
- Summary
-
- Multi Exploration - Visulization, clustering, profiling
-
- Modeling - Optimal model, Estimation
-
- Deployement Communications, usage for a certain context
Advanced Preprocessing
- Inventory Data Sources
- Fix Quality Issues
- Identify Important Features
- Apply Feature Engineering Libraries
- Validate Results
- Repeat or Complete
Additional Steps For Data Preprocessing
- Data profiling
- Data cleansing
- Data extraction
- Data transformation
- Data enrichment
- Data validation
Process
- Correlation between Features
- Statistical Tests
- Recursive Feature Elimination
- Variance Threshhold
Data Processing Steps
- Data Collections with label
- Data Preprocessing, Cleaning and Removing Duplicates, Split into Test, Validation
- Scale, balance and augment Data.
Preprocessing:
- feature selection: filtering variables
- feature extraction variables
- Transformations: -Recoding numerical for categorical -Quantifying a nominal variable -Normalized
Data Cleaning
- Errors & Typo, missing data, outliers
- Non-response is sample surveys
- Drop outs on longitudinal data
- Refusal to answer questions
- Complete case analysis missing data set
- Estimate quantities
- Imputation data
Outlier Detections
- A observation deviates with other observations generates a different mechanism
- Always by data with statistics
- Non-linked events with the current model
- Data that fails in current models
Univariate Detections
- box plot display for exploration, outliers are tagged
- mild outliers and extreme outliers - Observation X declared an extreme outliers if is outside off Q1-3
Multivariate Detections
- Not univariate
- Then detection is based on computing Mahalanobis
- Distance between I and G that is called generation if is the value of g
Mahalanobis Distance
- The distances for a normal distributed point
- Allow establishment for outliers
- Short distanced occurs oreoften
MultiVariant Outliers
- Problem is corrupted outlies (V is corrupted)
- initialization is G and V
- Compute Mahalanobis Rank is based on lower DM and updates V&G till convergence
Data Outliers (R)
- Metrics function
- With Quantiles
- Values
- Md Values of classical Mahalanobis Distance
- Rd robust Mahalanobis Distance
- The cut is in the outlier cuts
Not parametric
- Local Outliers
- Algorithm is identification metrics based in local dist / DL(Acm.orgh)
Summary Table
- LOF to estimate of outliers and values greater than 1 suggest
Outlier: Random Forest
- Algorithm is Isolation Tree
- Las Anomalias
Approaches
- Encoding
- Rencustruction
- It is important to have consideration in the analysis
- Rencustruction
Steps
- Cleaning data
- Detecting rare events
- Fraud , net, analysis
- Eliminate theme
- Weight the individuals and diminish stats
- Estimation with generating individual
Autoamte
- Goal number and deciding about automatic data faster
- Example EDA for R
Additional Library
- Data mad
- Python
- Data Explorer/YData
- SmartEDA Dtale
- The Sweet Viz Autoviz
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.