Podcast
Questions and Answers
Distinguish between data and information within the context of data analytics.
Distinguish between data and information within the context of data analytics.
Data represents recorded facts and figures, while information is data presented in a meaningful context, providing insights and understanding.
What are the three 'V's that characterize big data, and why are they significant?
What are the three 'V's that characterize big data, and why are they significant?
Volume (massive size of datasets), Velocity (speed at which data is generated), and Variety (diversity of data types). They are significant because they pose unique challenges for storage, processing, and analysis.
Explain the difference between structured and unstructured data, and provide an example of each.
Explain the difference between structured and unstructured data, and provide an example of each.
Structured data is organized and formatted, typically stored in databases with well-defined rows and columns (e.g., financial records). Unstructured data lacks a predefined format (e.g., text, images, video).
Describe the purpose of data cleaning in the data analytics process.
Describe the purpose of data cleaning in the data analytics process.
How does SELECT DISTINCT
differ from SELECT
in SQL, and when would you use SELECT DISTINCT
?
How does SELECT DISTINCT
differ from SELECT
in SQL, and when would you use SELECT DISTINCT
?
Explain the purpose of the GROUP BY
clause in SQL and provide an example of its use.
Explain the purpose of the GROUP BY
clause in SQL and provide an example of its use.
What is the difference between the WHERE
and HAVING
clauses in SQL, and when should each be used?
What is the difference between the WHERE
and HAVING
clauses in SQL, and when should each be used?
Explain the purpose of using joins in SQL and give an example of a situation where a LEFT JOIN
would be appropriate?
Explain the purpose of using joins in SQL and give an example of a situation where a LEFT JOIN
would be appropriate?
In the context of SQL, what are NULL values, and why is it important to handle them properly?
In the context of SQL, what are NULL values, and why is it important to handle them properly?
What distinguishes supervised learning from unsupervised learning in machine learning?
What distinguishes supervised learning from unsupervised learning in machine learning?
Give an example of a classification problem and a regression problem in the context of supervised learning.
Give an example of a classification problem and a regression problem in the context of supervised learning.
What is factor analysis, and when is it most appropriately used?
What is factor analysis, and when is it most appropriately used?
Explain the purpose of Principal Component Analysis (PCA) and how it achieves dimensionality reduction.
Explain the purpose of Principal Component Analysis (PCA) and how it achieves dimensionality reduction.
What is the Kaiser criterion, and how is it used in the context of Principal Component Analysis (PCA)?
What is the Kaiser criterion, and how is it used in the context of Principal Component Analysis (PCA)?
Describe the K-means clustering algorithm and the steps involved in assigning data points to clusters.
Describe the K-means clustering algorithm and the steps involved in assigning data points to clusters.
What are cluster centroids, and how are they used when interpreting the results of a K-means clustering analysis?
What are cluster centroids, and how are they used when interpreting the results of a K-means clustering analysis?
In the context of supervised learning, what is meant by 'splitting' the data, and why is it important?
In the context of supervised learning, what is meant by 'splitting' the data, and why is it important?
Describe the concept of imputation and when it is used in data preprocessing?
Describe the concept of imputation and when it is used in data preprocessing?
What is the purpose of feature selection, and describe one method for achieving it?
What is the purpose of feature selection, and describe one method for achieving it?
Explain what is meant by 'overfitting' in the context of model evaluation, and why is it a problem?
Explain what is meant by 'overfitting' in the context of model evaluation, and why is it a problem?
Flashcards
Problem Definition
Problem Definition
Identifying and clearly communicating the problem that needs to be solved.
Data Collection
Data Collection
Gathering and measuring data through surveys, interviews, web scraping, etc.
Data Cleaning
Data Cleaning
Converting raw data into an understandable and usable format for further analysis.
Data Analytics
Data Analytics
Signup and view all the flashcards
Big Data
Big Data
Signup and view all the flashcards
What is Data?
What is Data?
Signup and view all the flashcards
Structured Data
Structured Data
Signup and view all the flashcards
Unstructured Data
Unstructured Data
Signup and view all the flashcards
SQL
SQL
Signup and view all the flashcards
Data Retrieval
Data Retrieval
Signup and view all the flashcards
SELECT (SQL)
SELECT (SQL)
Signup and view all the flashcards
FROM (SQL)
FROM (SQL)
Signup and view all the flashcards
WHERE clause
WHERE clause
Signup and view all the flashcards
DELETE (SQL)
DELETE (SQL)
Signup and view all the flashcards
ORDER BY (SQL)
ORDER BY (SQL)
Signup and view all the flashcards
LIMIT (SQL)
LIMIT (SQL)
Signup and view all the flashcards
Machine Learning
Machine Learning
Signup and view all the flashcards
Supervised Learning
Supervised Learning
Signup and view all the flashcards
Unsupervised Learning
Unsupervised Learning
Signup and view all the flashcards
Factor Analysis
Factor Analysis
Signup and view all the flashcards
Study Notes
- Data Analytics Process
Problem Definition
- Identify and clearly communicate the the problem that needs solving.
Data Collection
- Gather and measure data through surveys, interviews, observations, web scraping, and third-party databases.
Data Cleaning
- Convert raw data into a understandable format for further analysis, using tools and methods like SQL.
- These tools enable data selection, filtering, aggregation, and manipulation.
Data Analysis
- Data Interpretation
- Result Communication
What is “Data Analytics”?
- Transform and analyze raw data to produce effectively useable information.
Big Data
- Collection of data in huge volume.
- Volume: Massive size of datasets.
- Velocity: Speed at which data is generated.
- Variety: Data from different sources
Data
- Recorded facts and numbers that are presented in meaningful context.
- Formats can be structured, organized, and formatted with well defined rows and columns.
- These can be stored in databases, as well as being financial records, customer info, inventory lists, etc.
- Unstructured data has no structure and flexible content and format like text, images, and video
Forms of data
- quantitative (numeric) such as sales figures or temp readings
- qualitative (contextual) such as customer feedback and employer reviews
SQL (Structured Query Language)
- Database query language that interacts with relational databases using common key links and tables.
- Retrieves specific data, performs various types of analysis, integrates data from different databases, and ensures data quality by identifying and correcting errors
SQL in R
- R packages include install.packages(“XXX”) and Library(XXX).
Commands
- SELECT [column(s)]: Determines which columns to include to be returned
- FROM [table]: Used to identify table from which to draw data
- WHERE [condition]: Retrieves records that satisfy a certain condition for filtering data
- HAVING: Cannot be used with aggregate functions
- INSERT: Inserts new records in a table
- UPDATE: Modifies existing records in a table
- DELETE: deletes existing records in a table
Comparison operators
- < (less than)
- > (greater than)
- != (not equal to)
- <= (less than or equal to)
-
= (greater or equal to)
Logical operators
- AND (display if ALL conditions are true)
- OR (display if ANY conditions are true)
- NOT (display if condition is not true)
Testing for a range of values
- BETWEEN selects values within a given range including the begin and end values
Set Values
- IN allows specification of multiple values in a WHERE clause
Sorting Data
- ORDER BY sorts results in ascending (ASC) or descending (DESC) order, with ascending as the default.
Limiting Data
- LIMIT specifies the number or records to return
Aggregate Functions
- MIN returns the smallest value of the selected column
- MAX returns the largest value of the selected column
- AVG returns the average value of the selected column
- SUM returns the total sum of the selected column
- COUNT returns the number of rows
Grouping Data
- GROUP BY groups rows that have common column values.
Multi Table Operations
- Joins combine rows from more than one table
- LEFT JOIN keeps all records from the left table and matching records from the right table.
- RIGHT JOIN keeps all records from the right table and matching records from the left table.
- INNER JOIN keeps records that match both tables.
- FULL OUTER JOIN keeps all records from both tables.
NULL Values
- Cannot use comparison operators.
- IS NULL tests for empty values.
- IS NOT NULL tests for non-empty values.
- IF NULL handles null values by providing an alternative.
- SQL: SELECT columns is select(df, column1, column2, ...) in R.
- SQL: WHERE condition is filter(df, condition) in R.
- SQL: ORDER BY column is arrange(df, column) in R.
- SQL: LIMIT n is head(df, n) in R.
- SQL: GROUP BY column is group_by(df, column) in R.
- SQL: LEFT JOIN is left_join(df1, df2, by = "column") in R.
- SQL: RIGHT JOIN is right_join(df1, df2, by = "column") in R.
- SQL: INNER JOIN is inner_join(df1, df2, by = "column") in R.
- SQL: FULL OUTER JOIN is full_join(df1, df2, by = "column") in R.
Machine Learning
- Uses statistical techniques to learn from data without being explicitly programmed.
Types of Machine Learning
- Supervised learning requires human interventions, using labeled datasets to accurately classify data or predict outputs, giving a desired output, classifying email or spam and regression, such as weather forecasting.
- Examples: Linear-, logistic regression, etc.
- Unsupervised learning requires no human interventions
- Uses unlabeled datasets to discover hidden patterns or similarities among observations, giving no desired output.
- Dimensionality reduction, clustering (e.g., customer segmentation), and association (e.g., recommendation systems) are examples.
Factor Analysis (focus on variables)
- Given a set of interrelated variables, find a smaller number of new (latent) variables, i.e., factors, that somehow describe the original set as good as possible, based on correlations between individual variables.
- Dimension reduction technique and finding underlying factors.
Spurious correlation
- When two variables are correlated but do not have a causal relationship.
Types of Factor Analysis
- Exploratory factor analysis (EFA)
- Confirmatory factor analysis (CFA)
- Principal component analysis (PCA)
- Attempts to describe the variables in a dataset as good as possible by a few new variables that are linear combinations of the original variables
Principal components
- Used as a dimension reduction technique or as a first step to base subsequent analyses on, aims to explain as much variance of the original variables as possible (amount of variance is measured by the eigenvalue), variables are uncorrelated and have a number of under or equal to the number of original variables.
- Feature = predictor = IV = X
- Outcome = DV = label = Y
PCA
- Larger eigenvalue, Z explains a large amount of variance in the data, making it important to standardize because it gives more weight to variables with higher variances and provides single units of measurement with a mean of zero and variance equal to one.
Steps to conducting PCA
- Formulate the problem, identifying the problem and variables.
- Variables with high correlation are also highly correlated with the same dimension(s).
R
The following code is to be used in R
- Correlation matrix: Corr_matrix <- round(cor(dataset), decimals)
- Principal component analysis: Resp.pca <- PCA(dataset, graph = FALSE)
Determine the number of Dimensions
- Approaches: A priori determination Kaiser criterion, Scree plot and Proportion of variance
- Kaiser criterion - Eig.value <- get_eigenvalue(res.pca) and Eig.value <- round(eig.value, digits=x)
- Code scree plot - Fviz_eig(res.pcs, addlabels = TRUE, ylim = (0, X)
- Retain the variables - contribution of each variable to a given dimension and contribution is at average to dimension.
Cluster Analysis
- Group a collection of objects that are more like each other into subgroups, i.e., cluster, than they are like members of other groups, based on similarity between individual objects and explorative technique, for uses like market segmentation, fraud detection, and data reduction.
Types of clustering methods
- K-means clustering divides objects into k-distinct, non-overlapping clusters by assigning each observation to the nearest cluster centre, updating the mean,s coordinates which are the averages of those observation categories.
- Steps 2 and 3 are repeated until some convergence criterion is met.
Steps of conducting K-means
- Formulate the problem and select a distance measure, such as Euclidean distance (Pythagoras).
Decide on a number of clusters
- The elbow method or average silhouette method are used.
- NbClust provides 30 indices to determine what is optimal.
- Cluster visualization uses PCA to reduce dimensions to 2 dimensional
- Interpret the clusters: Always cluster centroids (average clustering variables values of all observations in a cluster) and use in subsequent analysis
- Before clustering, standardize variables to have a single unit of measurement.
- Factors: Reliability and validity of k-means clustering results, Initial centroids, Data scaling, and outliers
Supervised Learning Workflow
- Explore the data for checking structure, checking missing values, and detecting outliers
Data Preprocessing
- Splitting: Test set and train set (80% training, 20% test), evaluate models on the testing data
- Missing values: Imputation (Prediction using other variables), replace predicted variables to missing variables, use K-nearest neighbours, Tree-based models, and linear models
- Over 20% missing data means that the 20% missing data rule of thumb is in play
Outliers
- Detect using methods like easy functions (summary, hist, boxplot), convenient packages (outliers()), handle with Winsorization, truncation, and Imputation
Normalization
- Transformation & normalization via Logarithms, normalization, scaling, and centering
Feature creation and selection
- Create new features by calculating ratios and creating dummy variables.
- Use simple filters based on numeric score, e.g., t-statistic, recursive feature elimination and stepwise selection
Train the data using algorithms
- Linear regression
- Logistic regression
Naive Bayes
- Classification technique based on conditional probabilities (P(Y/X) Bayes' theorem
- Pros include easy to build for large datasets & good performance with high dimensional data but Cons includes independence of features failing in most cases
K-Nearest Neighbour
- Predicts new cases based on similarity measure (distance function)
- In classification, based on majority class, or in aggression, based on mean or median, where K is √ N
- Look at data
- Calculate distances
- Find neighbours
- Vote on outcome (for classification)
- If continuous- Euclidean distance, or if Categorical features I Xi-X'¡ I
- Is Simple and assumes No data
- Disadvantage is computationally expensive & Potentially sensitive to irrelevant features
Decision Tree
- YES or NO
- branch or sub Tree with leafs to determine decisions
Leaf Node
- where decisions are taken
Outcome
- Be aware of overfitting
Random Forest
- Builds upon standard decision trees with 3 features: Random sampling with replacement, Random selection of feature set, and Voting to aggregate results
- Disadvantages Lack of interpretability
Gradient Boosting
- Builds tree after tree to imprive on the deficiencies of previous trees
- Is powerful and efficient
- The disadvantage if loss of interpretabilites
Model Evaluation on test data
- Mean absolute percentage error, accuracy, confusion matrix, ROC plots
- Overfitting models set rather than ‘Understanding’
Two types of model evaluation
- Domain assess/ understand with expert assesment and data generation
- Model fitted using training data (In-sample accuracy)
Model Evaluation
- Root mean squared error (RMSE) with sensitive to outliers and Mean absolute error (MAE) with less sensitive to outliers plus Mean absolute percentage error (MAPE) as undefined zeros
Evaluating Results
-
Correct decisions/ total number of decisions
- 1 Error Rate
- False P and Negatives
Type 1 and 2 errors ( sensitivity)
TP/ (TP FN) *analogous
- Negative Rate -specificity where / (TN FP) * False Rate of Alarm.
Expected Value
- Px,y) p(Y) p(XY) *
Visualize model Performance through Receiver Operations
- Characteristics Graphs with 2 dimensional axis
- False +ve + Axis
- True +ve + Y axis
- Guesswork side- permissive
- Rating based on score and threshold
- Useful Number on Summary and
- Equivalant to
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.