Podcast
Questions and Answers
Which type of data analysis is regarded as the fundamental layer, providing summaries and visualizations of prevalent data trends?
Which type of data analysis is regarded as the fundamental layer, providing summaries and visualizations of prevalent data trends?
- Diagnostic
- Prescriptive
- Predictive
- Descriptive (correct)
In data analysis, inferential statistics is particularly useful when:
In data analysis, inferential statistics is particularly useful when:
- The complete dataset is readily available for analysis.
- Describing the central tendencies of a dataset.
- Predicting future outcomes with certainty is required.
- Analyzing a sample to generalize findings to a larger population is necessary. (correct)
The primary goal of A/B testing within the context of hypothesis testing is to:
The primary goal of A/B testing within the context of hypothesis testing is to:
- Reject all hypotheses.
- Determine whether to retain or reject a hypothesis based on data analysis. (correct)
- Generate new hypotheses.
- Avoid data analysis altogether.
Which measure of central tendency is most influenced by outliers in a dataset?
Which measure of central tendency is most influenced by outliers in a dataset?
Which data visualization is best suited for displaying the frequency distribution of a single numerical variable?
Which data visualization is best suited for displaying the frequency distribution of a single numerical variable?
Secondary data is characterized by which of the following?
Secondary data is characterized by which of the following?
What is the primary advantage of using APIs for data collection?
What is the primary advantage of using APIs for data collection?
In the context of data cleaning, what does 'data wrangling' primarily involve?
In the context of data cleaning, what does 'data wrangling' primarily involve?
Which aspect of data quality focuses on ensuring that data values fall within acceptable parameters or ranges?
Which aspect of data quality focuses on ensuring that data values fall within acceptable parameters or ranges?
What does a 'consistency check' primarily aim to achieve during data cleaning?
What does a 'consistency check' primarily aim to achieve during data cleaning?
In data preprocessing, what is the likely outcome of ignoring outliers?
In data preprocessing, what is the likely outcome of ignoring outliers?
For what type of dataset is the Z-score method most appropriate for outlier detection?
For what type of dataset is the Z-score method most appropriate for outlier detection?
When dealing with missing data, what is the key characteristic of data that is 'Missing Completely at Random' (MCAR)?
When dealing with missing data, what is the key characteristic of data that is 'Missing Completely at Random' (MCAR)?
Which imputation method involves replacing missing values with a value drawn randomly from the observed values of that variable?
Which imputation method involves replacing missing values with a value drawn randomly from the observed values of that variable?
In time series data, which imputation method replaces missing values with the most recent prior observation?
In time series data, which imputation method replaces missing values with the most recent prior observation?
What is the purpose of the MICE (Multiple Imputation by Chained Equations) algorithm?
What is the purpose of the MICE (Multiple Imputation by Chained Equations) algorithm?
Which type of bias occurs when a dataset disproportionately represents certain segments of a population due to non-random sampling methods?
Which type of bias occurs when a dataset disproportionately represents certain segments of a population due to non-random sampling methods?
What type of bias is introduced when data is inaccurately measured or classified differently across various groups?
What type of bias is introduced when data is inaccurately measured or classified differently across various groups?
What is a key characteristic of unsupervised learning?
What is a key characteristic of unsupervised learning?
In machine learning, what is the purpose of 'one-hot encoding'?
In machine learning, what is the purpose of 'one-hot encoding'?
What is the primary purpose of k-fold cross-validation?
What is the primary purpose of k-fold cross-validation?
In a confusion matrix, what does a 'false positive' represent?
In a confusion matrix, what does a 'false positive' represent?
What does the linearity assumption in linear regression imply?
What does the linearity assumption in linear regression imply?
What is the primary purpose of the Gradient Descent algorithm?
What is the primary purpose of the Gradient Descent algorithm?
In a decision tree, what does each internal node represent?
In a decision tree, what does each internal node represent?
What is the role of entropy in the ID3 algorithm?
What is the role of entropy in the ID3 algorithm?
What is gini impurity?
What is gini impurity?
What does feature engineering primarily involve?
What does feature engineering primarily involve?
Which feature scaling method is most sensitive to outliers?
Which feature scaling method is most sensitive to outliers?
The primary goal of feature selection is to:
The primary goal of feature selection is to:
What characterizes ensemble learning methods?
What characterizes ensemble learning methods?
How does 'hard voting' work in ensemble methods?
How does 'hard voting' work in ensemble methods?
Which characteristic distinguishes boosting from bagging?
Which characteristic distinguishes boosting from bagging?
In the context of clustering, what is the role of 'similarity measures'?
In the context of clustering, what is the role of 'similarity measures'?
How does k-means handle outliers?
How does k-means handle outliers?
What is a key aspect of DBSCAN?
What is a key aspect of DBSCAN?
What data may non-personalized filtering be based on?
What data may non-personalized filtering be based on?
When does Cold-Start have an effect?
When does Cold-Start have an effect?
What method includes 'Thumbs Up', Star rating, writing reviews?
What method includes 'Thumbs Up', Star rating, writing reviews?
Univariate is what?
Univariate is what?
What may convert date or timestamp features to numeric features by taking the difference between two timestamps or dates?
What may convert date or timestamp features to numeric features by taking the difference between two timestamps or dates?
Flashcards
Descriptive analytics
Descriptive analytics
Summary of past data to understand what happened.
Diagnostic analytics
Diagnostic analytics
Analysis to determine why something happened.
Predictive analytics
Predictive analytics
Prediction based on current data.
Prescriptive analytics
Prescriptive analytics
Signup and view all the flashcards
Ordinal data
Ordinal data
Signup and view all the flashcards
Nominal data
Nominal data
Signup and view all the flashcards
Discrete data
Discrete data
Signup and view all the flashcards
Continuous data
Continuous data
Signup and view all the flashcards
Structured data
Structured data
Signup and view all the flashcards
Unstructured data
Unstructured data
Signup and view all the flashcards
Semi-structured data
Semi-structured data
Signup and view all the flashcards
Big Data
Big Data
Signup and view all the flashcards
Small Data
Small Data
Signup and view all the flashcards
Primary data
Primary data
Signup and view all the flashcards
Secondary data
Secondary data
Signup and view all the flashcards
Internal data
Internal data
Signup and view all the flashcards
External data
External data
Signup and view all the flashcards
Historical data
Historical data
Signup and view all the flashcards
Real-time data
Real-time data
Signup and view all the flashcards
Measures of Central Tendency
Measures of Central Tendency
Signup and view all the flashcards
Mean
Mean
Signup and view all the flashcards
Median
Median
Signup and view all the flashcards
Mode
Mode
Signup and view all the flashcards
Measures of Dispersion
Measures of Dispersion
Signup and view all the flashcards
Range
Range
Signup and view all the flashcards
Variance
Variance
Signup and view all the flashcards
Standard Deviation
Standard Deviation
Signup and view all the flashcards
Histogram
Histogram
Signup and view all the flashcards
Human-Reported Data
Human-Reported Data
Signup and view all the flashcards
Surveys
Surveys
Signup and view all the flashcards
Interviews
Interviews
Signup and view all the flashcards
Focus Groups
Focus Groups
Signup and view all the flashcards
Behavioural and Observational Data
Behavioural and Observational Data
Signup and view all the flashcards
Observations
Observations
Signup and view all the flashcards
Log Files
Log Files
Signup and view all the flashcards
Sensor Data
Sensor Data
Signup and view all the flashcards
Experimental Data
Experimental Data
Signup and view all the flashcards
Digital Platform Data
Digital Platform Data
Signup and view all the flashcards
Transaction Data
Transaction Data
Signup and view all the flashcards
Secondary Data
Secondary Data
Signup and view all the flashcards
Study Notes
Insights from Data
- Descriptive analytics summarize and visualize data trends to provide context.
- Diagnostic analytics explores why events occurred.
- Predictive analytics anticipates future outcomes based on current data.
- Prescriptive analytics recommends actions based on insights.
Data Types
- Qualitative data includes categories
- Quantitative data includes numerical values.
- Ordinal data has a rank
- Nominal data has distinct labels
- Discrete data has specific counted values
- Continuous data has values measured within a range, including fractions and decimals.
- Data can be structured (tables), unstructured (JSON), or semi-structured (paragraphs).
- Big Data involves large volumes of data with high variety generated quickly.
- Small Data is controlled, steady, and has a fixed schema.
- Primary data is collected firsthand with specific research goals in mind,
- Secondary data is collected by others and is readily available.
- Internal data comes from within an organization
- External data comes from sources outside the organization.
- Data can be historical or real-time.
Mathematical Foundations in Analytics
- Descriptive analytics uses measures of central tendency (mean, median, mode)
- Descriptive analytics uses measures of dispersion (range, variance, standard deviation).
- Inferential Statistics extrapolates tendencies about a large population using only a sample
- Hypothesis testing validates assumptions.
- Diagnostic analytics distinguishes causation from correlation.
- Predictive analytics utilizes probability, information theory, calculus and linear algebra.
- Regression models and decision trees are used for prediction.
- The Gradient Descent Algorithm minimizes loss functions in machine learning.
Measures and Visualizing Dispersion
- Mean signifies the average value.
- Median is the middle value in a sorted dataset.
- Mode represents the most frequent value.
- Range is the difference between the minimum and maximum values.
- Variance measures data spread relative to the average.
- Standard deviation is the square root of the variance.
- Histograms display the frequency of numerical values divided into intervals (bins).
- Count plots display the counts of observations in a category.
- Scatter plots show the relationship between two continuous variables.
- Heatmaps visualize matrix data, using color to represent values.
Data Collection
- Human-reported data comes directly from surveys, interviews, and focus groups.
- Behavioral and Observational Data is tracked through observations, logs, and sensors.
- Experimental data is gathered through controlled variable manipulation.
- Digital and Platform-Based Data are web scraped from online platforms
- Transactional data tracks exchanges within systems.
- Secondary data is pre-existing
Sources for Data
- Government databases offer demographic trends.
- Academic research reveals correlations.
- Industry reports forecast trends.
- Media sources reflect public opinion.
Inferential Statistics
- Inference is the process of drawing conclusions about a population.
- The normal distribution is symmetric and described by mean and standard deviation.
- The Empirical Rule (68-95-99.7) describes the spread of data
- The Central Limit Theorem dictates sample distributions approximate normal distributions
- Tail heaviness is determined by a parameter of the t-distribution called degrees of freedom.
- A T-test compares the groups to a mean.
Hypothesis Testing
- Null hypothesis: assume two groups, the means are equal
- Alternative hypothesis: assume two groups, the means are not equal
- Non-directional hypothesis: assess if there exists a difference between two groups
- Directional hypothesis: assess if one group is higher or lower than the second group
Data Wrangling and Quality
- Data wrangling transforms raw data.
- Data cleaning removes errors.
- Data profiling reviews dataset content for issues.
Data Quality Metrics
- Completeness measures the percentage of missing data.
- Accuracy assesses data representation.
- Consistency ensures data synchronization.
- Validity checks data against defined ranges.
- Uniqueness identifies duplicates.
- Integrity traces data relationships.
Consequences of Unclean Data
- Incorrect analysis arises.
- Unreliable models arise.
- Misleading insights arise.
Data Checks
- Type checks use correct data types
- Code checks validate values against lists.
- Range checks use appropriate ranges.
- Format checks validate correct format.
- Consistency checks validate data logically.
- Uniqueness checks validate only one entry.
- Presence checks validate all required fields are present.
- Length checks validate there are correct characters.
- Look-up checks validate fields with set values.
Outliers
- Outliers distort data.
- Methods for finding outliers include: Interquartile Range, Z-Score, K-NN, Local Outlier Factor.
- IQR measures the spread of the middle 50%.
- Z-Score computes standard deviation.
- K-Nearest Neighbors identifies distant outliers.
- Remove outliers, transform data, use robust statistics, or impute values.
Missing data
- Missing Completely at Random (MCAR) missing at random without any patterns
- Missing at Random (MAR) missing values are related to other data that is observed
- Missing Not at Random (MNAR) missing value linked to the value that is missing
Removing missing data
- Listwise removal: remove entire row
- Pairwise removal: remove entry just for that analysis
- Attribute removal: remove column
- Imputation: replace missing values with another value
- Time-based: use previous or later entries to fill the missing values
Evaluate missing data
- MSE means squared error
- MAE mean absolute error
Data biases
- Selection: sampling is incomplete
- Sampling: random sampling is not done properly
- Convergence: data is not selected properly
- Participation: recording isn't fully representative.
- Historical: dataset has biases
- Availablity: dataset isn't publicly available
- Outlier: not accounting for outliers
- Imputation: replacing using other flawed data
Types of ML
- Supervised learning is learning with known targets
- Unsupervised learning is learning without known targets
- Semi-supervised earning are models with both
Preparing ML models
- Gather training data and make sure it's balanced
- Determine input for learning function, which is turned into a vector
- Algorithms create model
- Linear regression minimizes support vector
- Decision trees will create new boundaries
- Multilayer perceptron will create decision surfaces
Evaluating models
- K-fold cross evaluation
- Confusion matrix is a table to evaluation the performance of a model
- Macro average finds average on result per class
- Micro average finds average when putting all data together
Terms
Ì‚ = point of intersection with the y-axis
- b = point of intersection with the y-axis
- error = epsilon
- residuals = diff. b/t actual and predicted output
Assumptions of Linear Regression
o Linearity: Dependent and independent variables are linearly related
- Equal variance
- Independence: values don't influence each other
- Lack of multicollinearity: value are related
- Absence of endogeneity- no correlation
Decision tree
- Internal node: attribute of the feature
- Branch: decision rule
- Leaf node: outcome
- Algorith selection is selected by the best attribute to split
Gini Index (Categorical features)
How often a element would be incorrectly labelled
- lowest value: 0 = elements within node are the same
- highest value: 0.5 = elements are evenly distributed
Analysis (trees)
- stopping criteria = prevents from overfitting
- max. depth limits depth from the tree
- min. samples for split = node MUST have min # samples before split
- early stopping = stopping WHEN splits don't result in significant gain
What to scale for Machine Learning
- Can be completed with models that are: supervised and unsupervised
- Scaling has variables with number and/or values
- Transformation = changing variables to be better suited for models
Ensemble Model Techniques
- Bagging requires diversity in input and dataset
- Booding increases diversity in dataset
- Can be tested with linear regression or boosted methods
- Ensemble learning needs multiple models to create a stronger model
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.