Data Insights and Analytics Types

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson
Download our mobile app to listen on the go
Get App

Questions and Answers

Which type of data analysis is regarded as the fundamental layer, providing summaries and visualizations of prevalent data trends?

  • Diagnostic
  • Prescriptive
  • Predictive
  • Descriptive (correct)

In data analysis, inferential statistics is particularly useful when:

  • The complete dataset is readily available for analysis.
  • Describing the central tendencies of a dataset.
  • Predicting future outcomes with certainty is required.
  • Analyzing a sample to generalize findings to a larger population is necessary. (correct)

The primary goal of A/B testing within the context of hypothesis testing is to:

  • Reject all hypotheses.
  • Determine whether to retain or reject a hypothesis based on data analysis. (correct)
  • Generate new hypotheses.
  • Avoid data analysis altogether.

Which measure of central tendency is most influenced by outliers in a dataset?

<p>Mean (A)</p> Signup and view all the answers

Which data visualization is best suited for displaying the frequency distribution of a single numerical variable?

<p>Histogram (C)</p> Signup and view all the answers

Secondary data is characterized by which of the following?

<p>Collected by others, potentially for a different purpose (D)</p> Signup and view all the answers

What is the primary advantage of using APIs for data collection?

<p>They allow interaction with external platform through defined interfaces. (D)</p> Signup and view all the answers

In the context of data cleaning, what does 'data wrangling' primarily involve?

<p>Transforming raw data into a usable format (C)</p> Signup and view all the answers

Which aspect of data quality focuses on ensuring that data values fall within acceptable parameters or ranges?

<p>Validity (A)</p> Signup and view all the answers

What does a 'consistency check' primarily aim to achieve during data cleaning?

<p>Ensuring data is entered in a logically consistent manne (C)</p> Signup and view all the answers

In data preprocessing, what is the likely outcome of ignoring outliers?

<p>Distorted correlations and model performances (A)</p> Signup and view all the answers

For what type of dataset is the Z-score method most appropriate for outlier detection?

<p>Normally distributed datasets (D)</p> Signup and view all the answers

When dealing with missing data, what is the key characteristic of data that is 'Missing Completely at Random' (MCAR)?

<p>There is no systematic reason for the missing values. (C)</p> Signup and view all the answers

Which imputation method involves replacing missing values with a value drawn randomly from the observed values of that variable?

<p>Random sample imputation (C)</p> Signup and view all the answers

In time series data, which imputation method replaces missing values with the most recent prior observation?

<p>LOCF (Last Observation Carried Forward) (D)</p> Signup and view all the answers

What is the purpose of the MICE (Multiple Imputation by Chained Equations) algorithm?

<p>To impute missing values using an iterative process (B)</p> Signup and view all the answers

Which type of bias occurs when a dataset disproportionately represents certain segments of a population due to non-random sampling methods?

<p>Selection bias (A)</p> Signup and view all the answers

What type of bias is introduced when data is inaccurately measured or classified differently across various groups?

<p>Measurement bias (C)</p> Signup and view all the answers

What is a key characteristic of unsupervised learning?

<p>It learns from unlabeled data. (C)</p> Signup and view all the answers

In machine learning, what is the purpose of 'one-hot encoding'?

<p>Converting categorical data to numerical (B)</p> Signup and view all the answers

What is the primary purpose of k-fold cross-validation?

<p>To improve model generalization by training and validating on different subsets of the data (A)</p> Signup and view all the answers

In a confusion matrix, what does a 'false positive' represent?

<p>An instance that was incorrectly predicted as positive (D)</p> Signup and view all the answers

What does the linearity assumption in linear regression imply?

<p>The relationship between predictors and the response variable is linear. (A)</p> Signup and view all the answers

What is the primary purpose of the Gradient Descent algorithm?

<p>To find the minimum of a function (A)</p> Signup and view all the answers

In a decision tree, what does each internal node represent?

<p>A feature (attribute) (C)</p> Signup and view all the answers

What is the role of entropy in the ID3 algorithm?

<p>To measure the information content of an attribute (C)</p> Signup and view all the answers

What is gini impurity?

<p>Measure of randomness (B)</p> Signup and view all the answers

What does feature engineering primarily involve?

<p>Creating new data representations to improve model performance (C)</p> Signup and view all the answers

Which feature scaling method is most sensitive to outliers?

<p>Min-Max Scaling (C)</p> Signup and view all the answers

The primary goal of feature selection is to:

<p>reduce the number of features, improving processing speed or model interpretation. (B)</p> Signup and view all the answers

What characterizes ensemble learning methods?

<p>Using multiple models to obtain better predictive performance (D)</p> Signup and view all the answers

How does 'hard voting' work in ensemble methods?

<p>By selecting the class with the most votes from base models (D)</p> Signup and view all the answers

Which characteristic distinguishes boosting from bagging?

<p>Boosting attempts to correct the errors from prior models (C)</p> Signup and view all the answers

In the context of clustering, what is the role of 'similarity measures'?

<p>To quantify the relatedness between data points (B)</p> Signup and view all the answers

How does k-means handle outliers?

<p>It is sensitive to initial conditions which can greatly affect where the point is bound. (C)</p> Signup and view all the answers

What is a key aspect of DBSCAN?

<p>Density-based clustering (C)</p> Signup and view all the answers

What data may non-personalized filtering be based on?

<p>Recency, popularity and trending (A)</p> Signup and view all the answers

When does Cold-Start have an effect?

<p>When a new product is loaded to the platform. (A)</p> Signup and view all the answers

What method includes 'Thumbs Up', Star rating, writing reviews?

<p>Explicit Feedback (D)</p> Signup and view all the answers

Univariate is what?

<p>A single variable recorded over time (A)</p> Signup and view all the answers

What may convert date or timestamp features to numeric features by taking the difference between two timestamps or dates?

<p>Age/time Difference (B)</p> Signup and view all the answers

Flashcards

Descriptive analytics

Summary of past data to understand what happened.

Diagnostic analytics

Analysis to determine why something happened.

Predictive analytics

Prediction based on current data.

Prescriptive analytics

Recommendation of actions based on data analysis.

Signup and view all the flashcards

Ordinal data

Categorical data with order or ranking.

Signup and view all the flashcards

Nominal data

Distinct, unordered categorical data.

Signup and view all the flashcards

Discrete data

Specific, distinct, countable values.

Signup and view all the flashcards

Continuous data

Any value within a range.

Signup and view all the flashcards

Structured data

Data organized in rows and columns.

Signup and view all the flashcards

Unstructured data

Data lacking a predefined format.

Signup and view all the flashcards

Semi-structured data

Data with some organizational properties, but not strictly fixed

Signup and view all the flashcards

Big Data

Extremely large, rapidly generated data.

Signup and view all the flashcards

Small Data

Smaller volumes of controlled data.

Signup and view all the flashcards

Primary data

Data collected for a specific research goal.

Signup and view all the flashcards

Secondary data

Data collected by others, possibly for different reasons

Signup and view all the flashcards

Internal data

Data from within an organization.

Signup and view all the flashcards

External data

Data from outside an organization.

Signup and view all the flashcards

Historical data

Describes past conditions and events.

Signup and view all the flashcards

Real-time data

Data that is updated immediately.

Signup and view all the flashcards

Measures of Central Tendency

Understanding data's central point.

Signup and view all the flashcards

Mean

Average value.

Signup and view all the flashcards

Median

Middle value in a sorted dataset.

Signup and view all the flashcards

Mode

Most frequent value.

Signup and view all the flashcards

Measures of Dispersion

Describes how spread out data is.

Signup and view all the flashcards

Range

Minimum to maximum.

Signup and view all the flashcards

Variance

How far from the average.

Signup and view all the flashcards

Standard Deviation

Square root of the variance.

Signup and view all the flashcards

Histogram

Displays frequency of values.

Signup and view all the flashcards

Human-Reported Data

Data collection from people

Signup and view all the flashcards

Surveys

Systematic data collection from individuals.

Signup and view all the flashcards

Interviews

Detailed qualitative insights from exploration

Signup and view all the flashcards

Focus Groups

Diverse perspectives from guided discussion

Signup and view all the flashcards

Behavioural and Observational Data

Monitoring or observing behaviours

Signup and view all the flashcards

Observations

Collecting by watching and recording actions

Signup and view all the flashcards

Log Files

Records of system/user activities

Signup and view all the flashcards

Sensor Data

Measurements from devices monitoring conditions

Signup and view all the flashcards

Experimental Data

Data by manipulating variables in a controlled setup

Signup and view all the flashcards

Digital Platform Data

Data originating from digital platforms

Signup and view all the flashcards

Transaction Data

Data from internal system transactions

Signup and view all the flashcards

Secondary Data

Data collected by someone else

Signup and view all the flashcards

Study Notes

Insights from Data

  • Descriptive analytics summarize and visualize data trends to provide context.
  • Diagnostic analytics explores why events occurred.
  • Predictive analytics anticipates future outcomes based on current data.
  • Prescriptive analytics recommends actions based on insights.

Data Types

  • Qualitative data includes categories
  • Quantitative data includes numerical values.
  • Ordinal data has a rank
  • Nominal data has distinct labels
  • Discrete data has specific counted values
  • Continuous data has values measured within a range, including fractions and decimals.
  • Data can be structured (tables), unstructured (JSON), or semi-structured (paragraphs).
  • Big Data involves large volumes of data with high variety generated quickly.
  • Small Data is controlled, steady, and has a fixed schema.
  • Primary data is collected firsthand with specific research goals in mind,
  • Secondary data is collected by others and is readily available.
  • Internal data comes from within an organization
  • External data comes from sources outside the organization.
  • Data can be historical or real-time.

Mathematical Foundations in Analytics

  • Descriptive analytics uses measures of central tendency (mean, median, mode)
  • Descriptive analytics uses measures of dispersion (range, variance, standard deviation).
  • Inferential Statistics extrapolates tendencies about a large population using only a sample
  • Hypothesis testing validates assumptions.
  • Diagnostic analytics distinguishes causation from correlation.
  • Predictive analytics utilizes probability, information theory, calculus and linear algebra.
  • Regression models and decision trees are used for prediction.
  • The Gradient Descent Algorithm minimizes loss functions in machine learning.

Measures and Visualizing Dispersion

  • Mean signifies the average value.
  • Median is the middle value in a sorted dataset.
  • Mode represents the most frequent value.
  • Range is the difference between the minimum and maximum values.
  • Variance measures data spread relative to the average.
  • Standard deviation is the square root of the variance.
  • Histograms display the frequency of numerical values divided into intervals (bins).
  • Count plots display the counts of observations in a category.
  • Scatter plots show the relationship between two continuous variables.
  • Heatmaps visualize matrix data, using color to represent values.

Data Collection

  • Human-reported data comes directly from surveys, interviews, and focus groups.
  • Behavioral and Observational Data is tracked through observations, logs, and sensors.
  • Experimental data is gathered through controlled variable manipulation.
  • Digital and Platform-Based Data are web scraped from online platforms
  • Transactional data tracks exchanges within systems.
  • Secondary data is pre-existing

Sources for Data

  • Government databases offer demographic trends.
  • Academic research reveals correlations.
  • Industry reports forecast trends.
  • Media sources reflect public opinion.

Inferential Statistics

  • Inference is the process of drawing conclusions about a population.
  • The normal distribution is symmetric and described by mean and standard deviation.
  • The Empirical Rule (68-95-99.7) describes the spread of data
  • The Central Limit Theorem dictates sample distributions approximate normal distributions
  • Tail heaviness is determined by a parameter of the t-distribution called degrees of freedom.
  • A T-test compares the groups to a mean.

Hypothesis Testing

  • Null hypothesis: assume two groups, the means are equal
  • Alternative hypothesis: assume two groups, the means are not equal
  • Non-directional hypothesis: assess if there exists a difference between two groups
  • Directional hypothesis: assess if one group is higher or lower than the second group

Data Wrangling and Quality

  • Data wrangling transforms raw data.
  • Data cleaning removes errors.
  • Data profiling reviews dataset content for issues.

Data Quality Metrics

  • Completeness measures the percentage of missing data.
  • Accuracy assesses data representation.
  • Consistency ensures data synchronization.
  • Validity checks data against defined ranges.
  • Uniqueness identifies duplicates.
  • Integrity traces data relationships.

Consequences of Unclean Data

  • Incorrect analysis arises.
  • Unreliable models arise.
  • Misleading insights arise.

Data Checks

  • Type checks use correct data types
  • Code checks validate values against lists.
  • Range checks use appropriate ranges.
  • Format checks validate correct format.
  • Consistency checks validate data logically.
  • Uniqueness checks validate only one entry.
  • Presence checks validate all required fields are present.
  • Length checks validate there are correct characters.
  • Look-up checks validate fields with set values.

Outliers

  • Outliers distort data.
  • Methods for finding outliers include: Interquartile Range, Z-Score, K-NN, Local Outlier Factor.
  • IQR measures the spread of the middle 50%.
  • Z-Score computes standard deviation.
  • K-Nearest Neighbors identifies distant outliers.
  • Remove outliers, transform data, use robust statistics, or impute values.

Missing data

  • Missing Completely at Random (MCAR) missing at random without any patterns
  • Missing at Random (MAR) missing values are related to other data that is observed
  • Missing Not at Random (MNAR) missing value linked to the value that is missing

Removing missing data

  • Listwise removal: remove entire row
  • Pairwise removal: remove entry just for that analysis
  • Attribute removal: remove column
  • Imputation: replace missing values with another value
  • Time-based: use previous or later entries to fill the missing values

Evaluate missing data

  • MSE means squared error
  • MAE mean absolute error

Data biases

  • Selection: sampling is incomplete
  • Sampling: random sampling is not done properly
  • Convergence: data is not selected properly
  • Participation: recording isn't fully representative.
  • Historical: dataset has biases
  • Availablity: dataset isn't publicly available
  • Outlier: not accounting for outliers
  • Imputation: replacing using other flawed data

Types of ML

  • Supervised learning is learning with known targets
  • Unsupervised learning is learning without known targets
  • Semi-supervised earning are models with both

Preparing ML models

  • Gather training data and make sure it's balanced
  • Determine input for learning function, which is turned into a vector
  • Algorithms create model
  • Linear regression minimizes support vector
  • Decision trees will create new boundaries
  • Multilayer perceptron will create decision surfaces

Evaluating models

  • K-fold cross evaluation
  • Confusion matrix is a table to evaluation the performance of a model
  • Macro average finds average on result per class
  • Micro average finds average when putting all data together

Terms

Ì‚ = point of intersection with the y-axis

  • b = point of intersection with the y-axis
  • error = epsilon
  • residuals = diff. b/t actual and predicted output

Assumptions of Linear Regression

o Linearity: Dependent and independent variables are linearly related

  • Equal variance
  • Independence: values don't influence each other
  • Lack of multicollinearity: value are related
  • Absence of endogeneity- no correlation

Decision tree

  • Internal node: attribute of the feature
  • Branch: decision rule
  • Leaf node: outcome
  • Algorith selection is selected by the best attribute to split

Gini Index (Categorical features)

How often a element would be incorrectly labelled

  • lowest value: 0 = elements within node are the same
  • highest value: 0.5 = elements are evenly distributed

Analysis (trees)

  • stopping criteria = prevents from overfitting
  • max. depth limits depth from the tree
  • min. samples for split = node MUST have min # samples before split
  • early stopping = stopping WHEN splits don't result in significant gain

What to scale for Machine Learning

  • Can be completed with models that are: supervised and unsupervised
  • Scaling has variables with number and/or values
  • Transformation = changing variables to be better suited for models

Ensemble Model Techniques

  • Bagging requires diversity in input and dataset
  • Booding increases diversity in dataset
  • Can be tested with linear regression or boosted methods
  • Ensemble learning needs multiple models to create a stronger model

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

More Like This

Datos y análisis en la era digital
5 questions
Data Types and Analytics Overview
21 questions
Big Data Overview
15 questions

Big Data Overview

AngelicHelium avatar
AngelicHelium
Data Types and Science Process Overview
5 questions
Use Quizgecko on...
Browser
Browser