Data Analytics Fundamentals

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson
Download our mobile app to listen on the go
Get App

Questions and Answers

Distinguish between data and information within the context of data analytics.

Data represents recorded facts and figures, while information is data presented in a meaningful context, providing insights and understanding.

What are the three 'V's that characterize big data, and why are they significant?

Volume (massive size of datasets), Velocity (speed at which data is generated), and Variety (diversity of data types). They are significant because they pose unique challenges for storage, processing, and analysis.

Explain the difference between structured and unstructured data, and provide an example of each.

Structured data is organized and formatted, typically stored in databases with well-defined rows and columns (e.g., financial records). Unstructured data lacks a predefined format (e.g., text, images, video).

Describe the purpose of data cleaning in the data analytics process.

<p>Data cleaning involves converting raw data into a format that is understandable and usable for further analysis. This includes handling missing values, correcting errors, and ensuring consistency.</p>
Signup and view all the answers

How does SELECT DISTINCT differ from SELECT in SQL, and when would you use SELECT DISTINCT?

<p><code>SELECT</code> retrieves all selected rows, including duplicates, while <code>SELECT DISTINCT</code> returns only unique values from the specified column(s). Use <code>SELECT DISTINCT</code> when you need to eliminate duplicate entries in the result set.</p>
Signup and view all the answers

Explain the purpose of the GROUP BY clause in SQL and provide an example of its use.

<p>The <code>GROUP BY</code> clause groups rows that have the same values in specified columns into summary rows, like finding total sales per region. For example, <code>SELECT region, SUM(sales) FROM sales_table GROUP BY region;</code></p>
Signup and view all the answers

What is the difference between the WHERE and HAVING clauses in SQL, and when should each be used?

<p>The <code>WHERE</code> clause filters rows <em>before</em> grouping occurs, based on individual row values. The <code>HAVING</code> clause filters groups <em>after</em> grouping, based on the aggregate values of the groups. <code>HAVING</code> is used only with the GROUP BY clause.</p>
Signup and view all the answers

Explain the purpose of using joins in SQL and give an example of a situation where a LEFT JOIN would be appropriate?

<p>Joins combine rows from two or more tables based on a related column. A <code>LEFT JOIN</code> would be appropriate when you want to keep all records from the left table and include matching records from the right table, showing nulls where matches don't exist. For Example: Joining <code>Customers</code> and <code>Orders</code> on <code>CustomerID</code>.</p>
Signup and view all the answers

In the context of SQL, what are NULL values, and why is it important to handle them properly?

<p>NULL values represent missing or unknown data. They are important to handle properly because you can't use comparison operators with them directly. Use <code>IS NULL</code> or <code>IS NOT NULL</code> to test for them.</p>
Signup and view all the answers

What distinguishes supervised learning from unsupervised learning in machine learning?

<p>Supervised learning requires labeled datasets to train models for prediction or classification, whereas unsupervised learning uses unlabeled datasets to discover hidden patterns or structures within the data.</p>
Signup and view all the answers

Give an example of a classification problem and a regression problem in the context of supervised learning.

<p>Classification: classifying emails as spam or not spam. Regression: predicting the price of a house based on its features.</p>
Signup and view all the answers

What is factor analysis, and when is it most appropriately used?

<p>Factor analysis is a technique used to reduce a large number of interrelated variables into a smaller set of uncorrelated latent variables (factors). It is used to simplify data, identify underlying relationships, and reduce dimensionality.</p>
Signup and view all the answers

Explain the purpose of Principal Component Analysis (PCA) and how it achieves dimensionality reduction.

<p>PCA aims to describe the variables in a dataset as good as possible by a new few variables that are linear combinations of the original variables. It achieves dimensionality reduction by identifying uncorrelated 'principal components' that capture most of the variance in the original data.</p>
Signup and view all the answers

What is the Kaiser criterion, and how is it used in the context of Principal Component Analysis (PCA)?

<p>The Kaiser criterion is a rule of thumb for determining the number of principal components to retain in PCA. It suggests keeping components with eigenvalues greater than 1, as they explain more variance than a single original variable.</p>
Signup and view all the answers

Describe the K-means clustering algorithm and the steps involved in assigning data points to clusters.

<p>K-means clustering divides objects into k-distinct, non-overlapping clusters. Steps include: Assigns each observation to the nearest cluster centre. Updates the mean's coordinated, which are the averages of the observations categorized in that cluster so far. Repeats steps 2 and 3 until some convergence criterion is met.</p>
Signup and view all the answers

What are cluster centroids, and how are they used when interpreting the results of a K-means clustering analysis?

<p>Cluster centroids represent the average clustering variables' values of all observations in a cluster. They are used to characterize and interpret each cluster, indicating the typical values for each variable within that cluster.</p>
Signup and view all the answers

In the context of supervised learning, what is meant by 'splitting' the data, and why is it important?

<p>Splitting the data involves dividing the dataset into a training set (e.g., 80%) and a test set (e.g., 20%). It's important to train your model on the training set and evaluate its performance on the unseen testing data.</p>
Signup and view all the answers

Describe the concept of imputation and when it is used in data preprocessing?

<p>Imputation is the process of replacing missing values in a dataset with estimated values. It is used when missing data could bias the analysis or prevent the use of certain modeling techniques.</p>
Signup and view all the answers

What is the purpose of feature selection, and describe one method for achieving it?

<p>Feature selection reduces the number of input variables to the model by only using the most information variables. One method relies on examining numeric scores, e.g. t-statistic.</p>
Signup and view all the answers

Explain what is meant by 'overfitting' in the context of model evaluation, and why is it a problem?

<p>Overfitting occurs when a model learns the training data too well, memorizing its specific characteristics and noise, rather than generalizing to the underlying phenomenon. The model performs well on the training set but poorly on unseen data, indicating poor generalization.</p>
Signup and view all the answers

Flashcards

Problem Definition

Identifying and clearly communicating the problem that needs to be solved.

Data Collection

Gathering and measuring data through surveys, interviews, web scraping, etc.

Data Cleaning

Converting raw data into an understandable and usable format for further analysis.

Data Analytics

Transforming and analyzing raw data to produce effectively usable information.

Signup and view all the flashcards

Big Data

Data with Volume, Velocity, and Variety.

Signup and view all the flashcards

What is Data?

Recorded facts and numbers; presented in a meaningful context.

Signup and view all the flashcards

Structured Data

Organized and stored in databases with well-defined rows and columns.

Signup and view all the flashcards

Unstructured Data

No structure; flexible in terms of content and format, e.g., text, images, video.

Signup and view all the flashcards

SQL

Database query language for interacting with relational databases.

Signup and view all the flashcards

Data Retrieval

Extract specific data from databases.

Signup and view all the flashcards

SELECT (SQL)

Identifies which columns to include in a query result.

Signup and view all the flashcards

FROM (SQL)

Identifies the table from which to draw data in an SQL query.

Signup and view all the flashcards

WHERE clause

Retrieves records that satisfy a certain condition.

Signup and view all the flashcards

DELETE (SQL)

Deletes existing records in a table.

Signup and view all the flashcards

ORDER BY (SQL)

Sort results in ascending or descending order.

Signup and view all the flashcards

LIMIT (SQL)

Specifies the number of records to return.

Signup and view all the flashcards

Machine Learning

Using statistical techniques to learn from data without being explicitly programmed.

Signup and view all the flashcards

Supervised Learning

Algorithms use labeled datasets to classify data or predict outputs.

Signup and view all the flashcards

Unsupervised Learning

Algorithms use unlabeled datasets to discover hidden patterns.

Signup and view all the flashcards

Factor Analysis

Find smaller number of new variables that describe the original set of variables.

Signup and view all the flashcards

Study Notes

  • Data Analytics Process

Problem Definition

  • Identify and clearly communicate the the problem that needs solving.

Data Collection

  • Gather and measure data through surveys, interviews, observations, web scraping, and third-party databases.

Data Cleaning

  • Convert raw data into a understandable format for further analysis, using tools and methods like SQL.
  • These tools enable data selection, filtering, aggregation, and manipulation.

Data Analysis

  • Data Interpretation
  • Result Communication

What is “Data Analytics”?

  • Transform and analyze raw data to produce effectively useable information.

Big Data

  • Collection of data in huge volume.
  • Volume: Massive size of datasets.
  • Velocity: Speed at which data is generated.
  • Variety: Data from different sources

Data

  • Recorded facts and numbers that are presented in meaningful context.
  • Formats can be structured, organized, and formatted with well defined rows and columns.
  • These can be stored in databases, as well as being financial records, customer info, inventory lists, etc.
  • Unstructured data has no structure and flexible content and format like text, images, and video

Forms of data

  • quantitative (numeric) such as sales figures or temp readings
  • qualitative (contextual) such as customer feedback and employer reviews

SQL (Structured Query Language)

  • Database query language that interacts with relational databases using common key links and tables.
  • Retrieves specific data, performs various types of analysis, integrates data from different databases, and ensures data quality by identifying and correcting errors

SQL in R

  • R packages include install.packages(“XXX”) and Library(XXX).

Commands

  • SELECT [column(s)]: Determines which columns to include to be returned
  • FROM [table]: Used to identify table from which to draw data
  • WHERE [condition]: Retrieves records that satisfy a certain condition for filtering data
  • HAVING: Cannot be used with aggregate functions
  • INSERT: Inserts new records in a table
  • UPDATE: Modifies existing records in a table
  • DELETE: deletes existing records in a table

Comparison operators

  • < (less than)
  • > (greater than)
  • != (not equal to)
  • <= (less than or equal to)
  • = (greater or equal to)

Logical operators

  • AND (display if ALL conditions are true)
  • OR (display if ANY conditions are true)
  • NOT (display if condition is not true)

Testing for a range of values

  • BETWEEN selects values within a given range including the begin and end values

Set Values

  • IN allows specification of multiple values in a WHERE clause

Sorting Data

  • ORDER BY sorts results in ascending (ASC) or descending (DESC) order, with ascending as the default.

Limiting Data

  • LIMIT specifies the number or records to return

Aggregate Functions

  • MIN returns the smallest value of the selected column
  • MAX returns the largest value of the selected column
  • AVG returns the average value of the selected column
  • SUM returns the total sum of the selected column
  • COUNT returns the number of rows

Grouping Data

  • GROUP BY groups rows that have common column values.

Multi Table Operations

  • Joins combine rows from more than one table
  • LEFT JOIN keeps all records from the left table and matching records from the right table.
  • RIGHT JOIN keeps all records from the right table and matching records from the left table.
  • INNER JOIN keeps records that match both tables.
  • FULL OUTER JOIN keeps all records from both tables.

NULL Values

  • Cannot use comparison operators.
  • IS NULL tests for empty values.
  • IS NOT NULL tests for non-empty values.
  • IF NULL handles null values by providing an alternative.
  • SQL: SELECT columns is select(df, column1, column2, ...) in R.
  • SQL: WHERE condition is filter(df, condition) in R.
  • SQL: ORDER BY column is arrange(df, column) in R.
  • SQL: LIMIT n is head(df, n) in R.
  • SQL: GROUP BY column is group_by(df, column) in R.
  • SQL: LEFT JOIN is left_join(df1, df2, by = "column") in R.
  • SQL: RIGHT JOIN is right_join(df1, df2, by = "column") in R.
  • SQL: INNER JOIN is inner_join(df1, df2, by = "column") in R.
  • SQL: FULL OUTER JOIN is full_join(df1, df2, by = "column") in R.

Machine Learning

  • Uses statistical techniques to learn from data without being explicitly programmed.

Types of Machine Learning

  • Supervised learning requires human interventions, using labeled datasets to accurately classify data or predict outputs, giving a desired output, classifying email or spam and regression, such as weather forecasting.
  • Examples: Linear-, logistic regression, etc.
  • Unsupervised learning requires no human interventions
  • Uses unlabeled datasets to discover hidden patterns or similarities among observations, giving no desired output.
  • Dimensionality reduction, clustering (e.g., customer segmentation), and association (e.g., recommendation systems) are examples.

Factor Analysis (focus on variables)

  • Given a set of interrelated variables, find a smaller number of new (latent) variables, i.e., factors, that somehow describe the original set as good as possible, based on correlations between individual variables.
  • Dimension reduction technique and finding underlying factors.

Spurious correlation

  • When two variables are correlated but do not have a causal relationship.

Types of Factor Analysis

  • Exploratory factor analysis (EFA)
  • Confirmatory factor analysis (CFA)
  • Principal component analysis (PCA)
  • Attempts to describe the variables in a dataset as good as possible by a few new variables that are linear combinations of the original variables

Principal components

  • Used as a dimension reduction technique or as a first step to base subsequent analyses on, aims to explain as much variance of the original variables as possible (amount of variance is measured by the eigenvalue), variables are uncorrelated and have a number of under or equal to the number of original variables.
  • Feature = predictor = IV = X
  • Outcome = DV = label = Y

PCA

  • Larger eigenvalue, Z explains a large amount of variance in the data, making it important to standardize because it gives more weight to variables with higher variances and provides single units of measurement with a mean of zero and variance equal to one.

Steps to conducting PCA

  • Formulate the problem, identifying the problem and variables.
  • Variables with high correlation are also highly correlated with the same dimension(s).

R

The following code is to be used in R

  • Correlation matrix: Corr_matrix <- round(cor(dataset), decimals)
  • Principal component analysis: Resp.pca <- PCA(dataset, graph = FALSE)

Determine the number of Dimensions

  • Approaches: A priori determination Kaiser criterion, Scree plot and Proportion of variance
  • Kaiser criterion - Eig.value <- get_eigenvalue(res.pca) and Eig.value <- round(eig.value, digits=x)
  • Code scree plot - Fviz_eig(res.pcs, addlabels = TRUE, ylim = (0, X)
  • Retain the variables - contribution of each variable to a given dimension and contribution is at average to dimension.

Cluster Analysis

  • Group a collection of objects that are more like each other into subgroups, i.e., cluster, than they are like members of other groups, based on similarity between individual objects and explorative technique, for uses like market segmentation, fraud detection, and data reduction.

Types of clustering methods

  • K-means clustering divides objects into k-distinct, non-overlapping clusters by assigning each observation to the nearest cluster centre, updating the mean,s coordinates which are the averages of those observation categories.
  • Steps 2 and 3 are repeated until some convergence criterion is met.

Steps of conducting K-means

  • Formulate the problem and select a distance measure, such as Euclidean distance (Pythagoras).

Decide on a number of clusters

  • The elbow method or average silhouette method are used.
  • NbClust provides 30 indices to determine what is optimal.
  • Cluster visualization uses PCA to reduce dimensions to 2 dimensional
  • Interpret the clusters: Always cluster centroids (average clustering variables values of all observations in a cluster) and use in subsequent analysis
  • Before clustering, standardize variables to have a single unit of measurement.
  • Factors: Reliability and validity of k-means clustering results, Initial centroids, Data scaling, and outliers

Supervised Learning Workflow

  • Explore the data for checking structure, checking missing values, and detecting outliers

Data Preprocessing

  • Splitting: Test set and train set (80% training, 20% test), evaluate models on the testing data
  • Missing values: Imputation (Prediction using other variables), replace predicted variables to missing variables, use K-nearest neighbours, Tree-based models, and linear models
  • Over 20% missing data means that the 20% missing data rule of thumb is in play

Outliers

  • Detect using methods like easy functions (summary, hist, boxplot), convenient packages (outliers()), handle with Winsorization, truncation, and Imputation

Normalization

  • Transformation & normalization via Logarithms, normalization, scaling, and centering

Feature creation and selection

  • Create new features by calculating ratios and creating dummy variables.
  • Use simple filters based on numeric score, e.g., t-statistic, recursive feature elimination and stepwise selection

Train the data using algorithms

  • Linear regression
  • Logistic regression

Naive Bayes

  • Classification technique based on conditional probabilities (P(Y/X) Bayes' theorem
  • Pros include easy to build for large datasets & good performance with high dimensional data but Cons includes independence of features failing in most cases

K-Nearest Neighbour

  • Predicts new cases based on similarity measure (distance function)
  • In classification, based on majority class, or in aggression, based on mean or median, where K is √ N
  1. Look at data
  2. Calculate distances
  3. Find neighbours
  4. Vote on outcome (for classification)
  5. If continuous- Euclidean distance, or if Categorical features I Xi-X'¡ I
  • Is Simple and assumes No data
  • Disadvantage is computationally expensive & Potentially sensitive to irrelevant features

Decision Tree

  • YES or NO
  • branch or sub Tree with leafs to determine decisions

Leaf Node

  • where decisions are taken

Outcome

  • Be aware of overfitting

Random Forest

  • Builds upon standard decision trees with 3 features: Random sampling with replacement, Random selection of feature set, and Voting to aggregate results
  • Disadvantages Lack of interpretability

Gradient Boosting

  • Builds tree after tree to imprive on the deficiencies of previous trees
  • Is powerful and efficient
  • The disadvantage if loss of interpretabilites

Model Evaluation on test data

  • Mean absolute percentage error, accuracy, confusion matrix, ROC plots
  • Overfitting models set rather than ‘Understanding’

Two types of model evaluation

  • Domain assess/ understand with expert assesment and data generation
  • Model fitted using training data (In-sample accuracy)

Model Evaluation

  • Root mean squared error (RMSE) with sensitive to outliers and Mean absolute error (MAE) with less sensitive to outliers plus Mean absolute percentage error (MAPE) as undefined zeros

Evaluating Results

  • Correct decisions/ total number of decisions

  • 1 Error Rate
  • False P and Negatives

Type 1 and 2 errors ( sensitivity)

TP/ (TP FN) *analogous

  • Negative Rate -specificity where / (TN FP) * False Rate of Alarm.

Expected Value

  • Px,y) p(Y) p(XY) *

Visualize model Performance through Receiver Operations

  • Characteristics Graphs with 2 dimensional axis
  • False +ve + Axis
  • True +ve + Y axis
  • Guesswork side- permissive
  • Rating based on score and threshold
  • Useful Number on Summary and
  • Equivalant to

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

Data Analytics Summary Lectures
Use Quizgecko on...
Browser
Browser