Summary

This document summarizes lectures on data analytics, covering topics such as SQL, big data, machine learning, factor analysis, and cluster analysis. It also discusses the supervised learning workflow and includes information on data collection, data cleaning, and data analysis.

Full Transcript

Lecture 1: - Data Analytics Process o Problem Definition  Process of identifying and clearly communicating the problem that needs to be solved o Data Collection  Process of gathering and measuring data  Methods...

Lecture 1: - Data Analytics Process o Problem Definition  Process of identifying and clearly communicating the problem that needs to be solved o Data Collection  Process of gathering and measuring data  Methods  Surveys and questionnaires  Interviews  Observations  Web scraping  Thid-party databases o Data Cleaning  Process of converting raw data into a format that is understandable and usable for further analysis  Tools and methods (SQL)  Data selection  Filtering  Aggregation  manipulation o Data Analysis o Data Interpretation o Result Communication - What is “Data Analytics”? o The process of transforming and analysing raw data to produce information that can be effectively used - Big data -> collection of data in huge volume o 3 Vs of Big data  Volume (massive size of datasets)  Velocity (Speed at data is generated)  Variety (data from different sources) - What is Data? o Recorded facts and numbers  Info -> data presented in meaningful context o Different formats for data  Structured data  Organized and formatted  Stored in databases (structured collection of data) with well-defined rows and columns o Financial records, customer info, inventory list, etc.  Unstructured data  No structure  Flexible in terms of content and format o Text, images, video, etc. o Different forms of data  Quantitative (numeric)  Sales figures, temp readings, etc.  Qualitative (contextual)  Customer feedback, employer reviews, etc. - SQL o Database query language  Interacting with relational databases (common keys links tables) o Helps to  Data retrieval (extract specific data)  Data analysis (perform various types of analysis)  Data integration (integrate data from different databases)  Data Quality Assurance (identify and correct errors in data) o S o SQL in R  R packages  Install.packages(“XXX”)  Library(XXX)   Commands  Returning data o SELECT (determines which columns to include  * -> selects al columns) o SELECT DINSTINCT (returns unique values form column) o FROM (identifies table from which to draw data)  Filtering data o WHERE (retrieves records that satisfy a certain condition) o HAVING (WHERE cannot be used with aggregate functions)  Manipulating data o INSERT (inserts new records in a table) o UPDATE (modifies existing records in a table) o DELETE (deletes existing records in a table)  Comparison operators o = (equals) o < (less) o > (greater) o != (not equal) o = (greater or equal)  Logical operators o AND (display if all conditions are true) o OR (display if any conditions are true) o NOT (display if condition is not true)  Testing for a range of values o BETWEEN (selects values within a given range)  Including begin and end values  Testing for a set of values o IN (allows to specify multiple values in a WHERE clause)  Sorting data o ORDER BY (sort results in ascending (ASC) or descending (DESC) order  Ascending is default order  Limiting data results o LIMIT (specifies the number or records to return)  Aggregate functions o MIN (returns smallest value of selected column) o MAX (returns largest value of selected column) o AVG (returns average value of selected column) o SUM (returns total sum of selected column) o COUNT (returns number of rows)  Grouping data o GROUP BY (groups rows that have common column values)  Multi-table operations o Joins (combines rows from more than one table)  LEFT JOIN (keeps all record s from left table, matching records from right table)  RIGHT JOIN (Keeps all records from right table, matching records from left table)  INNER JOIN (keep records that match both tables)  FULL OUTER JOIN (keeps all records form both tables)  NULL values o Cannot use comparison operators (no value/ ‘unknown’) o IS NULL (test for empty values) o IS NOT NULL (test for non-empty values) o IF NULL (handles null values by providing alternative) Lecture 2: - Machine Learning o Uses statistical techniques to learn from data without being explicitly programmed o 2 types  Supervised learning (requires human interventions)  Algorithms use labelled datasets to accurately classify data or predict outputs o Desired output is given  Main uses: o Classification (e.g. email or spam) o Regression (e.g. weather forecasting)  Algorithms: o Linear-, logistic regression, etc.  Terms used: o Feature = predictor = IV = X o Outcome = DV = label = Y  Unsupervised learning (no human interventions required)  Algorithms use unlabelled datasets to discover hidden patterns or similarities among observations. o No desired output is given  Main uses: o Dimensionality reduction o Clustering (e.g. customer segmentation) o Association (e.g. recommendation systems)  Algorithms: o Factor Analysis, Cluster Analysis, etc. - Factor Analysis (focus on variables) o Given a set of interrelated variables, find a smaller number of new (latent) variables, i.e., factors, that somehow describe the original set as good as possible.  Based on correlations between individual variables o When used?  Dimension reduction technique  Finding underlying factors o Spurious correlation  When two variables are correlated but do not have a causal relationship. - Types of factor analysis o Exploratory factor analysis (EFA) o Confirmatory factor analysis (CFA) o Principal component analysis (PCA)  Attempts to describe the variables in a dataset as good as possible by a few new variables that are linear combinations of the original variables  New variables are called principal components o Are uncorrelated o Number of principal components are less tan or equal to number of original variables.  Used?  As a dimension reduction technique  As first step: o Subsequent analyses are done on the principal components  Principal components (Zs)  Explain as much variance of the original variables as possible o Amount of variance measured by the eigenvalue  Larger eigenvalue, Z explains large amount of variance in the data  Important to standardize datasets  More weight to variables with higher variances  Standardization provides single unit of measurement o Mean zero and variance equal to one  Conducting PCA  Formulate the problem o Problem identification o Variable identification  Variables with high correlation are also highly correlated with same dimension(s) o Code in R  Correlation matrix  Corr_matrix 60%) o Code kaiser criterion  Eig.value