Data Analytics Summary Lectures

Lecture 1: - Data Analytics Process o Problem Definition  Process of identifying and clearly communicating the problem that needs to be solved o Data Collection  Process of gathering and measuring data  Methods  Surveys and questionnaires  Interviews  Observations  Web scraping  Thid-party databases o Data Cleaning  Process of converting raw data into a format that is understandable and usable for further analysis  Tools and methods (SQL)  Data selection  Filtering  Aggregation  manipulation o Data Analysis o Data Interpretation o Result Communication - What is “Data Analytics”? o The process of transforming and analysing raw data to produce information that can be effectively used - Big data -> collection of data in huge volume o 3 Vs of Big data  Volume (massive size of datasets)  Velocity (Speed at data is generated)  Variety (data from different sources) - What is Data? o Recorded facts and numbers  Info -> data presented in meaningful context o Different formats for data  Structured data  Organized and formatted  Stored in databases (structured collection of data) with well-defined rows and columns o Financial records, customer info, inventory list, etc.  Unstructured data  No structure  Flexible in terms of content and format o Text, images, video, etc. o Different forms of data  Quantitative (numeric)  Sales figures, temp readings, etc.  Qualitative (contextual)  Customer feedback, employer reviews, etc. - SQL o Database query language  Interacting with relational databases (common keys links tables) o Helps to  Data retrieval (extract specific data)  Data analysis (perform various types of analysis)  Data integration (integrate data from different databases)  Data Quality Assurance (identify and correct errors in data) o S o SQL in R  R packages  Install.packages(“XXX”)  Library(XXX)   Commands  Returning data o SELECT (determines which columns to include  * -> selects al columns) o SELECT DINSTINCT (returns unique values form column) o FROM (identifies table from which to draw data)  Filtering data o WHERE (retrieves records that satisfy a certain condition) o HAVING (WHERE cannot be used with aggregate functions)  Manipulating data o INSERT (inserts new records in a table) o UPDATE (modifies existing records in a table) o DELETE (deletes existing records in a table)  Comparison operators o = (equals) o < (less) o > (greater) o != (not equal) o = (greater or equal)  Logical operators o AND (display if all conditions are true) o OR (display if any conditions are true) o NOT (display if condition is not true)  Testing for a range of values o BETWEEN (selects values within a given range)  Including begin and end values  Testing for a set of values o IN (allows to specify multiple values in a WHERE clause)  Sorting data o ORDER BY (sort results in ascending (ASC) or descending (DESC) order  Ascending is default order  Limiting data results o LIMIT (specifies the number or records to return)  Aggregate functions o MIN (returns smallest value of selected column) o MAX (returns largest value of selected column) o AVG (returns average value of selected column) o SUM (returns total sum of selected column) o COUNT (returns number of rows)  Grouping data o GROUP BY (groups rows that have common column values)  Multi-table operations o Joins (combines rows from more than one table)  LEFT JOIN (keeps all record s from left table, matching records from right table)  RIGHT JOIN (Keeps all records from right table, matching records from left table)  INNER JOIN (keep records that match both tables)  FULL OUTER JOIN (keeps all records form both tables)  NULL values o Cannot use comparison operators (no value/ ‘unknown’) o IS NULL (test for empty values) o IS NOT NULL (test for non-empty values) o IF NULL (handles null values by providing alternative) Lecture 2: - Machine Learning o Uses statistical techniques to learn from data without being explicitly programmed o 2 types  Supervised learning (requires human interventions)  Algorithms use labelled datasets to accurately classify data or predict outputs o Desired output is given  Main uses: o Classification (e.g. email or spam) o Regression (e.g. weather forecasting)  Algorithms: o Linear-, logistic regression, etc.  Terms used: o Feature = predictor = IV = X o Outcome = DV = label = Y  Unsupervised learning (no human interventions required)  Algorithms use unlabelled datasets to discover hidden patterns or similarities among observations. o No desired output is given  Main uses: o Dimensionality reduction o Clustering (e.g. customer segmentation) o Association (e.g. recommendation systems)  Algorithms: o Factor Analysis, Cluster Analysis, etc. - Factor Analysis (focus on variables) o Given a set of interrelated variables, find a smaller number of new (latent) variables, i.e., factors, that somehow describe the original set as good as possible.  Based on correlations between individual variables o When used?  Dimension reduction technique  Finding underlying factors o Spurious correlation  When two variables are correlated but do not have a causal relationship. - Types of factor analysis o Exploratory factor analysis (EFA) o Confirmatory factor analysis (CFA) o Principal component analysis (PCA)  Attempts to describe the variables in a dataset as good as possible by a few new variables that are linear combinations of the original variables  New variables are called principal components o Are uncorrelated o Number of principal components are less tan or equal to number of original variables.  Used?  As a dimension reduction technique  As first step: o Subsequent analyses are done on the principal components  Principal components (Zs)  Explain as much variance of the original variables as possible o Amount of variance measured by the eigenvalue  Larger eigenvalue, Z explains large amount of variance in the data  Important to standardize datasets  More weight to variables with higher variances  Standardization provides single unit of measurement o Mean zero and variance equal to one  Conducting PCA  Formulate the problem o Problem identification o Variable identification  Variables with high correlation are also highly correlated with same dimension(s) o Code in R  Correlation matrix  Corr_matrix 60%) o Code kaiser criterion  Eig.value

Data Analytics Summary Lectures

Document Details

Tags

Related

Summary

Full Transcript