L4 Data Analysis (PDF)
Document Details
Uploaded by AffluentWoodland6965
Eulogio 'Amang' Rodriguez Institute of Science and Technology
Tags
Summary
This document provides an overview of data analysis techniques, including preliminary concepts, statistical analysis, characteristics of samples, and the use of libraries like pandas. It's designed to be a helpful resource for understanding and working with data.
Full Transcript
Analyzing Data Preliminaries ▪ Data is changed from its raw format into information after it has been gathered, prepared, analyzed, and presented in a usable format. ▪ Exploratory data analysis is a set of procedures designed to produce descriptive and graphical summaries of data...
Analyzing Data Preliminaries ▪ Data is changed from its raw format into information after it has been gathered, prepared, analyzed, and presented in a usable format. ▪ Exploratory data analysis is a set of procedures designed to produce descriptive and graphical summaries of data with the notion that the results may reveal interesting patterns Analyzing Data Preliminaries cont… ▪ IoT Concerns IoT data may come in large volume and in different forms. IoT data may require more advanced analytic tools for structured and unstructured data IoT data is frequently streaming in real time or nearly real time. ▪ Observations, Variables, and Values A variable is anything that varies from one instance to another and is something that can be measured, manipulated or controlled. The recordings of the values, patterns and occurrences for a set of variables is an observation. The set of values for a specific observation is called a data point. Analyzing Data Preliminaries cont… ▪ Categorical variables include: Nominal – Two or more categories or names that identify the object Ordinal – Two or more categories in which order matter in the value ▪ Numerical variables include: Continuous – quantitative along a continuum or range of values Ratio - Interval variables where zero (0) means none Discrete - Quantitative with a specific value from a finite set of values Analyzing Data Statistical Analysis ▪ Statistics is the collection and analysis of data using mathematical techniques. ▪ Sample and Population A population is a group of similar entities such as people, objects, or events that share some common set of characteristics. A sample is a representative group from the population. Analyzing Data Statistical Analysis cont… ▪ Descriptive statistics describe or summarize the values and observations of a data set. ▪ Inferential statistics process of collecting, analyzing and interpreting data gathered from a sample to make generalizations or predictions about a population Analyzing Data Characteristics of Samples ▪ Distribution a variable and its frequency or probability ▪ Centrality The mean, median, and mode ▪ Dispersion the variability in the distribution Analyzing Data Analysis Using Descriptive Statistics ▪ Pandas open source library for Python that adds high- performance data structures and tools for analysis of large data sets Import data from files Import data from web Descriptive statistics in pandas Analyzing Data Analysis Using Correlation ▪ “Correlation does not imply causation” Causation is a relationship in which one thing changes, or is created, directly because of something else. Correlation is a relationship between phenomena in which two or more things change at a similar rate. Correlations can be positive or negative. Analyzing Data Analysis Using Correlation cont… ▪ Correlations can be calculated for multiple variables simultaneously ▪ Heat map values for correlation coefficients relate to one another Preparation for Chapter 3 Internet Meter Lab Basic Analysis with pandas ▪ More often than not, the ▪ NaNs (Not a Number) data sets that you work values are used to with will have represent data that is incompatibilities undefined or cannot be represented. pandas refers ▪ Cleaning data can involve to missing data as NaN removing missing or values unwanted values, or NaTs are used for timestamps altering the format of the ▪ Pandas has many built-in values to make them functions for: consistent converting the datatypes manipulating data frames running statistical analysis on data sets.