Introduction to Data Science Unit 1

StreamlinedMannerism avatar
StreamlinedMannerism
·
·
Download

Start Quiz

Study Flashcards

34 Questions

What are the different types of data in data science?

Semi-Structured Data

Define Data Science.

Data Science is a multi-disciplinary science that aims to perform data analysis to generate knowledge for decision making.

Structured data in data science can be associated with a schema.

True

Semi-structured data has some structure due to the use of ______ or key/value pairs.

tags

Match the types of data with their characteristics:

Structured Data = Associated with a schema Semi-Structured Data = Contains tags or key/value pairs Unstructured Data = Does not follow any schema definition Data Streams = Characterized by a sequence of data over time

What are the two distinct types of data that can be used in statistical analysis?

Categorical data and Quantitative data

Which of the following defines the categories of 'categorical data'?

Occupation

What type of data would define age categories as '0 or more but less than 26', '26 or more but less than 46'?

Ordinal

Quantitative data can be used to define different __________ of data.

scale

Match the measurement scale with its characteristics and examples:

Nominal = Yes IDV, No M, No EI, No MZV Ordinal = Yes IDV, For rank M, No EI, No MZV Interval = Yes IDV, Yes M, Yes EI, No MZV Ratio = Yes IDV, Yes M, Yes EI, Yes MZV

What is the purpose of sampling in data science?

To enhance the speed of exploratory data analysis and develop exploratory models.

What does the Central Limit Theorem state?

With the increase in sample size, the sampling distribution of the mean approaches closer to a normal distribution.

Does the Central Limit Theorem impose constraints on the distribution of the population?

No

What is the Equation 15 a result of?

Central Limit Theorem

What is needed for the Central Limit Theorem to be applicable?

All of the above

What is the purpose of hypothesis testing?

To make decisions or inferences about a population based on sample data.

What is the equation of a single linear regression line?

ypredicted = a + bx

What is the purpose of the method of least squares in finding the regression line?

Minimizing the sum of squares of residuals

The value of r squared (r^2) represents the predictive power of the regression model.

True

The term 'Multiple R' in Regression Statistics defines the correlation between the dependent variable (y) with the set of ______________ variables in the regression model.

independent or explanatory

What is the sample mean of the height of students of class 12?

166

What is the Confidence Interval for the average height of class 12th students with 95% confidence level?

163.8 to 168.2

What is the formula used to compute the t-value in the context of sampling distribution?

t = (x̅ - μ) / (s / √n)

Correlation coefficient can have a value beyond the range of -1 to 1.

False

What does a positive correlation value indicate in correlation coefficient?

Value of y increases with increase in value of x and decreases with decrease in x.

What is the confidence interval for the population proportions of students who favor increasing practical sessions, considering a confidence level of 90%, 95%, and 99%?

For 90%: 0.4475 to 0.6125, For 95%: 0.432 to 0.628, For 99%: 0.401 to 0.659

Calculate the estimated weight of the student population given the weights of 20 students (in kilograms) as follows: 65, 75, 55, 60, 50, 59, 62, 70, 61, 57, 62, 71, 63, 69, 55, 51, 56, 67, 68, 60.

Mean = 61.8 kg, Standard Deviation = 1.52 kg

With a significance level of 95%, can you conclude if the training course was useful for the class of 10 students based on their marks before and after the course? Explain the hypothesis and analysis.

You would need to conduct a paired sample t-test to determine if there is a significant difference in the mean test results before and after the training course.

What is the mean of the given data set?

13.82

What is the median of the given data set?

14

How do outliers impact the mean and median?

Outliers impact the mean but not the median.

What does the standard normal distribution have the mean (μ) and standard deviation (σ) set as?

Mean (μ) as zero and standard deviation (σ) as 1

Match the following probability distributions with their respective names:

Poisson distribution = Discrete probability distribution Uniform Distribution = Equal likelihood of all outcomes Chi-square distribution = Used in hypothesis testing

What is the main purpose of sampling distribution?

To show the probability of choosing a specific sample from the population.

Study Notes

Introduction to Data Science

  • Data science is a multi-disciplinary science that aims to generate knowledge from data for decision making.
  • Data science involves collecting data from multiple sources, cleaning, integrating, and processing it to produce useful information.

Definition of Data Science

  • Data science is a way to extract knowledge from data to support decision making.
  • It involves processing data to generate patterns, models, and insights that can be used for decision making.

Types of Data

  • Structured Data: follows a specific schema, can be associated with a schema, and is typically stored in relational databases. Examples: customer data, account data, transaction data.
  • Semi-Structured Data: has some structure, but not a fixed schema, and is often stored in XML, JSON, or other formats. Examples: XML data, JSON objects, server logs.
  • Unstructured Data: does not follow a specific schema, and can be in the form of text, images, audio, or video. Examples: social media data, email data, image data.
  • Data Streams: a sequence of data generated over time, can be structured, semi-structured, or unstructured, and is often processed in real-time. Examples: IoT sensor data, social media feeds.

Statistical Data Types

  • Categorical Data: used to define categories, can be nominal (no relationship between categories) or ordinal (categories have a specific relationship).
  • Quantitative Data: numeric data, can be discrete (distinct numbers) or continuous (continuous values).

Measurement Scales

  • Nominal Scale: categorical data with no relationship between categories, examples: gender, occupation.
  • Ordinal Scale: categorical data with a specific relationship between categories, examples: age categories, income categories.
  • Interval Scale: quantitative data with equal intervals between values, but no absolute zero, examples: IQ, temperature in Celsius.
  • Ratio Scale: quantitative data with equal intervals between values and an absolute zero, examples: temperature in Kelvin, age.

Sampling

  • Population: the entire set of data being studied.
  • Sample: a representative subset of the population.
  • Statistic: a value computed from the sample data.
  • Parameter: a value predicted from the sample data about the population.

Basic Methods of Data Analysis

  • Descriptive Analysis: summary statistics and data visualization to understand the data.
  • Exploratory Analysis: identifying patterns and relationships in the data.
  • Inferential Analysis: using sample data to make inferences about the population.
  • Predictive Analysis: using data to predict future outcomes or trends.### Data Analysis Methods
  • Three basic methods used for analyzing data:
    • Descriptive analysis
    • Exploratory data analysis
    • Inferential data analysis

Descriptive Analysis

  • Used to present basic summaries about data without interpreting it
  • Includes different statistical values and graphs
  • Different types of data are described in different ways
  • Examples of descriptive analysis:
    • Frequency table of various categories for categorical data
    • Measures of central tendency (mean, median) and spread (range, interquartile range) for quantitative data

Categorical Data

  • Gender is a categorical variable
  • Summary of categorical data is presented in a frequency table
  • Frequency table includes:
    • Frequency of each category
    • Proportion of each category
    • Percentage of each category
  • Graphs for categorical data:
    • Bar chart or pie chart

Quantitative Data

  • Height is a quantitative variable
  • Descriptive statistics for quantitative data:
    • Measures of central tendency (mean, median)
    • Measures of spread (range, interquartile range)
  • Mean and median are two basic measures of central tendency
  • Mean is sensitive to outliers, while median is not
  • Mode is another measure of central tendency, but it's not commonly used

Sampling Distribution and Central Limit Theorem

  • Sampling distribution is a probability distribution of means of random samples from a population
  • Sampling distribution is used to determine if sample statistics are close to population parameters
  • Central Limit Theorem states that with the increase in sample size, the sampling distribution approaches a normal distribution
  • Conditions for Central Limit Theorem:
    • Independent random samples
    • Sufficiently large sample size (but less than 10% of population)
  • Sampling distribution follows a normal distribution with mean = population mean and standard deviation = population standard deviation / sqrt(sample size)

Standard Normal Distribution

  • Standard normal distribution is a standardized form of normal distribution
  • Mean of standard normal distribution is 0, and standard deviation is 1
  • Z-score for standard normal distribution is x / √n
  • 95% of the area under the standard normal curve lies between -2 and 2

Inferential Data Analysis

  • Used to make inferences about a population based on a sample
  • Includes hypothesis testing and confidence intervals### Estimation of Parameters of the Population
  • A point estimate is a single value that is used to estimate a population parameter.
  • A good point estimate should be unbiased and have a small standard deviation.
  • An example of a point estimate is the proportion of students who play some sport (40.5%).
  • A confidence interval is a range of values within which the population parameter is likely to lie.
  • The confidence interval is calculated using the sample proportion and the standard error.
  • The probability that the confidence interval contains the population parameter is called the confidence level (e.g. 95%).
  • The confidence level is determined by the z-score, which is the number of standard deviations from the mean.

Confidence Interval to Estimate Mean

  • The confidence interval can be used to estimate the mean of a population.
  • The standard error in the estimated mean is calculated using the sample standard deviation.
  • The formula for the confidence interval is: (sample mean - z * standard error) to (sample mean + z * standard error).
  • The confidence interval is used to estimate the average height of students in a sample.

Significance Testing of Statistical Hypothesis

  • Significance testing involves testing a hypothesis to determine if the data supports it.
  • The process of significance testing involves:
    1. Testing pre-conditions on the data.
    2. Making a statistical hypothesis (null and alternative hypotheses).
    3. Performing the desired statistical analysis.
    4. Analysing the results.
  • The null hypothesis defines a particular value for the parameter or specifies that there is no difference or change.
  • The alternative hypothesis specifies the values or difference in parameter values.

Correlation and Regression

  • Correlation is used to determine the strength of linear association between two quantitative variables.
  • The correlation coefficient (r) measures the strength and direction of the linear relationship.
  • The value of r lies between -1 and 1, where a positive value indicates a positive relationship and a negative value indicates a negative relationship.
  • A correlation coefficient close to 1 or -1 indicates a strong linear association.
  • Correlation does not imply causation.
  • Simple linear regression predicts a response variable using one explanatory variable.
  • The equation for simple linear regression is: ypredicted = a + bx.
  • The method of least squares is used to find the regression line by minimizing the sum of squares of the residuals.
  • The predictive power of the model is determined by the coefficient of determination (r2).

Learn the basics of data science, including its definition, types of data, and basic methods of data analysis. This quiz covers the foundational concepts of data science.

Make Your Own Quizzes and Flashcards

Convert your notes into interactive study material.

Get started for free

More Quizzes Like This

Introduction to Statistics
5 questions
Statistics Basics
5 questions

Statistics Basics

IntegralNephrite6068 avatar
IntegralNephrite6068
Análisis de Datos: Tipos y Técnicas
8 questions
Use Quizgecko on...
Browser
Browser