Data Science Concepts: Z-Score and Data Ethics

FervidSplendor9692 avatar
FervidSplendor9692
·
·
Download

Start Quiz

Study Flashcards

40 Questions

What is the primary purpose of subsetting the data in data science?

To divide the data into smaller, more manageable parts

What is the term for a small set of data taken from a larger set of data?

Subset

What is the primary focus of the second chapter of the textbook?

Use of Statistics in Data Science

What is the term for a measure of how much individual data points deviate from the mean?

Standard Deviation

What is the term for the middle value in a dataset when it is arranged in order?

Median

What is the primary purpose of studying data science?

To analyze and visualize data

What is the term for a set of data that is a part of a larger dataset?

Subset

What is the term for the process of dividing a larger dataset into smaller parts?

Data Subsetting

What is the primary purpose of subsetting in data analysis?

To observe just the required set of data by filtering out unnecessary content

What type of subsetting involves selecting specific columns from the dataset?

Column-based subsetting

What is the result of subsetting a table with 100 rows and 100 columns to retrieve the first 5 rows and columns?

A dataset with 5 rows and 5 columns

What is the term used to describe the process of selecting a part of the data from a data frame?

Data subsetting

What is the benefit of subsetting a large dataset?

It reduces the size of the dataset

What is the purpose of row-based subsetting?

To retrieve a part of the data from the top or bottom of the table

What is the term used to describe the smaller table that is created after subsetting?

Subset

Why is subsetting a significant component of data management?

It helps to focus on the relevant data and reduce unnecessary information

What is the process of selecting specific columns of a dataset known as?

Column-based subsetting

What is the purpose of data-based subsetting?

To select specific rows of a dataset based on certain conditions

What is a two-way frequency table used to demonstrate?

The observed number or frequency for two variables

What do the rows in a two-way frequency table indicate?

One category of the variable being studied

What information can be obtained from a two-way frequency table?

The number of people who were questioned

What is the purpose of subsetting in data analysis?

To select specific parts of the data based on certain conditions

What is the result of subsetting based on age categories?

A table with selected rows

What is the benefit of using a two-way frequency table?

It allows for the analysis of multiple variables simultaneously

What type of table is used when there are different sample sizes in a dataset?

Two-way relative frequency table

What is the main difference between a two-way frequency table and a two-way relative frequency table?

One considers percentage and the other considers numbers

What is the purpose of using row relative frequencies or column relative frequencies in a two-way relative frequency table?

To compare preferences in different contexts

What is the definition of mean in data science?

The value in the dataset around which the entire data is spread out

How is the mean of a dataset calculated?

By adding up all the values in the dataset and dividing by the number of values

What is the example dataset used to illustrate how to find the mean?

Array = {10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20}

What is the purpose of calculating the mean of a dataset?

To summarize the central tendency of the data

What is true about how values are weighted when calculating the mean of a dataset?

All values are weighted equally

In a dataset with an odd number of records, how is the median calculated?

By finding the middle value

What is the main advantage of using the median as a measure of central tendency?

It is less affected by outliers in the dataset

What happens to the mean when there is an outlier in the dataset?

It deviates greatly from the regular values

Why is the median a more effective measure of central tendency in certain scenarios?

Because it is less affected by outliers in the dataset

What is the result of calculating the median for an even number of records?

The average of the two middle points

What is the primary reason for using the median instead of the mean in certain scenarios?

Because the median is less affected by outliers

What is the advantage of using the median as a measure of central tendency in the given example?

It is a more accurate measure of central tendency when there are outliers

What is the result of having an outlier in the dataset in the given example?

The mean deviates greatly from the regular values

Study Notes

Data Analysis Techniques

  • Subsetting is a significant component of data management, used for selecting and filtering variables and observations.
  • It helps to observe only the required set of data by filtering out unnecessary content.

Subsetting Methods

  • Row-based subsetting: selecting specific rows from the top or bottom of the table.
  • Column-based subsetting: selecting specific columns from the dataset.
  • Data-based subsetting: breaking down the data into specific categories and selecting only those rows that meet the criteria.

Two-Way Frequency Tables

  • A two-way frequency table is a statistical table that demonstrates the observed number or frequency for two variables.
  • It shows how many data points fit in each category.
  • The row category and column category are used to organize the data.

Interpreting Two-Way Tables

  • Two-way relative frequency tables are helpful when there are different sample sizes in a dataset.
  • Percentages make it easier to compare the preferences.

Mean

  • Mean is a measure of central tendency, also known as the simple average.
  • It is an average value of a data set.
  • Mean is calculated by adding up all the values in the data set and dividing them by the number of values present.

Data Merging and Statistics

  • Z-Score: a measure of how many standard deviations an element is from the mean.
  • Percentiles: a measure of the value below which a certain percentage of data points fall.
  • Quartiles: a measure of the value below which 25% or 50% of data points fall.
  • Deciles: a measure of the value below which 10% of data points fall.

Ethics in Data Science

  • Data governance framework: a set of rules and guidelines for managing and analyzing data.
  • Ethical guidelines around data analysis: guidelines for ensuring that data analysis is fair and unbiased.
  • Discarding the Data: the importance of properly disposing of data that is no longer needed.

This quiz covers key data science concepts including z-score, percentiles, quartiles, and deciles. It also explores ethics in data science, including data governance and guidelines.

Make Your Own Quizzes and Flashcards

Convert your notes into interactive study material.

Get started for free

More Quizzes Like This

Use Quizgecko on...
Browser
Browser