Podcast
Questions and Answers
The function ______()
displays the first few rows of a DataFrame in pandas.
The function ______()
displays the first few rows of a DataFrame in pandas.
head
The ______()
method in pandas provides a concise summary of a DataFrame, including data types and non-null counts.
The ______()
method in pandas provides a concise summary of a DataFrame, including data types and non-null counts.
info
The ______()
method in pandas calculates summary statistics such as mean, median, and standard deviation for numerical columns in a DataFrame.
The ______()
method in pandas calculates summary statistics such as mean, median, and standard deviation for numerical columns in a DataFrame.
describe
The ______
attribute of a pandas Series reveals the data type of the Series.
The ______
attribute of a pandas Series reveals the data type of the Series.
The ______()
method in pandas counts the number of times each unique value appears in a Series.
The ______()
method in pandas counts the number of times each unique value appears in a Series.
The ______()
method is used to change the data type of a column in a pandas DataFrame.
The ______()
method is used to change the data type of a column in a pandas DataFrame.
In pandas, the ~
operator is used to negate a boolean Series, effectively selecting the ______ of the condition.
In pandas, the ~
operator is used to negate a boolean Series, effectively selecting the ______ of the condition.
To select rows in a pandas DataFrame based on a condition, you can pass a boolean Series inside square ______.
To select rows in a pandas DataFrame based on a condition, you can pass a boolean Series inside square ______.
The min()
and max()
methods in pandas are used to find the smallest and largest values, respectively, in a ______ or DataFrame column.
The min()
and max()
methods in pandas are used to find the smallest and largest values, respectively, in a ______ or DataFrame column.
______
are used to visualize the distribution of a continuous variable and to compare distributions across different groups.
______
are used to visualize the distribution of a continuous variable and to compare distributions across different groups.
The ______
style in Seaborn provides a clean look with white backgrounds and gridlines, enhancing readability.
The ______
style in Seaborn provides a clean look with white backgrounds and gridlines, enhancing readability.
The ______
and ylabel()
functions in Matplotlib and Seaborn are used to label the axes of a plot, making it more informative.
The ______
and ylabel()
functions in Matplotlib and Seaborn are used to label the axes of a plot, making it more informative.
The ______()
method returns the mean and standard deviation and other statistics of the data in a Pandas DataFrame.
The ______()
method returns the mean and standard deviation and other statistics of the data in a Pandas DataFrame.
Using ______
on a DataFrame groups the rows based on one or more columns and allows you to perform aggregate calculations on each group.
Using ______
on a DataFrame groups the rows based on one or more columns and allows you to perform aggregate calculations on each group.
A pandas ______
is a one-dimensional labeled array capable of holding any data type.
A pandas ______
is a one-dimensional labeled array capable of holding any data type.
To count missing values in each column of a DataFrame, you can use the ______().sum()
methods in pandas.
To count missing values in each column of a DataFrame, you can use the ______().sum()
methods in pandas.
The ______
attribute of a DataFrame provides the number of rows and columns as a tuple.
The ______
attribute of a DataFrame provides the number of rows and columns as a tuple.
When dealing with missing values, a common approach is to ______
columns that have a percentage of missing values exceeding a certain threshold.
When dealing with missing values, a common approach is to ______
columns that have a percentage of missing values exceeding a certain threshold.
The ______
of missing data help to identify which columns have null attributes and how many records have null values.
The ______
of missing data help to identify which columns have null attributes and how many records have null values.
In statistics, ______
are data points that differ significantly from other observations.
In statistics, ______
are data points that differ significantly from other observations.
One way to handle outliers is to ______
values, replacing extreme values with upper or lower limits.
One way to handle outliers is to ______
values, replacing extreme values with upper or lower limits.
The ______ Transformation
can be used to reduce the impact of outliers to make data look more normal.
The ______ Transformation
can be used to reduce the impact of outliers to make data look more normal.
The ______
is a measure of statistical dispersion, being equal to the difference between 75th and 25th percentiles..
The ______
is a measure of statistical dispersion, being equal to the difference between 75th and 25th percentiles..
To parse columns as dates when reading a CSV file into pandas, you can use the ______
parameter in the read_csv()
function.
To parse columns as dates when reading a CSV file into pandas, you can use the ______
parameter in the read_csv()
function.
The .dt.year
attribute of a datetime column in pandas allows you to extract the ______
component.
The .dt.year
attribute of a datetime column in pandas allows you to extract the ______
component.
A ______ Plot
is commonly used to visualize the trend of a variable over time, such as the average number of kids by marriage year.
A ______ Plot
is commonly used to visualize the trend of a variable over time, such as the average number of kids by marriage year.
A ______
Heatmap is a graphical representation of correlation matrix between different variables.
A ______
Heatmap is a graphical representation of correlation matrix between different variables.
A ______ Plot
is useful for visualizing the relationship between two numerical variables.
A ______ Plot
is useful for visualizing the relationship between two numerical variables.
A scatter plot can be enhanced by using the ______
parameter to add a third dimension of information through color.
A scatter plot can be enhanced by using the ______
parameter to add a third dimension of information through color.
______ Density Estimate
plots are useful for visualizing the distribution of a single variable.
______ Density Estimate
plots are useful for visualizing the distribution of a single variable.
Setting the ______
parameter in a KDE plot prevents smoothing beyond the extreme data points, ensuring a more accurate representation of the distribution.
Setting the ______
parameter in a KDE plot prevents smoothing beyond the extreme data points, ensuring a more accurate representation of the distribution.
A ______ Distribution
is a probability distribution that indicates the probability that a variable takes a value less than or equal to a certain value.
A ______ Distribution
is a probability distribution that indicates the probability that a variable takes a value less than or equal to a certain value.
______-tabulation
helps in identifying how observations occur in combination with one another.
______-tabulation
helps in identifying how observations occur in combination with one another.
The .dt.month
attribute of a datetime column in pandas allows you to extract the ______ of the datatime.
The .dt.month
attribute of a datetime column in pandas allows you to extract the ______ of the datatime.
The .dt.weekday
attribute of a datetime column in pandas allows you to extract the ______ of the datetime.
The .dt.weekday
attribute of a datetime column in pandas allows you to extract the ______ of the datetime.
When categorizing numerical data, the ______()
function is useful for binning values into discrete intervals.
When categorizing numerical data, the ______()
function is useful for binning values into discrete intervals.
A ______ Plot
displays how two variables are related to each other.
A ______ Plot
displays how two variables are related to each other.
______ computing
is the delivery of computing services—including servers, storage, databases, networking, software, analytics, and intelligence—over the Internet.
______ computing
is the delivery of computing services—including servers, storage, databases, networking, software, analytics, and intelligence—over the Internet.
______
as a Service (IaaS) provides you with the computing infrastructure – servers, virtual machines (VM), storage, networks, operating systems.
______
as a Service (IaaS) provides you with the computing infrastructure – servers, virtual machines (VM), storage, networks, operating systems.
______
as a Service (IaaS) provides a framework upon which companies can build code.
______
as a Service (IaaS) provides a framework upon which companies can build code.
______
as a Service (IaaS) provdes the end user with code that's already running.
______
as a Service (IaaS) provdes the end user with code that's already running.
Flashcards
.head()
function
.head()
function
Displays initial rows (default 5) of a DataFrame.
.info()
function
.info()
function
Provides a concise summary of a DataFrame, including data types and missing values.
.describe()
function
.describe()
function
Generates descriptive statistics of a DataFrame, including mean, median, and standard deviation.
.value_counts()
method
.value_counts()
method
Signup and view all the flashcards
.astype()
method
.astype()
method
Signup and view all the flashcards
Tilde operator (~
)
Tilde operator (~
)
Signup and view all the flashcards
Boolean indexing
Boolean indexing
Signup and view all the flashcards
.min()
method
.min()
method
Signup and view all the flashcards
.max()
method
.max()
method
Signup and view all the flashcards
Boxplot
Boxplot
Signup and view all the flashcards
.agg(["mean", "std"])
.agg(["mean", "std"])
Signup and view all the flashcards
.groupby()
method
.groupby()
method
Signup and view all the flashcards
.isnull().sum()
.isnull().sum()
Signup and view all the flashcards
.dropna()
method
.dropna()
method
Signup and view all the flashcards
.fillna()
method
.fillna()
method
Signup and view all the flashcards
Z-score
Z-score
Signup and view all the flashcards
Outliers
Outliers
Signup and view all the flashcards
IQR Method
IQR Method
Signup and view all the flashcards
parse_dates
parameter
parse_dates
parameter
Signup and view all the flashcards
pd.to_datetime()
pd.to_datetime()
Signup and view all the flashcards
.dt.year
.dt.year
Signup and view all the flashcards
Correlation
Correlation
Signup and view all the flashcards
Correlation Heatmap
Correlation Heatmap
Signup and view all the flashcards
Scatterplot
Scatterplot
Signup and view all the flashcards
Scatterplot with hue
Scatterplot with hue
Signup and view all the flashcards
Kernel Density Estimate (KDE)
Kernel Density Estimate (KDE)
Signup and view all the flashcards
Cumulative Distribution Function (CDF)
Cumulative Distribution Function (CDF)
Signup and view all the flashcards
Class Imbalance
Class Imbalance
Signup and view all the flashcards
Cross-tabulation
Cross-tabulation
Signup and view all the flashcards
.dt.month
.dt.month
Signup and view all the flashcards
.dt.weekday
.dt.weekday
Signup and view all the flashcards
Categorical binning
Categorical binning
Signup and view all the flashcards
Barplot
Barplot
Signup and view all the flashcards
Cloud Computing
Cloud Computing
Signup and view all the flashcards
IaaS
IaaS
Signup and view all the flashcards
PaaS
PaaS
Signup and view all the flashcards
SaaS
SaaS
Signup and view all the flashcards
Public Cloud
Public Cloud
Signup and view all the flashcards
Private Cloud
Private Cloud
Signup and view all the flashcards
Hybrid Cloud
Hybrid Cloud
Signup and view all the flashcards
Study Notes
Pandas DataFrames
- The code reads a CSV file into a Pandas DataFrame named
cars
. - The
.head()
method displays the first few rows of the DataFrame. - The
.info()
method provides a summary of the DataFrame, including data types and missing values. - The
.describe()
method calculates descriptive statistics for numeric columns.
Working with Data Types
unemployment.dtypes
prints the data types of each column in the unemployment DataFrame.unemployment["2019"] = unemployment["2019"].astype(float)
converts the data type of the "2019" column to a float.
Value Counts and Filtering
unemployment['continent'].value_counts()
counts the occurrences of each unique value in the 'continent' column, providing a frequency distribution.not_oceania = ~unemployment["continent"].isin(["Oceania"])
creates a boolean Series that is True for continents not in Oceania.unemployment[not_oceania]
filters the DataFrame to exclude records related to countries in Oceania.
Descriptive Statistics and Visualization
unemployment["2021"].min()
andunemployment["2021"].max()
calculate the minimum and maximum unemployment rates in 2021.sns.boxplot(data=unemployment, x="2021", y="continent")
generates a boxplot of 2021 unemployment rates by continent using Seaborn.
Aggregation
unemployment.loc[:, "2010":"2021"].agg(["mean", "std"])
calculates the mean and standard deviation of unemployment rates for the years 2010-2021.unemployment.groupby("continent").agg(["mean" , "std"])
calculates yearly mean and standard deviation grouped by continent.- The code groups the unemployment data by continent and calculates the mean and standard deviation of the 2021 unemployment rate for each continent.
Handling Missing Values
planes.info()
provides a summary of the DataFrame to identify missing values.planes.isnull().sum()
counts the number of missing values in each column.- The code calculates a threshold for dropping columns based on the percentage of missing values.
planes.columns[planes.isna().sum() < threshold]
identifies columns with missing values below the threshold.planes.drop(columns=cols_to_drop, inplace=True)
removes columns exceeding the threshold of missing values.planes.dropna(subset=["column_name"], inplace=True)
removes rows with any missing values in the specified columns.
Imputing Missing Values
planes["column_name"].fillna(planes["column_name"].mean(), inplace=True)
fills missing values in a column with the mean of that column.planes["column_name"].fillna(planes["column_name"].median(), inplace=True)
fills missing values with the median.planes["column_name"].fillna(planes["column_name"].mode()[0], inplace=True)
fills missing values with the mode.
Detecting and Removing Outliers
Outlier Detection
- Box plots and scatter plots are used to visually identify outliers
- IQR (Interquartile Range) Method:
- Calculate Q1 (25th percentile) and Q3 (75th percentile).
- Compute IQR = Q3 - Q1.
- Define lower bound = Q1 - (1.5 * IQR) and upper bound = Q3 + (1.5 * IQR).
- Data points outside these bounds are considered outliers.
- Z-Score Method:
- Calculate the Z-score for each data point.
- Data points with a Z-score greater than 3 or less than -3 are considered outliers.
Outlier Removal
- Removing outliers involves several strategies depending on the context
- Remove only if an error causes them
- Keep them if they provide insights such as high revenue
IQR Outlier Removal Code
- Calculates thresholds to subset data based if price is lower or higher than the calculated values
Handling Date Formats
divorce = pd.read_csv("divorce.csv", parse_dates=["divorce_date", "dob_man", "dob_woman", "marriage_date"])
automatically parses specified columns as dates when reading the CSV file.divorce["marriage_date"] = pd.to_datetime(divorce["marriage_date"], errors="coerce")
converts a column to DateTime format, setting unparseable dates toNaT
.
Visualizing Time Series Data
divorce["marriage_year"] = divorce["marriage_date"].dt.year
extracts the year from the "marriage_date" column to create a "marriage_year" column.sns.lineplot(data=divorce, x="marriage_year", y="num_kids")
creates a line plot to visualize the relationship between "marriage_year" and "num_kids".
Correlation and Heatmaps
corr_matrix = divorce.corr()
computes the correlation matrix of the DataFrame.sns.heatmap(corr_matrix, annot=True, cmap="coolwarm", fmt=".2f", linewidths=0.5)
generates a heatmap of the correlation matrix.
Scatterplots and KDE Plots
sns.scatterplot(data=divorce, x="marriage_duration", y="num_kids")
creates a scatter plot to visualize the relationship between two variables.- Color effects improve the plot with hue = "", and palette = "".
sns.kdeplot(...)
generates a Kernel Density Estimate plot, showing the distribution of one variable, can accept hue to visualise two variables
Class Imbalance and Cross-Tabulation
salaries["Job_Category"].value_counts(normalize=True)
calculates the relative frequency of each category in the "Job_Category" column, useful for identifying class imbalance.pd.crosstab(salaries["Company_Size"], salaries["Experience"])
performs cross-tabulation to show the relationship between company size and experience.
Feature Extraction and Salary Categorization
salaries["month"] = salaries["date_of_response"].dt.month
extracts the month from a datetime column.salaries["weekday"] = salaries["date_of_response"].dt.weekday
extracts the weekday.pd.cut(...)
categorizes salaries into different levels based on specified bins and labels.
Hypothesis Creation and Visual Comparison
- Barplots are used to compare salaries based on location, company size, or employment status using
sns.barplot()
. usa_and_gb = salaries[salaries["Employee_Location"].isin(["US", "GB"])]
filters data to include only employees in the US or GB.
Cloud Computing Basics
- Cloud computing delivers on-demand computing resources over the internet.
- IaaS provides virtualized computing resources.
- PaaS provides a platform for developing and deploying applications.
- SaaS delivers software applications over the internet.
- Cloud deployment models include public, private, hybrid, multi-cloud, and community clouds.
- Cloud roles include cloud architect, DevOps engineer, and cloud security engineer.
- GDPR is a major regulation affecting cloud services, ensuring data protection and privacy.
- Major cloud providers include AWS, Azure, and GCP. AWS is the market leader.
Key-data Structures
- Lists are ordered, mutable collections
- Tuples are ordered, immutable collections
- Sets are unordered collections with unique values
- Dictionaries are unordered collections of key-value pairs.
- To cast a variable is to change from one type to another.
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.