Data Cleaning with Janitor Package

Podcast

Play an AI-generated podcast conversation about this lesson

Download our mobile app to listen on the go

Get App

Questions and Answers

Which of the following is NOT a typical function of the `clean_names()` function in the janitor package?

Converting '%’ to 'percent' and '#' to 'number' to retain meaning in variable names.
Automatically correcting inaccurate data entries by referencing external databases. (correct)
Handling special characters and spaces, including transliterating characters.
Parsing letter cases and separators to a consistent format, such as converting to snake_case.

What is the primary purpose of the `compare_df_cols()` function in the janitor package?

To identify and summarize differences in column types, presence, and absence across multiple data frames. (correct)
To clean and standardize the column names of multiple data frames to ensure consistency.
To perform statistical comparisons of the data contained within the columns of different data frames.
To merge multiple data frames into a single data frame, resolving any column conflicts.

Which of the following best describes the functionality of the `get_dupes()` function?

It searches for and returns records with duplicated values across specified columns in a data frame. (correct)
It generates a summary report of all unique values present in a dataset, excluding any duplicates.
It efficiently merges multiple datasets by identifying and resolving any duplicate column names.
It automatically corrects common data entry errors by identifying similar, but non-identical entries.

What is the purpose of the `get_one_to_one()` function?

To identify and group columns that have a one-to-one relationship with each other within a data frame. (A) Signup and view all the answers

How does the `make_clean_names()` function enhance the process of cleaning column names compared to base R functions?

It offers stylings and case choices consistent with the janitor package's conventions, integrating seamlessly into janitor workflows. (B) Signup and view all the answers

What problem does the `single_value()` function address, particularly in conjunction with `dplyr::group_by()`?

It validates that within each group, a column has only one unique value, which can be used to identify inconsistencies or errors in the data. (A) Signup and view all the answers

In what scenarios is the `remove_empty()` function most useful, and what does it do?

For cleaning Excel files that contain empty rows or columns after being read into R. (C) Signup and view all the answers

What does the `remove_constant()` function do, and how does it treat `NA` values by default?

It drops columns from a data frame that contain only a single constant value, with an option to consider or ignore <code>NA</code> values during the removal process. (D) Signup and view all the answers

How does `round_half_up()` differ from R's base `round()` function in handling halves?

<code>round()</code> implements 'banker's rounding' (rounds to the nearest even number), while <code>round_half_up()</code> always rounds halves up. (C) Signup and view all the answers

What specific type of data inconsistency does the `round_to_fraction()` function address, and how does it resolve it?

It enforces a fractional distribution by rounding values to the nearest specified denominator, correcting imprecise or user-entered 'bad' values. (A) Signup and view all the answers

What type of problem does `excel_numeric_to_date()` solve, and what options are available for handling different Excel date encoding systems?

It converts serial dates originating from Excel into <code>Date</code> format. By default it assumes a standard Excel based system, but it offers options for other Excel date encoding systems. (C) Signup and view all the answers

How do `convert_to_date()` and `convert_to_datetime()` enhance date and datetime conversions in comparison to `excel_numeric_to_date()`?

They are more robust to handling a mix of inputs. This is useful when reading many spreadsheets that should have the same column formats, but don't. (A) Signup and view all the answers

What is the primary function of the `row_to_names()` function and what is the main impact it has on the data frame?

It elevates a specified row to become the column names of the data frame. (A) Signup and view all the answers

What does the `find_header()` function do in relation to `row_to_names()`?

<code>find_header()</code> works in conjunction with <code>row_to_names()</code>, where its primary purpose is to locate the row of the column headers. (B) Signup and view all the answers

For what type of data analysis is the `top_levels()` function originally designed and what kind of output does it provide?

<code>top_levels()</code> is designed for use with Likert survey data stored as factors and returns a frequency table with appropriately-named rows, grouped into head/middle/tail groups. (A) Signup and view all the answers

The text mentions the `adorn_*` functions. In the context of using `tabyl()`, what is the purpose of these functions?

They format the output of <code>tabyl()</code>to customize its appearance. (C) Signup and view all the answers

What is the main purpose of the `tabyl()` function as a replacement for `table()`?

To create tables of descriptive statistics in a way that integrates smoothly with tidyverse tools. (B) Signup and view all the answers

When using `compare_df_cols()` with the argument `return = "mismatch"`, what kind of output should you expect?

A summary exclusively of the columns that differ across the compared data frames. (D) Signup and view all the answers

You have a dataset with mixed date formats, some as ‘yyyy-mm-dd’ strings and others as Excel serial numbers. Which functions would be most efficient to standardize all dates into a consistent `Date` class?

<code>convert_to_date()</code>. (B) Signup and view all the answers

In a dataset with columns named `customerName`, `order_ID`, and `% Discount`, what is the best way to automatically clean these names to follow a consistent snake_case format?

Use <code>clean_names()</code> function. (B) Signup and view all the answers

Flashcards

Janitor Functions

A function that expedites initial data exploration and cleaning of new datasets.

clean_names()

A function to clean dataframe names, handling problematic variable names by parsing letter cases, special characters and duplicated names to create consistency.

clean_names() features

Parses letter cases and separators to a consistent format, with snake_case as default

compare_df_cols()

Helps identify differences in columns across multiple dataframes, highlighting missing or differing column types.