Data Cleaning with Janitor Package

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson
Download our mobile app to listen on the go
Get App

Questions and Answers

Which of the following is NOT a typical function of the clean_names() function in the janitor package?

  • Converting '%’ to 'percent' and '#' to 'number' to retain meaning in variable names.
  • Automatically correcting inaccurate data entries by referencing external databases. (correct)
  • Handling special characters and spaces, including transliterating characters.
  • Parsing letter cases and separators to a consistent format, such as converting to snake_case.

What is the primary purpose of the compare_df_cols() function in the janitor package?

  • To identify and summarize differences in column types, presence, and absence across multiple data frames. (correct)
  • To clean and standardize the column names of multiple data frames to ensure consistency.
  • To perform statistical comparisons of the data contained within the columns of different data frames.
  • To merge multiple data frames into a single data frame, resolving any column conflicts.

Which of the following best describes the functionality of the get_dupes() function?

  • It searches for and returns records with duplicated values across specified columns in a data frame. (correct)
  • It generates a summary report of all unique values present in a dataset, excluding any duplicates.
  • It efficiently merges multiple datasets by identifying and resolving any duplicate column names.
  • It automatically corrects common data entry errors by identifying similar, but non-identical entries.

What is the purpose of the get_one_to_one() function?

<p>To identify and group columns that have a one-to-one relationship with each other within a data frame. (A)</p> Signup and view all the answers

How does the make_clean_names() function enhance the process of cleaning column names compared to base R functions?

<p>It offers stylings and case choices consistent with the janitor package's conventions, integrating seamlessly into janitor workflows. (B)</p> Signup and view all the answers

What problem does the single_value() function address, particularly in conjunction with dplyr::group_by()?

<p>It validates that within each group, a column has only one unique value, which can be used to identify inconsistencies or errors in the data. (A)</p> Signup and view all the answers

In what scenarios is the remove_empty() function most useful, and what does it do?

<p>For cleaning Excel files that contain empty rows or columns after being read into R. (C)</p> Signup and view all the answers

What does the remove_constant() function do, and how does it treat NA values by default?

<p>It drops columns from a data frame that contain only a single constant value, with an option to consider or ignore <code>NA</code> values during the removal process. (D)</p> Signup and view all the answers

How does round_half_up() differ from R's base round() function in handling halves?

<p><code>round()</code> implements 'banker's rounding' (rounds to the nearest even number), while <code>round_half_up()</code> always rounds halves up. (C)</p> Signup and view all the answers

What specific type of data inconsistency does the round_to_fraction() function address, and how does it resolve it?

<p>It enforces a fractional distribution by rounding values to the nearest specified denominator, correcting imprecise or user-entered 'bad' values. (A)</p> Signup and view all the answers

What type of problem does excel_numeric_to_date() solve, and what options are available for handling different Excel date encoding systems?

<p>It converts serial dates originating from Excel into <code>Date</code> format. By default it assumes a standard Excel based system, but it offers options for other Excel date encoding systems. (C)</p> Signup and view all the answers

How do convert_to_date() and convert_to_datetime() enhance date and datetime conversions in comparison to excel_numeric_to_date()?

<p>They are more robust to handling a mix of inputs. This is useful when reading many spreadsheets that should have the same column formats, but don't. (A)</p> Signup and view all the answers

What is the primary function of the row_to_names() function and what is the main impact it has on the data frame?

<p>It elevates a specified row to become the column names of the data frame. (A)</p> Signup and view all the answers

What does the find_header() function do in relation to row_to_names()?

<p><code>find_header()</code> works in conjunction with <code>row_to_names()</code>, where its primary purpose is to locate the row of the column headers. (B)</p> Signup and view all the answers

For what type of data analysis is the top_levels() function originally designed and what kind of output does it provide?

<p><code>top_levels()</code> is designed for use with Likert survey data stored as factors and returns a frequency table with appropriately-named rows, grouped into head/middle/tail groups. (A)</p> Signup and view all the answers

The text mentions the adorn_* functions. In the context of using tabyl(), what is the purpose of these functions?

<p>They format the output of <code>tabyl()</code>to customize its appearance. (C)</p> Signup and view all the answers

What is the main purpose of the tabyl() function as a replacement for table()?

<p>To create tables of descriptive statistics in a way that integrates smoothly with tidyverse tools. (B)</p> Signup and view all the answers

When using compare_df_cols() with the argument return = "mismatch", what kind of output should you expect?

<p>A summary exclusively of the columns that differ across the compared data frames. (D)</p> Signup and view all the answers

You have a dataset with mixed date formats, some as ‘yyyy-mm-dd’ strings and others as Excel serial numbers. Which functions would be most efficient to standardize all dates into a consistent Date class?

<p><code>convert_to_date()</code>. (B)</p> Signup and view all the answers

In a dataset with columns named customerName, order_ID, and % Discount, what is the best way to automatically clean these names to follow a consistent snake_case format?

<p>Use <code>clean_names()</code> function. (B)</p> Signup and view all the answers

Flashcards

Janitor Functions

A function that expedites initial data exploration and cleaning of new datasets.

clean_names()

A function to clean dataframe names, handling problematic variable names by parsing letter cases, special characters and duplicated names to create consistency.

clean_names() features

Parses letter cases and separators to a consistent format, with snake_case as default

compare_df_cols()

Helps identify differences in columns across multiple dataframes, highlighting missing or differing column types.

Signup and view all the flashcards

tabyl()

A tidyverse-oriented replacement for table() that counts combinations of one, two, or three variables, and then can be formatted with a suite of adorn_* functions to look just how you want

Signup and view all the flashcards

get_dupes()

Function that returns records(and inserts a count of duplicates) so you can examine the problematic cases

Signup and view all the flashcards

get_one_to_one()

Shows which, if any, columns in a data frame have one-to-one relationships with each other.

Signup and view all the flashcards

make_clean_names()

Allows vectors of names to be manipulated with stylings and case choices

Signup and view all the flashcards

single_value()

Returns the single value in a column, often used in combination with dplyr::group_by() to validate that every value of X has only one associated value of Y.

Signup and view all the flashcards

remove_empty()

Removes empty rows and columns from a data frame, useful for cleaning imported Excel files.

Signup and view all the flashcards

remove_constant()

Drops columns from a data.frame that contain only a single constant value (with an na.rm option to control whether NAs should be considered as different values from the constant).

Signup and view all the flashcards

round_half_up()

Rounds halves up

Signup and view all the flashcards

round_to_fraction()

Enforces the desired fractional distribution by rounding the values to the nearest value given the specified denominator.

Signup and view all the flashcards

excel_numeric_to_date()

Function that converts those serial numbers to class Date, with options for different Excel date encoding systems, preserving fractions of a date as time

Signup and view all the flashcards

convert_to_date() and convert_to_datetime()

Functions that convert mix of date and datetime formats to date and are more robust to a mix of inputs

Signup and view all the flashcards

row_to_names()

Elevates the specified row to become the names of the data.frame and optionally (by default) remove the row in which names were stored and/or the rows above it.

Signup and view all the flashcards

top_levels()

Returns a tbl_df frequency table with appropriately-named rows, grouped into head/middle/tail groups.

Signup and view all the flashcards

Study Notes

Data Cleaning Using Janitor Package

  • The janitor package expedites initial data exploration and cleaning.
  • chisq.test and fisher.test are masked from package:stats when using janitor.

Cleaning Data Frames

  • Use clean_names() to clean data.frame names, especially after reading data with readxl::read_excel() or readr::read_csv().
  • It works within a %>% pipeline and handles problematic variable names.
  • Letter cases and separators are parsed into a consistent format, defaulting to snake_case, with other cases like camelCase available.
  • Special characters and spaces are handled, including transliterating characters like Å“ to oe.
  • Numbers are appended to duplicated names.
  • "%" converts to "percent" and "#" converts to "number".
  • Spacing around numbers is preserved.

Exploring Duplicates

  • get_dupes() identifies and examines duplicate records, inserting a count of duplicates for analysis.
  • It helps find unexpected duplicates, like unique IDs repeated for each year in a tidy data.frame.

Examining Column Relationships

  • get_one_to_one() identifies columns in a data.frame with one-to-one relationships.
  • A toy example shows how variables can be grouped into one-to-one clusters using the first four rows of the starwars data.frame from the dplyr package.

Manipulating Vectors

  • make_clean_names() manipulates vector names, offering the stylings and case choices of clean_names().
  • While clean_names() is used in data.frame pipelines with %>%, make_clean_names() has a wider use, as on a vector.
  • It can be an argument to .name_repair in tibble::as_tibble.

Validating Column Data

  • single_value() validates that a column has one value per group, often with dplyr::group_by().
  • It ensures every value of X has one value of Y, completing the Y value into missing values, and the info argument helps pinpoint where multiple values of Y occur.

Managing Rows and Columns

  • remove_empty() eliminates empty rows and columns in data.frames.
  • It is useful for cleaning Excel files with empty rows and columns after reading the data into R.

Removing Constant Columns

  • remove_constant() drops columns with a single constant value from a data.frame.
  • The na.rm option controls whether NAs should be considered different values.
  • remove_constant and remove_empty work on matrices and data.frames.

Applying Directionally-Consistent Rounding

  • round_half_up() rounds halves up, unlike R's default "banker's rounding" that rounds to the nearest even number.

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

Use Quizgecko on...
Browser
Browser