Data Wrangling, Cleaning, and Preparation

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson
Download our mobile app to listen on the go
Get App

Questions and Answers

Which of the following is a challenge posed by large volumes of data?

  • Overloaded memory. (correct)
  • Decreased I/O operations.
  • Reduced algorithm complexity.
  • Minimized CPU starvation.

What is a key consideration when performing analysis on large datasets?

  • Using algorithms that always load the entire dataset into memory.
  • Focusing solely on memory usage to avoid swapping.
  • Ignoring I/O and CPU starvation issues to increase speed.
  • Addressing issues like I/O and CPU starvation. (correct)

What happens when more data than the available RAM is processed?

  • The OS efficiently utilizes all the memory.
  • The OS starts swapping out memory blocks to disk. (correct)
  • The OS starts using the CPU as additional memory.
  • The OS compresses all data to fit in the available RAM.

What is a primary concern regarding algorithms and large datasets?

<p>Few algorithms can handle large data sets without loading it all into memory. (D)</p> Signup and view all the answers

Which technique helps in solving memory issues when handling large datasets?

<p>Data set compression. (D)</p> Signup and view all the answers

Why is RAM sometimes preferable to a hard disk when handling constantly changing information?

<p>RAM offers quicker data access compared to hard disks. (B)</p> Signup and view all the answers

What characteristic defines an algorithm suited for handling large data?

<p>It does not need to load the entire dataset into memory. (B)</p> Signup and view all the answers

What is a key advantage of using the MapReduce paradigm?

<p>It makes it easy to parallelize and distribute computations. (B)</p> Signup and view all the answers

In the context of online learning algorithms, what is mini-batch learning?

<p>Feeding the algorithm a small set of observations. (A)</p> Signup and view all the answers

How does 'bcolz' optimize memory usage when dealing with large arrays?

<p>By compressing data and using the hard drive when the array doesn't fit in memory. (A)</p> Signup and view all the answers

What does Dask help to achieve when handling large data?

<p>Optimizing the flow of calculation and making performing calculations in parallel easier. (D)</p> Signup and view all the answers

Which data structure is best suited for storing data where most values are zero?

<p>Sparse matrix. (B)</p> Signup and view all the answers

How do trees improve data retrieval in databases?

<p>By using an index to avoid scanning the table entirely. (B)</p> Signup and view all the answers

What is the primary function of a hash table?

<p>To calculate a key for every value and store these keys in buckets. (D)</p> Signup and view all the answers

Which characteristic makes Python dictionaries a close relative of key-value stores?

<p>Their implementation as hash tables. (C)</p> Signup and view all the answers

What is the purpose of using tools like Numexpr when dealing with large numerical computations in Python?

<p>To evaluate numerical expressions faster than pure Python. (A)</p> Signup and view all the answers

What is achieved by using Numba?

<p>Greater speed via just-in-time compiling. (C)</p> Signup and view all the answers

How does Bcolz help in overcoming memory issues when using NumPy?

<p>It overcomes out-of-memory issues with compressed arrays. (D)</p> Signup and view all the answers

What is the primary benefit of using Blaze?

<p>Providing a Pythonic interface to multiple data stores. (B)</p> Signup and view all the answers

What distinguishes Theano from other tools for handling large data?

<p>It works directly with the graphical processing unit and simplifies equations. (B)</p> Signup and view all the answers

What capability does Dask provide for optimizing calculations?

<p>Flow optimization and distribution of calculations. (B)</p> Signup and view all the answers

What is the most important reason to utilize existing tools and libraries?

<p>To avoid unnecessary reinvention and save time. (D)</p> Signup and view all the answers

When should user-defined functions and procedures be used in analytical base tables inside a database?

<p>Only when features are simple. (C)</p> Signup and view all the answers

What is the benefit of feeding the CPU compressed data?

<p>To avoid CPU starvation. (D)</p> Signup and view all the answers

Under what condition can switching to the GPU be beneficial?

<p>When your computations are highly parallelizable. (B)</p> Signup and view all the answers

What is the purpose of profiling code?

<p>Detect slow sections of the code. (A)</p> Signup and view all the answers

What is a reason to compile the code yourself?

<p>Implement the slowest parts of your code in a low-level language. (A)</p> Signup and view all the answers

Why is avoiding pulling all data into memory helpful?

<p>It enables calculations on extremely large data sets. (C)</p> Signup and view all the answers

What is the primary goal of data wrangling?

<p>To assure quality and useful data. (B)</p> Signup and view all the answers

What does the term 'wrangling' refer to in the context of data?

<p>Rounding up information in a certain way. (B)</p> Signup and view all the answers

Which step involves removing outliers and formatting nulls?

<p>Cleaning. (B)</p> Signup and view all the answers

In data wrangling, what does the validation step ensure?

<p>Data is consistent throughout the dataset. (A)</p> Signup and view all the answers

Which of the following use-cases is a use of data wrangling?

<p>Distinguishing corporate fraud. (A)</p> Signup and view all the answers

Which tool is described as the most basic manual data wrangling tool?

<p>Spreadsheets / Excel Power Query. (A)</p> Signup and view all the answers

What are the major steps to working with datasets in pandas?

<p>Cleaning, transforming and merging. (C)</p> Signup and view all the answers

Flashcards

Problems Faced Handling Large Data

Challenges with overloaded memory and algorithms that run indefinitely.

Data Wrangling

The process of refining, modifying, and integrating raw data into a more suitable format for analysis.

Cleaning Data

Resolving missing entries, duplicates, and mismatched data.

Transforming Data

Adapting data arrangements, employing functions, and creating new features.

Signup and view all the flashcards

Merging Data

Combining different datasets for a complete analysis.

Signup and view all the flashcards

Data set compression

Data is stored smaller; affects computation speed with shift from slow hard disk to fast CPU.

Signup and view all the flashcards

Algorithm suitable for large data

Doesn't require loading entire data set into memory to make predictions; supports parallelized calculations.

Signup and view all the flashcards

Mini-batch learning

Feed the algorithm by having small batches to train with.

Signup and view all the flashcards

bcolz

Store data arrays compactly; uses hard drive when array doesn't fit into main memory.

Signup and view all the flashcards

Dask

Optimize the flow of calculations and makes performing calculations in parallel easier.

Signup and view all the flashcards

MapReduce

Easy to parallelize and distribute.

Signup and view all the flashcards

Sparse Matrix

A data set contains relatively little information compared to its entries (observations).

Signup and view all the flashcards

Trees

A root value and subtrees of children; retrieve information much faster than scanning a table.

Signup and view all the flashcards

Hash Tables

Calculate a key for every value in your data and put the keys in a bucket.

Signup and view all the flashcards

Cython

Specify the data type while developing the program; runs programs much faster

Signup and view all the flashcards

Numexpr

Numerical expression evaluator for NumPy, but can be many times faster than the original NumPy.

Signup and view all the flashcards

Numba

Greater speed by compiling your code right before you execute it; also known as just-in-time compiling.

Signup and view all the flashcards

Bcolz data

Overcome the out-of-memory problem that can occur when using NumPy.

Signup and view all the flashcards

Blaze

Translate your Python code into SQL but can handle many more data stores than relation databases such as CSV, Spark, and others.

Signup and view all the flashcards

Theano

Work directly with the graphical processing unit (GPU) and do symbolical simplifications whenever possible.

Signup and view all the flashcards

Discovery

Data transformation requires thought about data's essence before starting

Signup and view all the flashcards

Organization

The way your raw data is placed inside a dataset.

Signup and view all the flashcards

Data Cleaning

Outliers are removed to format the data.

Signup and view all the flashcards

Data enrichment

A step back from the data to determine whether you have enough data to proceed.

Signup and view all the flashcards

Study Notes

Data Science Unit 2: Data Wrangling, Data Cleaning, and Preparation

  • Unit 2 covers data wrangling, cleaning, and preparation, including techniques for handling large volumes of data.
  • It also covers data wrangling processes like cleaning, transforming, merging, reshaping, combining, merging datasets, merging on index, concatenating, combining with overlap, reshaping, and pivoting
  • As well as data cleaning and preparation techniques such as handling missing data, data transformation, string manipulation, summarizing, binning, classing, standardization, and outlier/noise & anomaly management.

Challenges in Handling Large Data

  • Large data volume poses challenges like overloaded memory and algorithms that run endlessly.
  • Adapting and expanding techniques becomes necessary due to the challenges of working with large data.
  • Analysis can be affected by I/O (input/output) and CPU starvation, potentially causing speed issues.
  • Computers have limited RAM; exceeding it leads to the OS swapping memory blocks to disks, reducing efficiency.
  • Algorithms designed for smaller datasets can cause out-of-memory errors when loading large datasets.

Overcoming Large Data Challenges

  • Insufficient memory, unending processes, components bottlenecking, and insufficient speed are common problems
  • Solutions involve choosing suitable algorithms, appropriate data structures, and the correct tools.
  • Data set compression can alleviate memory problems by reducing data set size and affecting computation speed.
  • Hard disks offer permanent storage, but writing to them takes more time than changing data in RAM.
  • Algorithms should ideally not require loading the entire dataset into memory and support parallelized calculations.

Online Learning Algorithms

  • Full batch learning (statistical learning): Involves feeding the algorithm all data at once.
  • Mini-batch learning: Feeds the algorithm a subset of observations (e.g., 100, 1000) at a time.
  • Online learning: Feeds the algorithm one data observation at a time.

Algorithm Selection

  • bcolz: Compactly stores data arrays and utilizes hard drive space when the array exceeds main memory.
  • Dask: Optimizes calculation flow and facilitates easier, parallel calculations.
  • MAPREDUCE: Useful for parallelizing and distributing tasks.
    • For instance, consider counting votes nationally across 25 parties, 1,500 offices, and 2 million people: tasks can either be gathered and counted centrally, or offices can hand over results to be aggregated by party.

Data Structure Considerations

  • Data storage is as vital as the selection of algorithms.
  • Data structures affect storage needs and CRUD (create, read, update, delete) operations performance.
  • Sparse data sets have relatively little actual information in its entries.

Tree Data Structures

  • Trees offer a way of retrieving information significantly faster than scanning tables.
  • Trees contain a root value and subtrees of children, with each child potentially having its own children.
  • They are often used by databases
  • Databases use trees and other data structures to index data (eg using trees and hash tables) to avoid scanning the whole table.

Hash Tables

  • Hash tables calculate a key for each data value and store keys in buckets.
  • Data can then be retrieved quickly by looking up the right bucket.
  • Python dictionaries are an implementation of hash tables.

Tool Selection for Data Handling

  • Choosing a tool typically happens after selecting the right algorithms and data structures.
  • The right tool is either a Python library or controlled via Python.
  • A number of Python libraries exist that deal with large data using smarter data structures, code optimizers and just-in-time compilers.
    • Cython: Requires specification of data types during program development, making compilation and execution faster.
    • Numexpr: A numerical expression evaluator for NumPy, which is often significantly faster than NumPy itself.
    • Numba: Achieves higher speeds using just-in-time compilation to compile code right before it runs.
    • Bcolz: Overcomes the out-of-memory problem when using NumPy, storing and working with arrays in a compressed form. It also uses Numexpr in the background.
    • Blaze: Translates Python code into SQL, with support for handling data stores like CSV, Spark, and more.
    • Theano: GPU access and symbolic simplifications, coming with a just-in-time compiler.
    • Dask: Optimizes calculation flow, allowing efficient execution and distribution.

General Programming Advice

  • Leverage existing tools and libraries to avoid reinventing the wheel.
  • Maximize hardware potential through adaptations.
  • Reduce computing needs by streamlining memory and processing requirements.
  • The first reaction most data scientists have when working with large data sets is to prepare their analytical base tables inside a database
  • Exploit databases for simple feature preparation.
  • Adopt user-defined functions for advanced modeling when available.
  • Use optimized libraries like Mahout and Weka.
  • Only use advanced modeling when necessary

Hardware Optimization

  • Feed the CPU compressed data.
  • Switching to the GPU for parallelizable computations can provide higher throughput for computations.
  • Use multiple Python threads for your CPU.

Reducing Computing Needs

  • Remove as much of the work as possible before data is processed
  • Use a profiler to detect slow parts of your code.
  • Use compiled code or optimised numerical computation packages where appropriate.
  • Compile the code yourself in lower-level languages such as C or Fortran.
  • Read data in chunks rather than pulling it all into memory at once.
  • Use generators to avoid intermediate data storage
  • Use only the absolute minimum data you require.
  • Simplify calculations as much as possible.

Data Wrangling Explained

  • Data wrangling, also known as data munging, transforms raw data into a valuable, useable format for analytics.
  • Data wrangling assures data quality and utility.
  • Key operations involve discovery, organization, cleaning, enrichment, validation, and publishing.
    • Discovery: Understand the source data
    • Organization: Structure your raw data once gathered.
    • Cleaning: Remove things like outliers, nulls and duplication
    • Enrichment: Take a step back and check you have enough data to proceed
    • Validation: Apply validation rules to check for consistency throughout your dataset.
    • Publishing: Provision of data notes/documentation with access for other people to use.
  • It can be used for things like corporate fraud detection via analysis of multi-party or multi-layered emails
  • And to facilitate improved insights into things like customer behaviour analysis.

Data Wrangling Tools

  • Spreadsheets/Excel with Power Query are basic manual data wrangling tools.
  • OpenRefine is an automated cleaning tool requiring some programming skills.
  • Tabula is suited to all data types.
  • Google DataPrep is a data service that explores, cleans, and prepares data.
  • Data Wrangler is a data cleaning and transforming tool.
  • Plotly (with Python) is useful for map and chart data.

Common Data Operations

  • When dealing with datasets in pandas, the three primary steps include cleaning, transforming, and merging.
    • Cleaning: Managing missing values, duplicates, and inconsistent data.
    • Transforming: Includes altering data formats, applying functions, and feature engineering.
    • Merging: Combining multiple datasets.

Working with Data in Pandas

  • Clean your data by handling missing or inconsistent values using dropna() and fillna() functions in pandas.
  • For example, missing values can be replaced with "Unknown" or the mean value using fillna().
  • Functions such as apply() can be used to apply data transformations to certain columns.
  • Datatypes for a column can be changed as appropriate with the astype() command
  • An existing column can be renamed with the rename() command.
  • Columns can have their values derived from other columns using the assign() command. df.assign(Age_Squared=df['Age'] ** 2) creates a column "Age_Squared" with the values equal to the age squared.

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

More Like This

Data Wrangling Basics
13 questions

Data Wrangling Basics

ReadableArlington avatar
ReadableArlington
Data Wrangling with Pandas and Python
14 questions
Data Wrangling, Cleaning and Preparation
32 questions
Use Quizgecko on...
Browser
Browser