Data Acquisition and Preparation

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

Which of the following considerations is most crucial when gathering data during the data acquisition phase?

  • The physical location of the data servers.
  • The individuals authorized to access the data.
  • The completeness and origin of the data. (correct)
  • The brand of the data collection tools.

What is data preparation primarily focused on?

  • Transforming data from various sources into a unified format. (correct)
  • Securing data with advanced encryption techniques.
  • Implementing the latest business strategies directly into the dataset.
  • Rapid data collection, disregarding quality for speed.

Addressing which of the following issues is most likely to prevent project delays and cost overruns during data preparation?

  • Ignoring data quality issues to meet deadlines.
  • Underestimating the time and resources needed. (correct)
  • Using the latest technology regardless of its suitability.
  • Limiting the project scope to a single data source.

When addressing missing values in a dataset, choosing to impute involves:

<p>Calculating missing values based on other data observations. (C)</p> Signup and view all the answers

Which scenario demonstrates the application of 'customer householding' in data cleansing?

<p>Linking personal and business accounts of family members under one household. (A)</p> Signup and view all the answers

What is the primary goal of data wrangling?

<p>To transform data into a format suitable for analysis and BI tools. (D)</p> Signup and view all the answers

In data wrangling, the term 'granularity' refers to:

<p>The level of detail represented by the data. (A)</p> Signup and view all the answers

Which of the following actions is considered a transformation related to data 'structuring' during data wrangling?

<p>Changing the order of fields within a record. (B)</p> Signup and view all the answers

What does 'data profiling' primarily aim to achieve?

<p>Understanding the content, quality, and relationships within data. (A)</p> Signup and view all the answers

Which of the following is an example of 'semantic checks' in individual data profiling?

<p>Confirming that a 'date of order' does not fall on a public holiday if no delivery is scheduled. (D)</p> Signup and view all the answers

When conducting 'set-based profiling,' what is the significance of checking for duplication?

<p>To identify redundant information across multiple records. (A)</p> Signup and view all the answers

If a data analyst chooses to apply the 'Last Observation Carried Forward' (LOCF) method, what issue are they most likely addressing?

<p>Handling missing values in time series data. (C)</p> Signup and view all the answers

What data wrangling transformation best describes concatenating two datasets 'horizontally' to combine their attributes into a wider table?

<p>Unions (B)</p> Signup and view all the answers

Which of the following dataset structure types is characterized by having varied record lengths?

<p>Jagged dataset (C)</p> Signup and view all the answers

When preparing for predictive modeling, why is it important to examine data properties and compute descriptive statistics?

<p>To discover data anomalies and identify patterns and trends. (B)</p> Signup and view all the answers

In the context of predictive modeling, what does 'model decay' refer to?

<p>The decline in a model's effectiveness due to changing conditions. (A)</p> Signup and view all the answers

Which action is part of monitoring and maintenance in the data science lifecycle?

<p>Alerting business of model decay and modifying models. (C)</p> Signup and view all the answers

What is the primary purpose of 'visualization and communication' in the data science lifecycle?

<p>To present findings and insights to business users in an understandable format. (D)</p> Signup and view all the answers

Which BI and analytics maturity level is characterized by inconsistent data?

<p>Opportunistic (C)</p> Signup and view all the answers

In the BI and Analytics Maturity Model, what advancement signifies a move from the 'Standards' to the 'Enterprise' level?

<p>The deployment of an enterprise metrics framework. (B)</p> Signup and view all the answers

According to research, what is a common challenge faced by businesses regarding big data and AI initiatives?

<p>Business adoption. (D)</p> Signup and view all the answers

Which factor is most likely to impact analytics maturity in an organization?

<p>The level of executive support for analytics. (B)</p> Signup and view all the answers

What is a critical mitigation strategy to avoid analytics projects failure related to 'lack of expertise'?

<p>Ensuring deep understanding of the business. (A)</p> Signup and view all the answers

During data preparation, what is the purpose of reformatting data?

<p>To convert data into a common format and schema. (D)</p> Signup and view all the answers

What does the 'Validity checks' technique primarily address in data cleansing?

<p>Checking and correcting invalid formats and values. (A)</p> Signup and view all the answers

Which data wrangling step involves selecting a smaller, more relevant portion of a dataset?

<p>Filter (A)</p> Signup and view all the answers

What does transforming unstructured data into a numeric or categorical format exemplify within data wrangling?

<p>Restructuring &amp; de-normalizing (C)</p> Signup and view all the answers

When profiling data, what does 'value range' refer to?

<p>Whether the value falls within permissible set of values. (D)</p> Signup and view all the answers

What is the main objective of rescaling data values into a range from 0 to 1 during data normalization?

<p>Preparing data for specific algorithms by having all values on the same scale. (C)</p> Signup and view all the answers

Using samples of big data to iteratively refine data wrangling steps is called:

<p>Data Sampling (B)</p> Signup and view all the answers

How would you characterize a dataset where different entities have varied structures?

<p>Heterogeneous (A)</p> Signup and view all the answers

Timestamps may be used to identify record creation or the last known date which record was considered accurate relates to which of the descriptive Data wrangling terms?

<p>Temporality (A)</p> Signup and view all the answers

What does calculating a sentiment score from a chat bot transcript exemplify?

<p>Computation of new data values (C)</p> Signup and view all the answers

Identify the activity that is not considered part of Data Acquisition?

<p>Transforming data to fit business requirements. (D)</p> Signup and view all the answers

Applying business rules, algorithms, filters and creating association, are all processes to perform what during data preperation?

<p>Transform Data (A)</p> Signup and view all the answers

What aspect of data quality introduces bad data through operational system defects, human errors, manual steps or environmental instability as examples?

<p>Data quality at source (B)</p> Signup and view all the answers

What are the 3 courses of action to improve the maturity model of the analytics?

<p>Predict and forecast future, Infer why has it occurred and Determine the best course of action (D)</p> Signup and view all the answers

Flashcards

Data Acquisition

Gather, extract, and mine data from enterprise's source systens, cloud-based applications, and external sources.

Data Preparation

Set of processes to gather data from diverse sources, transform according to rules, and stage it for useful information.

Reformat data

Convert data from multiple systems into common format & schema using a data dictionary.

Consolidate & Validate data

Consolidate using standard definitions; validate by querying against pre-defined business rules.

Signup and view all the flashcards

Transform data

Transform data into business information using business rules, algorithms, & filters.

Signup and view all the flashcards

Cleanse data

Analyze data for quality and clean up data issues.

Signup and view all the flashcards

Store data

Store the resulting data for further processing.

Signup and view all the flashcards

Data Preparation Challenge

Volume, variety and veracity of data.

Signup and view all the flashcards

Validity checks

Check and correct invalid format/values (e.g. range, syntax, typos, whitespace).

Signup and view all the flashcards

Relevance checks

Detect and remove irrelevant data (corrupted, inaccurate, or irrelevant for analytics goals).

Signup and view all the flashcards

Duplicate removal

Find, resolve, and remove duplicate information from multiple sources or processes.

Signup and view all the flashcards

Consistency checks

Detect contradictory/incompatible values using constraints/business rules (e.g. age vs. birth year).

Signup and view all the flashcards

Data profiling

Use summary statistics to assess quality (range, mean, outliers).

Signup and view all the flashcards

Visualization

Visualize data to detect unexpected/erroneous values (e.g. outliers).

Signup and view all the flashcards

Missing values

Handle missing values with discard, imputation, or flagging.

Signup and view all the flashcards

Data Normalization

Rescale data values into a range from 0 to 1, for normally distributed data.

Signup and view all the flashcards

Data Wrangling

Aggregation, summarization, and enrichment of data for use with BI tools.

Signup and view all the flashcards

Gather

Gather data from various sources, considering different formats and structures.

Signup and view all the flashcards

Filter

Choose a smaller dataset part relevant for purpose tables, rows and columns.

Signup and view all the flashcards

Subset

Create subsets relevant to the analytics problem.

Signup and view all the flashcards

Profile data

Examine data content, quality, and relationships.

Signup and view all the flashcards

Restructure & de-normalize

Transform data from source schema to the target BI tool schema.

Signup and view all the flashcards

Enrich Data

Perform business transformations required for business purposes.

Signup and view all the flashcards

Individual values profiling

Understanding the validity of individual record fields.

Signup and view all the flashcards

Set-based profiling

Checking validity of the value's distribution – is it is expected.

Signup and view all the flashcards

Key analysis

Finds Potential primary key OR finds potential foreign key.

Signup and view all the flashcards

Dependency analysis

Determine dependent relationships within a dataset or across tables.

Signup and view all the flashcards

Structuring

Actions that change the form or schema of record.

Signup and view all the flashcards

Granularity

Changes the granularity of how you move data.

Signup and view all the flashcards

Cleansing of missing values

Predominately manipulating individual field value.

Signup and view all the flashcards

Invalid data

Data that is inconsistent with other fields.

Signup and view all the flashcards

De-duplication

Remove of duplicates to fix irregularities.

Signup and view all the flashcards

Data standardization

Replacement different values, codes or spelling.

Signup and view all the flashcards

Subsetting

Split data into subsets to wrangle them separately.

Signup and view all the flashcards

Sampling

Using samples of big data to iteratively refine data wrangling steps.

Signup and view all the flashcards

Enriching

Actions that add new values to dataset.

Signup and view all the flashcards

Monitor model performance

Models need to adapt to changing business conditions and data.

Signup and view all the flashcards

Analytics Maturity Factors

The most important factors that impact how organizations should scale.

Signup and view all the flashcards

Analytics Projects Failure Reasons

Not focused on the business, unclear or inaccurate.

Signup and view all the flashcards

Study Notes

Data Acquisition

  • Data is gathered and extracted from enterprise source systems, cloud-based applications, and external sources
  • Data understanding requires knowledge of its origin, completeness, collection points, handling, processing and quality issues

Data Preparation

  • It is defined as set of processes to gather data from diverse sources and transform it according to business and technical rules
  • Data preparation stages it for conversion into useful information
  • Reformatting converts data from multiple systems into a common format, needing schema and column definitions, also known as a data dictionary
  • Consolidation standardizes data definitions
  • Validation tests data against predefined business rules of querying
  • Transformation converts data into business information and uses business rules, algorithms, filters, and creating associations
  • Cleansing analyzes data for quality, consistency, and addresses issues
  • The final step involves storing the processed data for further use

Data Preparation Challenges

  • Project delays and cost overruns result from underestimating time and resources
  • Variations in format, differing rules, and collection rates presents a challenge
  • Operational defects, human errors, manual steps, and environmental instability introduce data quality issues
  • Skewed information arises through biased transformation and aggregation involving defects, errors, and incorrect algorithms
  • Incomplete understanding of source data leads to suboptimal data models and may cause missed relationships
  • The failure to bring data from multiple sources to a consistent definition inhibits joining or comparing it

Data Cleansing Techniques

  • Validity checks correct invalid formats and values like syntax errors, typos, and white space
  • Relevance checks detect and remove irrelevant data that is corrupted, inaccurate
  • Duplicate removal finds and resolves duplicate information, such as events recorded by two sources, the same events that is processed twice, or duplicate customer address
  • Consistency checks identify contradictory or incompatible values
  • Assess data quality using summary statistics to detect range, mean, distribution, unique values, and outliers during data profiling
  • Visualization uses statistical methods to detect unexpected or irregular values
  • Missing values can be addressed using one of three methods, by discarding, imputing, or flagging
  • Data normalization rescales data values into a range from 0 to 1 for normally distributed data
  • Standardizing names and addresses are examples of data cleansing
  • Linking personal and business accounts of family members under a household grouping is considered an example of data cleansing

Data Wrangling

  • Data wrangling a.k.a. data munging involves aggregation, summarization, and enrichment of data with the use of business intelligence (BI) tools
  • BI tools selected influence data wrangling processes
  • Data wrangling is an iterative process that can involve profiling transformations
  • Data is gathered from different formats, filter the smaller part of the data set and make analytical subsets
  • Data is examined and relationships are uncovered by profiling
  • Transforming data from a source schema to the target BI tool schema for the target is a aspect of data wrangling
  • Advanced cleaning may involve cleaning performed earlier during the data preparation phases
  • Enriching data involves using business transformations, joining multiple datasets, and creating aggregations

Data Profiling Questions

  • Questions for understanding data include:
    • What's in the data
    • What is its quality
    • Is it complete and unique
    • Are there problem records or anomalies
    • What is the data distribution
    • What are the ranges of values

Types of Data Profiling

  • Individual values profiling involve checking the validity of individual record fields like data formats, value ranges, and semantic checks for contextual relevance
  • Set-based profiling involves understanding value distributions and checking validity across multiple records, including analysis of numeric, categorical, geospatial, and date-time fields

Data Wrangling Transformations

  • An example of data wrangling involves changing a dataset's structure through intra-record actions (changing the order of fields within a record, to combining fields into complex structures), and inter-record actions
  • Granularity transformations involves actions like aggregations, and pivots shift records into fields

Data Wrangling Methods

  • Cleansing missing values fixes irregularities in the dataset primarily by handling missing (or NULL) values through one of the following methods:
    • Discarding records, calculate using average of statistical methods of median value or keep and add a flag
  • Cleansing invalid or inconsistent data involves correcting inconsistent data and overwriting the original values and marking values as invalid
  • Duplicate record removal, reconciliation of record is an aspect of data wrangling

Dataset Structure Types

  • Rectangular datasets are arranged in a basic database matrix.
  • Jagged datasets vary in record length
  • Heterogeneous datasets feature differing structures in a single dataset

Transformations and Data Enrichment

  • Data is enriched with multiple datasets by joins, unions, and metadata enrichment from data-related information
  • Computing new values derives new data for existing data and reduce the number of categories through categorization

Metadata Elements

  • Data structure refers to the records and field encodings
  • Granularity refers to the depth or number of entities represented by a data record.
  • Accuracy refers to the quality, consistency of data
  • Temporality refers the time sensitivity of the dataset and impact dataset quality
  • Scope of data the represent attributes of a dataset

Data Temporality Questions

  • Questions to ask about data temporality includes the followings:
    • When the data was collected
    • When records collected or measured
    • Are there any timestamps associated with the data
    • If records were modified
    • At what point did the data turn stale

Data Scope Questions

  • Questions to ask about data scope includes the fallowing:
    • What features of records captured and accessible through name or position
    • Does dataset have missing records
    • Multiple records are the same or does the data require data deduplication
    • Heterogeneous record included

Data Publishing

  • Data publishing involves storing refined datasets into the target analytics platform including the logic and scripts needed to generate the datasets

Predictive Modeling Process

  • The predictive model process involves the following steps:
    • Exploring data for anomalies, and patterns
    • Formulating a hypothesis
    • Training machine learning models
    • Evaluating the model performance
    • Deploying the best performing model to production

Data Decay

  • Data decay occurs when predictive model size, activity, quantity decreases gradually, declines from condition, or has fallen into ruins
  • Decay occurs because relationships between behaviour is changing, new data becomes available, data becomes unavailable, or organizations objective change

Monitoring & Maintenance

  • Monitoring model performance involves alerts for model decay and alerts for when effectiveness declines
  • This step requires documentation, maintenance, and updated documentations and requires to tune and optimize

Analytics Maturity

  • Business intelligence (BI) evolves through five levels:
    • Unaware: where spreadsheets dominate
    • Opportunistic: there are inconsistent data
    • Standards: Business executives became BI champions
    • Enterprise: where the technology is advanced to the point that research methods begin
    • Transformative: The company drives the company forward and industry transformation

Analytics Success Data

  • In 2020, 80% of AI projects did not scale
  • Only 20% of analytic insights will deliver business outcomes through 2022
  • 87% data science projects will not be deployed as production

Analytics Maturity Factors

  • Maturity in analytics relies on key factors: people, culture, organization, process, data quality, and technology
  • Executive support, data governance drives maturity

Failure Reasons

  • A lack of business value, over reliance of software leads of the failure of projects
  • Expertise needs to be adequate, data must properly organized and analyzed, analytics properly incorporated

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

More Like This

Data Preparation Process
10 questions
Data Preparation and Cleaning Quiz
21 questions
Data Preparation for Analytics Projects
19 questions
Use Quizgecko on...
Browser
Browser