Podcast
Questions and Answers
Which of the following considerations is most crucial when gathering data during the data acquisition phase?
Which of the following considerations is most crucial when gathering data during the data acquisition phase?
- The physical location of the data servers.
- The individuals authorized to access the data.
- The completeness and origin of the data. (correct)
- The brand of the data collection tools.
What is data preparation primarily focused on?
What is data preparation primarily focused on?
- Transforming data from various sources into a unified format. (correct)
- Securing data with advanced encryption techniques.
- Implementing the latest business strategies directly into the dataset.
- Rapid data collection, disregarding quality for speed.
Addressing which of the following issues is most likely to prevent project delays and cost overruns during data preparation?
Addressing which of the following issues is most likely to prevent project delays and cost overruns during data preparation?
- Ignoring data quality issues to meet deadlines.
- Underestimating the time and resources needed. (correct)
- Using the latest technology regardless of its suitability.
- Limiting the project scope to a single data source.
When addressing missing values in a dataset, choosing to impute
involves:
When addressing missing values in a dataset, choosing to impute
involves:
Which scenario demonstrates the application of 'customer householding' in data cleansing?
Which scenario demonstrates the application of 'customer householding' in data cleansing?
What is the primary goal of data wrangling?
What is the primary goal of data wrangling?
In data wrangling, the term 'granularity' refers to:
In data wrangling, the term 'granularity' refers to:
Which of the following actions is considered a transformation related to data 'structuring' during data wrangling?
Which of the following actions is considered a transformation related to data 'structuring' during data wrangling?
What does 'data profiling' primarily aim to achieve?
What does 'data profiling' primarily aim to achieve?
Which of the following is an example of 'semantic checks' in individual data profiling?
Which of the following is an example of 'semantic checks' in individual data profiling?
When conducting 'set-based profiling,' what is the significance of checking for duplication?
When conducting 'set-based profiling,' what is the significance of checking for duplication?
If a data analyst chooses to apply the 'Last Observation Carried Forward' (LOCF) method, what issue are they most likely addressing?
If a data analyst chooses to apply the 'Last Observation Carried Forward' (LOCF) method, what issue are they most likely addressing?
What data wrangling transformation best describes concatenating two datasets 'horizontally' to combine their attributes into a wider table?
What data wrangling transformation best describes concatenating two datasets 'horizontally' to combine their attributes into a wider table?
Which of the following dataset structure types is characterized by having varied record lengths?
Which of the following dataset structure types is characterized by having varied record lengths?
When preparing for predictive modeling, why is it important to examine data properties and compute descriptive statistics?
When preparing for predictive modeling, why is it important to examine data properties and compute descriptive statistics?
In the context of predictive modeling, what does 'model decay' refer to?
In the context of predictive modeling, what does 'model decay' refer to?
Which action is part of monitoring and maintenance in the data science lifecycle?
Which action is part of monitoring and maintenance in the data science lifecycle?
What is the primary purpose of 'visualization and communication' in the data science lifecycle?
What is the primary purpose of 'visualization and communication' in the data science lifecycle?
Which BI and analytics maturity level is characterized by inconsistent data?
Which BI and analytics maturity level is characterized by inconsistent data?
In the BI and Analytics Maturity Model, what advancement signifies a move from the 'Standards' to the 'Enterprise' level?
In the BI and Analytics Maturity Model, what advancement signifies a move from the 'Standards' to the 'Enterprise' level?
According to research, what is a common challenge faced by businesses regarding big data and AI initiatives?
According to research, what is a common challenge faced by businesses regarding big data and AI initiatives?
Which factor is most likely to impact analytics maturity in an organization?
Which factor is most likely to impact analytics maturity in an organization?
What is a critical mitigation strategy to avoid analytics projects failure related to 'lack of expertise'?
What is a critical mitigation strategy to avoid analytics projects failure related to 'lack of expertise'?
During data preparation, what is the purpose of reformatting
data?
During data preparation, what is the purpose of reformatting
data?
What does the 'Validity checks' technique primarily address in data cleansing?
What does the 'Validity checks' technique primarily address in data cleansing?
Which data wrangling step involves selecting a smaller, more relevant portion of a dataset?
Which data wrangling step involves selecting a smaller, more relevant portion of a dataset?
What does transforming unstructured data into a numeric or categorical format exemplify within data wrangling?
What does transforming unstructured data into a numeric or categorical format exemplify within data wrangling?
When profiling data, what does 'value range' refer to?
When profiling data, what does 'value range' refer to?
What is the main objective of rescaling data values into a range from 0 to 1 during data normalization?
What is the main objective of rescaling data values into a range from 0 to 1 during data normalization?
Using samples of big data to iteratively refine data wrangling steps is called:
Using samples of big data to iteratively refine data wrangling steps is called:
How would you characterize a dataset where different entities have varied structures?
How would you characterize a dataset where different entities have varied structures?
Timestamps may be used to identify record creation or the last known date which record was considered accurate relates to which of the descriptive Data wrangling terms?
Timestamps may be used to identify record creation or the last known date which record was considered accurate relates to which of the descriptive Data wrangling terms?
What does calculating a sentiment score from a chat bot transcript exemplify?
What does calculating a sentiment score from a chat bot transcript exemplify?
Identify the activity that is not considered part of Data Acquisition?
Identify the activity that is not considered part of Data Acquisition?
Applying business rules, algorithms, filters and creating association, are all processes to perform what during data preperation?
Applying business rules, algorithms, filters and creating association, are all processes to perform what during data preperation?
What aspect of data quality introduces bad data through operational system defects, human errors, manual steps or environmental instability as examples?
What aspect of data quality introduces bad data through operational system defects, human errors, manual steps or environmental instability as examples?
What are the 3 courses of action to improve the maturity model of the analytics?
What are the 3 courses of action to improve the maturity model of the analytics?
Flashcards
Data Acquisition
Data Acquisition
Gather, extract, and mine data from enterprise's source systens, cloud-based applications, and external sources.
Data Preparation
Data Preparation
Set of processes to gather data from diverse sources, transform according to rules, and stage it for useful information.
Reformat data
Reformat data
Convert data from multiple systems into common format & schema using a data dictionary.
Consolidate & Validate data
Consolidate & Validate data
Signup and view all the flashcards
Transform data
Transform data
Signup and view all the flashcards
Cleanse data
Cleanse data
Signup and view all the flashcards
Store data
Store data
Signup and view all the flashcards
Data Preparation Challenge
Data Preparation Challenge
Signup and view all the flashcards
Validity checks
Validity checks
Signup and view all the flashcards
Relevance checks
Relevance checks
Signup and view all the flashcards
Duplicate removal
Duplicate removal
Signup and view all the flashcards
Consistency checks
Consistency checks
Signup and view all the flashcards
Data profiling
Data profiling
Signup and view all the flashcards
Visualization
Visualization
Signup and view all the flashcards
Missing values
Missing values
Signup and view all the flashcards
Data Normalization
Data Normalization
Signup and view all the flashcards
Data Wrangling
Data Wrangling
Signup and view all the flashcards
Gather
Gather
Signup and view all the flashcards
Filter
Filter
Signup and view all the flashcards
Subset
Subset
Signup and view all the flashcards
Profile data
Profile data
Signup and view all the flashcards
Restructure & de-normalize
Restructure & de-normalize
Signup and view all the flashcards
Enrich Data
Enrich Data
Signup and view all the flashcards
Individual values profiling
Individual values profiling
Signup and view all the flashcards
Set-based profiling
Set-based profiling
Signup and view all the flashcards
Key analysis
Key analysis
Signup and view all the flashcards
Dependency analysis
Dependency analysis
Signup and view all the flashcards
Structuring
Structuring
Signup and view all the flashcards
Granularity
Granularity
Signup and view all the flashcards
Cleansing of missing values
Cleansing of missing values
Signup and view all the flashcards
Invalid data
Invalid data
Signup and view all the flashcards
De-duplication
De-duplication
Signup and view all the flashcards
Data standardization
Data standardization
Signup and view all the flashcards
Subsetting
Subsetting
Signup and view all the flashcards
Sampling
Sampling
Signup and view all the flashcards
Enriching
Enriching
Signup and view all the flashcards
Monitor model performance
Monitor model performance
Signup and view all the flashcards
Analytics Maturity Factors
Analytics Maturity Factors
Signup and view all the flashcards
Analytics Projects Failure Reasons
Analytics Projects Failure Reasons
Signup and view all the flashcards
Study Notes
Data Acquisition
- Data is gathered and extracted from enterprise source systems, cloud-based applications, and external sources
- Data understanding requires knowledge of its origin, completeness, collection points, handling, processing and quality issues
Data Preparation
- It is defined as set of processes to gather data from diverse sources and transform it according to business and technical rules
- Data preparation stages it for conversion into useful information
- Reformatting converts data from multiple systems into a common format, needing schema and column definitions, also known as a data dictionary
- Consolidation standardizes data definitions
- Validation tests data against predefined business rules of querying
- Transformation converts data into business information and uses business rules, algorithms, filters, and creating associations
- Cleansing analyzes data for quality, consistency, and addresses issues
- The final step involves storing the processed data for further use
Data Preparation Challenges
- Project delays and cost overruns result from underestimating time and resources
- Variations in format, differing rules, and collection rates presents a challenge
- Operational defects, human errors, manual steps, and environmental instability introduce data quality issues
- Skewed information arises through biased transformation and aggregation involving defects, errors, and incorrect algorithms
- Incomplete understanding of source data leads to suboptimal data models and may cause missed relationships
- The failure to bring data from multiple sources to a consistent definition inhibits joining or comparing it
Data Cleansing Techniques
- Validity checks correct invalid formats and values like syntax errors, typos, and white space
- Relevance checks detect and remove irrelevant data that is corrupted, inaccurate
- Duplicate removal finds and resolves duplicate information, such as events recorded by two sources, the same events that is processed twice, or duplicate customer address
- Consistency checks identify contradictory or incompatible values
- Assess data quality using summary statistics to detect range, mean, distribution, unique values, and outliers during data profiling
- Visualization uses statistical methods to detect unexpected or irregular values
- Missing values can be addressed using one of three methods, by discarding, imputing, or flagging
- Data normalization rescales data values into a range from 0 to 1 for normally distributed data
- Standardizing names and addresses are examples of data cleansing
- Linking personal and business accounts of family members under a household grouping is considered an example of data cleansing
Data Wrangling
- Data wrangling a.k.a. data munging involves aggregation, summarization, and enrichment of data with the use of business intelligence (BI) tools
- BI tools selected influence data wrangling processes
- Data wrangling is an iterative process that can involve profiling transformations
- Data is gathered from different formats, filter the smaller part of the data set and make analytical subsets
- Data is examined and relationships are uncovered by profiling
- Transforming data from a source schema to the target BI tool schema for the target is a aspect of data wrangling
- Advanced cleaning may involve cleaning performed earlier during the data preparation phases
- Enriching data involves using business transformations, joining multiple datasets, and creating aggregations
Data Profiling Questions
- Questions for understanding data include:
- What's in the data
- What is its quality
- Is it complete and unique
- Are there problem records or anomalies
- What is the data distribution
- What are the ranges of values
Types of Data Profiling
- Individual values profiling involve checking the validity of individual record fields like data formats, value ranges, and semantic checks for contextual relevance
- Set-based profiling involves understanding value distributions and checking validity across multiple records, including analysis of numeric, categorical, geospatial, and date-time fields
Data Wrangling Transformations
- An example of data wrangling involves changing a dataset's structure through intra-record actions (changing the order of fields within a record, to combining fields into complex structures), and inter-record actions
- Granularity transformations involves actions like aggregations, and pivots shift records into fields
Data Wrangling Methods
- Cleansing missing values fixes irregularities in the dataset primarily by handling missing (or NULL) values through one of the following methods:
- Discarding records, calculate using average of statistical methods of median value or keep and add a flag
- Cleansing invalid or inconsistent data involves correcting inconsistent data and overwriting the original values and marking values as invalid
- Duplicate record removal, reconciliation of record is an aspect of data wrangling
Dataset Structure Types
- Rectangular datasets are arranged in a basic database matrix.
- Jagged datasets vary in record length
- Heterogeneous datasets feature differing structures in a single dataset
Transformations and Data Enrichment
- Data is enriched with multiple datasets by joins, unions, and metadata enrichment from data-related information
- Computing new values derives new data for existing data and reduce the number of categories through categorization
Metadata Elements
- Data structure refers to the records and field encodings
- Granularity refers to the depth or number of entities represented by a data record.
- Accuracy refers to the quality, consistency of data
- Temporality refers the time sensitivity of the dataset and impact dataset quality
- Scope of data the represent attributes of a dataset
Data Temporality Questions
- Questions to ask about data temporality includes the followings:
- When the data was collected
- When records collected or measured
- Are there any timestamps associated with the data
- If records were modified
- At what point did the data turn stale
Data Scope Questions
- Questions to ask about data scope includes the fallowing:
- What features of records captured and accessible through name or position
- Does dataset have missing records
- Multiple records are the same or does the data require data deduplication
- Heterogeneous record included
Data Publishing
- Data publishing involves storing refined datasets into the target analytics platform including the logic and scripts needed to generate the datasets
Predictive Modeling Process
- The predictive model process involves the following steps:
- Exploring data for anomalies, and patterns
- Formulating a hypothesis
- Training machine learning models
- Evaluating the model performance
- Deploying the best performing model to production
Data Decay
- Data decay occurs when predictive model size, activity, quantity decreases gradually, declines from condition, or has fallen into ruins
- Decay occurs because relationships between behaviour is changing, new data becomes available, data becomes unavailable, or organizations objective change
Monitoring & Maintenance
- Monitoring model performance involves alerts for model decay and alerts for when effectiveness declines
- This step requires documentation, maintenance, and updated documentations and requires to tune and optimize
Analytics Maturity
- Business intelligence (BI) evolves through five levels:
- Unaware: where spreadsheets dominate
- Opportunistic: there are inconsistent data
- Standards: Business executives became BI champions
- Enterprise: where the technology is advanced to the point that research methods begin
- Transformative: The company drives the company forward and industry transformation
Analytics Success Data
- In 2020, 80% of AI projects did not scale
- Only 20% of analytic insights will deliver business outcomes through 2022
- 87% data science projects will not be deployed as production
Analytics Maturity Factors
- Maturity in analytics relies on key factors: people, culture, organization, process, data quality, and technology
- Executive support, data governance drives maturity
Failure Reasons
- A lack of business value, over reliance of software leads of the failure of projects
- Expertise needs to be adequate, data must properly organized and analyzed, analytics properly incorporated
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.