DAB 202 IT Service Management Week 6

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson
Download our mobile app to listen on the go
Get App

Questions and Answers

Which of the following is considered a 'related job role' in the field of data engineering?

  • Data Scientist (correct)
  • Network Security Engineer
  • Database Administrator specializing in backups
  • Frontend Web Developer

Organizations are projected to increase their investment in data and analytics services by what percentage through 2026?

  • 45 percent (correct)
  • 25 percent
  • 15 percent
  • 35 percent

Which of the following best describes the use of data analytics?

  • Applying mathematical models to handle unstructured data in real-time.
  • Systematic analysis of large datasets (big data) to find patterns and trends to produce actionable insights. (correct)
  • Analyzing large and complex datasets to uncover intricate relationships between variables.
  • Making predictions from data at a scale impossible for humans.

What is the primary function of AI/ML in contrast to data analytics?

<p>To make predictions from data at a scale that is difficult or impossible for humans. (C)</p> Signup and view all the answers

A retail store analyzes historical total revenue per customer and categorizes customers based on their spending habits. Which type of analytics is the store primarily using?

<p>Descriptive analytics to segment customers. (B)</p> Signup and view all the answers

A healthcare provider uses machine learning to forecast which patients are most likely to be readmitted within 30 days. Which category of analytics does this fall under?

<p>Predictive (A)</p> Signup and view all the answers

A manufacturing plant analyzes sensor data to determine the root causes of recent equipment failures. Which category of analytics are they employing?

<p>Diagnostic (A)</p> Signup and view all the answers

A city uses traffic pattern analysis to recommend adjustments to signal timings aiming to reduce congestion. Which type of data analytics are they applying?

<p>Prescriptive analytics (C)</p> Signup and view all the answers

What is the crucial initial step in designing a data pipeline for data-driven decisions?

<p>Identifying the business problem to be solved. (C)</p> Signup and view all the answers

Which of the following best describes data wrangling in the context of a data pipeline?

<p>The tasks of discovering, cleaning, normalizing, transforming, and augmenting data as it passes through the pipeline. (B)</p> Signup and view all the answers

In a data-driven organization, what is the primary focus of data engineering?

<p>Building and maintaining the data infrastructure. (A)</p> Signup and view all the answers

Which of the following practices aligns with the 'Unify' aspect of modern data strategies?

<p>Breaking down data silos to create a single source of truth. (B)</p> Signup and view all the answers

What is the primary goal of 'Modernizing' data infrastructure in data-driven organizations?

<p>To increase agility and reduce undifferentiated lifting. (D)</p> Signup and view all the answers

Which action exemplifies the 'Innovate' pillar of a modern data strategy?

<p>Applying AI/ML to uncover new insights in unstructured data. (C)</p> Signup and view all the answers

Which of the following is NOT one of the 'Five Vs' of data?

<p>Volatility (D)</p> Signup and view all the answers

Which 'V' of data is most closely associated with the trustworthiness and accuracy of data?

<p>Veracity (C)</p> Signup and view all the answers

How do volume and velocity most directly impact data pipeline design?

<p>They drive decisions about the data's required processing power and storage capacity. (B)</p> Signup and view all the answers

Which scenario exemplifies streaming ingestion?

<p>Clickstream data analyzed continuously from a retailer's website (A)</p> Signup and view all the answers

Which data type is characterized by elements and attributes, has a self-describing structure, and is exemplified by formats like JSON and XML?

<p>Semistructured (A)</p> Signup and view all the answers

Why is it important to consider data variety when designing a data pipeline?

<p>Different data types and sources may require different processing and transformation techniques. (D)</p> Signup and view all the answers

Which of the following is generally true about unstructured data compared to structured data?

<p>It is harder to query but more flexible. (D)</p> Signup and view all the answers

In data analytics, what is the significance of 'data lineage'?

<p>It traces the transformations applied to the data and its origins. (A)</p> Signup and view all the answers

A data engineer discovers inconsistencies in how customer addresses are formatted across two merged datasets. Which action best addresses this issue?

<p>Normalizing the address format to a consistent standard. (B)</p> Signup and view all the answers

What is the advantage of storing timestamped details instead of aggregated values in a data analytics system?

<p>Allows for detailed analysis to find and debug errors. (B)</p> Signup and view all the answers

Which of the following strategies would most effectively enhance the 'Veracity' of data in a data pipeline?

<p>Securing all layers of the pipeline and preventing unwanted data changes. (C)</p> Signup and view all the answers

A sensor network is established at a wind farm which sends constant streams of data about wind speed, direction, and temperature. Which category of data source best describes this situation?

<p>Events, IoT devices, and sensors (B)</p> Signup and view all the answers

A data scientist is tasked with building a predictive model for customer churn. During the data exploration phase, they notice a significant number of missing values in a key demographic field. Which action is the MOST appropriate first step to address this issue?

<p>Investigate the reasons for the missing data and assess potential biases. (D)</p> Signup and view all the answers

A company has traditionally relied on nightly batch processing of sales data to generate reports. However, business stakeholders are now demanding real-time insights into sales performance. Which data architecture change would best enable this capability?

<p>Implement a streaming data ingestion and processing pipeline alongside the batch pipeline. (D)</p> Signup and view all the answers

A financial institution wants to detect fraudulent transactions as quickly as possible. Which type of data analysis approach is MOST suitable for this purpose?

<p>Streaming analytics (C)</p> Signup and view all the answers

Which of the following is a critical consideration when choosing a data storage solution for a high-volume, high-velocity data stream, such as sensor data from industrial equipment?

<p>Scalability to handle increasing data volumes and ingestion rates. (B)</p> Signup and view all the answers

A large e-commerce company captures user browsing behavior, product views, and purchases to personalize recommendations. Which data type would best categorize the user's comments?

<p>Unstructured (C)</p> Signup and view all the answers

A health tech company is developing a personalized medicine app that combines patient medical history, genomic data, and real-time data from wearable sensors. What challenge is MOST likely to arise due to the variety of data sources?

<p>Integrating different data types and formats effectively. (A)</p> Signup and view all the answers

A retail company implements a new data governance program. Which activity would directly support maintaining data integrity and consistency?

<p>Securing all layers of the data pipeline. (B)</p> Signup and view all the answers

After a systems upgrade, a data analyst notices that customer birthdates are being incorrectly recorded in the database, resulting in many customers appearing to be born on January 1, 1900. Which step would be the MOST effective in addressing this data quality issue?

<p>Identify the root cause of the data entry issue and implement measures to prevent recurrence. (C)</p> Signup and view all the answers

A data team is designing a new data warehouse. Which approach is typically recommended to best supports traceability and debugging in data analytics?

<p>Save the all raw unmodified data. (D)</p> Signup and view all the answers

A company captures data about user interactions with their web application. Over time, they transition from capturing simple page view events to more complex events including mouse movements and form inputs. Which 'V' of data does this change primarily reflect?

<p>Variety (B)</p> Signup and view all the answers

Flashcards

Data analytics

A systematic analysis of large datasets to find patterns and trends, producing actionable insights.

AI/ML

Mathematical models used to make predictions from data at a scale that is difficult for humans.

Descriptive Analytics

Analyzing past performance, history, and trends.

Diagnostic Analytics

Describing the reasons behind trends in data.

Signup and view all the flashcards

Predictive Analytics

Forecasting future trends based on data.

Signup and view all the flashcards

Prescriptive Analytics

Recommending decisions or courses of action based on data analysis.

Signup and view all the flashcards

Data Pipeline

The infrastructure for data-driven decision-making.

Signup and view all the flashcards

Data Wrangling

How data is acted upon as it passes through the pipeline.

Signup and view all the flashcards

Data Volume

The amount of data that needs to be processed.

Signup and view all the flashcards

Data Velocity

How quickly data enters and moves through the pipeline.

Signup and view all the flashcards

Data Variety

General data classification including structured, semi-structured and unstructured data.

Signup and view all the flashcards

Data Veracity

The extent to which data can be trusted for analysis.

Signup and view all the flashcards

Data Value

The value or insight that can be pulled from data.

Signup and view all the flashcards

Data Consistency

How closely different metrics, derived from the same dataset, agree.

Signup and view all the flashcards

Structured Data

Data organized in a predefined format, typically rows and columns.

Signup and view all the flashcards

Semi-structured Data

Data with some organizational properties, making it easier to analyze than unstructured data.

Signup and view all the flashcards

Unstructured Data

Data that has no predefined format.

Signup and view all the flashcards

On-Premises Databases

Data derived from application and stored directly within an organization.

Signup and view all the flashcards

Public Datasets

Data aggregated about a topic, such as census data, health data and population data.

Signup and view all the flashcards

Time-series Data

Data generated continually by events and includes a time-based element.

Signup and view all the flashcards

Data Veracity

Ensuring trustworthiness of source data, to prevent bad data.

Signup and view all the flashcards

Data Integrity

A process for guaranteeing and maintaining the integrity of data.

Signup and view all the flashcards

Discover Data Errors

The action of discovering issues of data issues that decrease veracity.

Signup and view all the flashcards

Immutable data

Process that makes data uniform without removing source data.

Signup and view all the flashcards

Study Notes

  • DAB 202 is IT Service Management for Week 6

Job Roles

  • Data engineer roles are related to Data analyst, Data scientist, Extract, transform, and load (ETL) developer and Machine learning (ML) practitioners

Data-Driven Decisions

  • Through 2026, organizations plan to increase investments in data and analytics services by 45% to become more data-driven and digital per Gartner, March 2022

Making Decisions

  • Individuals use data to decide on restaurants, where to buy items, fantasy football, job types, and home prices
  • Organizations apply data to decide what transactions are fraudulent, webpage design impact, at risk patients, security issues and harvest times
  • Large greenhouse tomato growers, can make data-driven decisions

Data Analytics vs AI/ML

  • Data analytics systematically analyzes large datasets, (big data) to identify patterns and insights that can be put into action
  • Data analytics uses programming logic to answer from data, working well with structured data and a limited number of variables
  • AI/ML involves mathematical models used to forsee data at a scale too difficult or impossible for humans
  • AI/ML uses examples from large data amounts to provide learn about and answer questions
  • AI/ML works well when data is complex and unstructured

Customer Relationship Management Example

  • A retail business uses data analytics to analyzes total revenue per customer and divides customers by spending
  • Segmentation can provide higher customer service to high-spending customers
  • A retail business can use AI/ML to determine Customer churn and how often customers come and go
  • AI/ML aids in discovering churn influences, therefore, enabling changes to better retain customers

Categories of Analytics

  • Descriptive analytics describes past performance, history and trends to answer "What happened?"
  • Diagnostic analytics describes reasons for trends and help answer "Why did it happen?"
  • Predictive analytics forecasts future trends to answer "What will happen?"
  • Prescriptive analytics recommends a course of action and helps answer "How can we make it happen?"

Analytics Categories and Questions

  • "Which customer did churn?" is Descriptive.
  • "Why did the customer churn?" is Diagnostic.
  • "Which customer will churn" is Predictive
  • "What can I do to change the outcome of customer churn" is Prescriptive

Importance of data value and insights

  • More valuable insights are more difficult to derive
  • Descriptive analytics produces less valuable and easy to derive insights
  • Diagnostic analytics produces medium value and is more difficult to ascertain
  • Predictive analytics produces higher value and is even more difficult to work out
  • Prescriptive provides the highest level of value and hardest to determine
  • More data and fewer barriers equates to more data-driven decision

Data Science in Daily Life

  • Everyday experiences include seeing shoe ads online after shopping, streaming recommendations and pizza delivery tracking
  • Further examples are fraud alerts when using a credit card and navigation apps alerting of traffic jams
  • More data doesn't always equate to more value
  • More collected data also brings higher data costs, more unstructured data, greater security risks and slower query processing
  • Data becomes less valuable for decision-making over time

Data Value

  • The scale of most to least valuable is Preventive or Predictive, Actionable, Reactive and then Historical
  • Data that is near real time is more valuable than data from days or months ago
  • Cost, speed, and accuracy are the trade-offs taken when dealing with data driven decisions
  • Organizations use data to make informed decisions

Data Analytics vs AI/ML usage

  • Data analytics rely on programming logic and tools to derive data predictions and work better with limited variables
  • AI/ML can learn to make predictions from examples in data and works best when data is unstructured and the variables are complex

The Data Engineering Pipeline

  • Data pipelines provide infrastructure for data-driven decision-making
  • The simplest data pipeline has stages for data collection, storage and processing, and then building something practical with data
  • Design infrastructure by first identifying the decision trying to be made and then what data is essential to support that decision
  • The infrastructure should be designed weighing cost, speed and accuracy
  • Pipeline infrastructure contains layers for data sources, ingestion, storage, processing, visualization and then, finally, predictions and decisions
  • Data wrangling is how data transforms through the pipeline and includes steps like discover, clean, normalize, enrich and transformation
  • Data is often processed iteratively to evaluate and improve the results

Key Pipeline actions

  • The pipeline includes data ingestion, storage, process, analyzed and visualized
  • When designing a pipeline, begin with determining the problem needing solving and then decide what data is required
  • Data wrangling is used to act upon data passing through the pipeline like discovery, cleaning, normalization, transformation, and augmentation
  • Data is iteratively processed to evaluate and refine results

Roles of the Data Scientist and Data Engineer

  • Data scientists and data engineers work with the data pipeline and certain tasks can be fulfilled by either
  • Data engineering primarily concerns the infrastructure while a data scientist works with the data inside the pipeline
  • Building the best pipeline requires questions about required outcomes and the data, which is then iterated upon

Data Strategies

  • A three-pronged strategy to build data infrastructure involves modernizing, unifying and innovating
  • Organizations should modernize, unify and innovate data structures
  • Modernizing is moving to cloud-based and purpose-built services for administrative and operational reduction
  • Unifying means creating a single source of truth for data, making it available across the organization
  • Innovating looks for new data value, specifically applying AI and ML

The Five Vs of Data

  • The module is to list the five Vs of data: volume, velocity, variety, veracity and value
  • The module will describe the impact of volume and velocity on a data pipeline
  • The module will compare and contrast structured, semi-structured and unstructured data types
  • The module will identify data sources that are commonly used to feed data pipelines and also questions to assess data veracity
  • The module will suggest methods to improve data veracity in a pipeline

Data Pipeline Q&A

  • Common data pipeline questions include asking if the organization has the needed data, where it is stored and its format
  • Additional common question is does the data require combining from multiple sources, what is its quality and security requirements
  • Questions to ask are what mechanisms are in place to move the data into a pipeline, how much data, how frequently is it updated, and what is the importance of speed
  • Further questions to ask are what the data can determine, how to evaluate results, visualization specifics, formats, tools and whether a big data network is required
  • Final questions to ponder about are what AI/ML models are best suited and what is the simplest way to implement AI/ML

Data Characteristics

  • Data characteristics drive infrastructure decisions in respect to volume, velocity, variety, veracity and value
  • Volume relates to dataset size and how much new data is being generated
  • Velocity considers new data being generated, how often and when
  • Variety evaluates the types, formats and amount of data sources
  • Veracity looks at data accuracy, precision and trustworthiness
  • Value reviews what insights can be assembled from the data
  • Strategies to support the best value from data includes confirming available data meets need and evaluating the feasibility of acquiring said data
  • Additional strategies include matching pipeline design to data, cataloging data, focusing on business and implementing governance policies

Volume and Velocity

  • Scaling a pipeline considers data volume and velocity
  • Data volume and pace drive design choices and all pipeline layers are impacted
  • Each pipeline layer must be evaluated for its individual requirements
  • Balance costs for throughput and storage against the answer time and accuracy

Ingestion, Volume and Velocity

  • Analyze the best ingestion method, the amount of data to be ingested and the frequency of new data to be ingested and processed
  • Streaming ingestion is clickstream data from a retailer's website, sending large data amounts in small bits at a continuous pace which requires immediate analysis
  • Batch ingestion is sales transaction data from retailers from different locations sent to a central location periodically
  • Batch ingestion is typically analyzed overnight with reports sent the following morning

Storage and Volume

  • Storage scalability is in response to volume of data to be ingested and the accessibility for processing and analysis when required
  • Long term, reporting access can be achieved by storing five years of sales data for trending analysis on a monthly cadence
  • Short term, very fast access involves incoming ecommerce sales transaction data to suggest additional purchases within the current session

Processing, Volume and Velocity

  • The volume of data that must be processed in a single iteration must be assessed, additionally, whether it needs distributed solution
  • Also, must address how quickly and frequently does the processing need to occur
  • An example is in big data processing of credit card data for all the US-Based transactions within the past week
  • Streaming analytics can produce real-time alerts produced on log data to identify potential fraud as it occurs

Analysis, Visualization, Volume and Velocity

  • Historical analysis involves visualizing a year’s worth of sales data, enabling users to drill down by region and salesperson
  • Streaming Internet of Things (IoT) data analyzes rate reporting from rates of sensors in a plant

Data Volume

  • Volume relates to the amount of data being processed
  • Velocity is how quickly data enters and moves through a pipeline
  • Volume and velocity decide the expected throughput and scaling needs of a pipeline
  • Every pipeline layer requires evaluation for volume and velocity
  • Balance costs and throughput needs

Data Variety and Pipelines

  • Pipeline design is influenced by the type and source of data
  • Every data type contributes to certain types of processing and analysis
  • Each data source contributes to different amounts of discovery and transformation work
  • The data source type leads to different type and scope of the ingestion layer

Types of Data

  • Data types include structured, semistructured and unstructured
  • Structured data is easy to use, involving rows and columns and a well-defined schema, like relational databases
  • Semistructured data contains elements and attributes that self-describe like CSV, JSON and XML
  • Unstructured data contains files with no predefined structure like images, movies and clickstream data
  • Unstructured data is more "colder" or more flexible as opposed to "hotter" or structured data.
  • 80% or more of available data is unstructured

How data types are used

  • Structured data is queried from relational database and reports on customer services within a specified time
  • Semi-structured data extracts and analyzes customer comments from an online chat application that saves conversations as JSON
  • Unstructured data creates and analyzes the sentiment in customer service emails

Data Source Types

  • Data includes structured, semi-structured and unstructured
  • Structured is the easiest to query but least flexible
  • Unstructured is hardest to query but more flexible
  • Most growth is of available data is unstructured
  • Data source types include organizational stores, public datasets, and time-series data
  • Combining datasets enriches analysis but can complicate processing

Common Data Sources

  • On-premise databases or file stores are application data is owned and managed by the organization
  • Public datasets are about a topic, like census, health or population
  • Events, IoT devices, sensors are data continuously generated, including a time-based component

Pipeline Considerations

  • On-premises databases are controlled by the organization and may contain private, structured information
  • Public datasets might not be required and need transformation or merging and are often semistructured
  • Events, IoT devices, and sensors require streaming ingestion and time-series storage, requiring real-time processing

Data Veracity Challenges

  • An application in healthcare runs analysis on customer data to determine patients who have not received proper care
  • An application in healthcare combines public health data with customer data for more accurate and improved mobile alerts
  • A mobile application provides real-time heart rate monitoring and alerts given risk patterns
  • Possible formatting of data impacts ability to analyze
  • Data can become difficult to maintain given multiple data types and merges

Veracity and Value

  • Data veracity is easier to maintain with data integrity and fewer challenges
  • General type, sources, governance influence data issues
  • Trustable facts enable more relevant data to the business
  • Volume is having the right amount of facts of the type needed
  • The nature of the data helps you derive more business insights with the appropriate speed
  • Data source types range from organizational stores, public datasets and time-series data
  • Datasets enrich data and add complexity to processing

Data and Pipeline Security

  • Bad data is worse than no data and veracity is how much is trusted
  • To maintain proper data you must clean data and implement controls
  • Prevent unwanted changes to stored data and ensure constant occurrences

Data Integrity

  • Ingestion process must track data state and validity
  • Cleaning processes must preserve data integrity
  • Processing must preserve this validity as it is analyzed
  • Data issues include duplicates, dated data, missing records among other concerns

Maintaining Consistency

  • Keep new sources in mind to prevent data integrity issues
  • Ask how new information is managed in relationship to processes now set
  • Establish ways the company and you can discover, fix, prevent facts issues
  • Secure layers of every pipeline
  • Assign permission to least privilege accounts only
  • Design process with better usage/integrity in mind to implement new steps
  • Create records of past actions for better fact integrity (Keep audit trails)

Data Management Tips

  • Ask questions about data worthiness and lineages of the details
  • Need a known fact and must not design to get different details about fact standards
  • Be conscious when you may design and add details with conversions
  • Time-stamp to keep track of values
  • Use best practices, data-integrity maintenance and data is safe
  • Data management plans are a must

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

More Like This

IT Service Management Quiz
6 questions

IT Service Management Quiz

ReputableTropicalRainforest avatar
ReputableTropicalRainforest
IT Service Management Quiz
40 questions

IT Service Management Quiz

CleanerSalamander8337 avatar
CleanerSalamander8337
IT Service Management: Data Engineering
38 questions
Use Quizgecko on...
Browser
Browser