Understanding Data Engineering Concepts

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson
Download our mobile app to listen on the go
Get App

Questions and Answers

What is the primary role of a data engineer in an organization?

  • Developing and maintaining the organization's data infrastructure. (correct)
  • Designing user interfaces for data visualization tools.
  • Analyzing data to derive insights and make decisions.
  • Creating machine learning models for predictive analysis.

Data engineers are primarily responsible for analyzing data and generating reports for business stakeholders.

False (B)

In the context of data, what distinguishes 'unorganized information' from 'meaningful' data?

Processing

The value derived from data is heavily dependent on its accuracy and its _______ when it is needed.

<p>accessibility</p>
Signup and view all the answers

Match the stages of the Data Engineering Lifecycle with their descriptions:

<p>Generation = The stage where data is initially produced or created. Ingestion = The process of bringing data into the data system. Transformation = The stage where data is cleaned and converted into a usable format. Serving = The process of making data available for consumption by analysts and other stakeholders.</p>
Signup and view all the answers

Which of the following responsibilities is typically performed by a data engineer?

<p>Designing and managing data pipelines. (A)</p>
Signup and view all the answers

Data must always be perfectly accurate to be considered analytics-ready.

<p>False (B)</p>
Signup and view all the answers

Name two categories into which the skills of a data engineer can be divided.

<p>Technical and Functional or Soft</p>
Signup and view all the answers

A data engineer acts as a _______ between data producers and data consumers.

<p>hub</p>
Signup and view all the answers

Match the following roles with their primary focus:

<p>Data Architect = Designing the blueprint for organizational data management. Software Engineer = Building the software and systems that run a business and generate internal data. Data Scientist = Using data to make predictions and recommendations. Data Analyst = Using data scientists' insights to drive business decisions.</p>
Signup and view all the answers

According to the 'Data Science Hierarchy of Needs', what is the most basic step a company needs to take?

<p>Collecting data. (A)</p>
Signup and view all the answers

Most data scientists spend the majority of their time on complex data analysis and building machine learning models.

<p>False (B)</p>
Signup and view all the answers

Name two activities involved in the 'explore/transform' level of the data science hierarchy of needs.

<p>Anomaly detection or data cleaning.</p>
Signup and view all the answers

At the pinnacle of the Data Science Hierarchy of Needs lies _______ and deep learning.

<p>artificial intelligence</p>
Signup and view all the answers

Match the type of data with the percentage it represents of all enterprise data:

<p>Structured Data = 5% to 10% Semi-Structured Data = 10% to 20% Unstructured Data = Over 80%</p>
Signup and view all the answers

Which of the following is a characteristic of structured data?

<p>It can be stored in well-defined schemas. (C)</p>
Signup and view all the answers

Unstructured data can be easily organized and stored in a relational database.

<p>False (B)</p>
Signup and view all the answers

Name one type of source of structured data.

<p>SQL Database or Spreadsheet</p>
Signup and view all the answers

Data organized in rows and columns is referred to as _______ data.

<p>structured</p>
Signup and view all the answers

Match the following tools with their primary use in working with structured data:

<p>PostgreSQL = Object-relational database management system. MySQL = Widely used relational database management system. Oracle Database = Advanced database management system with a multi-model structure. Microsoft SQL Server = Relational database management system developed by Microsoft.</p>
Signup and view all the answers

Which of the following is a key characteristic of unstructured data?

<p>It does not follow any particular format or sequence. (C)</p>
Signup and view all the answers

Unstructured data requires less expertise to analyze compared to structured data.

<p>False (B)</p>
Signup and view all the answers

Name two sources of unstructured data.

<p>Web pages or Social Media Feeds</p>
Signup and view all the answers

The adaptability of unstructured data increases the file formats in the database, which widens the _______ pool.

<p>data</p>
Signup and view all the answers

What is a defining characteristic of semi-structured data?

<p>It contains tags and elements for organization but lacks a fixed schema. (C)</p>
Signup and view all the answers

Semi-structured data can be easily stored in relational databases without any modifications.

<p>False (B)</p>
Signup and view all the answers

Name two different sources of semi-structured data.

<p>E-mails or XML files</p>
Signup and view all the answers

Metadata for semi-structured data includes _______ and other markers just like in JSON, XML, or CSV.

<p>tags</p>
Signup and view all the answers

Match the following examples to their corresponding data types:

<p>Customer relationship management (CRM) = Structured data User profiles from social media = Semi-Structured data Analyzing voice recordings = Unstructured data</p>
Signup and view all the answers

Which term refers to an approach where the structure of the data is applied when it is read, rather than when it is written?

<p>Schema-on-Read (C)</p>
Signup and view all the answers

A relational database is a database that does not use the tabular schema of rows and columns.

<p>False (B)</p>
Signup and view all the answers

What does 'ETL' stand for in the context of data engineering?

<p>Extract, Transform, Load</p>
Signup and view all the answers

A _______ is a centralized repository that allows the user to store all structured and unstructured data at any scale.

<p>data lake</p>
Signup and view all the answers

Match the following phases of ETL to their description:

<p>Extract = Getting data from various sources Transform = Manipulating data to a project, such as filtering, cleansing, de-duplicating, validating, and authenticating data Load = Storing data to target location</p>
Signup and view all the answers

According to the content, which step is literally just the movement and storage of data?

<p>Data Loading (A)</p>
Signup and view all the answers

In the ETL process, the 'Transform' step always precedes the 'Load' step.

<p>False (B)</p>
Signup and view all the answers

Name the two types of data storage mentioned in the content.

<p>Data Warehouse and Data Lake</p>
Signup and view all the answers

While a data warehouse is relational from transactional systems, a data lake is _______ and relational.

<p>non-relational</p>
Signup and view all the answers

Data Warehouses vs Data Lake

<p>Data Warehouse = Designed prior to the DW implementation (schema-on-write) Data Lake = Written at the time of analysis (schema-on-read)</p>
Signup and view all the answers

Flashcards

What is Data?

Unorganized information that is processed to make it meaningful. Comprises facts, observations, perceptions and data that can be interpreted.

What is Data Engineering?

Data engineering involves creating interfaces and mechanisms to manage the flow and access of information.

Data Engineering Lifecycle

Stages that turn raw data ingredients into a useful end product, ready for consumption by analysts, data scientists, ML engineers, and others

Data Engineer

Converts raw data into usable data and provides analytics-ready data to data consumers. Ensures data is accurate, reliable, and accessible.

Signup and view all the flashcards

Knowledge of operating systems

Operating systems such as UNIX, Linux, and Windows, including commonly used administrative tools, system utilities and commands.

Signup and view all the flashcards

Knowledge of infrastructure components

Virtual machines, networking, and application services, such as load balancing and application performance monitoring

Signup and view all the flashcards

Cloud-based services

Those services offered by Amazon, Google, IBM, and Microsoft.

Signup and view all the flashcards

Experience of working with databases

Relational Database Management System, NoSQL databases such as Redis, MongoDB, Cassandra, and Neo4J.

Signup and view all the flashcards

Data Lakes

Azure Data Lake Storage, AWS Lake Formation, Alder Lake, and Google BigLake.

Signup and view all the flashcards

Data Pipelines

Apache Beam, AirFlow, And DataFlow.

Signup and view all the flashcards

ETL Tools

IBM Infosphere Information Server, AWS Glue, and Improvado

Signup and view all the flashcards

Languages for querying and manipulating data

Query languages for accessing and manipulating data in a database, such as SQL for relational databases and SQL-like query languages for NoSQL databases.

Signup and view all the flashcards

Programming languages

Python, R, and Java.

Signup and view all the flashcards

Shell and Scripting languages

Unix/Linux Shell and PowerShell.

Signup and view all the flashcards

Structured data

Data with a predefined structure or adheres to a specified data model

Signup and view all the flashcards

Sources of Structured data

SQL Databases.

Signup and view all the flashcards

Online booking

Hotel and ticket reservation data (e.g., dates, prices, destinations, etc.) fits the 'rows and columns' format indicative of the pre-defined data model.

Signup and view all the flashcards

Postgresql

An object-relational database management system (ORDBMS). It supports a large part of the SQL standard and offers many modern features like complex queries, transactional integrity, and multi-version concurrency control.

Signup and view all the flashcards

Sqlite

is an in-process library that implements a self-contained, serverless, zero-configuration, transactional SQL database engine. SQLite reads and writes directly to ordinary disk files.

Signup and view all the flashcards

Mysql

Is a widely used relational database management system (RDBMS). It is free and open-source and ideal for both small and large applications.

Signup and view all the flashcards

Oracle Database

Is an advanced database management system with a multi-model structure. It can be used for data warehousing, online transaction process, and mixed database workloads.

Signup and view all the flashcards

Microsoft SQL Server

Microsoft SQL Server is a relational database management system developed by Microsoft. As a database server, it is a software product with the primary function of storing and retrieving data as requested by other software applications.

Signup and view all the flashcards

Unstructured data

Does not have an easily identifiable structure.

Signup and view all the flashcards

Source of Unstructured data

Web pages.

Signup and view all the flashcards

Mongodb

MongoDB is a non-relational document database that provides support for JSON-like storage.

Signup and view all the flashcards

Hadoop

A distributed storage that can store any file format in a distributed and scalable manner.

Signup and view all the flashcards

Amazon Dynamodb

Is a fully managed NoSQL database service provided by Amazon Web Services (AWS). It is designed to provide seamless scalability, high performance, and low latency for applications that require single-digit millisecond response times.

Signup and view all the flashcards

Semi-Structured data

Has some organizational properties but lacks a fixed or rigid schema.

Signup and view all the flashcards

Sources of Semi-structured data

E-mails

Signup and view all the flashcards

Cassandra

Apache Cassandra is an open-source NoSQL distributed database having scalability and high availability without compromising performance and provides availability.

Signup and view all the flashcards

BigTable

The Google File System

Signup and view all the flashcards

ETL systems

Process by which you'll move data from databases and other sources into a single repository, like a data warehouse.

Signup and view all the flashcards

Data extraction

Data is copied or exported from source locations to a staging area.

Signup and view all the flashcards

Boring stats about data science work

80% of data science work is data preparation; 75% of data scientists find this to be the most boring aspect of the job.

Signup and view all the flashcards

Data Loading

Store data to target location.

Signup and view all the flashcards

Data Warehouse

Relational from transactional systems, operational databases, and line of business applications

Signup and view all the flashcards

Data Lake

Non-relational and relational from IoT devices, web sites, mobile apps, social media, and corporate applications

Signup and view all the flashcards

Study Notes

  • Data consists of facts, observations, and perceptions.
  • Data is essential in science, business, healthcare, and tech
  • Data is processed to make it meaningful.
  • Organizations use data to gain a competitive edge

Data Value

  • Accuracy and accessibility are key components
  • The job of a Data Engineer is to ensure accuracy and accessibility

Data Engineering

  • It involves creating interfaces and mechanisms to manage data flow
  • Data Engineers maintain data to ensure usability
  • Data Engineers establish and manage an organization's data infrastructure
  • Analysts and scientists use the data after it is prepared by Data Engineers

Data Engineering Lifecycle

  • Generation is the start point in the data engineering lifecycle
  • It involves storage, Ingestion, and transformation
  • Serving data is another stage in data engineering lifecycle
  • The lifecycle is supported by security, data management, DataOps, and data architecture

Data Engineer

  • Transforms raw data into usable analytics-ready data
  • Ensures data is accurate, reliable, and follows regulations
  • Accessible data is a key result
  • A data engineer's responsibilities include extracting, organizing, and integrating data
  • Data engineers prepare data for analysis and reporting by transforming it
  • They design and manage data pipelines and set up the necessary infrastructure
  • They make data accessible for business uses and to stakeholders
  • "Big data engineers" are now simply called "data engineers"

Essential Skills

  • Technical skills
  • Functional skills
  • Soft skills

Technical Prowess

  • Knowledge of operating systems like UNIX, Linux, and Windows
  • Understanding infrastructure components: virtual machines, networking, etc
  • Cloud-based services from Amazon, Google, IBM, and Microsoft
  • RDBMS knowledge, such as IBM DB2, MySQL, Oracle and PostgreSQL
  • Understanding of NoSQL databases: Redis, MongoDB, Cassandra, and Neo4J
  • Data warehouses: Oracle Exadata, IBM Db2 Warehouse on Cloud, Amazon RedShift
  • Data Lakes: Azure Data Lake Storage, AWS Lake Formation, Alder Lake, Google BigLake.
  • Working with data pipelines is useful
  • Data pipeline solutions include Apache Beam, AirFlow, and DataFLow
  • ETL tools such as IBM Infosphere, AWS Glue, and Improvado are important
  • Proficiency in languages for data querying/manipulation/processing is needed
  • SQL for relational databases and SQL-like query languages for NoSQL databases
  • Programming languages like Python, R, and Java
  • Shell and Scripting languages, such as Unix/Linux Shell, PowerShell
  • Familiarity with BigData processing tools like Hadoop, Hive, and Spark

Functional Skills

  • Convert business requirements into technical specifications: a core functional skill
  • Working with complete software development lifecycle stages is important
  • Software development lifecycle stages: Ideation, architecture, design, testing, etc
  • A data engineer must Understand data potential application in business
  • Must understand risks of poor data management covering data quality, privacy, security, compliance

Soft Skills

  • Interpersonal skills
  • Teamwork
  • Collaboration
  • Effective communication

Technical Roles

  • The Data Engineer is a hub connecting data producers and consumers
  • Data producers: software engineers, data architects, and DevOps/SREs
  • Data Consumers: data analysts, data scientists, and ML engineers
  • Data engineers interact with those in operational roles like DevOps engineers

Upstream Stakeholders

  • Data architects design the blueprint
  • They map processes for organizational data management
  • Act as a bridge between technical and nontechnical sides
  • Software engineers build software; are responsible for generating data that data engineers use
  • DevOps/SREs produce data through operational monitoring
  • They may be downstream too, consuming data through dashboards

Downstream Stakeholders

  • Data Scientists use Data Analytics and Data Engineering to make predictions and recommendations
  • Data Analysts (Business Analysts) use those predictions to drive decisions
  • ML engineers overlap with Data Engineers and Data Scientists
  • They develop advanced techniques, train models, and maintain infrastructure

Data Engineering vs Data Science

  • Data engineering sits upstream from data science
  • Data engineers provide the inputs that data scientists convert

Data Science Hierarchy

  • Data collection is the first step for a data scientist
  • The next is movement, securing organization, and storage of data
  • Data exploration and analysis, including data cleaning are then performed
  • Data classification and and basic analytics occur during the data aggregation stage
  • Analytics, metrics, and training data allow testing, learning, and optimization
  • AI and deep learning, with the right resources and data, are at the top

Types of Data:

  • Structured data: facts and values (5-10%)
  • Unstructured data: contains information (80%)
  • Semi-structured data: tags with elements (10 - 20%)

Structured Data

  • It has a well-defined structure, can be stored in schemas, can be tabular
  • Represents only 5% to 10% of all data
  • Supports objective numbers and facts
  • Is the simplest way to manage information
  • Can be collected, exported, stored, and organized in typical databases

Sources of Structured Data

  • SQL Databases
  • OLTP systems
  • Spreadsheets like Excel and Google Sheets
  • Online forms with GPS and RFID tags
  • Network and web server logs

Pros of Structured Data

  • Structured data allows Machine learning and algorithmic usage
  • It is easier to use for business users who can understand the subject matter
  • More tools are offered since it predates unstructured data.

Cons of Structured Data

  • Data can only be used for its intended purpose
  • Commonly stored with rigid schemas in "data warehouses"
  • Any adjustments will lead to massive time expenditure.

Use Cases for Structured Data

  • CRM software analyzes customer behavior with analytical tools
  • Hotel and ticket bookings
  • Accounting firms and departments
  • HR employee data
  • Admissions/Enrollment

Tools for Structured Data

  • PostgreSQL
  • SQLite
  • MySQL
  • Oracle Database
  • Microsoft SQL Server

Unstructured Data

  • Does not have an identifiable structure
  • Cannot be easily organized
  • Is increasing rapidly representing over 80% of all data
  • Does not follow format, sequences, semantics, and rules
  • Contains sources with analytics applications

Sources of Unstructured Data

  • Web pages
  • Social Media feeds
  • Images, Videos, and Audio files
  • Documents
  • Surveys
  • Media Logs
  • Powerpoint

Pros of Unstructured Data

  • Retains Adaptability, increasing file formats.
  • Can be collected very quickly
  • Is pay-as-you-use

Cons of Unstructured Data

  • Requires experience in data science
  • Unique tools are required, increasing time and resources
  • Complexity adds time to usage with algorithms

Use Cases for Unstructured Data

  • Uses text within social media posts for sentiment
  • Uses processing within videos for facial recognition
  • Voice recordings for speech sentiment
  • Big data usage with volume data needing insights
  • Mining data for consumer habits
  • Analytics for alert patterns/shifts
  • Chatbots for text analysis

Tools for Unstructured Data

  • MongoDB
  • Hadoop
  • Data Lake
  • Amazon DynamoDB

Semi-Structured Data

  • Has organizational traits, but lacks a rigid schema
  • Contains tags and elements to organize it correctly
  • Can't be stored with rows and columns

Sources of semi-structured data

  • XML files
  • Binary executables
  • Zips
  • Email

Pros of Semi-Structured Data

  • Modifications that change data are flexible
  • Scalability to handle various formats
  • Easy capture due to no predefined data
  • Performance with nested databases

Cons of Semi-Structured Data

  • Complexity when dealing with nested databases
  • Challenges when indexing efficiently
  • Integrate with difficult data through customs
  • Relational and Non-relational data can have defined relationships
  • Non SQL data is for not only SQL within the data
  • Both databases are different through the types of data presented

Data Engineering Skills

  • ETL processes move data into warehouses
  • Data Storage must be proper to design proper data solutions
  • In ETL (extract, transform, and load) systems, data moves from databases/sources to a repository like a data warehouse.

Data Extraction, Transformation, Loading

  • Data is copied from multiple source locations.
  • SQL and NoSQL servers, CRMs, Text and document files and websites are common sources
  • The transformation involves manipulating data, filtering it, and authenticating it
  • Load can be done around transformations based on machine
  • Cloud computating is an efficient way to process all data at once

Data Warehouses vs Data Lake

  • Data warehouses use designed schemas for transactional, operational databases
  • Fast query results are needed due to higher cost storage
  • Warehouse requires highly curated Data as the truth
  • Lakes use non-relational, IOT devices, mobile apps, etc

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

More Like This

Data Engineering Lifecycle Stages
18 questions
Data Engineering Tasks + components
45 questions
Data Engineering Concepts Quiz
40 questions
Use Quizgecko on...
Browser
Browser