Data Science vs Data Engineering

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson
Download our mobile app to listen on the go
Get App

Questions and Answers

Which of the following best describes the role of a data engineer?

  • Designing user interfaces for data visualization tools.
  • Building and maintaining data pipelines and systems to make data accessible and reliable. (correct)
  • Extracting meaningful insights from data using statistical methods.
  • Creating predictive models to forecast future trends.

Data Scientists typically need deeper knowledge of data warehousing than Data Engineers.

False (B)

What does data maturity primarily depend on within an organization?

  • How the data is leveraged as a competitive advantage. (correct)
  • The age of the company.
  • The size of the IT department.
  • The annual revenue of the company.

Which of the following is NOT a primary responsibility of a data engineer?

<p>Developing marketing strategies based on data insights. (A)</p>
Signup and view all the answers

Data Engineers focus exclusively on tasks related to data storage.

<p>False (B)</p>
Signup and view all the answers

Why is proficiency in coding languages crucial for data engineers?

<p>To automate data processing tasks and build data pipelines. (C)</p>
Signup and view all the answers

A key skill for data engineers is familiarity with both relational and ______ databases.

<p>non-relational</p>
Signup and view all the answers

Which of the following is a primary advantage of using Python in data engineering?

<p>It has diverse libraries and frameworks suitable for various tasks. (B)</p>
Signup and view all the answers

Which statement accurately describes ETL systems?

<p>They move data from multiple sources into a single repository. (C)</p>
Signup and view all the answers

Match the data engineering tasks with their descriptions:

<p>Data Extraction = Copying data from various sources to a staging area. Data Transformation = Converting raw data into a useful format for analysis. Data Loading = Storing data in the target location.</p>
Signup and view all the answers

What is a key characteristic of a 'data lake' compared to a 'data warehouse'?

<p>Data lakes are suitable for storing structured and unstructured data. (B)</p>
Signup and view all the answers

Which function is performed during the 'transformation' stage of ETL?

<p>Filtering, cleaning, and validating the data. (B)</p>
Signup and view all the answers

Data Engineer's work is mostly data preparation.

<p>True (A)</p>
Signup and view all the answers

What is the function of a data pipeline?

<p>Design of systems for processing and storing data (A)</p>
Signup and view all the answers

What are the functions of building data systems and pipelines?

<p>All mentioned options (C)</p>
Signup and view all the answers

Data engineers do not build data pipelines.

<p>False (B)</p>
Signup and view all the answers

What does raw data describe?

<p>Data in its most basic (A)</p>
Signup and view all the answers

Which of the following is a software engineering task fulfilled by Data Engineers?

<p>Scaling (D)</p>
Signup and view all the answers

What are examples of Big Data Tools?

<p>Hadoop, MongoDB, Kafka</p>
Signup and view all the answers

What is the purpose of performing complex data analysis to find trends and patterns?

<p>To report results using dashboards and data visualizations. (C)</p>
Signup and view all the answers

Data maturity depends simply on the age or revenue of a company.

<p>False (B)</p>
Signup and view all the answers

Ensuring the data is complete, has been cleansed, and that rules have been established for outliers is part of ______ data.

<p>preparing</p>
Signup and view all the answers

A relational data bank is organized with:

<p>All of the above (D)</p>
Signup and view all the answers

Machine learning is mostly a concern for data scientists, not data engineers.

<p>True (A)</p>
Signup and view all the answers

What is the function of cloud computing?

<p>Cloud storage and computing</p>
Signup and view all the answers

Flashcards

Data Analysis

Turning raw information into knowledge that can be acted on.

Data Modeling

Using existing data to estimate desired data.

Data Engineering

Enhancing speed, robustness, and scalability of data processes.

Domain Knowledge (Data Analysis)

Translating business needs into questions and making accuracy-cost trade-offs.

Signup and view all the flashcards

Research (Data Analysis)

Gathering data and designing experiments.

Signup and view all the flashcards

Interpretation (Data Analysis)

Summarizing, visualizing, and applying statistical tools to data.

Signup and view all the flashcards

Supervised Learning

Classification, regression, and anomaly detection.

Signup and view all the flashcards

Unsupervised Learning

Clustering, dimensionality reduction, and anomaly detection.

Signup and view all the flashcards

Custom Algorithm Development

Feature engineering and numerical optimization.

Signup and view all the flashcards

Data Management

Database management, pipeline construction, and data collection.

Signup and view all the flashcards

Production

Automation, system integration, and robustification.

Signup and view all the flashcards

Software Engineering

Ensuring maintainability, scaling, and collaborative development.

Signup and view all the flashcards

Data Engineers

More technical, with solid data warehousing and programming backgrounds.

Signup and view all the flashcards

Data Maturity

Progression toward higher data utilization, capabilities, and integration across the organization.

Signup and view all the flashcards

Data Pipeline

The design of systems for processing and storing data, including capture, cleanse and transform.

Signup and view all the flashcards

Coding (Data Engineering)

Proficiency in languages like SQL, NoSQL, Python, Java, R, and Scala.

Signup and view all the flashcards

ETL Systems

Moving data from databases and other sources into a single repository.

Signup and view all the flashcards

Relational Database

Collection of data items with pre-defined relationships organized as tables.

Signup and view all the flashcards

Non-relational Database

Does not use the tabular schema. NoSQL stands for not only SQL.

Signup and view all the flashcards

Data Extraction

Moving and copying of data from source locations to a staging area.

Signup and view all the flashcards

Data Transformation

Transforming data to be useful for analysis

Signup and view all the flashcards

More Data Transformation...

Filtering, cleaning, de-duplicating, validating, and authenticating the data

Signup and view all the flashcards

Keep transforming...

Performing calculations, translations, or summaries based on the raw data

Signup and view all the flashcards

Still transforming...

Formatting the data into tables or joined tables to match the target data warehouse schema

Signup and view all the flashcards

Cloud to the rescue

Cloud computing separates storage and computational machines

Signup and view all the flashcards

Study Notes

Data Science vs Data Engineering

  • Data Science combines domain expertise, coding skill, and knowledge of mathematics and statistics skills to extract meaningful insights from data
  • Data Engineering focuses on data formats, storage, extraction, and transformation
  • Data analysis translates a business into a question and make accuracy-cost trade-offs
  • Data Modeling includes classification, regression, and anomaly detection
  • Data Engineering includes: data management, production and software engineering

Data Scientist vs Data Engineer

  • Data engineers usually have more technical expertise and solid data warehousing and programming backgrounds
  • Data scientists tend to be more mathematical
  • There is crossover between the roles
  • Machine learning models require writing small applications and heavy data manipulation

Data Maturity and the Data Engineer

  • Data engineering complexity depends on a company's data maturity
  • Data maturity is the progression toward data utilization, capabilities, and integration
  • Data maturity depends how data is leveraged as a competitive advantage

Data Engineer Responsibilities

  • Analyzing and organizing raw data
  • Building data systems and pipelines
  • Evaluating business needs and objectives
  • Interpreting trends and patterns
  • Preparing data for prescriptive and predictive modeling
  • Building algorithms and prototypes
  • Developing analytical tools and programs

Data Engineering Skills

  • Coding proficiency is essential in languages like SQL, NoSQL, Python, Java, R, and Scala
  • Should be familiar with relational and non-relational databases and how they work
  • Needs ETL (extract, transform, and load) systems knowledge which is the process of moving data from databases and other sources into a single repository, like a data warehouse
  • Requires big data tools and various technologies
  • Should comprehend cloud computing and data security

Why Python

  • Python is easy and simple
  • Python is efficient and performs bulky tasks using fewer lines of code
  • Python has diverse libraries and frameworks
  • Python is versatile where one can implement Python on almost all software, actions, and infrastructures
  • Python has a vast community that supports Python learners
  • Python is portable and extensible as it can be used on any other platform without making any significant changes
  • Python is flexible, developers can choose a programming style between OOPs and scripting
  • Python has attractive documentation, lessons, and tutorials

Relational and Non-relational Databases

  • A relational database is a collection of data items with pre-defined relationships between them organized as a set of tables with rows and columns
  • Non-relational database does not use the tabular schema of rows and columns
  • NoSQL stands for "not only SQL"
  • Examples of this model are: Documents, Semi-structured data, and Large and unstructured data which come results from the Internet of Things (IoT), social networks, and the rise of Al

ETL Process

  • ETL is the process of moving data from databases and other sources into a single repository like a data warehouse
  • The components of the ETL process are: extraction, transformation, and load

Data Extraction

  • Extracting data gets data
  • The data is copied or exported from source locations to a staging area
  • The data comes from structured or unstructured sources like from SQL or NoSQL servers, CRM and ERP systems, text and document files, emails, web pages, and more

Data Transformation

  • 80% of data science work is data preparation, with 75% of data scientists finding this the most boring
  • Raw data is transformed to be useful for analysis and to fit the schema of the eventual target data warehouse
  • Data engineers bring their skill in manipulating data to a project, which includes:
    • Filtering, cleansing, de-duplicating, validating, and authenticating the data
    • Performing calculations, translations, or summaries based on the raw data
    • Formatting the data into tables or joined tables to match the target data warehouse schema

Data Loading

  • Load is the movement and storage of data or storing data to the target location
  • Data engineers sometimes swap the load and transform steps around to be (ELT) when dealing with big data technologies such as Hadoop/Spark
  • Extraction process is cheaper
  • Spreads the processing burden across multiple machines/clusters
  • Cloud computing separates storage and computational machines, where one can scale down expensive machines used to process data without affecting the stored data
  • A data lake is a centralized repository that allows storing structured and unstructured data at any scale

Data Warehouses vs Data Lake

  • Data warehouse hold relational data coming from transactional systems, operational databases, and line of business applications, while Data lakes hold non-relational and relational data from IoT devices, web sites, mobile apps, social media, and corporate applications
  • Warehouses are designed prior to the DW implementation (schema-on-write) but Lakes are Written at the time of analysis (schema-on-read)
  • Data warehouses have faster query results using higher cost storage but Data lakes have query results getting faster using low-cost storage
  • Data warehouses have highly curated data that serves as the central version of the truth while Data Lakes have any data that may or may not be curated (ie. raw data)
  • Warehouses are analyzed by business analysts while Lakes are by data scientists, data developers, and business analysts (using curated data)

Data Engineering Skills

  • Data engineers need to grasp the basic concepts and understand the needs of data scientists on the work team with machine learning
  • Data engineers are tasked with managing big data with technologies that include Hadoop, MongoDB, and Kafka
  • Data engineers need to understand cloud storage and computing
  • Data engineers are tasked with securely managing and storing data to protect it from loss or theft

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

More Like This

Use Quizgecko on...
Browser
Browser