Data Science Overview

Podcast

Play an AI-generated podcast conversation about this lesson

Download our mobile app to listen on the go

Get App

Questions and Answers

Which of the following best describes data science?

A career path suitable for only experienced professionals.
A process of converting information into raw data.
A multidisciplinary field using scientific methods and algorithm systems to extract knowledge and insights from various types of data. (correct)
A specialized tool for decision support.
A field focused solely on analyzing data.

Data is essentially the same as information.

False (B)

What are the four main stages of the data processing cycle?

Input, Processing, Output, Storage

A data type is an ______ of data that tells the compiler or interpreter how the programmer intends to use the data.

attribute

Signup and view all the answers

Which of the following is an example of structured data?

Excel files (E)

Signup and view all the answers

Semi-structured data conforms to the formal structure of data models associated with relational databases.

False (B)

Signup and view all the answers

What are the three common data types from a data analytics perspective?

structured, semi-structured, unstructured

Signup and view all the answers

Data that doesn't have a predefined data model or isn't organized in a predefined manner is known as ______ data.

unstructured

Signup and view all the answers

Which of the following is NOT one of the 4 V's that characterize big data?

Validity (D)

Signup and view all the answers

In the context of big data, 'Veracity' refers to the validation of data sources.

False (B)

Signup and view all the answers

Name at least three activities identified in a big data value chain.

Data Acquisition, Data Analysis, Data Curation

Signup and view all the answers

______ is the process of gathering, filtering, and cleaning data before it is put in a data warehouse.

Data acquisition

Signup and view all the answers

What is the primary goal of data analysis in the context of the Data Value Chain?

Discovering useful information for business decision-making. (B)

Signup and view all the answers

Data curation is a one-time process to ensure data quality at the moment the data is added to a database.

False (B)

Signup and view all the answers

What is the aim of 'Data Curation'?

ensure data quality requirements

Signup and view all the answers

Data ______ refers to the persistence and management of data in a scalable way that satisfies the needs of applications needing fast access to data.

Storage

Signup and view all the answers

What does 'Data Usage' generally cover?

The business activities that need access to data and its analysis. (E)

Signup and view all the answers

Computing only refers to calculations.

False (B)

Signup and view all the answers

What is 'Cluster Computing'?

group of interconnected computers working together

Signup and view all the answers

In cluster computing, each computer is known as a ______.

node

Signup and view all the answers

What advantage does 'Cluster Computing' have?

It allows for the distribution of computational load across multiple machines. (C)

Signup and view all the answers

Big data cluster systems are designed to manage and process small volumes of data.

False (B)

Signup and view all the answers

What is the role of the 'head node' in big data?

launching point for jobs

Signup and view all the answers

A 'classic cluster' allows nodes to share ______ and collaborate by sharing program data while those programs are running.

infrastructure

Signup and view all the answers

Which of the following is NOT guaranteed from the classic cluster?

Enhanced data analysis (B)

Signup and view all the answers

There is only a single benefit in using big data clustering software. Otherwise, individual computers are preferred to handle big data systems.

False (B)

Signup and view all the answers

What is Hadoop' s YARN?

resource management software

Signup and view all the answers

Big Data clustering software provides the benefit of high ______, guaranteeing to prevent hardware/software failures from affecting access to data and processing.

availability

Signup and view all the answers

Hadoop is designed for what purpose?

To simplify interaction with big data through distributed storage and processing. (B)

Signup and view all the answers

Hadoop systems are expensive and require specialized hardware.

False (B)

Signup and view all the answers

Name 2 important characteristics of Hadoop.

Economical, Flexible, Reliable, Scalable

Signup and view all the answers

Hadoop provides four core components: data management, data access, data ______, and data storage.

processing

Signup and view all the answers

What is the function of 'MapReduce' within the Hadoop ecosystem?

A programming model and processing engine for parallel processing. (C)

Signup and view all the answers

HDFS is is a SQL database.

False (B)

Signup and view all the answers

What does YARN stand for?

Yet Another Resource Negotiator

Signup and view all the answers

The high-level scripting platform in Hadoop that simplifies data processing tasks is called ______.

pig

Signup and view all the answers

In the context of the Big Data Life Cycle with Hadoop, what does 'ingesting data into the system' refer to?

Transferring data to Hadoop from various sources. (D)

Signup and view all the answers

In Hadoop's Big Data Life Cycle, the last step is cleaning data.

False (B)

Signup and view all the answers

What is the third stage of big data lifecycle processing with Hadoop?

analyzing data

Signup and view all the answers

In Hadoop's Big Data Lifecycle, ______ facilitates data transfer from RDBMS to HDFS

Sqoop

Signup and view all the answers

Match the following components with their descriptions:

HDFS = A distributed file system providing reliable and scalable storage. YARN = A resource management framework for managing resources and scheduling tasks. MapReduce = A programming model for parallel processing. Spark = A general-purpose cluster computing system with in-memory processing.

Signup and view all the answers

Flashcards

What is Data Science?

A multidisciplinary field using scientific methods, processes, and algorithms to extract knowledge and insights from various data types.

What is Data?

A representation of facts, concepts, or instructions in a formalized manner suitable for communication, interpretation, or processing.

What is Information?

Data that has been organized or classified, providing meaningful values for the receiver.

Data Processing Cycle

A sequence of steps or operations to transform raw data into a usable form, increasing its usefulness and adding value.

Signup and view all the flashcards

Input stage

The initial stage of the data processing cycle where raw data is prepared in a convenient form for processing.

Signup and view all the flashcards

Processing stage

involves changing and transforming the input data into a more useful form, converting it into meaningful information through suitable methods.

Signup and view all the flashcards

Output stage

The stage where the results of processing are collected and presented in a human-readable format, such as reports, charts, graphs, and dashboards.

Signup and view all the flashcards

Storage stage

Refers to how and where the results of data processing are kept for future use in databases, file systems, or other storage devices.

Signup and view all the flashcards

Data Type

An attribute of data that tells the compiler or interpreter how the programmer intends to use the data

Signup and view all the flashcards

Structured Data

Data format that adheres to a predefined data model, making it straightforward to analyze with a tabular format with defined relationships.

Signup and view all the flashcards

Semi-structured Data

Data that doesn't conform to a formal structure, but contains tags or markers to separate semantic elements.

Signup and view all the flashcards

Unstructured Data

Data that lacks a predefined data model and is stored in its native format, typically text-heavy but potentially containing dates, numbers, and facts.

Signup and view all the flashcards

Metadata

Additional information about a specific set of data, describing when and where the data was created or modified.

Signup and view all the flashcards

Big Data

A dataset to large and complex that it becomes difficult to process using typical database management tools.

Signup and view all the flashcards

Big Data: Volume

The amount of data.

Signup and view all the flashcards

Big Data: Velocity

The speed at which data is processed.

Signup and view all the flashcards

Big Data: Variety

The different types of data.

Signup and view all the flashcards

Big Data: Veracity

The quality of the data.

Signup and view all the flashcards

Data Value Chain

Describes the flow of data, identifies stages to transform raw data into valuable insights.

Signup and view all the flashcards

Data Acquisition

Process of gathering, filtering, and cleaning the data for analysis.

Signup and view all the flashcards

Data Analysis

cleaning, transforming, and modeling data to discover useful information.

Signup and view all the flashcards

Data Curation

Ensuring it meets the necessary data quality requirements for effective usage.

Signup and view all the flashcards

Data Storage

Persistence and management of data in a scalable way.

Signup and view all the flashcards

Data Usage

Data-driven business activities such as analysis and integration.

Signup and view all the flashcards

Computing

Process of using computers to perform various tasks, including calculations, data processing, and problem-solving, either on a single computer or distributed across multiple computers.

Signup and view all the flashcards

Cluster Computing

A specific type of computing that involves a group of interconnected computers or servers working together to perform a task or solve a problem.

Signup and view all the flashcards

Head Node

Central point for jobs running in a cluster

Signup and view all the flashcards

Big Data Cluster System

Multiple machine working together.

Signup and view all the flashcards

Hadoop

Open source framework that simplify the interaction with big data.

Signup and view all the flashcards

Hadoop characteristic: Reliable

Hadoop is reliable due to its ability to store data copies on different machines

Signup and view all the flashcards

Hadoop characteristic: Economical

Hadoop systems are highly economical because they can utilize ordinary computers for data processing.

Signup and view all the flashcards

Hadoop characteristic: scalable

Hadoop is easily scalable both horizontally and vertically

Signup and view all the flashcards

Spark

Is a fast and general-purpose cluster computing system with in-memory processing capabilities

Signup and view all the flashcards

Pig

A high-level scripting platform in Hadoop that simplifies data processing

Signup and view all the flashcards

Hive

Is a data warehouse infrastructure built on Hadoop

Signup and view all the flashcards

Study Notes

Overview of Data Science

Data science is a multidisciplinary field using scientific methods, processes, and algorithms to extract knowledge and insights.
This extraction involves structured, semi-structured, and unstructured data.
Data science is more than simply analyzing data.
It offers a range of roles and requires diverse skills.
Data science is a promising and in-demand career path.

What is Data?

Data represents facts, concepts, or instructions in a formalized manner suitable for interpretation, communication, or processing.
This processing can be by humans or electronic machines.
Data can be described as unprocessed facts and figures.
Data consists of streams of raw facts representing events within an organization or the physical environment.
These facts must be organized and arranged for people to understand or use.
Data is defined as groups of non-random symbols in the form of text, images, and voice representing quantities, actions, and objects.
Data is represented using alphabets (A-Z, a-z) or special characters (+, -, /, *, <, >, = etc.).

What is Information?

Information refers to organized or classified data with meaningful values for the receiver.
Information is processed data upon which actions and decisions are based.
Raw facts on their own cannot help in decision-making.
Interpreted data is created from organized, structured, and processed data within a specific context.

Summary: Data vs. Information

Data is unprocessed or raw facts and figures, while information is described as processed data.
Data cannot aid in decision-making, but information can.
Data is raw material organized, structured, and interpreted into useful information systems
Interpreted data is created from organized, structured, and processed data within a particular context.
Data is groups of non-random symbols in the form of text, images, and voice representing quantities, actions, and objects
Processed data is in the form of text, images, and voice representing quantities, actions, and objects.

Data vs Information - Examples

Data is the number of cars sold by a dealership last month: 100
Information sales increased by 10% last month.
Data is the temperature in Addis Ababa on October 21, 2021, at 6:00 PM being 23 degrees Celsius.
Information is The temperature was above average for that time of year.

Data Processing Cycle

The data processing cycle is a sequence of steps or operations converting raw data into a usable form
It involves restructuring or reordering data by people or machines to increase usefulness and add value.
It transforms raw data into meaningful information.
The output of one step is the input for the next.
The value of data is realized when processed into actionable information.
Data processing is used for various purposes, including business intelligence, research, or decision support.
The cycle consists of four main stages, input, processing, output, and storage.

Input

Input data is prepared in a convenient form for processing.
The specific form needed depends on the processing machine.
When using electronic computers, input data can be recorded on mediums like flash disks or hard disks.

Processing

Input data is transformed to produce data in a more useful form, changing it into meaningful information.
Raw data is processed using a suitable or selected processing method.
An example summary of sales for a month is calculated from the sales orders data.

Output

The result of the processing step is collected
Processed data is presented in a human-readable format, like reports, charts, graphs, and dashboards.
The output form depends on the data's use, like total sales in a month.

Storage

Storage refers to how and where data processing output is stored for future use.
Processed data is stored in databases or file systems on devices like hard drives, solid-state drives, and cloud storage.

Data Types and its Representation

In computer science and programming, a data type is an attribute of the data, telling the compiler how the programmer intends to use the data.
Data types define the operations possible on the data, the meaning of the data, and how the values of that type can be stored.

Common Data Types

Integers (int) store whole numbers, including positives and negatives.
Booleans (bool) represent values as true (1) or false (0).
Characters (char) store a single text character.
Floating-point numbers (float) store real numbers.
Alphanumeric strings (strings) store a mix of characters and numbers.

Data Types from Data Analytics Perspective

From a data analytics perspective, three common types of data structures exists: structured, semi-structured, and unstructured.

Structured Data

Data adheres to a predefined data model and is straightforward to analyze.
Structured data exists in a fixed field within a file or record
It conforms to a tabular format with relationships between rows and columns.
Its use depends on creating a data model that defines the types of data to include and how to store/process it.
A data model is a visual representation of a database structure.
A database is an organized collection of structured data stored in a computer system.
Common examples of structured data are SQL databases and Excel files.

Semi-Structured Data

It does not conform to the formal structure of data models associated with relational databases or other forms of data tables.
It contains tags or markers to separate semantic elements and enforce hierarchies, therefore it is known as self-describing structure.

Semi-Structured Data Examples

JSON (JavaScript Object Notation)
XML (Extensible Markup Language).

Unstructured Data

This is data that either has no predefined data model or is not organized in a predefined manner.
There is no data model, and it is stored in its native format.
It is typically text-heavy but contains dates, numbers, and facts.
Common examples include PDFs, images, NoSQL databases, video files, and audio files.

Metadata

Data about data, it provides additional information about a specific set of data.
Metadata of a photo could describe when and where the photos were taken
Metadata then provides fields for dates and locations that can be considered structured data.

What is Big Data?

This is datasets so large and complex that traditional tools have difficulties processing.
A "data set" is an ordered collection of data.
The common scale of big data sets is constantly shifting and varies from one organization to another.
Big Data is characterized by the 4 V's.

4 V's of Big Data

Volume refers to the amount of data, in large amounts of data - Zeta bytes.
Velocity is the speed of data processing, it is live streaming or in motion.
Variety: is to the number of types of data, diverse forms from diverse sources.
Veracity: the quality of the data, asking questions such as "Can we trust the data?"

Other V's of Big Data

Value refers to the usefulness of gathered data for the business.
Variability refers to the number of inconsistencies in the data and the inconsistent speed at which big data is loaded into the database.
Validity is the data quality, governance, and master data management on massive scales.
Venue is distributed multiple heterogeneous data from multiple platforms.
Vocabulary is data models/semantics that describes data structure.
Vulnerability: big data brings new security concerns, since a data breach with big data is a big breach.
Volatility: because of the volatility/volume of big data, its volatility needs to be carefully considered by asking questions such as "How long does data need to be kept for?"
Visualization: different ways of representing such as data clustering or using tree maps, sunbursts, parallel coordinates, circular network diagrams, cone trees.
Vagueness: confusion over the meaning of big data and the tools used.

Data Value Chain

It is the series of steps needed to generate value and useful insights from data.
This chain includes data acquisition, data analysis, data curation, data storage, and data usage.

Data Acquisition

The process of gathering, filtering, and cleaning data before placing it in a data warehouse or storage solutions.
This then allows data analysis to be carried out.
One major challenges regarding acquiring data is infrastructure requirements.
Infrastructure should deliver low, predictable latency in capturing data and executing queries.
The infrastructure should be able to handle very high transaction volumes, often in a distributed environment.
The infrastructure supports flexible and dynamic data structures.

Data Analysis

The process of cleaning, transforming, and modeling data to discover useful information for business decision-making.
This involves exploring, transforming, and modeling data to highlight relevant data.
It also involves synthesizing and extracting useful hidden information with high potential from a business point of view.
Related areas include data mining, business intelligence, and machine learning.

Data Curation

This is the active management of data over its life cycle to ensure it meets data quality requirements for effective usage.
Data curation processes can be categorized into creation, selection, classification, transformation, validation, and preservation.
Data curation is performed by expert curators responsible for improving the accessibility and quality of data.
Data curators also hold the responsibility of of ensuring data is trustworthy, discoverable, accessible, reusable, and fits its purpose.
Community and crowd sourcing are a key trend for big data is curation.

Data Storage

The persistence and management of data in a scalable way that satisfies application needs requiring fast access to the data.
Relational Database Management Systems (RDBMS) have been the main solution for nearly 40 years.
Relational databases lack flexibility regarding schema changes, performance, and fault tolerance when data volumes and complexity increases.
NoSQL technologies are designed with scalability in mind.

Data Usage

This covers data-driven business activities needing access to data.
Data analysis and the tools needed to integrate data analysis within a business activity are included.
This can enhance competitiveness through reduction of costs, increased added value, or parameter measurement against existing performance criteria.

Cluster Computing

Cluster computing is to use computers to do calculations, data processing, and problem-solving.
It involves manipulating and transforming data using software applications and algorithms.
Computing is done on a single computer or distributed across multiple computers connected through a network.
Cluster computing is a specific type of computing involving the use of a cluster.
A cluster is a group of interconnected computers or servers working together to perform a task or solve a problem.
Cluster computing refers to multiple connected computers functioning as a single entity.
It distributes computational load, enabling faster processing and increased computational power.

Cluster Computing Setup

Each computer in the cluster, or "node", works in parallel with other nodes to handle different parts of a larger workload.
Nodes connect through a high-speed network and communicate to coordinate tasks, and each node performs a dedicated task.
Many nodes connect with a single node called the "head node".
Accessing a cluster system typically means accessing a head node or gateway node.
A head node is the launching point for jobs running on the cluster and the main access point.
Classic clusters essentially allow nodes to share infrastructure like disk space.
Cluster is also used to collaborate by sharing program data while those programs are running.
Cluster computing offers solutions to solve complicated problems via providing faster computational speed & enhanced data integrity.
Data integrity refers to the overall accuracy, completeness, and consistency of data.

Big Data Cluster System

Individual computers are often inadequate for handling big data at most stages.
Therefore, computer clusters address the high storage and computational needs of big data.
The specialized cluster is designed to manage/process large volumes of data, and enable scalable and distributed processing of big data across multiple nodes within the cluster.
Big Data clustering software combines resources of smaller machines.
Examples of big data clustering software/tools include Hadoop's YARN (Yet Another Resource Negotiator), Qubole, HPCC, Cassandra, MongoDB, Apache Storm, CouchDB, and Statwing.
Big data cluster provides solutions like managing cluster membership, coordination resource sharing, and scheduling actual work on individual nodes
Cluster membership & resource allocation can be handled by software like Hadoop's YARN (Yet Another Resource Negotiator).
The assembled computing cluster acts as a foundation which other software interfaces with to process data.
Additionally, the machines in the computing cluster manage a distributed storage system.

Benefits of Big Data Clustering Software

Resource Pooling involves combining available storage space to hold data and CPU/memory pooling.
High Availability such as clusters offer varying levels of fault tolerance and guaranteed prevention of hardware/software failures from affecting access to data.
Easy Scalability where Clusters facilitate easy horizontal scaling by adding machines to the group.

Hadoop

Hadoop is an open-source framework designed to simplify interaction with big data.
It is designed for distributed storage and processing large datasets across clusters of computers, and inspired by a Google technical document.
Open-source software allows anyone to inspect, modify, and enhance its source code.
This development and distribution model provides the public with access to the underlying (source) code of a program or application.
Source code refers to the part of software that programmers modify to alter the function of software.
A software framework is an abstraction that provides generic functionality, allowing users to extend it with additional code to create application-specific software.

Characteristics of Hadoop

The system is highly economical because it utilizes ordinary computers for data processing/
Hadoop systems are reliable due to storing copies on different machines, making it robust to hardware failures.
It is easily scalable both horizontally and vertically, enabling the framework to expand with the addition of extra nodes.
Extremely flexible because it enables the storage of structured and structured data for future use as needed.

Hadoop Ecosystem

Hadoop has an ecosystem evolved from its four core components, data management, data access, dataprocessing, and data storage.
The four components work to provide a comprehensive ecosystem for managing, accessing, processing, and big data.
This ecosystem is offers tools/technologies to address different aspects of the data lifecycle and cater to use cases in big data.

Components of Hadoop Ecosystem

Hadoop Distributed File System (HDFS) is a distributed file system that provides reliable and scalable storage.
Yet Another Resource Negotiator (YARN) manages resources and schedules tasks across the cluster.
MapReduce is a programming model and processing engine for parallel processing by dividing task into map and reducer phases.
Spark is a cluster computing system with memory-processing capabilities where it integrates Hadoop offers higher performance for certain workloads.
Pig a high-level scripting platform that simplifies data processing using a languages called Pig Latin.
Hive is a data warehouse infrastructure providing a high-level query language, it is also known as HiveQL.
HBase database that runs on top of Hadoop that offers random and real-time access to big data.
Mahout a library of machine learning algorithms for Hadoop, for tasks like clustering. classification and recommendation.
MLlib is a machine learning library in Spark for scalable machine learning tasks, including data preprocessing, feature extraction, model training, and evaluation.
Solr is an open-source search platform based on Apache, providing powerful search capabilities.
Lucene is a Java library providing indexing/searching that serves as the core for search-related applications.
ZooKeeper is a coordination service for distributed systems, offering infrastructure for maintaining configuration information, synchronizing processes, and managing distributed locks.
Oozie is workflow scheduling system for Hadoop, enabling users to manage workflows.

Big Data Life Cycle with Hadoop

Ingestion occurs when the 1st step when processing big data/
Sqoop facilitates data transfer from RDBMS to HDFS while FIume handles event date transfer.
Processing the data in storage involves the the 2nd step when processing and storing. HDFS stores data is where date is distributed in and the NoSQL is distributed in HBase.
Spark processes the HDFS and MapReduce.
Computing and analyzing data, for the third step frameworks include Pig, Hive, and impala.
Pig applies to map and reduce techniques.
Hive is suited for structured data.
Visualizing the results are tools involve such as Hue and Clouders Search.

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Data Science Overview

Choose a study mode

Podcast

Questions and Answers

Which of the following best describes data science?

Data is essentially the same as information.

What are the four main stages of the data processing cycle?

A data type is an ______ of data that tells the compiler or interpreter how the programmer intends to use the data.

Which of the following is an example of structured data?

Semi-structured data conforms to the formal structure of data models associated with relational databases.

What are the three common data types from a data analytics perspective?

Data that doesn't have a predefined data model or isn't organized in a predefined manner is known as ______ data.

Which of the following is NOT one of the 4 V's that characterize big data?

In the context of big data, 'Veracity' refers to the validation of data sources.

Name at least three activities identified in a big data value chain.

______ is the process of gathering, filtering, and cleaning data before it is put in a data warehouse.

What is the primary goal of data analysis in the context of the Data Value Chain?

Data curation is a one-time process to ensure data quality at the moment the data is added to a database.

What is the aim of 'Data Curation'?

Data ______ refers to the persistence and management of data in a scalable way that satisfies the needs of applications needing fast access to data.

What does 'Data Usage' generally cover?

Computing only refers to calculations.

What is 'Cluster Computing'?

In cluster computing, each computer is known as a ______.

What advantage does 'Cluster Computing' have?

Big data cluster systems are designed to manage and process small volumes of data.

What is the role of the 'head node' in big data?

A 'classic cluster' allows nodes to share ______ and collaborate by sharing program data while those programs are running.

Which of the following is NOT guaranteed from the classic cluster?

There is only a single benefit in using big data clustering software. Otherwise, individual computers are preferred to handle big data systems.

What is Hadoop' s YARN?

Big Data clustering software provides the benefit of high ______, guaranteeing to prevent hardware/software failures from affecting access to data and processing.

Hadoop is designed for what purpose?

Hadoop systems are expensive and require specialized hardware.

Name 2 important characteristics of Hadoop.

Hadoop provides four core components: data management, data access, data ______, and data storage.

What is the function of 'MapReduce' within the Hadoop ecosystem?

HDFS is is a SQL database.

What does YARN stand for?

The high-level scripting platform in Hadoop that simplifies data processing tasks is called ______.

In the context of the Big Data Life Cycle with Hadoop, what does 'ingesting data into the system' refer to?

In Hadoop's Big Data Life Cycle, the last step is cleaning data.

What is the third stage of big data lifecycle processing with Hadoop?

In Hadoop's Big Data Lifecycle, ______ facilitates data transfer from RDBMS to HDFS

Match the following components with their descriptions:

Flashcards

What is Data Science?

What is Data?

What is Information?

Data Processing Cycle

Input stage

Processing stage

Output stage

Storage stage

Data Type

Structured Data

Semi-structured Data

Unstructured Data

Metadata

Big Data

Big Data: Volume

Big Data: Velocity

Big Data: Variety

Big Data: Veracity

Data Value Chain

Data Acquisition

Data Analysis

Data Curation

Data Storage

Data Usage

Computing

Cluster Computing

Head Node

Big Data Cluster System

Hadoop

Hadoop characteristic: Reliable

Hadoop characteristic: Economical

Hadoop characteristic: scalable

Spark

Pig