Podcast
Questions and Answers
Which of the following best describes data science?
Which of the following best describes data science?
- A career path suitable for only experienced professionals.
- A process of converting information into raw data.
- A multidisciplinary field using scientific methods and algorithm systems to extract knowledge and insights from various types of data. (correct)
- A specialized tool for decision support.
- A field focused solely on analyzing data.
Data is essentially the same as information.
Data is essentially the same as information.
False (B)
What are the four main stages of the data processing cycle?
What are the four main stages of the data processing cycle?
Input, Processing, Output, Storage
A data type is an ______ of data that tells the compiler or interpreter how the programmer intends to use the data.
A data type is an ______ of data that tells the compiler or interpreter how the programmer intends to use the data.
Which of the following is an example of structured data?
Which of the following is an example of structured data?
Semi-structured data conforms to the formal structure of data models associated with relational databases.
Semi-structured data conforms to the formal structure of data models associated with relational databases.
What are the three common data types from a data analytics perspective?
What are the three common data types from a data analytics perspective?
Data that doesn't have a predefined data model or isn't organized in a predefined manner is known as ______ data.
Data that doesn't have a predefined data model or isn't organized in a predefined manner is known as ______ data.
Which of the following is NOT one of the 4 V's that characterize big data?
Which of the following is NOT one of the 4 V's that characterize big data?
In the context of big data, 'Veracity' refers to the validation of data sources.
In the context of big data, 'Veracity' refers to the validation of data sources.
Name at least three activities identified in a big data value chain.
Name at least three activities identified in a big data value chain.
______ is the process of gathering, filtering, and cleaning data before it is put in a data warehouse.
______ is the process of gathering, filtering, and cleaning data before it is put in a data warehouse.
What is the primary goal of data analysis in the context of the Data Value Chain?
What is the primary goal of data analysis in the context of the Data Value Chain?
Data curation is a one-time process to ensure data quality at the moment the data is added to a database.
Data curation is a one-time process to ensure data quality at the moment the data is added to a database.
What is the aim of 'Data Curation'?
What is the aim of 'Data Curation'?
Data ______ refers to the persistence and management of data in a scalable way that satisfies the needs of applications needing fast access to data.
Data ______ refers to the persistence and management of data in a scalable way that satisfies the needs of applications needing fast access to data.
What does 'Data Usage' generally cover?
What does 'Data Usage' generally cover?
Computing only refers to calculations.
Computing only refers to calculations.
What is 'Cluster Computing'?
What is 'Cluster Computing'?
In cluster computing, each computer is known as a ______.
In cluster computing, each computer is known as a ______.
What advantage does 'Cluster Computing' have?
What advantage does 'Cluster Computing' have?
Big data cluster systems are designed to manage and process small volumes of data.
Big data cluster systems are designed to manage and process small volumes of data.
What is the role of the 'head node' in big data?
What is the role of the 'head node' in big data?
A 'classic cluster' allows nodes to share ______ and collaborate by sharing program data while those programs are running.
A 'classic cluster' allows nodes to share ______ and collaborate by sharing program data while those programs are running.
Which of the following is NOT guaranteed from the classic cluster?
Which of the following is NOT guaranteed from the classic cluster?
There is only a single benefit in using big data clustering software. Otherwise, individual computers are preferred to handle big data systems.
There is only a single benefit in using big data clustering software. Otherwise, individual computers are preferred to handle big data systems.
What is Hadoop' s YARN?
What is Hadoop' s YARN?
Big Data clustering software provides the benefit of high ______, guaranteeing to prevent hardware/software failures from affecting access to data and processing.
Big Data clustering software provides the benefit of high ______, guaranteeing to prevent hardware/software failures from affecting access to data and processing.
Hadoop is designed for what purpose?
Hadoop is designed for what purpose?
Hadoop systems are expensive and require specialized hardware.
Hadoop systems are expensive and require specialized hardware.
Name 2 important characteristics of Hadoop.
Name 2 important characteristics of Hadoop.
Hadoop provides four core components: data management, data access, data ______, and data storage.
Hadoop provides four core components: data management, data access, data ______, and data storage.
What is the function of 'MapReduce' within the Hadoop ecosystem?
What is the function of 'MapReduce' within the Hadoop ecosystem?
HDFS is is a SQL database.
HDFS is is a SQL database.
What does YARN stand for?
What does YARN stand for?
The high-level scripting platform in Hadoop that simplifies data processing tasks is called ______.
The high-level scripting platform in Hadoop that simplifies data processing tasks is called ______.
In the context of the Big Data Life Cycle with Hadoop, what does 'ingesting data into the system' refer to?
In the context of the Big Data Life Cycle with Hadoop, what does 'ingesting data into the system' refer to?
In Hadoop's Big Data Life Cycle, the last step is cleaning data.
In Hadoop's Big Data Life Cycle, the last step is cleaning data.
What is the third stage of big data lifecycle processing with Hadoop?
What is the third stage of big data lifecycle processing with Hadoop?
In Hadoop's Big Data Lifecycle, ______ facilitates data transfer from RDBMS to HDFS
In Hadoop's Big Data Lifecycle, ______ facilitates data transfer from RDBMS to HDFS
Match the following components with their descriptions:
Match the following components with their descriptions:
Flashcards
What is Data Science?
What is Data Science?
A multidisciplinary field using scientific methods, processes, and algorithms to extract knowledge and insights from various data types.
What is Data?
What is Data?
A representation of facts, concepts, or instructions in a formalized manner suitable for communication, interpretation, or processing.
What is Information?
What is Information?
Data that has been organized or classified, providing meaningful values for the receiver.
Data Processing Cycle
Data Processing Cycle
Signup and view all the flashcards
Input stage
Input stage
Signup and view all the flashcards
Processing stage
Processing stage
Signup and view all the flashcards
Output stage
Output stage
Signup and view all the flashcards
Storage stage
Storage stage
Signup and view all the flashcards
Data Type
Data Type
Signup and view all the flashcards
Structured Data
Structured Data
Signup and view all the flashcards
Semi-structured Data
Semi-structured Data
Signup and view all the flashcards
Unstructured Data
Unstructured Data
Signup and view all the flashcards
Metadata
Metadata
Signup and view all the flashcards
Big Data
Big Data
Signup and view all the flashcards
Big Data: Volume
Big Data: Volume
Signup and view all the flashcards
Big Data: Velocity
Big Data: Velocity
Signup and view all the flashcards
Big Data: Variety
Big Data: Variety
Signup and view all the flashcards
Big Data: Veracity
Big Data: Veracity
Signup and view all the flashcards
Data Value Chain
Data Value Chain
Signup and view all the flashcards
Data Acquisition
Data Acquisition
Signup and view all the flashcards
Data Analysis
Data Analysis
Signup and view all the flashcards
Data Curation
Data Curation
Signup and view all the flashcards
Data Storage
Data Storage
Signup and view all the flashcards
Data Usage
Data Usage
Signup and view all the flashcards
Computing
Computing
Signup and view all the flashcards
Cluster Computing
Cluster Computing
Signup and view all the flashcards
Head Node
Head Node
Signup and view all the flashcards
Big Data Cluster System
Big Data Cluster System
Signup and view all the flashcards
Hadoop
Hadoop
Signup and view all the flashcards
Hadoop characteristic: Reliable
Hadoop characteristic: Reliable
Signup and view all the flashcards
Hadoop characteristic: Economical
Hadoop characteristic: Economical
Signup and view all the flashcards
Hadoop characteristic: scalable
Hadoop characteristic: scalable
Signup and view all the flashcards
Spark
Spark
Signup and view all the flashcards
Pig
Pig
Signup and view all the flashcards
Hive
Hive
Signup and view all the flashcards
Study Notes
Overview of Data Science
- Data science is a multidisciplinary field using scientific methods, processes, and algorithms to extract knowledge and insights.
- This extraction involves structured, semi-structured, and unstructured data.
- Data science is more than simply analyzing data.
- It offers a range of roles and requires diverse skills.
- Data science is a promising and in-demand career path.
What is Data?
- Data represents facts, concepts, or instructions in a formalized manner suitable for interpretation, communication, or processing.
- This processing can be by humans or electronic machines.
- Data can be described as unprocessed facts and figures.
- Data consists of streams of raw facts representing events within an organization or the physical environment.
- These facts must be organized and arranged for people to understand or use.
- Data is defined as groups of non-random symbols in the form of text, images, and voice representing quantities, actions, and objects.
- Data is represented using alphabets (A-Z, a-z) or special characters (+, -, /, *, <, >, = etc.).
What is Information?
- Information refers to organized or classified data with meaningful values for the receiver.
- Information is processed data upon which actions and decisions are based.
- Raw facts on their own cannot help in decision-making.
- Interpreted data is created from organized, structured, and processed data within a specific context.
Summary: Data vs. Information
- Data is unprocessed or raw facts and figures, while information is described as processed data.
- Data cannot aid in decision-making, but information can.
- Data is raw material organized, structured, and interpreted into useful information systems
- Interpreted data is created from organized, structured, and processed data within a particular context.
- Data is groups of non-random symbols in the form of text, images, and voice representing quantities, actions, and objects
- Processed data is in the form of text, images, and voice representing quantities, actions, and objects.
Data vs Information - Examples
- Data is the number of cars sold by a dealership last month: 100
- Information sales increased by 10% last month.
- Data is the temperature in Addis Ababa on October 21, 2021, at 6:00 PM being 23 degrees Celsius.
- Information is The temperature was above average for that time of year.
Data Processing Cycle
- The data processing cycle is a sequence of steps or operations converting raw data into a usable form
- It involves restructuring or reordering data by people or machines to increase usefulness and add value.
- It transforms raw data into meaningful information.
- The output of one step is the input for the next.
- The value of data is realized when processed into actionable information.
- Data processing is used for various purposes, including business intelligence, research, or decision support.
- The cycle consists of four main stages, input, processing, output, and storage.
Input
- Input data is prepared in a convenient form for processing.
- The specific form needed depends on the processing machine.
- When using electronic computers, input data can be recorded on mediums like flash disks or hard disks.
Processing
- Input data is transformed to produce data in a more useful form, changing it into meaningful information.
- Raw data is processed using a suitable or selected processing method.
- An example summary of sales for a month is calculated from the sales orders data.
Output
- The result of the processing step is collected
- Processed data is presented in a human-readable format, like reports, charts, graphs, and dashboards.
- The output form depends on the data's use, like total sales in a month.
Storage
- Storage refers to how and where data processing output is stored for future use.
- Processed data is stored in databases or file systems on devices like hard drives, solid-state drives, and cloud storage.
Data Types and its Representation
- In computer science and programming, a data type is an attribute of the data, telling the compiler how the programmer intends to use the data.
- Data types define the operations possible on the data, the meaning of the data, and how the values of that type can be stored.
Common Data Types
- Integers (int) store whole numbers, including positives and negatives.
- Booleans (bool) represent values as true (1) or false (0).
- Characters (char) store a single text character.
- Floating-point numbers (float) store real numbers.
- Alphanumeric strings (strings) store a mix of characters and numbers.
Data Types from Data Analytics Perspective
- From a data analytics perspective, three common types of data structures exists: structured, semi-structured, and unstructured.
Structured Data
- Data adheres to a predefined data model and is straightforward to analyze.
- Structured data exists in a fixed field within a file or record
- It conforms to a tabular format with relationships between rows and columns.
- Its use depends on creating a data model that defines the types of data to include and how to store/process it.
- A data model is a visual representation of a database structure.
- A database is an organized collection of structured data stored in a computer system.
- Common examples of structured data are SQL databases and Excel files.
Semi-Structured Data
- It does not conform to the formal structure of data models associated with relational databases or other forms of data tables.
- It contains tags or markers to separate semantic elements and enforce hierarchies, therefore it is known as self-describing structure.
Semi-Structured Data Examples
- JSON (JavaScript Object Notation)
- XML (Extensible Markup Language).
Unstructured Data
- This is data that either has no predefined data model or is not organized in a predefined manner.
- There is no data model, and it is stored in its native format.
- It is typically text-heavy but contains dates, numbers, and facts.
- Common examples include PDFs, images, NoSQL databases, video files, and audio files.
Metadata
- Data about data, it provides additional information about a specific set of data.
- Metadata of a photo could describe when and where the photos were taken
- Metadata then provides fields for dates and locations that can be considered structured data.
What is Big Data?
- This is datasets so large and complex that traditional tools have difficulties processing.
- A "data set" is an ordered collection of data.
- The common scale of big data sets is constantly shifting and varies from one organization to another.
- Big Data is characterized by the 4 V's.
4 V's of Big Data
- Volume refers to the amount of data, in large amounts of data - Zeta bytes.
- Velocity is the speed of data processing, it is live streaming or in motion.
- Variety: is to the number of types of data, diverse forms from diverse sources.
- Veracity: the quality of the data, asking questions such as "Can we trust the data?"
Other V's of Big Data
- Value refers to the usefulness of gathered data for the business.
- Variability refers to the number of inconsistencies in the data and the inconsistent speed at which big data is loaded into the database.
- Validity is the data quality, governance, and master data management on massive scales.
- Venue is distributed multiple heterogeneous data from multiple platforms.
- Vocabulary is data models/semantics that describes data structure.
- Vulnerability: big data brings new security concerns, since a data breach with big data is a big breach.
- Volatility: because of the volatility/volume of big data, its volatility needs to be carefully considered by asking questions such as "How long does data need to be kept for?"
- Visualization: different ways of representing such as data clustering or using tree maps, sunbursts, parallel coordinates, circular network diagrams, cone trees.
- Vagueness: confusion over the meaning of big data and the tools used.
Data Value Chain
- It is the series of steps needed to generate value and useful insights from data.
- This chain includes data acquisition, data analysis, data curation, data storage, and data usage.
Data Acquisition
- The process of gathering, filtering, and cleaning data before placing it in a data warehouse or storage solutions.
- This then allows data analysis to be carried out.
- One major challenges regarding acquiring data is infrastructure requirements.
- Infrastructure should deliver low, predictable latency in capturing data and executing queries.
- The infrastructure should be able to handle very high transaction volumes, often in a distributed environment.
- The infrastructure supports flexible and dynamic data structures.
Data Analysis
- The process of cleaning, transforming, and modeling data to discover useful information for business decision-making.
- This involves exploring, transforming, and modeling data to highlight relevant data.
- It also involves synthesizing and extracting useful hidden information with high potential from a business point of view.
- Related areas include data mining, business intelligence, and machine learning.
Data Curation
- This is the active management of data over its life cycle to ensure it meets data quality requirements for effective usage.
- Data curation processes can be categorized into creation, selection, classification, transformation, validation, and preservation.
- Data curation is performed by expert curators responsible for improving the accessibility and quality of data.
- Data curators also hold the responsibility of of ensuring data is trustworthy, discoverable, accessible, reusable, and fits its purpose.
- Community and crowd sourcing are a key trend for big data is curation.
Data Storage
- The persistence and management of data in a scalable way that satisfies application needs requiring fast access to the data.
- Relational Database Management Systems (RDBMS) have been the main solution for nearly 40 years.
- Relational databases lack flexibility regarding schema changes, performance, and fault tolerance when data volumes and complexity increases.
- NoSQL technologies are designed with scalability in mind.
Data Usage
- This covers data-driven business activities needing access to data.
- Data analysis and the tools needed to integrate data analysis within a business activity are included.
- This can enhance competitiveness through reduction of costs, increased added value, or parameter measurement against existing performance criteria.
Cluster Computing
- Cluster computing is to use computers to do calculations, data processing, and problem-solving.
- It involves manipulating and transforming data using software applications and algorithms.
- Computing is done on a single computer or distributed across multiple computers connected through a network.
- Cluster computing is a specific type of computing involving the use of a cluster.
- A cluster is a group of interconnected computers or servers working together to perform a task or solve a problem.
- Cluster computing refers to multiple connected computers functioning as a single entity.
- It distributes computational load, enabling faster processing and increased computational power.
Cluster Computing Setup
- Each computer in the cluster, or "node", works in parallel with other nodes to handle different parts of a larger workload.
- Nodes connect through a high-speed network and communicate to coordinate tasks, and each node performs a dedicated task.
- Many nodes connect with a single node called the "head node".
- Accessing a cluster system typically means accessing a head node or gateway node.
- A head node is the launching point for jobs running on the cluster and the main access point.
- Classic clusters essentially allow nodes to share infrastructure like disk space.
- Cluster is also used to collaborate by sharing program data while those programs are running.
- Cluster computing offers solutions to solve complicated problems via providing faster computational speed & enhanced data integrity.
- Data integrity refers to the overall accuracy, completeness, and consistency of data.
Big Data Cluster System
- Individual computers are often inadequate for handling big data at most stages.
- Therefore, computer clusters address the high storage and computational needs of big data.
- The specialized cluster is designed to manage/process large volumes of data, and enable scalable and distributed processing of big data across multiple nodes within the cluster.
- Big Data clustering software combines resources of smaller machines.
- Examples of big data clustering software/tools include Hadoop's YARN (Yet Another Resource Negotiator), Qubole, HPCC, Cassandra, MongoDB, Apache Storm, CouchDB, and Statwing.
- Big data cluster provides solutions like managing cluster membership, coordination resource sharing, and scheduling actual work on individual nodes
- Cluster membership & resource allocation can be handled by software like Hadoop's YARN (Yet Another Resource Negotiator).
- The assembled computing cluster acts as a foundation which other software interfaces with to process data.
- Additionally, the machines in the computing cluster manage a distributed storage system.
Benefits of Big Data Clustering Software
- Resource Pooling involves combining available storage space to hold data and CPU/memory pooling.
- High Availability such as clusters offer varying levels of fault tolerance and guaranteed prevention of hardware/software failures from affecting access to data.
- Easy Scalability where Clusters facilitate easy horizontal scaling by adding machines to the group.
Hadoop
- Hadoop is an open-source framework designed to simplify interaction with big data.
- It is designed for distributed storage and processing large datasets across clusters of computers, and inspired by a Google technical document.
- Open-source software allows anyone to inspect, modify, and enhance its source code.
- This development and distribution model provides the public with access to the underlying (source) code of a program or application.
- Source code refers to the part of software that programmers modify to alter the function of software.
- A software framework is an abstraction that provides generic functionality, allowing users to extend it with additional code to create application-specific software.
Characteristics of Hadoop
- The system is highly economical because it utilizes ordinary computers for data processing/
- Hadoop systems are reliable due to storing copies on different machines, making it robust to hardware failures.
- It is easily scalable both horizontally and vertically, enabling the framework to expand with the addition of extra nodes.
- Extremely flexible because it enables the storage of structured and structured data for future use as needed.
Hadoop Ecosystem
- Hadoop has an ecosystem evolved from its four core components, data management, data access, dataprocessing, and data storage.
- The four components work to provide a comprehensive ecosystem for managing, accessing, processing, and big data.
- This ecosystem is offers tools/technologies to address different aspects of the data lifecycle and cater to use cases in big data.
Components of Hadoop Ecosystem
- Hadoop Distributed File System (HDFS) is a distributed file system that provides reliable and scalable storage.
- Yet Another Resource Negotiator (YARN) manages resources and schedules tasks across the cluster.
- MapReduce is a programming model and processing engine for parallel processing by dividing task into map and reducer phases.
- Spark is a cluster computing system with memory-processing capabilities where it integrates Hadoop offers higher performance for certain workloads.
- Pig a high-level scripting platform that simplifies data processing using a languages called Pig Latin.
- Hive is a data warehouse infrastructure providing a high-level query language, it is also known as HiveQL.
- HBase database that runs on top of Hadoop that offers random and real-time access to big data.
- Mahout a library of machine learning algorithms for Hadoop, for tasks like clustering. classification and recommendation.
- MLlib is a machine learning library in Spark for scalable machine learning tasks, including data preprocessing, feature extraction, model training, and evaluation.
- Solr is an open-source search platform based on Apache, providing powerful search capabilities.
- Lucene is a Java library providing indexing/searching that serves as the core for search-related applications.
- ZooKeeper is a coordination service for distributed systems, offering infrastructure for maintaining configuration information, synchronizing processes, and managing distributed locks.
- Oozie is workflow scheduling system for Hadoop, enabling users to manage workflows.
Big Data Life Cycle with Hadoop
- Ingestion occurs when the 1st step when processing big data/
- Sqoop facilitates data transfer from RDBMS to HDFS while FIume handles event date transfer.
- Processing the data in storage involves the the 2nd step when processing and storing. HDFS stores data is where date is distributed in and the NoSQL is distributed in HBase.
- Spark processes the HDFS and MapReduce.
- Computing and analyzing data, for the third step frameworks include Pig, Hive, and impala.
- Pig applies to map and reduce techniques.
- Hive is suited for structured data.
- Visualizing the results are tools involve such as Hue and Clouders Search.
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.