Podcast
Questions and Answers
What is the main purpose of Hadoop?
What is the main purpose of Hadoop?
What is the importance of High Availability in clustered computing?
What is the importance of High Availability in clustered computing?
What is the benefit of Easy Scalability in clustered computing?
What is the benefit of Easy Scalability in clustered computing?
What is the first step in the Big Data Life Cycle with Hadoop?
What is the first step in the Big Data Life Cycle with Hadoop?
Signup and view all the answers
What is the primary component of a computer system that clustered computing aims to provide high availability for?
What is the primary component of a computer system that clustered computing aims to provide high availability for?
Signup and view all the answers
What is the source of inspiration for Hadoop?
What is the source of inspiration for Hadoop?
Signup and view all the answers
What is the main advantage of using Hadoop for big data processing?
What is the main advantage of using Hadoop for big data processing?
Signup and view all the answers
What is the last step in the Big Data Life Cycle with Hadoop?
What is the last step in the Big Data Life Cycle with Hadoop?
Signup and view all the answers
What is the primary benefit of using clustered computing for big data processing?
What is the primary benefit of using clustered computing for big data processing?
Signup and view all the answers
What is the primary component of a computer system that Hadoop is designed to work with?
What is the primary component of a computer system that Hadoop is designed to work with?
Signup and view all the answers
Study Notes
Big Data Value Chain (DVC)
- The Big Data-Value-Chain describes the information flow within a big data system that aims to generate values and useful insights from data.
- The Big Data Value Chain identifies the following key high-level activities:
- Data Acquisition
- Data Analysis
- Data Curation
- Data Storage
- Data Usage
Data Acquisition
- Data Acquisition is the process of gathering, filtering, and cleaning data before it is put in a data warehouse or any other storage solution on which data analysis can be carried out.
- The infrastructure required to support the acquisition of big data must provide:
- Low latency
- High volumes of transaction
- Flexible and dynamic data structures
Data Analysis
- Data Analysis is concerned with making the raw data acquired amenable to use in decision-making as well as domain-specific usages.
- Data scientists need to:
- Be curious and result-oriented
- Be good at communication skills that allow them to explain highly technical results to their non-technical counterparts
- Have a strong quantitative background in statistics and linear algebra as well as programming knowledge with focuses in data warehousing, mining, and modeling to build and analyze algorithms
Data Vs Information
- Data can be described as unprocessed facts and figures.
- Data can be defined as a collection of facts, concepts, or instructions in a formalized manner.
- Data should be interpreted, or processed by human or electronic machine to have a true meaning.
- Data can be presented in the form of:
- Alphabets (A-Z, a-z)
- Digits (0-9)
- Special characters (+,-,/,*,,= etc.)
Information
- Information is the processed data on which decisions and actions are based.
- It is data that has been processed into a form that is meaningful to the recipient and is of real or perceived value in the current or the prospective action or decision of recipient.
- Information is interpreted data; created from organized, structured, and processed data in a particular context.
Data Types
- From a data analytics point of view, there are three common data types or structures:
- Structured data
- Semi-structured data
- Unstructured data
Structured Data
- Structured data is data that adheres to a pre-defined Data Model and is therefore straightforward to analyze.
- Structured data conforms to a tabular format with a relationship between the different rows and columns.
- Common examples of structured data are Excel files or SQL databases.
Semi-Structured Data
- Semi-structured data is a form of structured data that does not obey the tabular structure of data models associated with relational databases or other forms of data tables.
- Semi-structured data contains tags or other markers to separate semantic elements within the data.
- Examples of semi-structured data include XML, JSON, etc.
Unstructured Data
- Unstructured data does not have a predefined data model and is not organized in a pre-defined manner.
- Unstructured information is typically text-heavy but may contain data such as dates, numbers, and facts as well.
- Unstructured data is difficult to understand using traditional programs as compared to data stored in structured databases.
- Common examples of unstructured data include audio files, video files, PDF, Word file or No-SQL databases.
Data Curation
- Data curation is the active management of data over its life cycle to ensure it meets the necessary data quality requirements for its effective usage.
- Data curation is performed by expert curators (Data curators, scientific curators, or data annotators) that are responsible for improving the Accessibility, Quality, Trustworthy, Discoverable, Accessible and Reusable of data.
Data Storage
- Data Storage is the persistence and management of data in a scalable way that satisfies the needs of applications that require fast access to the data.
- The best solution to store Big data is a data lake because it can support various data types and typically are based on Hadoop clusters, cloud object storage services, NoSQL databases or other big data platforms.
Data Usage
- Data usage in business decision-making can enhance competitiveness through the reduction of costs, increased added value, or any other parameter that can be measured against existing performance criteria.
- Data usage covers the data-driven business activities that need access to data, its analysis, and the tools needed to integrate the data analysis within the business activity.
Big Data Life Cycle with Hadoop
- Activities or life cycle involved with big data processing are:
- Ingesting data into the system
- Processing data in the storage
- Computing and analyzing data
- Visualizing the results
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.
Description
Learn about the Big Data Value Chain, a process that generates insights from data, involving data acquisition, analysis, curation, storage, and usage.