Big Data Value Chain Quiz & Flashcards for Mastery

Study Notes

The Big Data-Value-Chain describes the information flow within a big data system that aims to generate values and useful insights from data.
The Big Data Value Chain identifies the following key high-level activities:
- Data Acquisition
- Data Analysis
- Data Curation
- Data Storage
- Data Usage

Data Acquisition is the process of gathering, filtering, and cleaning data before it is put in a data warehouse or any other storage solution on which data analysis can be carried out.
The infrastructure required to support the acquisition of big data must provide:
- Low latency
- High volumes of transaction
- Flexible and dynamic data structures

Data Analysis is concerned with making the raw data acquired amenable to use in decision-making as well as domain-specific usages.
Data scientists need to:
- Be curious and result-oriented
- Be good at communication skills that allow them to explain highly technical results to their non-technical counterparts
- Have a strong quantitative background in statistics and linear algebra as well as programming knowledge with focuses in data warehousing, mining, and modeling to build and analyze algorithms

Data can be described as unprocessed facts and figures.
Data can be defined as a collection of facts, concepts, or instructions in a formalized manner.
Data should be interpreted, or processed by human or electronic machine to have a true meaning.
Data can be presented in the form of:
- Alphabets (A-Z, a-z)
- Digits (0-9)
- Special characters (+,-,/,*,,= etc.)

Information is the processed data on which decisions and actions are based.
It is data that has been processed into a form that is meaningful to the recipient and is of real or perceived value in the current or the prospective action or decision of recipient.
Information is interpreted data; created from organized, structured, and processed data in a particular context.

From a data analytics point of view, there are three common data types or structures:
- Structured data
- Semi-structured data
- Unstructured data

Structured data is data that adheres to a pre-defined Data Model and is therefore straightforward to analyze.
Structured data conforms to a tabular format with a relationship between the different rows and columns.
Common examples of structured data are Excel files or SQL databases.

Semi-structured data is a form of structured data that does not obey the tabular structure of data models associated with relational databases or other forms of data tables.
Semi-structured data contains tags or other markers to separate semantic elements within the data.
Examples of semi-structured data include XML, JSON, etc.

Unstructured data does not have a predefined data model and is not organized in a pre-defined manner.
Unstructured information is typically text-heavy but may contain data such as dates, numbers, and facts as well.
Unstructured data is difficult to understand using traditional programs as compared to data stored in structured databases.
Common examples of unstructured data include audio files, video files, PDF, Word file or No-SQL databases.

Data curation is the active management of data over its life cycle to ensure it meets the necessary data quality requirements for its effective usage.
Data curation is performed by expert curators (Data curators, scientific curators, or data annotators) that are responsible for improving the Accessibility, Quality, Trustworthy, Discoverable, Accessible and Reusable of data.

Data Storage is the persistence and management of data in a scalable way that satisfies the needs of applications that require fast access to the data.
The best solution to store Big data is a data lake because it can support various data types and typically are based on Hadoop clusters, cloud object storage services, NoSQL databases or other big data platforms.

Data usage in business decision-making can enhance competitiveness through the reduction of costs, increased added value, or any other parameter that can be measured against existing performance criteria.
Data usage covers the data-driven business activities that need access to data, its analysis, and the tools needed to integrate the data analysis within the business activity.

Activities or life cycle involved with big data processing are:
- Ingesting data into the system
- Processing data in the storage
- Computing and analyzing data
- Visualizing the results