Big Data Fundamentals

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson
Download our mobile app to listen on the go
Get App

Questions and Answers

Which statement correctly describes the role of big data analytics in business?

  • It solely automates data preprocessing tasks.
  • It mainly concerns itself with data visualization techniques.
  • It aids in making informed decisions to improve financial and operational outcomes. (correct)
  • It primarily focuses on data storage solutions.

In the context of big data, what distinguishes semi-structured data from structured data?

  • Semi-structured data lacks any identifiable markers for semantic elements.
  • Semi-structured data is exclusively composed of images, audio, and video files.
  • Semi-structured data contains tags or markers to delineate semantic elements but doesn't conform to rigid database structures. (correct)
  • Semi-structured data strictly adheres to predefined relational database schemas.

Which of the following is a characteristic of unstructured data?

  • It always maintains a standard hierarchy, facilitating easy data retrieval.
  • It is characterized by a lack of a predefined format, varying in content and structure. (correct)
  • It neatly fits into predefined rows and columns for easy analysis.
  • It is easily managed and accessed by humans and computers.

Why is 'veracity' considered a crucial 'V' in the context of big data?

<p>It emphasizes the importance of data accuracy and consistency due to its vulnerability to inconsistencies and uncertainty. (B)</p> Signup and view all the answers

What is the primary goal of the 'value' characteristic within the concept of the 5Vs of big data?

<p>To highlight the need to process large datasets to extract meaningful insights and knowledge. (B)</p> Signup and view all the answers

How do transportation sectors utilize live tracking reports derived from big data analytics?

<p>To track vehicles, manage customer requests, process payments, provide emergency warnings, and assess revenue. (A)</p> Signup and view all the answers

In what way has big data analytics played a crucial role during crisis situations, such as the earthquake in Nepal in April 2015?

<p>By facilitating rescue and relief operations through analysis and coordination. (C)</p> Signup and view all the answers

In the context of data analytics, why is the life cycle often depicted as an iterative process?

<p>To allow for continuous learning and refinement as new information is discovered, potentially leading back to earlier phases. (D)</p> Signup and view all the answers

Which of the following activities is part of the initial data discovery phase in the data analytics life cycle?

<p>Identifying needed data sources and assessing their accessibility. (A)</p> Signup and view all the answers

Why is data preparation considered one of the most labor-intensive phases in the data analytics life cycle?

<p>Because it often involves extensive data cleaning, combining datasets, and conditioning data for analysis, consuming a significant amount of project time. (B)</p> Signup and view all the answers

What is the purpose of an analytical sandbox in the data preparation phase?

<p>To create a separate workspace where the team can manipulate and analyze data without affecting the production environment. (C)</p> Signup and view all the answers

The ETL process involves moving financial data from an organization's main database to a computational sandbox. What is one reason for doing this?

<p>To minimize risk to the integrity and performance of the live production database. (B)</p> Signup and view all the answers

What does the 'transformation' step in ETL primarily ensure regarding data?

<p>That the data is correct, complete, coherent, and unambiguous for further use. (B)</p> Signup and view all the answers

Why might data scientists prefer to load raw data into a data warehouse rather than immediately transforming it?

<p>To preserve the original data for future analysis, preventing the accidental removal of potentially important outliers or patterns. (B)</p> Signup and view all the answers

What is the primary function of Application Programming Interfaces (APIs) in the context of data preparation?

<p>To facilitate access to large amounts of data from various websites to support projects. (A)</p> Signup and view all the answers

How does Hadoop assist data scientists in data preparation?

<p>By enabling the exploration of data complexity, even without fully understanding it, and allowing storage without needing to know the specifics. (C)</p> Signup and view all the answers

During the model planning phase of data analytics, what should a team consider regarding analytical techniques?

<p>Whether the technique meets the project's objectives and if a single model suffices or a series of techniques is needed. (D)</p> Signup and view all the answers

What role do the types of input and output variables play in the model selection substep?

<p>They are a vital consideration in determining the appropriate model(s) for a project. (B), They guide the selection of an analytical technique or a shortlist of techniques to potentially use. (E)</p> Signup and view all the answers

What is one primary use of SAS modules in the context of big data analytics?

<p>To profile customers and prospects through web, social media, and marketing analytics. (A)</p> Signup and view all the answers

In model building, what best describes the role of the testing data?

<p>It evaluates the performance and predictive accuracy of the model after it has been trained. (C)</p> Signup and view all the answers

Which of the following questions is most relevant to consider during the model building phase?

<p>Is the model sufficiently accurate to meet the goal? (C)</p> Signup and view all the answers

What should the data analytics team do if the model fails to solve the research problem?

<p>Revisit the data analytics life cycle. (C), It must revisit the model building process (D)</p> Signup and view all the answers

Which of the following is an advantage of using Apache Spark?

<p>It offers faster processing speeds than MapReduce. (A)</p> Signup and view all the answers

In the context of the data analytics life cycle, what is addressed during the 'Communicate Results' phase?

<p>The project's findings and the business value of the model to stakeholders. (D)</p> Signup and view all the answers

Where is a great deal of time spent in the data analytics life cycle?

<p>Data discovery and data preparation. (B)</p> Signup and view all the answers

What is the role of text transformation in text mining?

<p>Monitoring and controlling the capitalization of the text. (D)</p> Signup and view all the answers

Which phrase is another name for the Feature Selection step of text mining?

<p>Variable Selection. (D)</p> Signup and view all the answers

What is one difference between Data Mining and Text Mining?

<p>Data Mining is a statistical technique for processing raw data in a structured form, whereas Text Mining deals with unstructured forms. (C)</p> Signup and view all the answers

What is the purpose of Information Retrieval?

<p>To converts the unstructured text into a structured form and obtains important information. (D)</p> Signup and view all the answers

What are the R packages for Framework Text Mining Applications and R Network Analysis Packages

<p>The R Packages are known as tm and igraph (B)</p> Signup and view all the answers

What is are the components of Natural Language Processing (NLP)

<p>It is Natural Language Understanding (NLU) and Natural Language Generation (NLG) (D)</p> Signup and view all the answers

What is the Lexicon in Lexical Analysis

<p>It is the words and phrases in a language. (B)</p> Signup and view all the answers

In the context of data science, what is the primary benefit of using Python?

<p>The availability of numerous libraries for data science or data analytics. (A)</p> Signup and view all the answers

How are the variable names improved?

<p>To have words separated using an underscore (A)</p> Signup and view all the answers

What is meant by Lists being mutable and used in Python?

<p>Lists are mutable and thus can be altered by adding new items, deleting some items, or modifying the existing items. (D)</p> Signup and view all the answers

How can Tuple be created?

<p>Singleton tuple – (a.) Note: Even when creating a single-valued tuple, a comma must be included after the value. (D)</p> Signup and view all the answers

What makes the list unique in Python as compared to a dictionary?

<p>A dictionary has unique keys, but the values in the dictionary can be repeated. (B)</p> Signup and view all the answers

What does the use of Numpy allow?

<p>Its the vectorized operations supported by the NumPy arrays and not by the Python lists. (A), It can also be used in linear algebra and matrices. (B), Helps the functionality of arrays. (D)</p> Signup and view all the answers

What happens with the increase in the dimension?

<p>Reduction usually aims to retrieve a few low-dimensional representations of data that, in a way, maintains the relevant qualities of its entire dataset (A)</p> Signup and view all the answers

In the text's dataset what did Supervised Learning further include?

<p>Supervised learning is further subdivided as follows: Classification and Regression. (D)</p> Signup and view all the answers

Scikit-learn is used to implement many models. In its implementation what would this not include?

<p>It is featured with uniform and slimmed-down application programming interfaces (APls) with practical and detailed documentation (A), It offers a lot of common algorithms, such as clustering, dimensionality reduction, classification, and regression. (B), It makes the transition to a new model or algorithm very straightforward once you are under the basic utilization and syntax of scikit-learn on one model form. (D)</p> Signup and view all the answers

Flashcards

What is Big Data?

An umbrella term for the techniques and technologies required to collect, aggregate, process, and gain insights from massive datasets.

What is Volume in Big Data?

Quantity of data

What is Velocity in Big Data?

Speed the data is generated

What is Variety in Big Data?

Nature and type of data

Signup and view all the flashcards

What is Veracity in Big Data?

Addresses inconsistencies and uncertainties in the data

Signup and view all the flashcards

What is Value in Big Data?

The ability to extract useful insights or knowledge that can be extracted from big data.

Signup and view all the flashcards

What is Fraud Detection?

Focuses on identifying and analyzing potential fraud cases, security breaches, and unauthorized transactions within financial systems.

Signup and view all the flashcards

What is Live Tracking?

Involves the use of real-time location data to track vehicles, manage fleets, optimize routes, and improve delivery times.

Signup and view all the flashcards

what is Sales forecasting?

Used to predict future sales trends, forecast demand, and make informed decisions about inventory management, pricing strategies, and marketing campaigns.

Signup and view all the flashcards

What is Live Data Handling?

Focuses on leveraging real-time data to provide insights, personalize content, optimize user experiences, and drive engagement.

Signup and view all the flashcards

What are Alert Generations?

Used to create automated alerts based on predefined rules or thresholds, enabling organizations to monitor critical events, identify anomalies, and respond quickly to potential issues.

Signup and view all the flashcards

What are Google Analytics Reports?

Provides insights into website traffic, user behavior, engagement metrics, and conversion rates, helping organizations optimize website performance, improve user experience, and drive business outcomes.

Signup and view all the flashcards

What is Data Analytics Life-Cycle?

A circular process consisting of six basic phases: data discovery, data preparation, model planning, model building, communicate results, and operationalization.

Signup and view all the flashcards

What is Data Discovery?

The first step in the life cycle of data analytics, to analyze the problem and establish meaning and understanding.

Signup and view all the flashcards

What is Data Preparation?

Involves cleaning, sampling, combining, and aggregating datasets or elements for training and testing.

Signup and view all the flashcards

What is Analytical Sandbox?

A workspace in which the data science team can work with the data for the duration of the project and conduct analytics.

Signup and view all the flashcards

What is Conservation?

This is the process of raw data as it is and load it into the warehouse.

Signup and view all the flashcards

What is ETL?

Extract, Transform, and Load

Signup and view all the flashcards

What is ELT?

Extraction, Load and Transform

Signup and view all the flashcards

What is Hadoop?

Allows data scientists to explore the complexities that exist in the data, even if they cannot make sense of it.

Signup and view all the flashcards

What is Alpine Miner?

Includes a graphical user interface (GUI) to develop analytical workflows, including data manipulation and a sequence of analytical events on PostgreSQL and other big data sources.

Signup and view all the flashcards

What is OpenRefine?

A standalone open-source tool for data cleanup and transformation to other formats, called data wrangling.

Signup and view all the flashcards

What is Model Planning?

A model is selected based on project type.

Signup and view all the flashcards

What is Model Building?

The selected analytical technique is applied to a set of training data during this phase.

Signup and view all the flashcards

What is SAS?

A data analysis tool that is designed specifically for statistical operations.

Signup and view all the flashcards

What is Apache Spark?

A versatile analytics engine and the data analysis platform that is most used.

Signup and view all the flashcards

What is BigML?

Structured software that uses cloud computing to meet market requirements.

Signup and view all the flashcards

What is MATLAB?

It is a closed-source software providing matrix functions, algorithmic execution, and simulation of statistical results.

Signup and view all the flashcards

What is Jupyter?

Framework based on IPython, designed to help developers create open-source software.

Signup and view all the flashcards

What is Scikit?

Library based on Python, used to implement machine learning algorithms.

Signup and view all the flashcards

What is TensorFlow?

An open-source and ever-evolving toolkit that is famous for its performance and high computational capabilities.

Signup and view all the flashcards

What is Weka?

A tool written in Java for machine learning.

Signup and view all the flashcards

What is Communicate Results?

The project's findings and the business value of the model are communicated to sponsors and stakeholders/

Signup and view all the flashcards

What is Operationalization?

The operationalization phase implements the model in the production environment.

Signup and view all the flashcards

What is Text mining?

Process of gaining insights and knowledge from huge data sources.

Signup and view all the flashcards

What is Text Transformation?

To monitor and control the capitalization of the text.

Signup and view all the flashcards

What is Data Preprocessing?

Used in the field of text mining to derive valuable information and knowledge from unstructured text data.

Signup and view all the flashcards

What is Feature Selection?

Significant part of data mining defined as process of reducing input

Signup and view all the flashcards

What is Information Retrieval?

An automatic process that extracts organized data, important words, attributes, and relationships between entities from loosely organized data and unstructured data.

Signup and view all the flashcards

What is Data Mining?

Finding out the hidden patterns from the extracted data.

Signup and view all the flashcards

What is Natural Language Processing (NLP)?

Deals with the study of human language where aim is to read, decode, and comprehend human languages.

Signup and view all the flashcards

Study Notes

  • Big data refers to nontraditional techniques and tech for collecting, aggregating, processing, and gaining insights from massive datasets
  • The big data process involves data acquisition, preprocessing, mining, prediction, and visualization
  • Skilled analysts are needed to effectively leverage big data, basic big data analytics include: data preparation, model planning, and model building

Analytics for Data Science

  • Big data's components consist of structured, semi-structured, and unstructured data
  • Structured data has a well-defined structure in columns and rows for easy management and accessibility
  • Semi-structured data does not adhere to formal structures but uses tags to distinguish semantic elements
  • Unstructured data lacks a defined structure or standard hierarchy
  • Big data has five characteristics known as the 5 V's: volume, velocity, variety, veracity, and value

The 5 V's of Big Data

  • Volume is a huge amount of information and the size of data is and indicates whether data should be classified as "big"
  • Internet traffic in 2016 measured 6.2 exabytes/month; expected to reach 40,000 exabytes by 2020
  • Velocity refers to the speed of data generation, from machines, networks, social media, and mobile phones
  • More than 3.5 billion searches occur on Google daily, and Facebook users are rising by about 22% yearly
  • Variety talks about the nature of data that can be structured, semi-structured, and unstructured and data comes from inside and outside enterprises
  • Veracity signifies data's vulnerability to inconsistencies and uncertainty because data are collected from various sources in huge amounts and monitoring data quality is challenging
  • Value means data processed to extract value/knowledge, and insights derived are important

Examples of Data Analytics

  • Fraud detection reports identify fraudulent transactions and unauthorized account access
  • Live tracking reports are used track cars, customer requests, payment processing, and find needs and revenue
  • Sales forecast and plan analysis are made to assess customers' sales, profits, needs, and evaluate future targets
  • Live data are handled to provide stock market data and other real-time reports
  • Alerts are generated based on events like data center alerts, using examples of big data analytics notifications
  • Google Analytics reports get user visit counts, user location, and client computer specs

Data analytics life cycle

  • The Data analytics life cycle is required for problems with big data and has data science applications
  • Consists of six basic phases
  • The process is iterative
  • Data discovery involves analyzing the problem, finding data sources, and forming initial hypotheses
  • Data preparation includes cleaning, sampling, combining, and aggregating data, often requiring extensive time and labor
  • System preparing
  • Plan construction
  • Results communication
  • Operationalization
  • An analytical sandbox is created for data preparation, aggregating relevant data
  • A copy of financial data from the computational sandbox is accessed over dealing with the production version of the organization’s main database, as this will be closely managed and required for financial reporting
  • The ETL (Extract, Load, Transform) process is a type of data integration
  • Analytical sandbox size should be at least 5–10 times the original datasets
  • ETL involves extraction, transformation, and loading

ETL

  • Extraction involves being aware of ODBC/JDBC drivers, understanding data structures, and handling resources, CDC (data capture modification) extracts data that has changed over time
  • Transformation cleans data for further use, installation, and data integration
  • Loading loads data into target structure for further processing
  • Data scientists wanting raw data in warehouses lead to big ETL
  • Application Programming Interfaces (APIs) are popular for accessing data
  • Hadoop allows data scientists to explore data complexities and store data without needing to grasp all specifics
  • Alpine Miner develops analytical workflows with a GUI for data manipulation and analytics
  • OpenRefine is a standalone open-source tool for cleaning, transforming, and wrangling data
  • Trifacta Wrangler enables analysts to get data from various sources to prepare for analytical or visualization tools

Model Planning

  • The team selects an analytical model and appropriate variables
  • Key consideration is assessing the structure of datasets and ensuring the analytical technique meets objectives
  • Data exploration aims to know relationships among variables and variable reduction addresses significant correlations
  • Model selection chooses an analytical technique based on project goals and revisits analytic challenges
  • R manipulates data easily, runs on multiple platforms, has many packages
  • Tableau Public links any data source, generates visualizations, allows file sharing
  • SAS accesses/handles data, has customer intelligence products/marketing analytics, can predict, monitor, and refine behaviors

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

More Like This

Use Quizgecko on...
Browser
Browser