Big Data Analytics Fundamentals

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson
Download our mobile app to listen on the go
Get App

Questions and Answers

Which characteristic is NOT typically associated with structured data?

  • Defined in fixed fields.
  • Conforms to a data model.
  • Variable schema. (correct)
  • Organized in rows and columns.

Which statement accurately describes semi-structured data?

  • Contains tags or metadata to define hierarchy. (correct)
  • Easily queried using SQL.
  • Cannot be stored in rows and columns. (correct)
  • Strictly adheres to a relational database model.

Which is a characteristic of unstructured data?

  • Easy integration with relational databases.
  • Well-defined schema.
  • Limited data analysis possibilities.
  • Non-relational model without a specific schema. (correct)

Which storage solution is most suitable for unstructured data requiring high scalability and availability?

<p>Distributed file system with data sharding and replication. (A)</p> Signup and view all the answers

How does a distributed file system (DFS) enhance data accessibility for programmers?

<p>By allowing file access from any network or computer. (A)</p> Signup and view all the answers

Which of the following is a main benefit of using a Distributed File System (DFS)?

<p>Automatic data backup and recovery, ensuring data reliability. (A)</p> Signup and view all the answers

With which type of Distributed File System (DFS) are Google FS and Hadoop FS associated?

<p>Cloud File System. (A)</p> Signup and view all the answers

What inspired the creation of cloud file systems like Hadoop, Grid, Amazon, and Azure file systems?

<p>The Google File System (GFS). (A)</p> Signup and view all the answers

What is the average file size optimized for in the Google File System (GFS)?

<p>Gigabytes. (C)</p> Signup and view all the answers

In Google File System (GFS), what is the size of each chunk into which files are divided?

<p>64 MB (B)</p> Signup and view all the answers

What happens to updates in metadata within the Google File System (GFS) architecture?

<p>They are stored in a log at the master storage. (A)</p> Signup and view all the answers

What is the role of chunk-servers in the Google File System (GFS)?

<p>To store the actual file data in chunks. (C)</p> Signup and view all the answers

What happens if a chunk-server in Google File System (GFS) fails to send a heartbeat signal to the master server?

<p>The master server marks the chunk-server as potentially unavailable. (A)</p> Signup and view all the answers

Which of the following describes the 'consistent' state in the Google File System (GFS) consistency model?

<p>All redundant chunks contain the same data after a write. (B)</p> Signup and view all the answers

In Google File System (GFS), what is the role of the primary replica during a write operation?

<p>To determine the order of write operations and ensure consistency across all secondary replicas. (A)</p> Signup and view all the answers

What happens in Google File System (GFS) if a chunk misses a serial number during write operations?

<p>The chunk is marked as an orphan, and the master completes the missing data in the next heartbeat. (A)</p> Signup and view all the answers

How does distributed computing make a network of computers appear to end-users?

<p>As a single, powerful computer. (A)</p> Signup and view all the answers

Which of the following is a key advantage of distributed computing?

<p>Improved Scalability. (D)</p> Signup and view all the answers

What characterizes the client-server architecture in distributed computing?

<p>Servers provide services and manage their own databases. (A)</p> Signup and view all the answers

What is a key limitation of client-server architecture in distributed computing?

<p>Server bottlenecks under heavy request loads. (D)</p> Signup and view all the answers

In a three-tier architecture, what is the primary responsibility of the database tier?

<p>Retrieving and ensuring the consistency of data. (C)</p> Signup and view all the answers

What characterizes N-tier architecture in distributed systems?

<p>Client-server systems communicating to solve a problem. (A)</p> Signup and view all the answers

What is a defining characteristic of peer-to-peer architecture?

<p>Each node has equal responsibilities. (C)</p> Signup and view all the answers

What is an example of content sharing that commonly uses peer-to-peer architecture?

<p>File streaming services. (A)</p> Signup and view all the answers

How does parallel computing differ in memory access compared to typical distributed computing?

<p>Parallel computing provides shared memory access for all processors. (B)</p> Signup and view all the answers

Which type of computing emphasizes performance and coordination across multiple networks?

<p>Grid computing. (D)</p> Signup and view all the answers

What is a core difference in coupling between grid computing and other distributed systems?

<p>Grids are loosely coupled externally while tightly coupled internally. (C)</p> Signup and view all the answers

What is the defining characteristic of cloud computing?

<p>Delivering hosted services over the internet. (C)</p> Signup and view all the answers

Which programming paradigm inspired the creation of MapReduce?

<p>Lisp. (D)</p> Signup and view all the answers

Which services have adopted MapReduce as a key technology?

<p>Hadoop, Mongo, AWS, and Azure. (B)</p> Signup and view all the answers

In the MapReduce programming model, what is the role of the 'reduce' function?

<p>To merge intermediate values associated with the same key. (B)</p> Signup and view all the answers

If using MapReduce to count words in a set of distributed documents, what is the responsibility of the map function?

<p>Produce word-exist pairs to intermediate storage. (B)</p> Signup and view all the answers

How does the MapReduce library initially divide the input files when starting the processing on a cluster of machines?

<p>Into segments of 64 megabytes each. (A)</p> Signup and view all the answers

In MapReduce, what step follows after a worker is assigned a map task and reads the corresponding input split?

<p>The worker passes each key/value pair to the user-defined Map function. (A)</p> Signup and view all the answers

What action must a MapReduce worker complete when it has read all intermediate data?

<p>Sort the data by the intermediate keys. (D)</p> Signup and view all the answers

How does MapReduce handle a worker failure during a computation task?

<p>The master redistributes the failed worker’s task to another worker. (C)</p> Signup and view all the answers

Why is a combiner function used in MapReduce?

<p>To reduce the amount of intermediate data by processing similar keys at map workers. (B)</p> Signup and view all the answers

Which default partitioning function is used in MapReduce?

<p>HASH(key) mod R. (B)</p> Signup and view all the answers

What happens if the master task dies in MapReduce?

<p>A new master is started from the last checkpointed state. (B)</p> Signup and view all the answers

Flashcards

Data Model Types

A database system can be divided according to data model into three types: Structured, Semi-structured, and Unstructured.

Structured Data

Data with an identifiable structure, presented in rows and columns, and organized so the definition, format, and meaning are explicitly understood.

SQL

A query model for structured data using SQL, which is efficient to handle complex joins.

Structured Data tools

Technologies such as MySQL, ORACLE, SQL Server, and PostgreSQL.

Signup and view all the flashcards

Semi-structured Data

Data that does not reside in a relational database but has organizational properties, using tags or markers to enforce hierarchies of records and fields within the data

Signup and view all the flashcards

Semi-structured Data Formats

Specific data format XML, JSON

Signup and view all the flashcards

Semi-Structured Data Tools

Technologies such as MongoDB, Cassandra, BigTable, and HBase.

Signup and view all the flashcards

Unstructured Data

Data that does not conform to any data model and has no easily identifiable structure.

Signup and view all the flashcards

Unstructured Data Tools

Hadoop File System HFS, Google File System GFS, GridFS

Signup and view all the flashcards

Distributed File System (DFS)

A file system that is distributed on multiple file servers or locations, allowing access to files from any network or computer as if they were local.

Signup and view all the flashcards

DFS Transparency

User unaware of file location.

Signup and view all the flashcards

DFS Scalability

Ability to grow without limit.

Signup and view all the flashcards

DFS Reliability

No fear of losing data; easy backups.

Signup and view all the flashcards

DFS Availability

Data is available even with hardware failure.

Signup and view all the flashcards

DFS Accessibility

Data is relocated to serve users at their locations.

Signup and view all the flashcards

DFS Integrity

Files integrity is maintained, even with concurrent mutations.

Signup and view all the flashcards

DFS Remote Access

Users can access files remotely.

Signup and view all the flashcards

DFS Performance

Improved access time and network efficiency.

Signup and view all the flashcards

RAID DFS

Distributed file system along multiple storage in the same computer. E.g. RAID

Signup and view all the flashcards

Network File System DFS

File system allowing users to access multiple remote systems, via network interface. E.g. NFS, Netware, SMB

Signup and view all the flashcards

Cloud File System DFS

Distributed file systems along multiple nodes. E.g. Google FS, Hadoop FS, Grid FS

Signup and view all the flashcards

Google File System (GFS)

A distributed file system created by Google that has successfully met storage needs.

Signup and view all the flashcards

GFS: Commodity Hardware

The system is built of many inexpensive commodity hardware that often fail.

Signup and view all the flashcards

GFS: Large Files

The system is optimized for large files of multi-giga bytes on average.

Signup and view all the flashcards

GFS Chunk

Files are divided into chunks each are 64 MB. These are distributed along multiple chunk-servers, also knowns as Sharding.

Signup and view all the flashcards

GFS Metadata

Metadata (files, chunk-servers and chunks) are stored in the master server.

Signup and view all the flashcards

GFS Heartbeat

Chunkservers must send heart-beat signal to master to ensure liveness and to update chunks information.

Signup and view all the flashcards

GFS Consistent Writes

Writes attempts must be designed to keep data consistent and defined.

Signup and view all the flashcards

GFS Write Mechanism

Writes are performed with data consistency, and is uniquely serially identified to prevent data from being orphaned

Signup and view all the flashcards

GFS Orphans

Orphaned data occurs if writes occur out of order, the master server is used to maintain the correct replication level

Signup and view all the flashcards

Distributed Computing

Method of making multiple computers work together to solve a common problem by appearing as a single computer that provides large-scale resources.

Signup and view all the flashcards

Distributed Computing: Scalability

Add more nodes to increase computing capabilities

Signup and view all the flashcards

Distributed Computing: Availability

System is consistent even with hardware failure

Signup and view all the flashcards

Distributed Computing: Consistency

Data is shared or duplicates along nodes, and is consistent through the system.

Signup and view all the flashcards

Client-Server Architecture

Consists of multiple client and server. Can cause communication bottlenecks when several machines make request simultanesouly

Signup and view all the flashcards

Three-Tier Architecture

Consists of a client tier, application tier and the database tier.

Signup and view all the flashcards

N-Tier Architecture

N-tier models include several different client-server systems communicating with each other to solve the same problem.

Signup and view all the flashcards

Peer-to-Peer Architecture

Assigns equal responsibilities to all networked computers, with no separation between client and server.

Signup and view all the flashcards

Parallel Computing

Tightly coupled form of distributed computin

Signup and view all the flashcards

Grid Computing

Emphasizes performance and coordination between several networks.

Signup and view all the flashcards

Cloud Computing

A general term for anything that involves delivering hosted services over the internet: (IaaS), platform as a service (PaaS) and software as a service (SaaS).

Signup and view all the flashcards

Study Notes

Lecture 1: Fundamentals of Big-Data Analytics

  • This lecture covers big data concepts.
  • It covers the data model, distributed file system, and distributed computing.

Data Models

  • Database systems are divided into structured, semi-structured, and unstructured types.

Structured Data

  • Structured data has an identifiable structure conforming to a specific data model.
  • Presented in rows and columns, is well-organized for definition, format, and meaning.
  • Data resides in fixed fields within files or records.
  • Elements can be efficiently analyzed and processed.
  • RDBMS (relational database management systems) are an example.
  • Query Model uses Structured Query Language SQL and is efficient with complex joins.
  • Data Analysis and Visualization is easy using any programming language, but limited to available relations.
  • Technologies used include MySQL, ORACLE, SQL Server, PostgreSQL, MySQL Cluster, and Oracle Clusterware.

Semi-Structured Data

  • Does not conform to a relational database, but contains elements of structure, though less rigid.
  • Contains metadata and tags for grouping and describing storage.
  • Organized hierarchically.
  • Automating, managing, and accessing using programs is difficult.
  • XML, JSON, emails, zipped/web files, and binary executables all are forms of semi-structured data.
  • The Data Model type is Semi-relational with variable scheme, using specific data formats like XML and JSON.
  • Query Model supports specific search mechanisms.
  • Data Analysis and Visualization requires intermediate data processing, but allows for more analysis options.
  • It uses distributed file systems that employ data sharding and replication for scaling data and ensuring high availability.
  • Technologies and tools include MongoDB, Cassandra, BigTable, and HBase.

Unstructured Data

  • Unstructured data does not conform to any model and lacks easily identifiable structure or organization.
  • This data type is incompatible with traditional database structures, having no predefined rules or formats.
  • Videos, reports, surveys, Word documents, images, and memos all fall under this type.
  • Its Data Model is a non-relational model, lacking any fixed scheme, for example Word, Video, and Images.
  • Query Model is limited, such as only full-text search for documents.
  • Data Analysis and Visualization also needs advanced processing, like natural language processing or image and video processing.
  • Hadoop File System (HFS), Google File System (GFS), and GridFS are technologies and tools for scaling data and ensuring high availability.

Distributed File System (DFS)

  • DFS spans multiple file servers or locations.
  • Allows programs to interact with local files, enabling access across networks or computers.

DFS Benefits:

  • Transparency: Users do not need to know the physical location of the files
  • Scalability: Ability to grow without limit
  • Reliability: No fear of losing data, backup is not a concern
  • Availability: Data that remains accessible even with partial hardware failure
  • Accessibility: Data can be relocated to users at their locations
  • Integrity: Files integrity is maintained, even with mutations.
  • Remote Access: Allows users to access files from remote locations.
  • Performance: Improved access time and network efficiency.

DFS Types includ:

  • RAID: Distributed file system within the same computer.
  • Network File System: Enables users to access remote file systems through a network interface, ex: NFS, Netware, SMB.
  • Cloud File System: Distributed file systems across multiple nodes, ex: Google FS, Hadoop FS, Grid FS.

Google File System (GFS)

  • A research paper, published by Google in 2003, describes its distributed file system.
  • GFS met Google's storage needs by providing the infrastructure for generation, and data processing for research and development efforts requiring large datasets.
  • The largest cluster provided hundreds of terabytes of storage across thousands of disks on over a thousand machines accessed by hundreds of clients.
  • GFS has inspired other cloud file systems, including Hadoop, Grid, Amazon, and Azure file systems.

GFS Assumptions:

  • Is built of inexpensive commodity hardware prone to failure.
  • Optimized for large, multi-gigabyte files on average.
  • Uses large streaming reads more often than small random reads.
  • Performs writes as continuous appends rather than small updates at random locations.
  • Supports concurrent read/write access while keeping data integrity.
  • Values sustained bandwidth over low latency.

GFS Architecture:

  • Files are divided into 64 MB chunks.
  • Chunks are distributed along multiple chunk-servers through sharding (chunk=shard).
  • Replication is used according to a replication factor (RF), giving more redundancy and durability.
  • The master server stores metadata.
  • Clients get file chunks from the master server, then work directly with chunk-servers.

GFS: Master Server

  • Stores metadata in memory for quick lookup and update.
  • Logs all metadata updates at master storage.
  • It regularly stores metadata snapshots using B-Tree indices for the faster reload
  • Used logs to complete the metadata, after last snapshot reloaded in case of disaster.
  • Echo masters used for disaster recovery and old logs and snapshots can stored into other locations.

GFS: Chunk-Servers

  • Replicas are present for every chunk.
  • Master replies with the set of redundant chucks locations and specifies a prime chunk when client requests file at certain offset value.
  • Nearest chunk-server is chosen for fast access.
  • Must follow a mechanism that maintains data consistency and definition, even with concurrency.
  • Liveness ensured and chunk information updated through heart-beat signals sent to the master.

GFS: Consistency Model

  • Writes should keep data consistent and defined.
  • Consistent: Redundant chunks must have the same data after writes.
  • Defined: Chunks must contain the last written data; after concurrent updated of multiple chunks, all redundant chunks should be consistent.

GFS: Write Mechanism

  • A primary gives a serial for each write attempt.
  • All secondary performs the write attempts according to that serial in same order to preserve the consistency.
  • If a chuck miss a serial, it become orphan, and it is completed in thee next heartbeat to maintain the replication level as desired.

Distributed Computing

  • Involves multiple computers working together to solve a common problem, making the computer network a powerful single unit.
  • It can encrypt large data volumes or render high-quality 3D video animation.
  • Distributed systems, programming, and algorithms relate to it.

Advantages of Distributed Computing:

  • Scalability: Increase computing capabilities
  • Availability: Continued operation through nodes
  • Consistency: Data sharing along nodes
  • Transparency: Logical separation between user and physical devices.
  • Efficiency: Faster performance with optimum resource use.

Distributed Computing Architectures:

  • Client-Server Architecture
  • Three-Tier Architecture
  • N-Tier architecture
  • Peer-to-Peer Architecture

Client-Server Architecture

  • It causes communication bottlenecks when several machines make requests simultaneously.
  • Consists of a set clients, with limited capabilities, and servers, which perform specific services and manage databases.

Three-Tier Architecture

  • Workload is generally better than the client-server architecture, however single computation tasks are performed at a single server.
  • Three tiers are client, application which contains logic, and database which is in charge of data retrieval and consistency.

N-Tier Architecture

  • Modern systems use an n-tier architecture, integrating enterprise applications.
  • Client-server systems communicate with each other in order to solve the same problem.

Peer-to-Peer Architecture

  • There is no separation between the client and server and any computer can perform all responsibilities
  • Peer-to-peer distributed systems give equal responsibilities to all networked computers.
  • It has increased in used for content sharing, file streaming, and blockchain networks.

Distributed Computing Types include:

  • Parallel Computing
  • Grid Computing
  • Cloud Computing

Parallel Computing

  • A tightly coupled form of distributed computing.
  • Shared memory use by processors for information exchange.
  • Nodes use message passing to exchange information, while each node has private memory in distributed computing.

Grid Computing

  • A highly scaled distributed computing emphasizes performance and coordination between several networks.
  • Acts like a tightly coupled computing system internally.
  • Is more loosely coupled externally with each grid network performing individual functions.

Cloud Computing

  • Involves delivering hosted services over the internet.
  • Divided into infrastructure, IaaS, platform, PaaS, and software, SaaS, as a service.

Distributed Computing Technologies:

  • Map Reduce
  • Spark

Map Reduce

  • Inspired by the map and reduce primitives present in Lisp.
  • Introduced by Google in 2004.
  • Is a programming model used for processing and generating large data sets through distributed file system infrastructure.
  • Hundreds of MapReduce jobs run daily on Google's clusters.
  • Hadoop, Mongo, AWS, and Azure use MR.

Map Reduce, Idea

  • Users specify a map function processes a key/value pair to generate a set of intermediate key/value pairs
  • Reduce function merges all intermediate values associated with the same intermediate key.
  • All distributed documents can have each individual word counted
  • The map produce word-exist pairs to intermediate storage. The reduce calculate each word existence count.

Map Reduce Processing

  • The MapReduce library in the user program splits the input files into M pieces of 64 megabytes (MB) per piece.
  • One copy serves as the master, assigning M map tasks and R reduce tasks to idle workers on a cluster of machines.
  • A master needs to assign tasks. There are M map and R reduce tasks ready to be assigned.
  • A worker assigned to a map task parses key/value pairs from its input split.
  • The user-defined Map function generates intermediate key/value pairs buffered in memory.
  • These buffered pairs written to local disk, partitioned into R regions.
  • Their locations passed back to the master, who forwards these locations to the reduce workers.
  • Reduce workers read buffered data using remote procedure calls, then sort by intermediate keys for grouping.
  • A reduce worker iterates over the sorted data, applying the Reduce function to each key and values and appending the result to its output, which is a unique intermediate key.
  • Once all tasks complete, the master wakes up the user program.

Map Reduce Techniques include:

  • Partitioning Function.
  • Fault Tolerance.
  • Combiner Function.

Map Reduce: Partitioning Function

  • Default function uses hashing HASH(key) mod R and usually results in that end up in balanced partitions.
  • Custom partitioning is needed, like URLs, which can be used HASH(hostname(key)) mod R.

Map Reduce: Fault Tolerance

  • After the time certain, if non response is got from worker, the master marks a worker as failed, by resetting the map or reduce tasks associated.
  • The master can do periodic checkpoints, and a new copy takes over from the last save, if the master task dies.
  • Or client restarts the map-reduce job if the masters fails.

Map Reduce: Combiner Function

  • Additional combiner functions are used in order the save storage and time.
  • Map reduce saves huge amounts of <k, 1> to processed by reduce worker in word counting example, which is a good case.
  • It will generate <k, x> and save x number of records instead.

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

More Like This

Database Concepts Quiz
253 questions

Database Concepts Quiz

CapableAmethyst avatar
CapableAmethyst
GIS Data Models and Spatial Data Structures Lecture 2 Objectives
12 questions
Use Quizgecko on...
Browser
Browser