Podcast
Questions and Answers
Which characteristic is NOT typically associated with structured data?
Which characteristic is NOT typically associated with structured data?
- Defined in fixed fields.
- Conforms to a data model.
- Variable schema. (correct)
- Organized in rows and columns.
Which statement accurately describes semi-structured data?
Which statement accurately describes semi-structured data?
- Contains tags or metadata to define hierarchy. (correct)
- Easily queried using SQL.
- Cannot be stored in rows and columns. (correct)
- Strictly adheres to a relational database model.
Which is a characteristic of unstructured data?
Which is a characteristic of unstructured data?
- Easy integration with relational databases.
- Well-defined schema.
- Limited data analysis possibilities.
- Non-relational model without a specific schema. (correct)
Which storage solution is most suitable for unstructured data requiring high scalability and availability?
Which storage solution is most suitable for unstructured data requiring high scalability and availability?
How does a distributed file system (DFS) enhance data accessibility for programmers?
How does a distributed file system (DFS) enhance data accessibility for programmers?
Which of the following is a main benefit of using a Distributed File System (DFS)?
Which of the following is a main benefit of using a Distributed File System (DFS)?
With which type of Distributed File System (DFS) are Google FS and Hadoop FS associated?
With which type of Distributed File System (DFS) are Google FS and Hadoop FS associated?
What inspired the creation of cloud file systems like Hadoop, Grid, Amazon, and Azure file systems?
What inspired the creation of cloud file systems like Hadoop, Grid, Amazon, and Azure file systems?
What is the average file size optimized for in the Google File System (GFS)?
What is the average file size optimized for in the Google File System (GFS)?
In Google File System (GFS), what is the size of each chunk into which files are divided?
In Google File System (GFS), what is the size of each chunk into which files are divided?
What happens to updates in metadata within the Google File System (GFS) architecture?
What happens to updates in metadata within the Google File System (GFS) architecture?
What is the role of chunk-servers in the Google File System (GFS)?
What is the role of chunk-servers in the Google File System (GFS)?
What happens if a chunk-server in Google File System (GFS) fails to send a heartbeat signal to the master server?
What happens if a chunk-server in Google File System (GFS) fails to send a heartbeat signal to the master server?
Which of the following describes the 'consistent' state in the Google File System (GFS) consistency model?
Which of the following describes the 'consistent' state in the Google File System (GFS) consistency model?
In Google File System (GFS), what is the role of the primary replica during a write operation?
In Google File System (GFS), what is the role of the primary replica during a write operation?
What happens in Google File System (GFS) if a chunk misses a serial number during write operations?
What happens in Google File System (GFS) if a chunk misses a serial number during write operations?
How does distributed computing make a network of computers appear to end-users?
How does distributed computing make a network of computers appear to end-users?
Which of the following is a key advantage of distributed computing?
Which of the following is a key advantage of distributed computing?
What characterizes the client-server architecture in distributed computing?
What characterizes the client-server architecture in distributed computing?
What is a key limitation of client-server architecture in distributed computing?
What is a key limitation of client-server architecture in distributed computing?
In a three-tier architecture, what is the primary responsibility of the database tier?
In a three-tier architecture, what is the primary responsibility of the database tier?
What characterizes N-tier architecture in distributed systems?
What characterizes N-tier architecture in distributed systems?
What is a defining characteristic of peer-to-peer architecture?
What is a defining characteristic of peer-to-peer architecture?
What is an example of content sharing that commonly uses peer-to-peer architecture?
What is an example of content sharing that commonly uses peer-to-peer architecture?
How does parallel computing differ in memory access compared to typical distributed computing?
How does parallel computing differ in memory access compared to typical distributed computing?
Which type of computing emphasizes performance and coordination across multiple networks?
Which type of computing emphasizes performance and coordination across multiple networks?
What is a core difference in coupling between grid computing and other distributed systems?
What is a core difference in coupling between grid computing and other distributed systems?
What is the defining characteristic of cloud computing?
What is the defining characteristic of cloud computing?
Which programming paradigm inspired the creation of MapReduce?
Which programming paradigm inspired the creation of MapReduce?
Which services have adopted MapReduce as a key technology?
Which services have adopted MapReduce as a key technology?
In the MapReduce programming model, what is the role of the 'reduce' function?
In the MapReduce programming model, what is the role of the 'reduce' function?
If using MapReduce to count words in a set of distributed documents, what is the responsibility of the map function?
If using MapReduce to count words in a set of distributed documents, what is the responsibility of the map function?
How does the MapReduce library initially divide the input files when starting the processing on a cluster of machines?
How does the MapReduce library initially divide the input files when starting the processing on a cluster of machines?
In MapReduce, what step follows after a worker is assigned a map task and reads the corresponding input split?
In MapReduce, what step follows after a worker is assigned a map task and reads the corresponding input split?
What action must a MapReduce worker complete when it has read all intermediate data?
What action must a MapReduce worker complete when it has read all intermediate data?
How does MapReduce handle a worker failure during a computation task?
How does MapReduce handle a worker failure during a computation task?
Why is a combiner function used in MapReduce?
Why is a combiner function used in MapReduce?
Which default partitioning function is used in MapReduce?
Which default partitioning function is used in MapReduce?
What happens if the master task dies in MapReduce?
What happens if the master task dies in MapReduce?
Flashcards
Data Model Types
Data Model Types
A database system can be divided according to data model into three types: Structured, Semi-structured, and Unstructured.
Structured Data
Structured Data
Data with an identifiable structure, presented in rows and columns, and organized so the definition, format, and meaning are explicitly understood.
SQL
SQL
A query model for structured data using SQL, which is efficient to handle complex joins.
Structured Data tools
Structured Data tools
Signup and view all the flashcards
Semi-structured Data
Semi-structured Data
Signup and view all the flashcards
Semi-structured Data Formats
Semi-structured Data Formats
Signup and view all the flashcards
Semi-Structured Data Tools
Semi-Structured Data Tools
Signup and view all the flashcards
Unstructured Data
Unstructured Data
Signup and view all the flashcards
Unstructured Data Tools
Unstructured Data Tools
Signup and view all the flashcards
Distributed File System (DFS)
Distributed File System (DFS)
Signup and view all the flashcards
DFS Transparency
DFS Transparency
Signup and view all the flashcards
DFS Scalability
DFS Scalability
Signup and view all the flashcards
DFS Reliability
DFS Reliability
Signup and view all the flashcards
DFS Availability
DFS Availability
Signup and view all the flashcards
DFS Accessibility
DFS Accessibility
Signup and view all the flashcards
DFS Integrity
DFS Integrity
Signup and view all the flashcards
DFS Remote Access
DFS Remote Access
Signup and view all the flashcards
DFS Performance
DFS Performance
Signup and view all the flashcards
RAID DFS
RAID DFS
Signup and view all the flashcards
Network File System DFS
Network File System DFS
Signup and view all the flashcards
Cloud File System DFS
Cloud File System DFS
Signup and view all the flashcards
Google File System (GFS)
Google File System (GFS)
Signup and view all the flashcards
GFS: Commodity Hardware
GFS: Commodity Hardware
Signup and view all the flashcards
GFS: Large Files
GFS: Large Files
Signup and view all the flashcards
GFS Chunk
GFS Chunk
Signup and view all the flashcards
GFS Metadata
GFS Metadata
Signup and view all the flashcards
GFS Heartbeat
GFS Heartbeat
Signup and view all the flashcards
GFS Consistent Writes
GFS Consistent Writes
Signup and view all the flashcards
GFS Write Mechanism
GFS Write Mechanism
Signup and view all the flashcards
GFS Orphans
GFS Orphans
Signup and view all the flashcards
Distributed Computing
Distributed Computing
Signup and view all the flashcards
Distributed Computing: Scalability
Distributed Computing: Scalability
Signup and view all the flashcards
Distributed Computing: Availability
Distributed Computing: Availability
Signup and view all the flashcards
Distributed Computing: Consistency
Distributed Computing: Consistency
Signup and view all the flashcards
Client-Server Architecture
Client-Server Architecture
Signup and view all the flashcards
Three-Tier Architecture
Three-Tier Architecture
Signup and view all the flashcards
N-Tier Architecture
N-Tier Architecture
Signup and view all the flashcards
Peer-to-Peer Architecture
Peer-to-Peer Architecture
Signup and view all the flashcards
Parallel Computing
Parallel Computing
Signup and view all the flashcards
Grid Computing
Grid Computing
Signup and view all the flashcards
Cloud Computing
Cloud Computing
Signup and view all the flashcards
Study Notes
Lecture 1: Fundamentals of Big-Data Analytics
- This lecture covers big data concepts.
- It covers the data model, distributed file system, and distributed computing.
Data Models
- Database systems are divided into structured, semi-structured, and unstructured types.
Structured Data
- Structured data has an identifiable structure conforming to a specific data model.
- Presented in rows and columns, is well-organized for definition, format, and meaning.
- Data resides in fixed fields within files or records.
- Elements can be efficiently analyzed and processed.
- RDBMS (relational database management systems) are an example.
- Query Model uses Structured Query Language SQL and is efficient with complex joins.
- Data Analysis and Visualization is easy using any programming language, but limited to available relations.
- Technologies used include MySQL, ORACLE, SQL Server, PostgreSQL, MySQL Cluster, and Oracle Clusterware.
Semi-Structured Data
- Does not conform to a relational database, but contains elements of structure, though less rigid.
- Contains metadata and tags for grouping and describing storage.
- Organized hierarchically.
- Automating, managing, and accessing using programs is difficult.
- XML, JSON, emails, zipped/web files, and binary executables all are forms of semi-structured data.
- The Data Model type is Semi-relational with variable scheme, using specific data formats like XML and JSON.
- Query Model supports specific search mechanisms.
- Data Analysis and Visualization requires intermediate data processing, but allows for more analysis options.
- It uses distributed file systems that employ data sharding and replication for scaling data and ensuring high availability.
- Technologies and tools include MongoDB, Cassandra, BigTable, and HBase.
Unstructured Data
- Unstructured data does not conform to any model and lacks easily identifiable structure or organization.
- This data type is incompatible with traditional database structures, having no predefined rules or formats.
- Videos, reports, surveys, Word documents, images, and memos all fall under this type.
- Its Data Model is a non-relational model, lacking any fixed scheme, for example Word, Video, and Images.
- Query Model is limited, such as only full-text search for documents.
- Data Analysis and Visualization also needs advanced processing, like natural language processing or image and video processing.
- Hadoop File System (HFS), Google File System (GFS), and GridFS are technologies and tools for scaling data and ensuring high availability.
Distributed File System (DFS)
- DFS spans multiple file servers or locations.
- Allows programs to interact with local files, enabling access across networks or computers.
DFS Benefits:
- Transparency: Users do not need to know the physical location of the files
- Scalability: Ability to grow without limit
- Reliability: No fear of losing data, backup is not a concern
- Availability: Data that remains accessible even with partial hardware failure
- Accessibility: Data can be relocated to users at their locations
- Integrity: Files integrity is maintained, even with mutations.
- Remote Access: Allows users to access files from remote locations.
- Performance: Improved access time and network efficiency.
DFS Types includ:
- RAID: Distributed file system within the same computer.
- Network File System: Enables users to access remote file systems through a network interface, ex: NFS, Netware, SMB.
- Cloud File System: Distributed file systems across multiple nodes, ex: Google FS, Hadoop FS, Grid FS.
Google File System (GFS)
- A research paper, published by Google in 2003, describes its distributed file system.
- GFS met Google's storage needs by providing the infrastructure for generation, and data processing for research and development efforts requiring large datasets.
- The largest cluster provided hundreds of terabytes of storage across thousands of disks on over a thousand machines accessed by hundreds of clients.
- GFS has inspired other cloud file systems, including Hadoop, Grid, Amazon, and Azure file systems.
GFS Assumptions:
- Is built of inexpensive commodity hardware prone to failure.
- Optimized for large, multi-gigabyte files on average.
- Uses large streaming reads more often than small random reads.
- Performs writes as continuous appends rather than small updates at random locations.
- Supports concurrent read/write access while keeping data integrity.
- Values sustained bandwidth over low latency.
GFS Architecture:
- Files are divided into 64 MB chunks.
- Chunks are distributed along multiple chunk-servers through sharding (chunk=shard).
- Replication is used according to a replication factor (RF), giving more redundancy and durability.
- The master server stores metadata.
- Clients get file chunks from the master server, then work directly with chunk-servers.
GFS: Master Server
- Stores metadata in memory for quick lookup and update.
- Logs all metadata updates at master storage.
- It regularly stores metadata snapshots using B-Tree indices for the faster reload
- Used logs to complete the metadata, after last snapshot reloaded in case of disaster.
- Echo masters used for disaster recovery and old logs and snapshots can stored into other locations.
GFS: Chunk-Servers
- Replicas are present for every chunk.
- Master replies with the set of redundant chucks locations and specifies a prime chunk when client requests file at certain offset value.
- Nearest chunk-server is chosen for fast access.
- Must follow a mechanism that maintains data consistency and definition, even with concurrency.
- Liveness ensured and chunk information updated through heart-beat signals sent to the master.
GFS: Consistency Model
- Writes should keep data consistent and defined.
- Consistent: Redundant chunks must have the same data after writes.
- Defined: Chunks must contain the last written data; after concurrent updated of multiple chunks, all redundant chunks should be consistent.
GFS: Write Mechanism
- A primary gives a serial for each write attempt.
- All secondary performs the write attempts according to that serial in same order to preserve the consistency.
- If a chuck miss a serial, it become orphan, and it is completed in thee next heartbeat to maintain the replication level as desired.
Distributed Computing
- Involves multiple computers working together to solve a common problem, making the computer network a powerful single unit.
- It can encrypt large data volumes or render high-quality 3D video animation.
- Distributed systems, programming, and algorithms relate to it.
Advantages of Distributed Computing:
- Scalability: Increase computing capabilities
- Availability: Continued operation through nodes
- Consistency: Data sharing along nodes
- Transparency: Logical separation between user and physical devices.
- Efficiency: Faster performance with optimum resource use.
Distributed Computing Architectures:
- Client-Server Architecture
- Three-Tier Architecture
- N-Tier architecture
- Peer-to-Peer Architecture
Client-Server Architecture
- It causes communication bottlenecks when several machines make requests simultaneously.
- Consists of a set clients, with limited capabilities, and servers, which perform specific services and manage databases.
Three-Tier Architecture
- Workload is generally better than the client-server architecture, however single computation tasks are performed at a single server.
- Three tiers are client, application which contains logic, and database which is in charge of data retrieval and consistency.
N-Tier Architecture
- Modern systems use an n-tier architecture, integrating enterprise applications.
- Client-server systems communicate with each other in order to solve the same problem.
Peer-to-Peer Architecture
- There is no separation between the client and server and any computer can perform all responsibilities
- Peer-to-peer distributed systems give equal responsibilities to all networked computers.
- It has increased in used for content sharing, file streaming, and blockchain networks.
Distributed Computing Types include:
- Parallel Computing
- Grid Computing
- Cloud Computing
Parallel Computing
- A tightly coupled form of distributed computing.
- Shared memory use by processors for information exchange.
- Nodes use message passing to exchange information, while each node has private memory in distributed computing.
Grid Computing
- A highly scaled distributed computing emphasizes performance and coordination between several networks.
- Acts like a tightly coupled computing system internally.
- Is more loosely coupled externally with each grid network performing individual functions.
Cloud Computing
- Involves delivering hosted services over the internet.
- Divided into infrastructure, IaaS, platform, PaaS, and software, SaaS, as a service.
Distributed Computing Technologies:
- Map Reduce
- Spark
Map Reduce
- Inspired by the map and reduce primitives present in Lisp.
- Introduced by Google in 2004.
- Is a programming model used for processing and generating large data sets through distributed file system infrastructure.
- Hundreds of MapReduce jobs run daily on Google's clusters.
- Hadoop, Mongo, AWS, and Azure use MR.
Map Reduce, Idea
- Users specify a map function processes a key/value pair to generate a set of intermediate key/value pairs
- Reduce function merges all intermediate values associated with the same intermediate key.
- All distributed documents can have each individual word counted
- The map produce word-exist pairs to intermediate storage. The reduce calculate each word existence count.
Map Reduce Processing
- The MapReduce library in the user program splits the input files into M pieces of 64 megabytes (MB) per piece.
- One copy serves as the master, assigning M map tasks and R reduce tasks to idle workers on a cluster of machines.
- A master needs to assign tasks. There are M map and R reduce tasks ready to be assigned.
- A worker assigned to a map task parses key/value pairs from its input split.
- The user-defined Map function generates intermediate key/value pairs buffered in memory.
- These buffered pairs written to local disk, partitioned into R regions.
- Their locations passed back to the master, who forwards these locations to the reduce workers.
- Reduce workers read buffered data using remote procedure calls, then sort by intermediate keys for grouping.
- A reduce worker iterates over the sorted data, applying the Reduce function to each key and values and appending the result to its output, which is a unique intermediate key.
- Once all tasks complete, the master wakes up the user program.
Map Reduce Techniques include:
- Partitioning Function.
- Fault Tolerance.
- Combiner Function.
Map Reduce: Partitioning Function
- Default function uses hashing HASH(key) mod R and usually results in that end up in balanced partitions.
- Custom partitioning is needed, like URLs, which can be used HASH(hostname(key)) mod R.
Map Reduce: Fault Tolerance
- After the time certain, if non response is got from worker, the master marks a worker as failed, by resetting the map or reduce tasks associated.
- The master can do periodic checkpoints, and a new copy takes over from the last save, if the master task dies.
- Or client restarts the map-reduce job if the masters fails.
Map Reduce: Combiner Function
- Additional combiner functions are used in order the save storage and time.
- Map reduce saves huge amounts of <k, 1> to processed by reduce worker in word counting example, which is a good case.
- It will generate <k, x> and save x number of records instead.
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.