Podcast
Questions and Answers
What data format does Elasticsearch primarily use to store data?
What data format does Elasticsearch primarily use to store data?
- CSV
- JSON (correct)
- XML
- BSON
Which of the following is a key feature of Elasticsearch?
Which of the following is a key feature of Elasticsearch?
- Limited scalability
- Dependency on specific hardware
- Strict schema enforcement
- Real-time search capabilities (correct)
What is a 'node' in the context of Elasticsearch?
What is a 'node' in the context of Elasticsearch?
- A collection of indexes
- A backup of the entire cluster
- A unit of data within an index
- A single running instance of Elasticsearch (correct)
How do shards contribute to Elasticsearch's scalability?
How do shards contribute to Elasticsearch's scalability?
What role do replicas play in Elasticsearch?
What role do replicas play in Elasticsearch?
Which of the following is an advantage of using Elasticsearch?
Which of the following is an advantage of using Elasticsearch?
What is a limitation of Elasticsearch regarding multi-language support?
What is a limitation of Elasticsearch regarding multi-language support?
How does Kibana enhance the use of Elasticsearch?
How does Kibana enhance the use of Elasticsearch?
In the ELK stack, what is the role of Logstash?
In the ELK stack, what is the role of Logstash?
Which feature of Kibana allows users to combine various visualizations into a single view?
Which feature of Kibana allows users to combine various visualizations into a single view?
What is one of the disadvantages of Kibana?
What is one of the disadvantages of Kibana?
In Logstash, what is the purpose of the 'Filter' stage in the pipeline?
In Logstash, what is the purpose of the 'Filter' stage in the pipeline?
Which of the following is a key advantage of using Logstash?
Which of the following is a key advantage of using Logstash?
What is a potential drawback of using Logstash?
What is a potential drawback of using Logstash?
How does MongoDB differ from traditional relational databases?
How does MongoDB differ from traditional relational databases?
What does 'schema-less' mean in the context of MongoDB?
What does 'schema-less' mean in the context of MongoDB?
Which of the following is an advantage of MongoDB over RDBMS?
Which of the following is an advantage of MongoDB over RDBMS?
In MongoDB, what is the equivalent of a 'table' in a relational database?
In MongoDB, what is the equivalent of a 'table' in a relational database?
What is a key consideration when designing a schema in MongoDB?
What is a key consideration when designing a schema in MongoDB?
In MongoDB data modeling, what is the approach of Embedded Data Model?
In MongoDB data modeling, what is the approach of Embedded Data Model?
What makes Apache Cassandra suitable for applications requiring high availability and scalability?
What makes Apache Cassandra suitable for applications requiring high availability and scalability?
Which of the following best describes Cassandra's data model?
Which of the following best describes Cassandra's data model?
What is a Keyspace in Cassandra?
What is a Keyspace in Cassandra?
How does Cassandra achieve fault tolerance?
How does Cassandra achieve fault tolerance?
What is the purpose of the Commit Log in Cassandra?
What is the purpose of the Commit Log in Cassandra?
What is the role of SSTable in Cassandra?
What is the role of SSTable in Cassandra?
How does Cassandra differ from an RDBMS in terms of data structure?
How does Cassandra differ from an RDBMS in terms of data structure?
Which programming language that runs on the JVM is Logstash written on?
Which programming language that runs on the JVM is Logstash written on?
What is CQL in Cassandra?
What is CQL in Cassandra?
Flashcards
What is Elasticsearch?
What is Elasticsearch?
A real-time distributed, open-source, full-text search and analytics engine accessible from RESTful web service interface.
How scalable is Elasticsearch?
How scalable is Elasticsearch?
Elasticsearch is scalable up to petabytes of structured and unstructured data.
What is a Node in Elasticsearch?
What is a Node in Elasticsearch?
An Elasticsearch instance on a server, accommodating multiple nodes depending on physical resources.
What is a Cluster in Elasticsearch?
What is a Cluster in Elasticsearch?
Signup and view all the flashcards
What is an Index in Elasticsearch?
What is an Index in Elasticsearch?
Signup and view all the flashcards
What is a Document in Elasticsearch?
What is a Document in Elasticsearch?
Signup and view all the flashcards
What is a Shard in Elasticsearch?
What is a Shard in Elasticsearch?
Signup and view all the flashcards
What are Replicas in Elasticsearch?
What are Replicas in Elasticsearch?
Signup and view all the flashcards
What is Kibana?
What is Kibana?
Signup and view all the flashcards
What is the ELK stack?
What is the ELK stack?
Signup and view all the flashcards
What does Logstash do?
What does Logstash do?
Signup and view all the flashcards
What is a Visualization in Kibana?
What is a Visualization in Kibana?
Signup and view all the flashcards
What is a Dashboard in Kibana?
What is a Dashboard in Kibana?
Signup and view all the flashcards
What are Dev Tools in Kibana?
What are Dev Tools in Kibana?
Signup and view all the flashcards
What is Timelion in Kibana?
What is Timelion in Kibana?
Signup and view all the flashcards
What is Canvas in Kibana?
What is Canvas in Kibana?
Signup and view all the flashcards
What is Logstash?
What is Logstash?
Signup and view all the flashcards
What is Event Object?
What is Event Object?
Signup and view all the flashcards
What is a Pipeline in Logstash?
What is a Pipeline in Logstash?
Signup and view all the flashcards
What is Input in Logstash?
What is Input in Logstash?
Signup and view all the flashcards
What is Filter in Logstash?
What is Filter in Logstash?
Signup and view all the flashcards
What is Output in Logstash?
What is Output in Logstash?
Signup and view all the flashcards
What is MongoDB?
What is MongoDB?
Signup and view all the flashcards
What is a Database in MongoDB?
What is a Database in MongoDB?
Signup and view all the flashcards
What is a Collection in MongoDB?
What is a Collection in MongoDB?
Signup and view all the flashcards
What is a Document in MongoDB?
What is a Document in MongoDB?
Signup and view all the flashcards
What is Schema less in MongoDB?
What is Schema less in MongoDB?
Signup and view all the flashcards
What is Apache Cassandra?
What is Apache Cassandra?
Signup and view all the flashcards
What is a NoSQL Database?
What is a NoSQL Database?
Signup and view all the flashcards
Keyspace in Cassandra
Keyspace in Cassandra
Signup and view all the flashcards
Study Notes
Elasticsearch Overview
- Elasticsearch is an Apache Lucene-based search server developed by Shay Banon and released in 2010 and is maintained by Elasticsearch BV, with the latest version being 7.0.0.
- It is a real-time, distributed, open-source full-text search and analytics engine.
- It is accessible via a RESTful web service interface and uses schema-less JSON documents for data storage.
- Built on Java, Elasticsearch runs on various platforms, enabling high-speed exploration of large datasets.
Elasticsearch General Features
- Scalable to petabytes of structured and unstructured data.
- Can replace document stores like MongoDB and RavenDB.
- Uses denormalization to enhance search performance.
- Widely used by large organizations like Wikipedia, The Guardian, StackOverflow, and GitHub.
- Open source under the Apache license version 2.0.
Elasticsearch Key Concepts
Node
- A single running instance of Elasticsearch.
- Accommodates multiple nodes based on physical resources like RAM, storage, and processing power.
Cluster
- A collection of one or more nodes.
- Provides collective indexing and search capabilities across all nodes for entire data.
Index
- Collection of different types of documents and their properties.
- Uses shards to improve performance.
- A set of documents contains data, such as from a social networking application.
Document
- Collection of fields in a specific manner defined in JSON format.
- Belongs to a type and resides inside an index.
- Has a unique identifier called the UID.
Shard
- Indexes are horizontally subdivided into shards, each containing all document properties, but fewer JSON objects than the index.
- Enables storage on independent nodes.
- Primary shards are the original horizontal parts of an index which are replicated into replica shards.
Replicas
- Allows users to create copies of indexes and shards.
- Increases data availability in case of failure.
- Enhances search performance through parallel search operations.
Elasticsearch Advantages
- Developed in Java, ensuring platform compatibility.
- Provides near real-time search capabilities. Added documents searchable after one second.
- Distributed architecture allows easy scaling and integration in large organizations.
- Simplifies full backups using the gateway concept.
- Offers easy multi-tenancy compared to Apache Solr.
- Uses JSON objects as responses, compatible with numerous programming languages.
- Supports almost all document types, except those without text rendering.
Elasticsearch Disadvantages
- Lacks multi-language support for request and response data, only supporting JSON, unlike Apache Solr which supports CSV, XML, and JSON.
- Can encounter "Split brain" situations.
Comparison between Elasticsearch and RDBMS
- Cluster is similar to Database
- Shard is similar to Shard
- Index is similar to Table
- Field is similar to Column
- Document is similar to Row
Kibana Overview
- Kibana is an open-source, browser-based visualization tool for analyzing large volumes of logs using various formats like line graphs, bar graphs, pie charts, heat maps, region maps, coordinate maps, gauges, goals and timelion.
- It facilitates the prediction or observation of changes in trends of errors or other significant events.
- Kibana synchronizes with Elasticsearch and Logstash to form the ELK stack
ELK Stack
- ELK stands for Elasticsearch, Logstash, and Kibana, a popular log management platform used worldwide for log analysis.
- Logstash extracts logging data from diverse input sources, processes it, and stores it in Elasticsearch.
- Kibana is a visualization tool that accesses logs from Elasticsearch and displays them to users using graphs and charts.
Kibana Features
- Visualization: Offers diverse ways to visualize data, including vertical bar charts, horizontal bar charts, pie charts, line graphs, and heat maps.
- Dashboard: Place visualizations onto a single board, providing a clear overview of events.
- Dev Tools: Used to manage indexes, add dummy data, and create visualizations.
- Reports: Converts data from visualizations and dashboards into various formats (CSV), embeds in code, or generates shareable URLs.
- Filters and Search query: Enables detailed data retrieval from dashboards or visualization tools.
- Plugins: Supports addition of third-party plugins for new visualizations.
- Coordinate and Region Maps: Displays visualization on geographical maps.
- Timelion: Visualization tool for time-based data analysis, using simple expressions to connect to indexes and perform calculations for data comparison over time.
- Canvas: Powerful feature for data representation via color combinations, shapes, texts and workpads.
Kibana Advantages
- Open-source browser-based visualization tool.
- Simple and easy to understand.
- Allows easy conversion of visualizations and dashboards into reports.
- Offers Canvas visualization for analyzing complex data and Timelion visualization for comparing historical data.
Kibana Disadvantages
- Adding plugins can be tedious due to version mismatches.
- Upgrading from older versions can lead to issues.
Logstash Overview
- This tool gathers, processes, and generates logs/events using filter/pipes patterns, helping centralize and analyze logs in real-time from different sources.
- JRuby programming language written code runs on the JVM, supporting various Operating Systems
- Events, logs, packets, timestamp data, and transactions are some examples of data it collects from various sources like social media, e-commerce, news, CRM, game data, mobile devices, web trends, financial data and the Internet of Things
Logstash General Features
- Collects data from different sources and sends it to multiple destinations.
- Can handle various logging data types.
- Handles HTTP requests and response data.
- Offers filters to help users find data meaning by parsing and transforming.
- Used for handling sensor data in the Internet of Things.
- Open source under the Apache license version 2.0.
Logstash Key Concepts
Event Object
- This is the main object in Logstash
- Encapsulates data flow in the Logstash pipeline
- Used to store the input data and add extra fields during the filter stage
- It is referred to as Logging Data Event, Log Event, Log Data, Input Log Data, Output Log Data, etc
Pipeline
- Comprises of data flow stages from input to output.
- Input data enters and is processed as an event.
- The event sends to an output destination in a desirable format.
Input
- Data is received in the Logstash pipeline.
- Plugins like File, Syslog, Redis, and Beats are used.
Filter
- The middle stage where the actual processing of events occurs.
- Pre-defined Regex Patterns are used to create sequences for differentiating fields.
- Filter plugins like Grok, Mutate, Drop, Clone, and Geoip are used.
Output
- The final stage where output events are formatted and sent to destination systems.
- Plugins such as Elasticsearch, File, Graphite, and Statsd, are used.
Logstash Advantages
- Regex pattern sequences identify and parse various fields in any input event.
- Supports a variety of web servers and data sources for extracting logging data.
- Multiple plugins parse and transform logging data as desired.
- Centralized, easing the processing and collection of data from different servers.
- Supports many databases, network protocols, and other services.
- Uses the HTTP protocol to upgrade Elasticsearch versions without upgrading Logstash.
Logstash Disadvantages
- Uses HTTP. Negatively affecting the processing of logging data.
- Working with it can be complex and requires a good understanding of the input logging data and analysis.
- Filter plugins are not generic. Requiring users to find correct pattern.
MongoDB Overview
- MongoDB is a cross-platform, document-oriented database that provides high performance, high availability, and easy scalability.
- It works on the concept of collections and documents.
Database
- A physical container for collections, with each database having its own file set.
- A single MongoDB server typically manages multiple databases
Collection
- Group of MongoDB documents, equivalent to an RDBMS table, existing within a single database and do not enforce schema.
- Documents within a collection can have different fields.
- Typically, documents are of similar or related purpose.
Document
- Set of key-value pairs with a dynamic schema.
- Documents in the same collection do not need to have the same fields or structure.
- Common fields can hold different types of data.
MongoDB/RDBMS Terminology
- Database is Database
- Table is Collection
- Tuple/Row is Document
- Column is Field
- Table Join is Embedded Document
- Primary Key is Primary Key (Default key _id provided by MongoDB itself)
- mysqld/Oracle is mongod
- mysql/sqlplus is mongo
Sample Document Structure
- _id: ObjectId(7df78ad8902c) (12-byte hexadecimal number assuring uniqueness, a unique ID is provided if not declared.)
- 4 bytes is the current timestamp
- 3 for machine id
- 2 for process ID of MongoDB server
- Remaining 3 are simple incremental value
- title: 'MongoDB Overview'
- description: 'MongoDB is no sql database'
- by: 'tutorials point'
- url: 'http://www.tutorialspoint.com'
- tags: ['mongodb', 'database', 'NoSQL']
- likes: 100
- comments: [
- user:'user1'
- message: 'My first comment'
- dateCreated: new Date(2011,1,20,2,15)
- like: 0 ]
MongoDB Advantages over RDBMS
- Schema-less: A document database where one collection holds different documents. Number of fields, content, and size can differ.
- Structure of a single object is clear.
- No complex joins.
- Supports dynamic queries on documents via a document-based language nearly as powerful as SQL.
- Tuning.
- Easy to scale-out.
- Conversion/mapping of application objects to database objects is not needed.
- Uses internal memory for storage (windowed working set) for faster access of data.
Why Use MongoDB?
- Document Oriented Storage: Data is stored in JSON style documents.
- Index on any attribute.
- Replication and high availability.
- Auto-Sharding.
- Rich queries.
- Fast in-place updates.
- Professional support available.
MongoDB Use Cases
- Big Data
- Content Management and Delivery
- Mobile and Social Infrastructure
- User Data Management
- Data Hub
MongoDB Data Modelling
- Data in MongoDB has a flexible schema residing in the same collection where documents don't need identical fields.
- Common fields may hold different types of data.
Data Model Design
- MongoDB supports embedded and normalized data models, the selection depending on the requirements.
Embedded Data Model
- Known as a de-normalized data model, it consolidates all related data into a single document.
- Example: Embedding details like Personal_details, Contact, and Address of an employee in one document
Example of Embedded Data Model
- _id: ,
- Emp_ID: "10025AE336"
- Personal_details: {First_Name: "Radhika", Last_Name: "Sharma", Date_Of_Birth: "1995-09-26"}
- Contact: {e-mail: "[email protected]", phone: "9848022338"}
- Address: {city: "Hyderabad", Area: "Madapur", State: "Telangana"}
Normalized Data Model
In this model, sub-documents are referenced in the original document.
Example of Normalized Data Model
- Employee: {_id: <ObjectId101>, Emp_ID: "10025AE336"}
- Personal_details: {_id: <ObjectId102>, empDocID: " ObjectId101", First_Name: "Radhika", Last_Name: "Sharma", Date_Of_Birth: "1995-09-26"}
- Contact: {_id: <ObjectId103>, empDocID: " ObjectId101", e-mail: "[email protected]", phone: "9848022338"}
- Address: {_id: <ObjectId104>, empDocID: " ObjectId101", city: "Hyderabad", Area: "Madapur", State: "Telangana"}
MongoDB Schema Design Considerations
- Design schema based on user requirements ensuring there is not complex joins
- Optimize for most frequent use cases.
- Use complex aggregation in schema.
- Duplicate data (limited) as disk space is cheaper than compute time.
- Combine objects into one document if used together when writing; or separate them if not being used together.
MongoDB RDBMS Schema Differences
- Client requirements dictates use with a MongoDB database design
For a blog/website application example:
- MongoDB needs one collection post
- RDBMS needs a minimum of 3 tables to operate
Cassandra Overview
- Apache Cassandra: high scalability, performance, distributed database for big data, commodity servers, and high availability. It's a NoSQL variant.
- NoSQL Database: Not Only SQL, stores and retrieves data differently from tabular relational databases.
- Schema-free
- Easy replication
- Simple API
- Eventually consistent
- Handle huge amounts of data
- Primary objective:
- Simplicity of design
- Horizontal scaling
- Finer control over availability
- Faster operations due to different data structures.
- HBase and MongoDB are also popular NoSQL databases.
NoSQL vs. Relational Databases
- Relational Database: Supports powerful query language and fixed schema, follows ACID, and supports transactions
- NoSQL Database: Supports very simple query language, has no fixed schema, is only eventually consistent, and does not support transactions.
Apache Cassandra
- Cassandra is an open-source, distributed, decentralized storage system for managing large structured data worldwide.
- Provides a highly available service with no single point of failure.
Notable Points of Cassandra
- Scalable, fault-tolerant, and consistent.
- Column-oriented.
- Distribution design based on Amazon’s Dynamo and its data model on Google’s Bigtable.
- Created at Facebook which differs from relational database management systems.
- Implements a Dynamo-style replication model with no single point of failure, with a "column family" data model.
- Used by Facebook, Twitter, Cisco, Rackspace, eBay, Twitter, Netflix, and more.
Features of Cassandra
- Elastic Scalability: Hardware added for scaling.
- Always-On Architecture: Continuously available for business-critical applications.
- Fast Linear Scale Performance: Throughput increases with nodes.
- Flexible Data Storage: Supports structured, semi-structured, and unstructured data with a change ready dynamic system
- Easy Data Distribution: Provides data across multiple data centers by replicating.
- Transaction Support: Cassandra supports properties like Atomicity, Consistency, Isolation, and Durability ACID
- Fast Writes: Operates on cheap commodity hardware and performance writes are blazingly fast
Cassandra History
- Developed at Facebook for inbox search.
- Open-sourced by Facebook in July 2008.
- Accepted into Apache Incubator in March 2009.
- Apache top-level project since February 2010.
Cassandra Architecture
- The goal is to handle big data workloads across nodes without failures. Has peer-to-peer architecture and data distributed among nodes.
- Nodes in a cluster have same role along with independent and interconnected systems.
- Each Node: Can accept all requests regardless of said file location
- Node down: Read/Write requests can be served from other nodes in network.
Data Replication in Cassandra
- One or more nodes in a cluster act as replicas.
- Cassandra returns the most recent value to client. Then performs a read repair in the background to update the stale values.
Cassandra Component
- Node: it is place where data is stored
- Data Center: a collection of related nodes
- Cluster: Contains one or more data centers
- Commit Log: a crash recovery mechanism
- Mem-Table: A memory-resident data structure written after data commit
- SSTable: Data flushed from mem-table upon reaching a threshold value to a disk file.
- Bloom Filter: Algorithms for locating elements.
Cassandra Query Language
- CQL accesses the database through clusters/nodes and the tables within the key space
- Programmers work with CQL through cqlsh, a prompt or separate application
Client approaches
- Nodes and read writes
- Node coordinator acts a proxy between node client and data
Write ops
- Node operations caputred by written commit lods
- Data caputred/stored in mem-table
- Data written to SSTable once mem-table fulfills data for auto partition system
- Cassandra regularly consolidates SSTables by discarding unnecessesary data
Read ops
- Gets values from mem-table and checks bloom filters to locate SSTable data
Cassandra Data Model
- The data model of Cassandra is different from an RDBMS.
- Cluster: Cassandra database is distributed over several machines.
- Keyspace: The outer container for cluster data
Keyspace Replication
- Replication: data copies on machines in a cluster
- Replication placement strategy: Placing the copies around the machines. (Simple, and network Topology based options)
Column: Keyspace Container
- Container for a list of one of more column values
- In turn each column is a rowed collection, with each row containing columns
- The columns represent data structure
Column Family Relational Table Differences
- Schema: Fix data while inserting or at least a fill in null value is required
- Relational: Defines only data
- Attributes*
- Keys cache - locations kept per SSTable
- Rows cache - number of rows to cache in memo
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.