Elasticsearch: Features and Concepts

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson
Download our mobile app to listen on the go
Get App

Questions and Answers

What data format does Elasticsearch primarily use to store data?

  • CSV
  • JSON (correct)
  • XML
  • BSON

Which of the following is a key feature of Elasticsearch?

  • Limited scalability
  • Dependency on specific hardware
  • Strict schema enforcement
  • Real-time search capabilities (correct)

What is a 'node' in the context of Elasticsearch?

  • A collection of indexes
  • A backup of the entire cluster
  • A unit of data within an index
  • A single running instance of Elasticsearch (correct)

How do shards contribute to Elasticsearch's scalability?

<p>By horizontally dividing the index and distributing data across multiple nodes (C)</p> Signup and view all the answers

What role do replicas play in Elasticsearch?

<p>They increase data availability and improve search performance through parallel operations. (A)</p> Signup and view all the answers

Which of the following is an advantage of using Elasticsearch?

<p>Real-time data searchability (D)</p> Signup and view all the answers

What is a limitation of Elasticsearch regarding multi-language support?

<p>It does not have multi-language support in terms of handling request and response data (only possible in JSON) (D)</p> Signup and view all the answers

How does Kibana enhance the use of Elasticsearch?

<p>By offering data visualization tools for Elasticsearch data (B)</p> Signup and view all the answers

In the ELK stack, what is the role of Logstash?

<p>Data collection, parsing, and transformation (C)</p> Signup and view all the answers

Which feature of Kibana allows users to combine various visualizations into a single view?

<p>Dashboard (D)</p> Signup and view all the answers

What is one of the disadvantages of Kibana?

<p>Potential issues during upgrades (C)</p> Signup and view all the answers

In Logstash, what is the purpose of the 'Filter' stage in the pipeline?

<p>To process and transform the events (A)</p> Signup and view all the answers

Which of the following is a key advantage of using Logstash?

<p>Centralized data processing and collection (B)</p> Signup and view all the answers

What is a potential drawback of using Logstash?

<p>Its use of HTTP, which can negatively affect the logging data processing (C)</p> Signup and view all the answers

How does MongoDB differ from traditional relational databases?

<p>It uses JSON-like documents with dynamic schemas. (C)</p> Signup and view all the answers

What does 'schema-less' mean in the context of MongoDB?

<p>Documents within the same collection can have different fields and structures. (C)</p> Signup and view all the answers

Which of the following is an advantage of MongoDB over RDBMS?

<p>Ease of scale-out (B)</p> Signup and view all the answers

In MongoDB, what is the equivalent of a 'table' in a relational database?

<p>Collection (B)</p> Signup and view all the answers

What is a key consideration when designing a schema in MongoDB?

<p>Design the schema according to user requirements (B)</p> Signup and view all the answers

In MongoDB data modeling, what is the approach of Embedded Data Model?

<p>All the related data in single document, also known as de-normalized data model (C)</p> Signup and view all the answers

What makes Apache Cassandra suitable for applications requiring high availability and scalability?

<p>Its distributed and decentralized architecture (B)</p> Signup and view all the answers

Which of the following best describes Cassandra's data model?

<p>Column-oriented with flexible schemas (B)</p> Signup and view all the answers

What is a Keyspace in Cassandra?

<p>A container for column families (D)</p> Signup and view all the answers

How does Cassandra achieve fault tolerance?

<p>By replicating data across multiple nodes (D)</p> Signup and view all the answers

What is the purpose of the Commit Log in Cassandra?

<p>To record every write operation for crash recovery (B)</p> Signup and view all the answers

What is the role of SSTable in Cassandra?

<p>A disk file to which data is flushed from the mem-table (B)</p> Signup and view all the answers

How does Cassandra differ from an RDBMS in terms of data structure?

<p>In Cassandra, a table is a list of “nested key-value pairs which is unlike schema in RDBMS (D)</p> Signup and view all the answers

Which programming language that runs on the JVM is Logstash written on?

<p>JRuby (C)</p> Signup and view all the answers

What is CQL in Cassandra?

<p>Cassandra Query Language (D)</p> Signup and view all the answers

Flashcards

What is Elasticsearch?

A real-time distributed, open-source, full-text search and analytics engine accessible from RESTful web service interface.

How scalable is Elasticsearch?

Elasticsearch is scalable up to petabytes of structured and unstructured data.

What is a Node in Elasticsearch?

An Elasticsearch instance on a server, accommodating multiple nodes depending on physical resources.

What is a Cluster in Elasticsearch?

A collection of one or more nodes that provide collective indexing and search capabilities.

Signup and view all the flashcards

What is an Index in Elasticsearch?

A collection of different types of documents with their properties, using shards to improve performance.

Signup and view all the flashcards

What is a Document in Elasticsearch?

A collection of fields in a specific manner, defined in JSON format, residing inside an index.

Signup and view all the flashcards

What is a Shard in Elasticsearch?

Horizontally subdivided parts of an index, containing all document properties but fewer JSON objects.

Signup and view all the flashcards

What are Replicas in Elasticsearch?

Copies of indexes and shards, increasing data availability and search performance.

Signup and view all the flashcards

What is Kibana?

Open-source browser-based visualization tool used to analyze large volumes of logs via graphs, charts, etc.

Signup and view all the flashcards

What is the ELK stack?

Elasticsearch, Logstash, and Kibana combined for log management and analysis.

Signup and view all the flashcards

What does Logstash do?

Collects data from remote sources, parses, transforms, and sends it to Elasticsearch.

Signup and view all the flashcards

What is a Visualization in Kibana?

Data representation in the form of line graphs, bar graphs, pie charts, etc.

Signup and view all the flashcards

What is a Dashboard in Kibana?

A board in Kibana where visualizations are placed for observing different sections.

Signup and view all the flashcards

What are Dev Tools in Kibana?

Used in Kibana for working with indexes; adding, updating, and deleting data.

Signup and view all the flashcards

What is Timelion in Kibana?

Tool for time-based data analysis in Kibana using expression language.

Signup and view all the flashcards

What is Canvas in Kibana?

A feature in Kibana that allows data representation using different color combinations, shapes, and texts.

Signup and view all the flashcards

What is Logstash?

A tool for gathering, processing, and generating logs or events, centralizing real-time analysis.

Signup and view all the flashcards

What is Event Object?

The main object in Logstash, encapsulating data flow in the Logstash pipeline.

Signup and view all the flashcards

What is a Pipeline in Logstash?

Data flow stages in Logstash, from input to output, processing the data and sending it to the destination.

Signup and view all the flashcards

What is Input in Logstash?

First stage in the Logstash pipeline, used to get data from different platforms.

Signup and view all the flashcards

What is Filter in Logstash?

Middle stage of Logstash, where the actual processing of events takes place.

Signup and view all the flashcards

What is Output in Logstash?

Last stage in the Logstash pipeline, where output events are formatted and sent to destination systems.

Signup and view all the flashcards

What is MongoDB?

A cross-platform, document-oriented database providing high performance and scalability.

Signup and view all the flashcards

What is a Database in MongoDB?

A physical container for collections; each database has its own set of files on the file system.

Signup and view all the flashcards

What is a Collection in MongoDB?

A group of MongoDB documents, equivalent to an RDBMS table, existing within a single database.

Signup and view all the flashcards

What is a Document in MongoDB?

A set of key-value pairs, with dynamic schema, in MongoDB.

Signup and view all the flashcards

What is Schema less in MongoDB?

MongoDB is a document database where one collection holds different documents.

Signup and view all the flashcards

What is Apache Cassandra?

An open source, distributed database for handling large amounts of data across commodity servers.

Signup and view all the flashcards

What is a NoSQL Database?

Provides a mechanism to store and retrieve data other than tabular relations.

Signup and view all the flashcards

Keyspace in Cassandra

The database is treated as a container for tables.

Signup and view all the flashcards

Study Notes

Elasticsearch Overview

  • Elasticsearch is an Apache Lucene-based search server developed by Shay Banon and released in 2010 and is maintained by Elasticsearch BV, with the latest version being 7.0.0.
  • It is a real-time, distributed, open-source full-text search and analytics engine.
  • It is accessible via a RESTful web service interface and uses schema-less JSON documents for data storage.
  • Built on Java, Elasticsearch runs on various platforms, enabling high-speed exploration of large datasets.

Elasticsearch General Features

  • Scalable to petabytes of structured and unstructured data.
  • Can replace document stores like MongoDB and RavenDB.
  • Uses denormalization to enhance search performance.
  • Widely used by large organizations like Wikipedia, The Guardian, StackOverflow, and GitHub.
  • Open source under the Apache license version 2.0.

Elasticsearch Key Concepts

Node

  • A single running instance of Elasticsearch.
  • Accommodates multiple nodes based on physical resources like RAM, storage, and processing power.

Cluster

  • A collection of one or more nodes.
  • Provides collective indexing and search capabilities across all nodes for entire data.

Index

  • Collection of different types of documents and their properties.
  • Uses shards to improve performance.
  • A set of documents contains data, such as from a social networking application.

Document

  • Collection of fields in a specific manner defined in JSON format.
  • Belongs to a type and resides inside an index.
  • Has a unique identifier called the UID.

Shard

  • Indexes are horizontally subdivided into shards, each containing all document properties, but fewer JSON objects than the index.
  • Enables storage on independent nodes.
  • Primary shards are the original horizontal parts of an index which are replicated into replica shards.

Replicas

  • Allows users to create copies of indexes and shards.
  • Increases data availability in case of failure.
  • Enhances search performance through parallel search operations.

Elasticsearch Advantages

  • Developed in Java, ensuring platform compatibility.
  • Provides near real-time search capabilities. Added documents searchable after one second.
  • Distributed architecture allows easy scaling and integration in large organizations.
  • Simplifies full backups using the gateway concept.
  • Offers easy multi-tenancy compared to Apache Solr.
  • Uses JSON objects as responses, compatible with numerous programming languages.
  • Supports almost all document types, except those without text rendering.

Elasticsearch Disadvantages

  • Lacks multi-language support for request and response data, only supporting JSON, unlike Apache Solr which supports CSV, XML, and JSON.
  • Can encounter "Split brain" situations.

Comparison between Elasticsearch and RDBMS

  • Cluster is similar to Database
  • Shard is similar to Shard
  • Index is similar to Table
  • Field is similar to Column
  • Document is similar to Row

Kibana Overview

  • Kibana is an open-source, browser-based visualization tool for analyzing large volumes of logs using various formats like line graphs, bar graphs, pie charts, heat maps, region maps, coordinate maps, gauges, goals and timelion.
  • It facilitates the prediction or observation of changes in trends of errors or other significant events.
  • Kibana synchronizes with Elasticsearch and Logstash to form the ELK stack

ELK Stack

  • ELK stands for Elasticsearch, Logstash, and Kibana, a popular log management platform used worldwide for log analysis.
  • Logstash extracts logging data from diverse input sources, processes it, and stores it in Elasticsearch.
  • Kibana is a visualization tool that accesses logs from Elasticsearch and displays them to users using graphs and charts.

Kibana Features

  • Visualization: Offers diverse ways to visualize data, including vertical bar charts, horizontal bar charts, pie charts, line graphs, and heat maps.
  • Dashboard: Place visualizations onto a single board, providing a clear overview of events.
  • Dev Tools: Used to manage indexes, add dummy data, and create visualizations.
  • Reports: Converts data from visualizations and dashboards into various formats (CSV), embeds in code, or generates shareable URLs.
  • Filters and Search query: Enables detailed data retrieval from dashboards or visualization tools.
  • Plugins: Supports addition of third-party plugins for new visualizations.
  • Coordinate and Region Maps: Displays visualization on geographical maps.
  • Timelion: Visualization tool for time-based data analysis, using simple expressions to connect to indexes and perform calculations for data comparison over time.
  • Canvas: Powerful feature for data representation via color combinations, shapes, texts and workpads.

Kibana Advantages

  • Open-source browser-based visualization tool.
  • Simple and easy to understand.
  • Allows easy conversion of visualizations and dashboards into reports.
  • Offers Canvas visualization for analyzing complex data and Timelion visualization for comparing historical data.

Kibana Disadvantages

  • Adding plugins can be tedious due to version mismatches.
  • Upgrading from older versions can lead to issues.

Logstash Overview

  • This tool gathers, processes, and generates logs/events using filter/pipes patterns, helping centralize and analyze logs in real-time from different sources.
  • JRuby programming language written code runs on the JVM, supporting various Operating Systems
  • Events, logs, packets, timestamp data, and transactions are some examples of data it collects from various sources like social media, e-commerce, news, CRM, game data, mobile devices, web trends, financial data and the Internet of Things

Logstash General Features

  • Collects data from different sources and sends it to multiple destinations.
  • Can handle various logging data types.
  • Handles HTTP requests and response data.
  • Offers filters to help users find data meaning by parsing and transforming.
  • Used for handling sensor data in the Internet of Things.
  • Open source under the Apache license version 2.0.

Logstash Key Concepts

Event Object

  • This is the main object in Logstash
  • Encapsulates data flow in the Logstash pipeline
  • Used to store the input data and add extra fields during the filter stage
  • It is referred to as Logging Data Event, Log Event, Log Data, Input Log Data, Output Log Data, etc

Pipeline

  • Comprises of data flow stages from input to output.
  • Input data enters and is processed as an event.
  • The event sends to an output destination in a desirable format.
Input
  • Data is received in the Logstash pipeline.
  • Plugins like File, Syslog, Redis, and Beats are used.
Filter
  • The middle stage where the actual processing of events occurs.
  • Pre-defined Regex Patterns are used to create sequences for differentiating fields.
  • Filter plugins like Grok, Mutate, Drop, Clone, and Geoip are used.
Output
  • The final stage where output events are formatted and sent to destination systems.
  • Plugins such as Elasticsearch, File, Graphite, and Statsd, are used.

Logstash Advantages

  • Regex pattern sequences identify and parse various fields in any input event.
  • Supports a variety of web servers and data sources for extracting logging data.
  • Multiple plugins parse and transform logging data as desired.
  • Centralized, easing the processing and collection of data from different servers.
  • Supports many databases, network protocols, and other services.
  • Uses the HTTP protocol to upgrade Elasticsearch versions without upgrading Logstash.

Logstash Disadvantages

  • Uses HTTP. Negatively affecting the processing of logging data.
  • Working with it can be complex and requires a good understanding of the input logging data and analysis.
  • Filter plugins are not generic. Requiring users to find correct pattern.

MongoDB Overview

  • MongoDB is a cross-platform, document-oriented database that provides high performance, high availability, and easy scalability.
  • It works on the concept of collections and documents.

Database

  • A physical container for collections, with each database having its own file set.
  • A single MongoDB server typically manages multiple databases

Collection

  • Group of MongoDB documents, equivalent to an RDBMS table, existing within a single database and do not enforce schema.
  • Documents within a collection can have different fields.
  • Typically, documents are of similar or related purpose.

Document

  • Set of key-value pairs with a dynamic schema.
  • Documents in the same collection do not need to have the same fields or structure.
  • Common fields can hold different types of data.

MongoDB/RDBMS Terminology

  • Database is Database
  • Table is Collection
  • Tuple/Row is Document
  • Column is Field
  • Table Join is Embedded Document
  • Primary Key is Primary Key (Default key _id provided by MongoDB itself)
  • mysqld/Oracle is mongod
  • mysql/sqlplus is mongo

Sample Document Structure

  • _id: ObjectId(7df78ad8902c) (12-byte hexadecimal number assuring uniqueness, a unique ID is provided if not declared.)
    • 4 bytes is the current timestamp
    • 3 for machine id
    • 2 for process ID of MongoDB server
    • Remaining 3 are simple incremental value
  • title: 'MongoDB Overview'
  • description: 'MongoDB is no sql database'
  • by: 'tutorials point'
  • url: 'http://www.tutorialspoint.com'
  • tags: ['mongodb', 'database', 'NoSQL']
  • likes: 100
  • comments: [
    • user:'user1'
    • message: 'My first comment'
    • dateCreated: new Date(2011,1,20,2,15)
    • like: 0 ]

MongoDB Advantages over RDBMS

  • Schema-less: A document database where one collection holds different documents. Number of fields, content, and size can differ.
  • Structure of a single object is clear.
  • No complex joins.
  • Supports dynamic queries on documents via a document-based language nearly as powerful as SQL.
  • Tuning.
  • Easy to scale-out.
  • Conversion/mapping of application objects to database objects is not needed.
  • Uses internal memory for storage (windowed working set) for faster access of data.

Why Use MongoDB?

  • Document Oriented Storage: Data is stored in JSON style documents.
  • Index on any attribute.
  • Replication and high availability.
  • Auto-Sharding.
  • Rich queries.
  • Fast in-place updates.
  • Professional support available.

MongoDB Use Cases

  • Big Data
  • Content Management and Delivery
  • Mobile and Social Infrastructure
  • User Data Management
  • Data Hub

MongoDB Data Modelling

  • Data in MongoDB has a flexible schema residing in the same collection where documents don't need identical fields.
  • Common fields may hold different types of data.

Data Model Design

  • MongoDB supports embedded and normalized data models, the selection depending on the requirements.

Embedded Data Model

  • Known as a de-normalized data model, it consolidates all related data into a single document.
  • Example: Embedding details like Personal_details, Contact, and Address of an employee in one document
Example of Embedded Data Model
  • _id: ,
  • Emp_ID: "10025AE336"
  • Personal_details: {First_Name: "Radhika", Last_Name: "Sharma", Date_Of_Birth: "1995-09-26"}
  • Contact: {e-mail: "[email protected]", phone: "9848022338"}
  • Address: {city: "Hyderabad", Area: "Madapur", State: "Telangana"}

Normalized Data Model

In this model, sub-documents are referenced in the original document.
Example of Normalized Data Model
  • Employee: {_id: <ObjectId101>, Emp_ID: "10025AE336"}
  • Personal_details: {_id: <ObjectId102>, empDocID: " ObjectId101", First_Name: "Radhika", Last_Name: "Sharma", Date_Of_Birth: "1995-09-26"}
  • Contact: {_id: <ObjectId103>, empDocID: " ObjectId101", e-mail: "[email protected]", phone: "9848022338"}
  • Address: {_id: <ObjectId104>, empDocID: " ObjectId101", city: "Hyderabad", Area: "Madapur", State: "Telangana"}

MongoDB Schema Design Considerations

  • Design schema based on user requirements ensuring there is not complex joins
  • Optimize for most frequent use cases.
  • Use complex aggregation in schema.
  • Duplicate data (limited) as disk space is cheaper than compute time.
  • Combine objects into one document if used together when writing; or separate them if not being used together.

MongoDB RDBMS Schema Differences

  • Client requirements dictates use with a MongoDB database design For a blog/website application example:
    • MongoDB needs one collection post
    • RDBMS needs a minimum of 3 tables to operate

Cassandra Overview

  • Apache Cassandra: high scalability, performance, distributed database for big data, commodity servers, and high availability. It's a NoSQL variant.
  • NoSQL Database: Not Only SQL, stores and retrieves data differently from tabular relational databases.
    • Schema-free
    • Easy replication
    • Simple API
    • Eventually consistent
    • Handle huge amounts of data
  • Primary objective:
    • Simplicity of design
    • Horizontal scaling
    • Finer control over availability
  • Faster operations due to different data structures.
  • HBase and MongoDB are also popular NoSQL databases.

NoSQL vs. Relational Databases

  • Relational Database: Supports powerful query language and fixed schema, follows ACID, and supports transactions
  • NoSQL Database: Supports very simple query language, has no fixed schema, is only eventually consistent, and does not support transactions.

Apache Cassandra

  • Cassandra is an open-source, distributed, decentralized storage system for managing large structured data worldwide.
  • Provides a highly available service with no single point of failure.

Notable Points of Cassandra

  • Scalable, fault-tolerant, and consistent.
  • Column-oriented.
  • Distribution design based on Amazon’s Dynamo and its data model on Google’s Bigtable.
  • Created at Facebook which differs from relational database management systems.
  • Implements a Dynamo-style replication model with no single point of failure, with a "column family" data model.
  • Used by Facebook, Twitter, Cisco, Rackspace, eBay, Twitter, Netflix, and more.

Features of Cassandra

  • Elastic Scalability: Hardware added for scaling.
  • Always-On Architecture: Continuously available for business-critical applications.
  • Fast Linear Scale Performance: Throughput increases with nodes.
  • Flexible Data Storage: Supports structured, semi-structured, and unstructured data with a change ready dynamic system
  • Easy Data Distribution: Provides data across multiple data centers by replicating.
  • Transaction Support: Cassandra supports properties like Atomicity, Consistency, Isolation, and Durability ACID
  • Fast Writes: Operates on cheap commodity hardware and performance writes are blazingly fast

Cassandra History

  • Developed at Facebook for inbox search.
  • Open-sourced by Facebook in July 2008.
  • Accepted into Apache Incubator in March 2009.
  • Apache top-level project since February 2010.

Cassandra Architecture

  • The goal is to handle big data workloads across nodes without failures. Has peer-to-peer architecture and data distributed among nodes.
  • Nodes in a cluster have same role along with independent and interconnected systems.
  • Each Node: Can accept all requests regardless of said file location
  • Node down: Read/Write requests can be served from other nodes in network.

Data Replication in Cassandra

  • One or more nodes in a cluster act as replicas.
  • Cassandra returns the most recent value to client. Then performs a read repair in the background to update the stale values.

Cassandra Component

  • Node: it is place where data is stored
  • Data Center: a collection of related nodes
  • Cluster: Contains one or more data centers
  • Commit Log: a crash recovery mechanism
  • Mem-Table: A memory-resident data structure written after data commit
  • SSTable: Data flushed from mem-table upon reaching a threshold value to a disk file.
  • Bloom Filter: Algorithms for locating elements.

Cassandra Query Language

  • CQL accesses the database through clusters/nodes and the tables within the key space
  • Programmers work with CQL through cqlsh, a prompt or separate application

Client approaches

  • Nodes and read writes
  • Node coordinator acts a proxy between node client and data

Write ops

  • Node operations caputred by written commit lods
  • Data caputred/stored in mem-table
  • Data written to SSTable once mem-table fulfills data for auto partition system
  • Cassandra regularly consolidates SSTables by discarding unnecessesary data

Read ops

  • Gets values from mem-table and checks bloom filters to locate SSTable data

Cassandra Data Model

  • The data model of Cassandra is different from an RDBMS.
  • Cluster: Cassandra database is distributed over several machines.
  • Keyspace: The outer container for cluster data
Keyspace Replication
  • Replication: data copies on machines in a cluster
  • Replication placement strategy: Placing the copies around the machines. (Simple, and network Topology based options)

Column: Keyspace Container

  • Container for a list of one of more column values
  • In turn each column is a rowed collection, with each row containing columns
  • The columns represent data structure

Column Family Relational Table Differences

  • Schema: Fix data while inserting or at least a fill in null value is required
  • Relational: Defines only data
  • Attributes*
  • Keys cache - locations kept per SSTable
  • Rows cache - number of rows to cache in memo

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

More Like This

Use Quizgecko on...
Browser
Browser