Group 4 Replication, Sharding, & Storage.pdf
Document Details
Uploaded by Deleted User
Full Transcript
MongoDB Replication, Sharding, & Storage Group_4 {01} Replication process of creating a copy of the Increases fault tolerance by backups same data set in more than one MongoDB server Load Balancing “Replica Set” - group...
MongoDB Replication, Sharding, & Storage Group_4 {01} Replication process of creating a copy of the Increases fault tolerance by backups same data set in more than one MongoDB server Load Balancing “Replica Set” - group of MongoDB Enables database admins. to provide: instances that maintain the same 1. Data Redundancy data set and pertain to any mongod 2. High Availability of Data process. {01} Replication How does Replication Work? Through “Replica Set” - consists of multiple MongoDB nodes grouped together A Replica Set requires a minimum of “three” MongoDB nodes: Primary Nodes - receives all the write operations Secondary Nodes - replicate the data from the primary node Replica Set Elections - selection process of the most suitable secondary node as a new primary node Arbiter - participates in elections without a copy of data set; cannot be a primary node How to replicate sets and modify nodes in MongoDB? Step 01 SET UP NODES To start, you’ll need MongoDB installed on three or more nodes. Each of the nodes in the cluster will need to be able to communicate with the others over a standard port (Example port: by default). Additionally, each replica set member needs to have a hostname that is resolvable from the others. 1. Get the directory path of MongoDB from Program Files and add cd before the directory path (Ex. cd C:\Program Files\MongoDB\Server\8.0\bin 2. Go to C:\data and create 3 folders of instances as replica sets. For this example, name it rs1 (replica set 1), rs2, and rs3 (the highlighted ones) How to replicate sets in MongoDB? Start each member of the replica set with the Step 02 appropriate options. Open Command Prompt and type these command: start mongod --replSet test1 -logpath \data\rs1.log --dbpath \data\rs1 --port 2001 start mongod --replSet test1 -logpath \data\rs2.log --dbpath \data\rs2 --port 2002 start mongod --replSet test1 -logpath \data\rs3.log --dbpath \data\rs3 --port 2003 This will open 3 command prompts in total, each representing a replica set of the hostname (test1). logpath and dbpath are also made for their respective directories. Lastly, the port where it will be hosted. How to replicate sets in MongoDB? Step 03 Interconnect a mongosh shell to all of the nodes For this example, the port 27018 will be the primary node and the other two will be the secondary node. mongosh --port 2001 Step 04 Set up the config Use this code to set up a config for all of the nodes: config = {_id:"test1",members:[ {_id:0,host:"localhost:2001"}, {_id:1,host:"localhost:2002"}, {_id:2,host:"localhost:2003"} ] }; How to replicate sets in MongoDB? Step 05 Initiate the replica set. From the mongosh shell, run the full rs.initiate({...}) on replica set member 0. This command initializes the replica set, and should only be run on the first replica set member. On subsequent nodes, you can run the command without parameters - just rs.initiate() or rs.initiate (config) if specified. rs.initiate(config) How to enable replication in MongoDB? Step 06 View the replica set configuration. Use rs.conf() to view the replica set configuration How to enable replication in MongoDB? Step 07 Ensure that the replica set has a primary. Use rs.status() to identify the primary in the replica set and can show all the details of the nodes. {01}Replication Adding a new node to the Replica Set? priority - relative eligibility of the Through rs.add() command - add new node new node to become the primary node. to an existing replica set (priority 0 = node cannot become the primary node) host - IP address or hostname of the new node. vote - node is capable of voting in elections to select primary node. {01}Replication Removing a node from the Replica Set. Through rs.remove() command - remove a node from the replica set; can be specified using the name of that node {01}Replication Benefits of Replication Replication increases data availability and reliability by maintaining multiple live copies. It protects data during hardware failures or server crashes, minimizing downtime. Replication safeguards against data loss by distributing data across multiple servers. It supports collaboration for distributed analytics teams on business intelligence projects. {02} Sharding a way of sharing data among numerous WHY SHARDING? devices. Sharding in MongoDB allows horizontal employs sharding to facilitate scaling to manage large workloads. installations with massive data sets It uses a shared-nothing architecture, and high performance processes. where nodes don’t share resources. Data is divided into shards, each functioning as a replica set for redundancy and availability. Shards handle specific workloads, and partitions can be added or removed based on demand. {02} Sharding TWO METHODS OF ADDRESSING SYSTEM GROWTH Vertical Scaling - expanding the capacity of a single server - upgrade CPU, RAM, Storage Space Horizontal Scaling - entails distributing the system's dataset and load over numerous servers, and adding more servers as needed to boost capacity. {02} Sharding SHARDED CLUSTER Shard - component of the sharded data; deployed as a replica set. Mongos - serves as a query router, connecting client applications to the sharded cluster; MongoDB can enable hedged reads to reduce latencies. Config servers - store cluster metadata and configuration parameters; must be CSRS. {02} Sharding MONGODB SHARDING BENEFITS INCREASE READ/WRITE THROUGHPUT Parallelism is achieved by distributing the data set across multiple shards. Each shard can process 1,000 operations per second. Adding a shard increases throughput by an additional 1,000 operations per second {02} Sharding MONGODB SHARDING BENEFITS INCREASE STORAGE CAPACITY Increasing the number of shards increases total storage capacity. Assume one shard = 4TB of data, Adding a shard increases storage capacity by an additional 4TB. This approach offers near-infinite storage scalability. {02} Sharding MONGODB SHARDING BENEFITS HIGH AVAILABILITY Replica sets are essential for sharding to function. They improve data availability by utilizing multiple servers. A dataset is divided into shards (e.g., S1, S2, S3) and distributed across servers. Each shard has one or more replica copies (e.g., R1, R2, R3) for redundancy and high availability. {02} Sharding MONGODB SHARDING BENEFITS DATA LOCALITY Zone sharding facilitates the design of distributed databases for geographically dispersed applications. It supports policies requiring data residency in specific regions. Each zone can contain one or more shards. {02} Sharding DATA DISTRIBUTION SHARD KEY Sharding in MongoDB occurs at the collection level; user selects which to shard. A shard key is used to distribute a collection’s documents across shards. Data is partitioned into non-overlapping intervals based on shard key values. Each data chunk has a maximum size of 128MB. MongoDB distributes these chunks evenly across the cluster's shards. {02} Sharding DATA DISTRIBUTION Balancer a background function. automatically migrates data chunks between shards. ensures that each shard maintains an equal amount of data. {02} Sharding Sharding Strategy Ranging Sharding - divides data into ranges based on shard key values. - Shard keys with similar values are likely to be stored in the same chunk. - enables focused operations, as MongoDB can route operations to specific shards containing relevant data {02} Sharding Zone Sharding Separates data into distinct zones based on application requirements. Zones can be created for specific user groups Each zone can be associated with one or more shards. Each shard can store data from one or more zones. {02} Sharding Sharding Strategy Hashed Sharding - involves creating a hash of the shard key field’s value. - Data chunks are assigned ranges based on the hashed shard key values. - ensures that data is spread evenly across shards. How to create Shard collection in MongoDB? Step 01 SET UP CONFIG THE SERVER Each configuration server replica set can have an unlimited number of mongod processes (up to 50), with the following exceptions: no arbiters and no zero-priority members. For each of these, start it with the --configsvr option. For example: mongod --configsvr --replSet --dbpath --port 27019 --bind_ip localhost, From there, link to only one of the replica set members mongosh --host --port 27019 And use rs.initiate() on only one of the replica set members: rs.initiate( { _id: "", configsvr: true, members: [ { _id : 0, host : "" }, { _id : 1, host : "" }, { _id : 2, host : "" } ] } ) How to create Shard collection in MongoDB? Step 02 SET UP SHARDS As previously stated, each shard is a replica set in and of itself. This process will be similar to configuring servers, but with the --shardsvr option. Use a separate replica set name for each shard. mongod --shardsvr --replSet --dbpath --port 27018 --bind_ip From there, link to only one of the replica set members mongosh --host --port 27018 Execute rs.initiate() on only one of the replica set members. Make sure to leave out the --configsvr option. rs.initiate( { _id: "", members: [ { _id : 0, host : "" }, { _id : 1, host : "" }, { _id : 2, host : "" } ] } ) How to create Shard collection in MongoDB? Step 03 START THE MONGOS Finally, configure MongoDB and point it to your configuration server's replica set. mongos --configdb /,, --bind_ip localhost, Step 04 CONFIGURE AND TURN ON SHARDING FOR THE DATABASE Connect to your mongos: mongosh --host --port Add your shards to the cluster. Repeat this for each shard: sh.addShard( "/,,") Enable sharding for your database: sh.enableSharding("") How to create Shard collection in MongoDB? Step 04 START THE MONGOS Finally, shard your collection with the sh.shardCollection() method. You can use hashed sharding to evenly distribute data between shards. sh.shardCollection(".", { : "hashed" ,... } ) Alternatively, you can use range-based sharding to optimize distribution among shards depending on shard key values. For specific collections of data, this will improve the efficiency of queries across data ranges. The command goes as follows: sh.shardCollection(".", { : 1,... } ) And that's it! You've now created your first sharded cluster. Any future application interaction should be limited to the routers (mongos instances). {03} STORAGE 3 STORAGE SYSTEM OF MongoDB Storage Engine: primary component of GridFS for Self-Managed Deployments: MongoDB responsible for managing A versatile storage system designed data for handling large files that exceed the 16 MB document size limit. Journal: A write-ahead log system in combination with checkpoints that helps recover data after a hard shutdown. It has configurable options to balance performance and reliability. {03} STORAGE WiredTiger Storage Engine default storage engine in MongoDB, Supported in All Deployments of automatically selected unless MongoDB: Atlas, Enterprise, & otherwise specified. Community {03} STORAGE WiredTiger Storage Engine OPERATION AND LIMITATIONS TRANSACTION CONCURRENCY Transactions are dynamically optimized for performance, with a limit of 128 read and 128 write transactions per node. WiredTiger cache isn't partitioned for reads/writes, and performance To specify a maximum number of read and write may degrade under heavy write transactions (read and write tickets) that the dynamic maximum can not exceed, use workloads. uses optimistic concurrency control storageEngineConcurrentReadTransactions storageEngineConcurrentWriteTransactions. {03} STORAGE WiredTiger Storage Engine DOCUMENT LEVEL CONCURRENCY SNAPSHOTS AND CHECKPOINTS uses MultiVersion Concurrency Control (MVCC) allows multiple clients to write to Snapshots: Present consistent views different documents simultaneously. of data at the start of operations. Checkpoints: Data is periodically written to disk in consistent snapshots every 60 seconds, ensuring data durability. {03} STORAGE JOURNAL & COMPRESSION persists all data modifications between minimizes storage use at the expense of additional CPU usage checkpoints uses block compression with the snappy compression library for all collections; Journal data is compressed by default using and prefix compression for all indexes. the snappy library unless specified with: To specify an alternate compression algorithm storage.wiredTiger.engineConfig.journalComp or no compression, use the ressor setting to use different compression storage.wiredTiger.collectionConfig.blockComp ressor setting. For indexes, to disable prefix compression, use the storage.wiredTiger.indexConfig.prefixCompress ion setting. {03} STORAGE WiredTiger Storage Engine MEMORY USAGE To adjust the size of the WiredTiger internal Internal Cache: Allocates 50% of cache, see available RAM minus 1GB (or 256MB storage.wiredTiger.engineConfig.cacheSizeGB minimum). This cache stores data and and --wiredTigerCacheSizeGB. indexes in-memory for faster access. Avoid increasing the WiredTiger internal cache size above its default value. Filesystem Cache: The OS uses available memory for cached disk I/O, supplementing WiredTiger's internal cache. {03} STORAGE WiredTiger Storage Engine Journaling and the WiredTiger Storage Engine Journaling logs write operations to Journaling is required to recover on-disk journal files information that occurred after the last checkpoint maintaining Journaling is always enabled. consistent data on disk. {03} STORAGE WiredTiger Storage Engine Journaling Process Journal Records: One journal record is Syncing Conditions: created for each client-initiated write Every 100 milliseconds intervals. operation (including internal A new journal file is created after modifications like index updates). 100 MB. Risk: Updates can be lost if a Buffering: WiredTiger stores journal crash occurs while journal records records temporarily in memory buffers (up are still in the memory buffer. to 128 kB). {03} STORAGE WiredTiger Storage Engine Journal Files and Records Location: MongoDB stores journal files in Each journal record is uniquely a journal subdirectory under the dbPath. identified and has a minimum size of 128 bytes. Naming: Files are named using the format WiredTigerLog., where Includes both the initial write is a number starting from 0000000001. operation and any internal modifications. {03} STORAGE WiredTiger Storage Engine Compression Journal File Size Limit Customization: Other compression algorithms Maximum Size: WiredTiger journal files can be specified or compression can be have a size limit of 100 MB. disabled. Management: Older journal files are automatically removed once recovery from Records smaller than 128 bytes are not the last checkpoint is possible. compressed. - Additional disk space should be allocated to ensure proper Pre-Allocation operation and recovery. Storage Space Calculation: Estimating WiredTiger pre-allocates journal files disk space for journal files is complex; to optimize performance. overestimating space is safer. (THANK YOU!) [GROUP 4] BIONG, ABIGAIL GARCIA, RENDEL GONZALES, MARCO MARIÑAS, JUSTIN KYLLE OBLIGADO, ATHENA (REFERENCES) https://www.mongodb.com/docs/manual/applications/replication/ https://www.mongodb.com/resources/products/capabilities/replication https://www.bmc.com/blogs/mongodb- replication/#:~:text=In%20simple%20terms%2C%20MongoDB%20replication,pertain%20to%20any%20mong od%20process. https://www.mongodb.com/resources/products/capabilities/sharding https://www.mongodb.com/docs/manual/sharding/ https://www.javacodegeeks.com/2015/09/mongodb-replication-guide.html