MongoDB PDF
Document Details
Uploaded by FaithfulDanburite6249
ENSAM
Tags
Summary
This document provides an introduction to MongoDB, a document-oriented database. It explains its core concepts, including documents, collections, and databases. The document also details scaling and data management aspects on MongoDB.
Full Transcript
M MongoDB...
M MongoDB 1 oD on g B Introduction to MongoDB r ge hi g a p ) big m ett ng u Introduction (g ali ac in Sc ne MongoDB is a document-oriented database, not a relational one. MongoDB was designed to scale out (partitioning data across more machines). MongoDB automatically takes care of balancing data and load across a cluster, redistributing documents automatically and routing reads and writes to the correct machines to t To summarize the MongoDB project, on no bly m re a com es a not however, is by referencing its main focus re s as B, a tu atab goD to create a full-featured data store that is e d on e f nal om o M. scalable, flexible, and fast. S lat t in oins i re esen ex j pr mpl co Getting Started (but m uch m o re e x p ressive A document is the basic unit of data is roughly equivalent to a row in a relational database. ). A collection can be thought of as a table with a dynamic schema. A single instance of MongoDB can host multiple independent databases. Every document has a special key, "_id", that is unique within a collection. Documents Document is an ordered set of keys with associated values. en ts. {"greeting" : "Hello, world!"} o cu m p of d The keys in a document are strings. g rou n is a MongoDB is type-sensitive and case-sensitive. ct io olle Documents cannot contain dupli‐ cate keys. Ac Dynamic Schemas Collections have dynamic schemas. This means that the documents within a single collection can have any number of different “shapes.” The empty string ("") is not a valid collection name. ed limit You should not create any collections with names that start with “system.” a re User-created collections should not contain the reserved character $. mes byte s ba se na f 6 4 Dat a mo Databases m a ximu to a MongoDB groups collections into databases. A single instance of MongoDB can host several databases. t to a To start the server Running the Shell ell au to m atical ly attempts to connec The sh hine on $ mongod > mongosh goD B se rver ru nning on the local mac Mon arting ak e su re yo u start mongod before st startup, so m the shell. The shell is a full-featured JavaScript interpreter, capable of running arbitrary JavaScript programs We can also leverage all of the standard JavaScript libraries. We can even define and call JavaScript functions. To see the database to which selecting which database to access collections from the db is currently assigned use db variable > db > use video > db.movies test o switched to db vide Basic Operations with the Shell The insertOne function adds find and findOne can be To perform the update a document to a collection. used to query a collection. >db.colc1.updateOne({"movie":'nothing'}, > db.movies.insertOne({”test” : 3}) > db.movies.findOne() {$set:{"movie":"yaas"}}) or insertMany() or find() or update() Delete > db.movies.deleteOne({title : "Star Wars: Episode IV - A New Hope"}) or deleteMany () Data Types Null - Boolean - Number - String - Date - Regular expression - Array - Embedded document {"x" : new Date()} {"x" : /foobar/i} makes it o DB also M o n g bit ry ra le to store ar - Object ID - Binary data - Code poss ib and ript in queries JavaS c ents docum {"x" : ObjectId()} { }} {"x" : function() _id and ObjectIds Every document stored in MongoDB must have an "_id" key. The "_id" key’s value can be any type, but it defaults to an ObjectId. In a single collection, every document must have a unique value for "_id" ObjectIds use 12 bytes of storage, which gives them a string representation that is 24 hexadecimal digits: 2 digits for each byte. The first four bytes are a timestamp in seconds. The 12 bytes of an ObjectId are generated as follows: The next five bytes are a random value. 0123 45678 9 10 11 The final three bytes are a counter that starts Timestamp Random Counter (random start value) with a random value. Tips for Using the Shell Database-level help is provided by db.help() and collection-level help by db.foo.help(). A good way of figuring out what a function is doing is to type it without the parentheses. This will print the JavaScript source code for the function. Running Scripts with the Shell $ mongo script1.js script2.js script3.js run a script using a connection to a nondefault host/port mongod : $ mongo server-1:30000/foo --quiet script1.js script2.js script3.js You can also run scripts from within the interactive shell using the load function: > load("script1.js") shell helpers such as use db or show collections do not work from files. There are valid JavaScript equivalents to each Creating a.mongorc.js M 2 oD on If you have frequently loaded scripts, you might want to put them in your g.mongorc.js file. This file is run whenever you start up the shell B Customizing Your Prompt The default shell prompt can be overridden by setting the prompt variable to either a string or a function. Creating, Updating, and Deleting Documents Inserting Documents > db.movies.insertOne({"title" : "Stand by Me"}) insertMany >db.movies.insertMany([{"title" : "Ghostbusters"}, When using insertMany for bulk inserts, the choice {"title" : "E.T."}, between ordered or unordered operations impacts {"title" : "Blade Runner"}], {"ordered" : false})); the insertion sequence. For ordered inserts (default), documents are inserted as per array order; an error halts further inserts past that point. For unordered inserts (ordered: false), MongoDB tries inserting all documents despite potential errors, aiming to enhance performance by possibly reordering inserts. Insert Validation MongoDB does minimal checks on data being inserted: it checks the document’s basic structure and adds an "_id" field if one does not exist. One of the basic struc‐ ture checks is size: all documents must be smaller than 16 MB. Removing Documents > db.movies.deleteOne({"_id" : 4}) > db.movies.deleteMany({"year" : 1984}) drop if you want to clear an entire collection, it is faster to drop it: > db.movies.drop() Or remove all documents on it : > db.movies.deleteMany({}) Updating Documents Document Replacement replaceOne fully replaces a matching document with a new one. > db.users.replaceOne({"name" : "joe"}, joe); Using Update Operators When using operators, the value of "_id" cannot be changed. (Note that "_id" can be changed by using whole-document replacement.) Values for any other key, including other uniquely indexed keys, can be modified. the “$set” modifier "$set" sets the value of a field. If the field does not yet exist, it will be created. > db.users.updateOne({"name" : "joe"}, {"$set" : {"favorite book" : "Green Eggs and Ham"}}) You can remove the key altogether with "$unset" > db.users.updateOne({"name" : "joe"}, {"$unset" : {"favorite book" : 1}}) Incrementing and decrementing We can use the "$inc" modifier to add 50 to the player’s score: > db.games.updateOne({"game" : "pinball", "user" : "joe"}, {"$inc" : {"score" : 50}}) Array operators Adding elements. "$push" adds elements to the end of an array if the array exists and creates a new array if it does not. > db.blog.posts.updateOne({"title" : "A blog post"}, {"$push" : {"comment" : “that is nice” }}) You can push multiple values in one operation using the "$each" modifer for "$push": >db.stock.ticker.updateOne({"_id" : "GOOG"}, {"$push" : {"hourly" : {"$each" : [562.776, 562.790, 559.123]}}}) This would push three new elements onto the array you can use the "$slice" modifier with "$push" to prevent an array from growing beyond a certain size db.movies.updateOne({"genre" : "horror"}, {"$push" : {"top10" : {"$each" : ["Night", "Saw"], "$slice" : -10}}}) This example limits the array to the last 10 elements pushed. "$slice" can be used to create a queue in a document. you can apply the "$sort" modifier to "$push" operations before trimming: > db.movies.updateOne({"genre" : "horror"}, This will sort all of the objects in the {"$push" : {"top10" : {"$each" : [{"name" : "Nightmare on Elm Street", array by their "rating" field and then "rating" : 6.6}, keep the first 10. Note that you must {"name" : "Saw", "rating" : 4.3}], include "$each"; you cannot just "$slice" : -10, "$slice" or "$sort" an array with "$sort" : {"rating" : -1}}}}) "$push". You might want to treat an array as a set, only adding values if they are not present. This can be done using "$ne" in the query document Or with "$addToSet" > db.papers.updateOne({"authors cited" : {"$ne" : "Richie"}},{$push : {"authors cited" : "Richie"}}) > db.users.updateOne({"_id" : 6}, {"$addToSet" : {"emails" : "[email protected]"}}) Array operators Removing elements. There are a few ways to remove elements from an array. If you want to treat the array like a queue or a stack, you can use "$pop", which can remove elements from either end. {"$pop" : {"key" : 1}} removes an element from the end of the array. {"$pop" : {"key" : -1}} removes it from the beginning. "$pull" is used to remove elements of an array that match the given criteria. > db.lists.updateOne({}, {"$pull" : {"todo" : "laundry"}}) Array operators Positional array modifications. Arrays use 0-based indexing, and elements can be selected as though their index were a document key. If we want to increment the number of votes for the first comment, we can say the following: > db.blog.updateOne({"post" : post_id}, {"$inc" : {"comments.0.votes" : 1}}) MongoDB's positional operator $ enables updating specific elements within an array without knowing their index beforehand. By using "$" in update operations like updateOne, it matches and modifies elements based on the query document. For instance, updating a user "John" to "Jim" within the "comments" array targets the first match only, preserving the original name if multiple comments were made by "John". The positional operator updates only the first match. > db.blog.updateOne({"comments.author" : "John"}, {"$set" : {"comments.$.author" : "Jim"}}) Array operators M 3 oD on Updates using array filters. g if we want to hide all comments with five or more down votes, we can do something like the following: B > db.blog.updateOne( This command defines elem as the identifier for each {"post" : post_id }, matching element in the "com ments" array. If the votes { $set: { "comments.$[elem].hidden" : true } }, value for the comment identified by elem is less than or { arrayFilters: [ { "elem.votes": { $lte: -5 } } ] } equal to -5, we will add a field called "hidden" to the ) "comments" document and set its value to true. Upserts The "upsert" option in MongoDB's update operations combines update and insert functionalities. When applied, it attempts to update a document based on specified criteria. If no matching document is found, it creates a new one using the criteria and applies the update operation. For instance, db.collection.updateOne({"field": "value"}, {"$set": {"field2": "newValue"}}, {"upsert": true}) will either update an existing document where "field" equals "value" or create a new one with "field" as "value" and "field2" set to "newValue". Sometimes a field needs to be set when a document is created, but not changed on subsequent updates. This is what "$setOnInsert" is for. "$setOnInsert" is an opera‐ tor that only sets the value of a field when the document is being inserted. > db.users.updateOne({}, {"$setOnInsert" : {"createdAt" : new Date()}}, {"upsert" : true}) Updating Multiple Documents updateMany follows the same semantics as updateOne and takes the same parameters. The key difference is in the number of documents that might be changed. Querying > db.users.find({"age" : 27}) An empty query document (i.e., {}) matches everything in the collection. If find isn’t given > db.users.find({"username" : "joe", "age" : 27}) a query document, it defaults to {}. Specifying Which Keys to Return Sometimes you do not need all of the key/value pairs in a document returned. If this is the case, you can pass a second argument to find (or findOne) specifying the keys you want. > db.users.find({}, {"username" : 1, "email" : 1}) you can prevent "_id" from being returned: the "_id" key is returned by default, > db.users.find({}, {"username" : 1, "_id" : 0}) even if it isn’t specifically requested. Query Criteria Query Conditionals "$lt", "$lte", "$gt", and "$gte" are all comparison operators, corresponding to =, respectively. look for users who are between the ages of 18 and 30 > db.users.find({"age" : {"$gte" : 18, "$lte" : 30}}) To query for documents where a key’s value is not equal to a certain value, you must use another conditional operator, "$ne", which stands for “not equal.” > db.users.find({"username" : {"$ne" : "joe"}}) OR Queries "$in" is very flexible and allows you to specify criteria of different types as well as values. > db.users.find({"user_id" : {"$in" : [12345, "joe"]}}) The opposite of "$in" is "$nin", which returns documents that don’t match any of the criteria in the array. > db.raffle.find({"ticket_no" : {"$nin" : [725, 542, 390]}}) "$or" takes an array of possible criteria. > db.raffle.find({"$or" : [{"ticket_no" : 725}, {"winner" : true}]}) “$and” { $and: [ { Expression1 }, { Expression2 },..., { ExpressionN } ] } $not > db.users.find({"id_num" : {"$not" : {"$mod" : [5, 1]}}}) Type-Specific Queries null In MongoDB, null matches itself, enabling queries for documents where a field's value is null. Additionally, null also matches non-existing fields. For instance, querying {"z": null} returns documents without "z". To exclusively find keys with null values, "$exists" and "$eq" can be combined: db.c.find({"z": {"$eq": null, "$exists": true}}). Regular Expressions If we want to find all users with the name “Joe” or “joe,” we can use a regular expression to do case- insensitive matching: > db.users.find( {"name" : {"$regex" : /joe/i } }) Regular expressions can also match themselves. Very few people insert regular expressions into the database, but if you insert one, you can match it with itself: > db.foo.find({"bar" : /baz/}) { "_id" : ObjectId("4b23c3ca7525f35f94b60a2d"), "bar" : /baz/ } Querying Arrays Querying for elements of an array is designed to behave the way querying for scalars does. For example, if the array is a list of fruits, like this: > db.food.insertOne({"fruit" : ["apple", "banana", "peach"]}) the following query will successfully match the document: > db.food.find({"fruit" : "banana"}) If you need to match arrays by more than one element, you can use "$all". This allows you to match a list of elements. > db.food.find({fruit : {$all : ["apple", "banana"]}}) You can also query by exact match using the entire array. However, exact match will not match a M 4 oD on document if any elements are missing or superfluous. g > db.food.find({"fruit" : ["apple", "banana", "peach"]}) B If you want to query for a specific element of an array, you can specify an index using the syntax key.index: > db.food.find({"fruit.2" : "peach"}) A useful conditional for querying arrays is "$size", which allows you to query for arrays of a given size. > db.food.find({"fruit" : {"$size" : 3}}) "$size" cannot be combined with another $ conditional (example: "$gt") As mentioned earlier in this chapter, the optional second argument to find specifies the keys to be returned. The special "$slice" operator can be used to return a subset of elements for an array key. > db.blog.posts.findOne(criteria, {"comments" : {"$slice" : 10}}) if we wanted the last 10 comments, we could use −10 > db.blog.posts.findOne(criteria, {"comments" : {"$slice" : [23, 10]}}) This would skip the first 23 elements and return the 24th through 33rd. If there were fewer than 33 elements in the array, it would return as many as possible. Unless otherwise specified, all keys in a document are returned when "$slice" is used. You can return the matching element with the $ operator. > db.blog.posts.find({"comments.name" : "bob"}, {"comments.$" : 1}) Scalars (nonarray elements) in documents must match each clause of a query’s crite‐ ria. For example, if you queried for {"x" : {"$gt" : 10, "$lt" : 20}}, "x" would have to be both greater than 10 and less than 20. However, if a document’s "x" field is an array, the document matches if there is an element of "x" that matches each part of the criteria but each query clause can match a different array element. > db.test.find({"x" : {"$gt" : 10, "$lt" : 20}}) {"x" : [5, 25]} , Neither 5 nor 25 is between 10 and 20, but the document is returned because 25 matches the first clause (it is greater than 10) and 5 matches the second clause (it is less than 20). you can use "$elemMatch" to force MongoDB to compare both clauses with a single array element. However, the catch is that "$elemMatch" won’t match nonarray elements: > db.test.find({"x" : {"$elemMatch" : {"$gt" : 10, "$lt" : 20}}}) Querying on Embedded Documents we can query for someone named Joe Schmoe with the following: > db.people.find({"name" : {"first" : "Joe", "last" : "Schmoe"}}) However, a query for a full subdocument must exactly match the subdocument. If Joe decides to add a middle name field, suddenly this query won’t work anymore; You can query for embedded keys using dot notation: > db.people.find({"name.first" : "Joe", "name.last" : "Schmoe"}) $where Queries For security, use of "$where" clauses should be highly restricted or eliminated. End users should never be allowed to execute arbitrary "$where" clauses. We’d like to return documents where any two of the fields are equal. It’s unlikely MongoDB M 5 oD on will ever have a $ conditional for this, so we can use a "$where" clause to do it with JavaScript: g > db.foo.find({"$where" : function () { If the function returns true, the B for (var current in this) { document will be part of the result for (var other in this) { set; if it returns false, it won’t be. if (current != other && this[current] == this[other]) { "$where" queries should not be used return true;} unless strictly necessary: they are } much slower than regular queries. } return false; }}); Cursors cursor.hasNext() checks that the next > for(i=0; i while (cursor.hasNext()) { result exists, and cursor.next() fetches it.... db.collection.insertOne({x : i});... obj = cursor.next(); > var cursor = db.people.find();... }... // do stuff > cursor.forEach(function(x) { > var cursor = db.collection.find();... }... print(x.name); Almost every method on a cursor object returns the cursor... }); itself, so that you can chain options in any order. > var cursor = db.foo.find().sort({"x" : 1}).limit(1).skip(10); At this point, the query has not been executed yet. All of these functions merely build the query. Now, suppose we call the following: > cursor.hasNext() At this point, the query will be sent to the server. The shell fetches the first 100 results or first 4 MB of results (whichever is smaller) at once so that the next calls to next or hasNext will not have to make trips to the server. After the client has run through the first set of results, the shell will again contact the database and ask for more results with a getMore request. getMore requests basically contain an identifier for the cursor and ask the database if there are any more results, returning the next batch if there are. This process continues until the cursor is exhausted and all results have been returned. Limits, Skips, and Sorts To set a limit, chain the limit function onto your call to find. To only return three results, use this: > db.c.find().limit(3) This will skip the first three matching documents and return the rest of the matches. > db.c.find().skip(3) sort, where the keys are key names and the values are the sort directions. > db.c.find().sort({username : 1, age : -1}) 1 (ascending) | −1 (descend‐ing). The server-side cursor represents database operations and consumes memory and resources. Cursors are typically freed when they run out of results, go out of scope on the client side, or remain inactive for 10 minutes. This prevents resource hogging and ensures efficient database use. To avoid automatic closure due to inactivity, some drivers offer an "immortal" function that disables the timeout feature for a cursor. However, if used, it's crucial to either iterate through all results or explicitly close the cursor to prevent it from persisting in the database indefinitely and consuming resources until the server restarts. Indexes for (i=0; i db.users.find({"username": "user101"}).explain("executionStats") Creating an index on the "username" field > db.users.createIndex({"username" : 1}) Creating the index should take no longer than a few seconds, unless you made your collection especially large. If the createIndex call does not return after a few seconds, run db.currentOp() (in a different shell) or check your mongod’s log to see the index build’s progress. the query is now almost instantaneous and, even better, However, indexes have their price: write operations (inserts, updates, and deletes) that modify an indexed field will take longer. Introduction to Compound In general, if MongoDB uses an index for a query it Indexes will return the resulting documents in index order. This is called a compound index and is useful if your query has multiple sort direc‐ tions or multiple keys in the criteria. A compound index is an index on more than one field. > db.users.createIndex({"age" : 1, "username" : 1}) If you have more than 32 MB of results MongoDB will just error out, refusing to sort that much data Introduction to the Aggregation Framework Getting Started with Stages: Familiar Operations let’s do a simple filter looking for all companies that were founded in 2004: > db.companies.aggregate([ {$match: {founded_year: 2004}}, ]) Now let’s add a project stage to db.companies.aggregate([ {"name": "Digg", our pipeline to reduce the output {$match: {founded_year: 2004}}, "founded_year": 2004 } to just a few fields per document. {$project: { _id: 0, name: 1, founded_year: 1}} ]) To aggregate, we pass in an aggregation pipeline. A pipeline is an array with documents as elements. Each of the documents must stipu‐ late a particular stage operator. db.companies.aggregate([ M Let’s review our pipeline one more time. We have five stages. First, 6 oD on {$match: {founded_year: 2004}}, g we’re filtering the companies collection, looking only for {$sort: {name: 1}}, documents where the "founded_year" is 2004. Then we’re sorting B {$skip: 10}, based on the name in ascending order, skipping the first 10 {$limit: 5}, matches, and limiting our end results to 5. Finally, we pass those {$project: { five documents on to the project stage, where we reshape the _id: 0, documents such that our output documents contain just the name: 1}}, company name. ]) $project db.companies.aggregate([ The project stage we have defined in this aggregation {$match: pipeline will suppress the "_id" and include the {"funding_rounds.investments.financial_org.pe "name". It will also promote some nested fields. This rmalink": "greylock" }}, project uses doThe project stage we have defined in {$project: { this aggregation pipeline will suppress the "_id" and _id: 0, include the "name". It will also promote some nested name: 1, fields. This project uses dot notation to express field ipo: "$ipo.pub_year", paths that reach into the "ipo" field and the "fund valuation: "$ipo.valuation_amount", ing_rounds" field to select values from those nested funders: documents and arrays.t notation to express field "$funding_rounds.investments.financial_org.p paths that reach into the "ipo" field and the "fund ermalink" ing_rounds" field to select values from those nested }} documents and arrays. ]).pretty() $unwind db.companies.aggregate([ { $match: {"funding_rounds.investments.financial_org.pe rmalink": "greylock"} }, { $unwind: "$funding_rounds" }, { $project: { _id: 0, name: 1, amount: "$funding_rounds.raised_amount", year: "$funding_rounds.funded_year" } }]) Array Expressions db.companies.aggregate([ { $match: {"funding_rounds.investments.financial_org.permalink": "greylock"} }, { $project: {_id: 0, name: 1, founded_year: 1, rounds: { $filter: { input: "$funding_rounds", as: "round", cond: { $gte: ["$$round.raised_amount", 100000000] } } } } }, { $match: {"rounds.investments.financial_org.permalink": "greylock" } }, ]).pretty() The rounds field uses a filter expression. The $filter operator is designed to work with array fields and specifies the options we must supply. The first option to $filter is input. For input, we simply specify an array. In this case, we use a field path speci‐ fier to identify the "funding_rounds" array found in documents in our companies collection. Next, we specify the name we’d like to use for this "funding_rounds" array throughout the rest of our filter expression. Then, as the third option, we need to specify a condition. The condition should provide criteria used to filter whatever array we’ve provided as input, selecting a subset. In this case, we’re filtering such that we only select elements where the "raised_amount" for a "funding_round" is greater than or equal to 100 million. In specifying the condition, we’ve made use of $$. We use $$ to reference a variable defined within the expression we’re working in. The as clause defines a variable within our filter expression. db.companies.aggregate([ The $arrayElemAt operator enables us to { $match: { "founded_year": 2010 } }, select an element at a particular slot within { $project: { an array. The following pipeline provides _id: 0, name: 1, founded_year: 1, an example of using $arrayElemAt: first_round: { $arrayElemAt: [ "$funding_rounds", 0 ] }, last_round: { $arrayElemAt: [ "$funding_rounds", -1 ] } } } ]).pretty() db.companies.aggregate([ { $match: { "founded_year": 2010 } }, Related to $arrayElemAt is the $slice expression. This { $project: { allows us to return not just one but multiple items from an _id: 0, name: 1, founded_year: 1, array in sequence, beginning with a particular index, early_rounds: { $slice: [ "$funding_rounds", 1, 3 ] } Here, again with the funding_rounds array, we begin at }} index 1 and take three ele‐ ments from the array. ]).pretty() Accumulators Accumulators the aggregation framework provides enable us to perform operations such as summing all values in a particular field ($sum), calculating an average ($avg), etc. We also consider $first and $last to be accumulators because these consider values in all documents that pass through the stage in which they are used. $max and $min are two more examples of accumulators that consider a stream of documents and save just one of the values they see. We can use $mergeObjects to combine mul‐ tiple documents into a single document. We also have accumulators for arrays. We can $push values onto an array as docu‐ ments pass through a pipeline stage. $addToSet is very similar to $push except that it ensures no duplicate values are included in the resulting array. db.companies.aggregate([ { $match: { "funding_rounds": { $exists: true, $ne: [ ]} } }, { $project: {_id: 0, name: 1, largest_round: { $max: "$funding_rounds.raised_amount" } } }]) Introduction to Grouping db.companies.aggregate([ { $group: { we’re using a group stage to aggregate _id: { founded_year: "$founded_year" }, together all companies based on the year average_number_of_employees: { $avg: "$number_of_employees" } they were founded, then calculate the } }, average number of employees for each { $sort: { average_number_of_employees: -1 } } year ]) { "_id" : { "founded_year" : 1847 }, "average_number_of_employees" : 405000 } db.companies.aggregate( [ M 7 oD { $match: { "relationships.person": { $ne: null } } }, on g { $project: { relationships: 1, _id: 0 } }, { $unwind: "$relationships" }, B { $group: { _id: "$relationships.person", count: { $sum: 1 } sum : 1, mean the numbre } }, of peresen that have the { $sort: { count: -1 } } made _id in we do $sum : 2 ]).pretty() this will double the numbre