Handling Unstructured Data for Data Science PDF

Summary

This document provides an overview of handling unstructured data in data science, focusing on NoSQL databases. It explains the concept of NoSQL, its advantages over traditional relational databases, and its flexible data models. It touches on various NoSQL types like key-value, document, columnar, and graph databases and how they excel at handling large distributed data sets.

Full Transcript

Handling Unstructured Data for Data Science Advanced NoSQL for Data Science NoSQL is a database design that can accommodate various data models, including key-value, document, columnar, and graph formats. NoSQL, which means “not only SQL”, is an alternative to relational databases in wh...

Handling Unstructured Data for Data Science Advanced NoSQL for Data Science NoSQL is a database design that can accommodate various data models, including key-value, document, columnar, and graph formats. NoSQL, which means “not only SQL”, is an alternative to relational databases in which data is stored in tables and has a fixed data schema. NoSQL databases are very useful for working with large distributed data. The NoSQL databases are built in the early 2000s to deal with large-scale database clustering in web and cloud applications. NoSQL has a flexible schema, unlike the traditional relational database model. All rows can have different structures or attributes. NoSOL databases are found to be very useful for handling really big data tasks because it follows the Basically Available, Soft State, Eventual Consistency (BASE) approach instead of Atomicity, Consistency, Isolation, and Durability —commonly known as ACID properties. Two major drawbacks of SQL are rigidity when adding columns and attributes to tables and slow performance when many tables need to be joined and when tables store a large amount of data. NoSQL databases tried to overcome these two biggest drawbacks of relational databases. NoSQL offers a more flexible, schema-free solution that can work with unstructured data. Traditional relational database model. Atomicity: Example: The transaction to transfer $100 involves two operations: debiting $100 from your savings account and crediting $100 to your checking account. Atomicity: Both operations must succeed or fail together. If debiting $100 from your savings account succeeds but crediting $100 to your checking account fails, the entire transaction is rolled back, and no money is moved. Consistency: Example: Suppose your savings account has $500 and your checking account has $300 before the transaction. Consistency: After transferring $100, the total balance across both accounts should remain $800. The database enforces rules to ensure that the accounts' total remains consistent and no data integrity rules (such as a minimum balance) are violated. Isolation: Example: Two transactions occur simultaneously: one transferring $100 from savings to checking, and another withdrawing $50 from savings. Isolation: Each transaction should operate as if it is the only transaction in the system. If these transactions are not isolated, one might see an intermediate state where the savings account balance is incorrect. Isolation ensures that the first transaction either completes entirely before the second one starts or vice versa, preventing interference. Durability: Example: After successfully transferring $100, the system crashes. Durability: Once the transaction is committed (i.e., the $100 has been debited and credited accordingly), the changes must be permanently recorded. Even if the system crashes right after the transaction is committed, the transferred amount should not be lost, and the state of the accounts should reflect the completed transaction when the system recovers. BASE of No-SQL Basically Available: The system guarantees availability of the data, meaning the system will always be available to respond to any request (although it might not immediately return the most recent write). Soft State: The state of the system may change over time, even without input. This is in contrast to ACID where state consistency is strictly maintained. In BASE, intermediate states are allowed. Eventual Consistency: While the system may not be consistent at all times, it will become consistent eventually. This allows the system to continue to operate and accept writes, even if some of the nodes are temporarily out of sync. Why NoSQL NoSQL supports unstructured data or semi-structured data. In many applications, an attribute usually needs to be added on the fly, for specific rows, but not every row, and may be of different types than attributes in the rows. Features: It is not using the relational model to store data. NoSQL running well on clusters. It is mostly open-source. NoSQL is capable to handle a large amount of social media data. NoSQL is schema-less. Document Databases for Data Science Document-based NoSQL databases store the data in the JSON object format. Each document has key-value pairs like structures. The document-based NoSOQL databases are simple for engineers as they map items as a JSON object. JSON is a very common data format truly adaptable by web developers and permits us to change the structure whenever required. Some example of document-based NoSQL databases are CouchDB, MongoDB, OrientDB, and BaseX JSON document format: { “id": 1, "name" { "First" "John", "last" "Backus" }, "Contribs" : [ "Fortran", "ALGOL", "Form", "FP" ], "awards“: [ { "award" :"Dowell Award", "year" 1988, "by" :"Computer Society" }, { "award" :"First Prize", "year" 1993, "by" :“National Academy of Engineering“ } ] } Graph Databases for Data Science Graph database stores the data in the form of nodes and edges. The node stores information about the main entities like people, places, and products, and the edge stores the relationships between them. Graph database is very useful to find out the pattern or relationship among data like a social network and recommendation engines. Examples of graph databases are Neo4j and Amazon Neptune MongoDB MongoDB MongoDB is a popular open-source, NoSQL database that stores data in a document-oriented format. Unlike traditional relational databases, which store data in tables, MongoDB stores data as JSON-like documents, making it more flexible and scalable MongoDB is a powerful, flexible, and scalable general-purpose database. MongoDB is a document-oriented database, not a relational one. The primary reason for moving away from the relational model is to make scaling out easier A document-oriented database replaces the concept of a “row” with a more flexible model, the “document.” By allowing embedded documents and arrays, the document oriented approach makes it possible to represent complex hierarchical relationships with a single record. There are also no predefined schemas: a document’s keys and values are not of fixed types or sizes. Without a fixed schema, adding or removing fields as needed becomes easier. Generally, this makes development faster as developers can quickly iterate. It is also easier to experiment. Easy Scaling Data set sizes for applications are growing at an incredible pace. Increases in available bandwidth and cheap storage have created an environment where even small-scale applications need to store more data than many databases were meant to handle Scaling a database comes down to the choice between scaling up (getting a bigger machine) or scaling out (partitioning data across more machines). Scaling up is often the path of least resistance, but it has drawbacks: large machines are often very expensive, and eventually a physical limit is reached where a more powerful machine cannot be purchased at any cost. The alternative is to scale out: to add storage space or increase performance, buy another commodity server and add it to your cluster. This is both cheaper and more scalable; however, it is more difficult to administer a thousand machines than it is to care for one. MongoDB was designed to scale out. Its document-oriented data model makes it easier for it to split up data across multiple servers. MongoDB automatically takes care of balancing data and load across a cluster, redistributing documents automatically and routing user requests to the correct machines. This allows developers to focus on programming the application, not scaling it. When a cluster need more capacity, new machines can be added and MongoDB will figure out how the existing data should be spread to them. Features MongoDB is intended to be a general-purpose database, so aside from creating, reading, updating, and deleting data, it provides an ever-growing list of unique features: Indexing MongoDB supports generic secondary indexes, allowing a variety of fast queries, and provides unique, compound, geospatial, and full-text indexing capabilities as well. Aggregation MongoDB supports an “aggregation pipeline” that allows you to build complex aggregations from simple pieces and allow the database to optimize it. Special collection types MongoDB supports time-to-live collections for data that should expire at a certain time, such as sessions. It also supports fixed-size collections, which are useful for holding recent data, such as logs. File storage MongoDB supports an easy-to-use protocol for storing large files and file metadata Some features common to relational databases are not present in MongoDB, notably joins and complex multirow transactions. Omitting these was an architectural decision to allow for greater scalability, as both of those features are difficult to provide efficiently in a distributed system. Intro… A document is the basic unit of data for MongoDB and is roughly equivalent to a row in a relational database management system A collection can be thought of as a table with a dynamic schema. Every document has a special key, "_id", that is unique within a collection. MongoDB comes with a simple but powerful JavaScript shell, which is useful for the administration of MongoDB instances and data manipulation Database : In MongoDB, a database is a container for collections. It is where all the data is stored. You can think of a database as a namespace for collections. You can create as many databases as you need, and each database can have one or more collections. Collections : A collection in MongoDB is a group of related documents. It is similar to a table in a relational database, but without a fixed schema. You can add or remove fields to a collection at any time, without affecting other documents in the collection. A collection can have one or more indexes, which can be used to optimize query performance. Documents : A document in MongoDB is a JSON-like data structure that represents a single instance of data. It is similar to a row in a relational database, but with a more flexible and dynamic structure. A document can contain any number of fields, which can be of different data types, and can have nested structures. { "name":“Sam", "age":35, "gender":"male", "married":true, "address":{ "street":"cherry Road", "city":"Salem", "state":"Tamil Nadu" }, "hobbies":[ { "name":"Cooking" }, { "name":"Sports" } ] } How MongoDB looks when compared to RDBMS ? [ first_name last_name email { “first_name” : “Joe”, Joe Satana [email protected] “last_name” : “Satana”, “email” : “[email protected]” Bob Michel [email protected] }, { “first_name” : “Bob”, “last_name” : “Michel”, “email” : “[email protected]” } ] Documents At the heart of MongoDB is the document: an ordered set of keys with associated values In JavaScript, for example, documents are represented as objects: {"greeting" : "Hello, world!"} This simple document contains a single key, "greeting", with a value of "Hello, world!". Most documents will be more complex than this simple one and often will contain multiple key/value pairs: {"greeting" : "Hello, world!", "foo" : 3} values in documents can be one of several different data types (or even an entire embedded document) In the above example the value for "greeting" is a string, whereas the value for "foo" is an integer. The keys in a document are strings. Any UTF-8 character is allowed in a key, with a few notable exceptions: Keys must not contain the character \0 (the null character). This character is used to signify the end of a key. The. and $ characters have some special properties and should be used only in certain circumstances MongoDB is type-sensitive and case-sensitive. For example, these documents are distinct: {"foo" : 3} {"foo" : "3"} as are as these: {"foo" : 3} {"Foo" : 3} MongoDB cannot contain duplicate keys. For example, the following is not a legal document: {"greeting" : "Hello, world!", "greeting" : "Hello, MongoDB!"} Key/value pairs in documents are ordered: {"x" : 1, "y" : 2} is not the same as {"y" : 2, "x" : 1} Collections A collection is a group of documents. If a document is the MongoDB analog of a row in a relational database, then a collection can be thought of as the analog to a table. Dynamic Schemas Collections have dynamic schemas. This means that the documents within a single collection can have any number of different “shapes.” For example, both of the following documents could be stored in a single collection: {"greeting" : "Hello, world!"} {"foo" : 5} previous documents not only have different types for their values (string versus integer) but also have entirely different keys. Naming The empty string ("") is not a valid collection name. Collection names may not contain the character \0 (the null character) because this delineates the end of a collection name. You should not create any collections that start with system., a prefix reserved for internal collections. For example, the system. users collection contains the database’s users, and the system.namespaces collection contains information about all of the database’s collections. User-created collections should not contain the reserved character $ in the name. The various drivers available for the database do support using $ in collection names because some system-generated collections contain it. You should not use $ in a Databases In addition to grouping documents by collection, MongoDB groups collections into databases. A single instance of MongoDB can host several databases, each grouping together zero or more collections A good rule of thumb is to store all data for a single application in the same database. Separate databases are useful when storing data for several application or users on the same MongoDB server. The empty string ("") is not a valid database name. A database name cannot contain any of these characters: /, \,., ", *, , :, |, ?, $, (a single space), or \0 (the null character). Basically, stick with alphanumeric ASCII. Database names are case-sensitive, even on non-case-sensitive filesystems. To keep things simple, try to just use lowercase characters. Database names are limited to a maximum of 64 bytes. There are also several reserved database names, which you can access but which have special semantics. These are as follows: admin This is the “root” database, in terms of authentication. If a user is added to the admin database, the user automatically inherits permissions for all databases. There are also certain server-wide commands that can be run only from the admin database, such as listing all of the databases or shutting down the server. local This database will never be replicated and can be used to store any collections that should be local to a single server config When MongoDB is being used in a sharded setup, it uses the config database to store information about the shards. Introduction to the MongoDB Shell MongoDB comes with a JavaScript shell that allows interaction with a MongoDB instance from the command line. The shell is useful for performing administrative functions, inspecting a running instance, Running the Shell To start the shell, run the mongo executable: $ mongo MongoDB shell version: 2.4.0 connecting to: test standard JavaScript libraries: We can even define and call JavaScript functions: you can create multiline commands. The shell will detect whether the Java‐Script statement is complete when you press Enter. If the statement is not complete, the shell will allow you to continue writing it on the next line. Pressing Enter three times in a row will cancel the half-formed command and get you back to the >-prompt. CREATE NEW DATABASE Create new database using the use command followed by the desired database name. For example, to switch to a new database named "school", you can type "use school": use school If the "school" database already exists, "mongosh" will switch to that database. If the "school" database doesn't exist, "mongosh" will create it and switch to it. Inserting new document You can then start inserting data into the new "school" database. For example, you can use the insertOne commands to insert documents into collections within the "school" database. Here's an example of inserting a document into a "students" collection in the "school" database: db.students.insertOne({"name":“James","age":17,"city":“Chennai"}) This will insert a document with the fields "name" , "age" and "city" into the "students" collection in the "school" database. If the "school" database does not already exist, MongoDB will automatically create it for you. Display or select documents from the collection Read Operations Use the db.collection_name.find() method to query data from a specific collection. For example, if you have a collection named students, you can use the following command to find all documents in the students collection: db.students.find() INSERT ARRAY OF DATAS IN DOCUMENT To insert an array of data into a document in MongoDB, you can use the insertOne() method with an object that contains an array field. The array field can then contain any number of elements, including strings, numbers, objects, or other arrays. Here's an example query that inserts a document with an array of hobbies into a collection called "students": db.students.insertOne({"_id":1,"name":“Jona","hobbies":[“Coding",“Playing",“Gaming"]}); If you want to insert multiple documents with arrays, you can use the insertMany() method instead of insertOne(), and pass an array of objects to the method. INSERT MULTIPLE DOCUMENTS This insertMany() is a MongoDB method that allows you to insert multiple documents into a collection in one operation. Here's an example of how to use insertMany() to insert multiple documents into the students collection: db.students.insertMany([ {"_id":2,"name":"Joel"}, {"_id":3,"name":"Trina","age":17}, {"name":"Joseph","age":18} ]); db.students.insertMany([ { "Name":"Jabez", "Age":17, "Gender":"Male", "City":"Covai" }, { "Name":"Jonah", "Age":17, "Gender":"Male", "City":"Madurai" }, { "Name":"Abi", "Age":17, "Gender":"Female", "City":"Coimbatore" } ]) db.students.find() Returns all documents in the students collection. Displaying certain documents db.students.find({ name: “Joel" }) Returns all documents in the students collection that have a name field equal to “Jona". db.students.find({},{"_id":0,"name":1,"age": 1}) Returns all documents in the students collection, but only with the name and age fields included, and the _id field excluded, since it is given as 0, name and age is given as 1 so those fields are included Show all Database In MongoDB, you can list all the databases by using the show dbs command Create Collection inside a Database Switch to the Database: Use the use command to switch to the database where you want to create the collection. Replace "user" with the name of your target database: Syntax db.createCollection("collectionName") Example db.createCollection("student") If the collection is successfully created, MongoDB will acknowledge it with a { "ok" : 1 } response. Show Collection Database show collections Delete Database db.dropDatabase() Update Update the age of the student whose name is Annu db.students.updateOne({“Name”: "Abi"}, {$set:{“Age”:18}}) Insert a new field in the document Replace db.collection('users').replaceOne( { username: "john_doe" }, // Filter { // Replacement document username: "john_doe", name: "John Doe", email: "[email protected]", age: 30, city: "New York" }, { upsert: true } // Options: upsert will insert the document if no match is found); Upserts An upsert is a special type of update. If no document is found that matches the updatecriteria, a new document will be created by combining the criteria and updated documents. Delete Operations The deleteOne() method in MongoDB is used to remove a single document from a collection that matches a specified filter. db.student.deleteOne({age:17}) delete the first document that matches the filter query(i.e., age:17) from the student collection using the deleteOne() method. $addToSet operator In MongoDB, the $addToSet operator is used to add a value to an array field, but only if the value does not already exist in the array. This operation is particularly useful when you want to maintain an array of unique values within a document, avoiding duplicates automatically. Example Usage: Suppose you have a document in a collection called users that looks like this: { "_id": 1, "name": "John Doe", "hobbies": ["reading", "swimming"] } If you want to add the hobby "cycling" to John's list of hobbies, you can use the $addToSet operator: db.users.updateOne( { _id: 1 }, { $addToSet: { hobbies: "cycling" } } ) If you run this operation, the document will be updated as follows: { "_id": 1, "name": "John Doe", "hobbies": ["reading", "swimming", "cycling"] } If you try to add "swimming" again using $addToSet, the array will remain unchanged because "swimming" is already present: db.users.updateOne( { _id: 1 }, { $addToSet: { hobbies: "swimming" } } ) $set modifier The $set modifier in MongoDB is used in update operations to update the value of a specific field in a document. If the field does not exist, $set will add the field to the document with the specified value. Example Usage: Assume you have a document in a collection called users: { Updating an Existing Field If you want to update the age of "_id": 1, John Doe to 31: "name": "John Doe", "age": 30, "city": "New York" } db.users.updateOne( { _id: 1 }, { $set: { age: 31 } } ) Adding a New Field If you want to add a new field, such as occupation, to the document: db.users.updateOne( { _id: 1 }, { $set: { occupation: "Engineer" } } ) The document will now look like this: { "_id": 1, "name": "John Doe", "age": 31, "city": "New York", "occupation": "Engineer“ } $inc operator To increment the pageviews field by 1 for a document with the URL "www.example.com" in MongoDB, you can use the $inc operator in an update command. Here’s the command: db.collection.updateOne( { url: "www.example.com" }, { $inc: { pageviews: 1 } } ) batchInsert() The batchInsert() method in MongoDB was used to insert multiple documents into a collection in a single operation. However, it's important to note that batchInsert() was deprecated in MongoDB 2.6 and is no longer supported in more recent versions of MongoDB. The recommended method for inserting multiple documents is now insertMany(). db.collection.batchInsert( [ db.collection.insertMany( { name: "Alice", age: 25 }, [ { name: "Bob", age: 30 }, { name: "Alice", age: 25 }, { name: "Charlie", age: 35 } { name: "Bob", age: 30 }, ] { name: "Charlie", age: 35 } ); ] ); insert operation db.foo.insertOne({“Name" : “Fanny"}); getCollection used to access a collection within the current database. It is functionally equivalent to accessing a collection directly using the db.collectionName Direct Access db.users.find(); Var n=“collectionName” db.getCollection(collectionName).find();.mongorc.js function used to override built-in MongoDB shell functions to prevent certain operations The.mongorc.js file is a JavaScript file that can be used to customize the behavior of the MongoDB shell (the mongo command). When you start the MongoDB shell, it automatically looks for a file named.mongorc.js in your home directory and executes the commands within it. This file is useful for setting up environment-specific configurations, custom functions, aliases, and other repetitive tasks you want to automate each time you start the shell. updateOne() This method updates a single document that matches the filter criteria. db.collection.updateOne( { name: "Alice" }, // Filter: find the document where name is "Alice" deleteOne() This method deletes a single document that matches the specified filter. If multiple documents match the filter, only the first matching document is deleted. db.collection.deleteOne( { name: "Alice" } // Filter: find the document where name is "Alice" ) db.collection.deleteMany( { status: "inactive" } // Filter: find all documents where status is "inactive" ) duplicate _id values the operation will fail when it encounters the duplicate _id value. db.foo.batchInsert( [ { "_id": 0 }, { "_id": 1 }, { "_id": 1 }, // Duplicate _id { "_id": 2 } ]) MongoDB will successfully insert the first two documents ({ "_id": 0 } and { "_id": 1 }). When MongoDB encounters the third document ({ "_id": 1 }), it will detect the duplicate _id and raise an error. The fourth document ({ "_id": 2 }) will not be inserted because the operation is terminated as soon as the error is encountered. Data Types null Null can be used to represent both a null value and a nonexistent field: {"x" : null} boolean There is a boolean type, which can be used for the values true and false: {"x" : true} number The shell defaults to using 64-bit floating point numbers. Thus, these numbers look “normal” in the shell: {"x" : 3.14} or: {"x" : 3} For integers, use the NumberInt or NumberLong classes, which represent 4-byte or 8-byte signed integers, respectively. {"x" : NumberInt("3")} {"x" : NumberLong("3")} string Any string of UTF-8 characters can be represented using the string type: {"x" : "foobar"} date Dates are stored as milliseconds since the epoch. The time zone is not stored: {"x" : new Date()} regular expression Queries can use regular expressions using JavaScript’s regular expression syntax: {"x" : /foobar/i} array Sets or lists of values can be represented as arrays: {"x" : ["a", "b", "c"]} embedded document Documents can contain entire documents embedded as values in a parent document: {"x" : {"foo" : "bar"}} object id An object id is a 12-byte ID for documents. {"x" : ObjectId()} binary data Binary data is a string of arbitrary bytes. It cannot be manipulated from the shell. Binary data is the only way to save non-UTF-8 strings to the database. code Queries and documents can also contain arbitrary JavaScript code: {"x" : function() { }} _id and ObjectIds Every document stored in MongoDB must have an "_id" key. The "_id" key’s value can be any type, but it defaults to an ObjectId. In a single collection, every document must have a unique value for "_id", which ensures that every document in a collection can be uniquely identified. ObjectId is the default type for "_id“ ObjectIds use 12 bytes of storage, which gives them a string representation that is 24 hexadecimal digits: 2 digits for each byte.

Use Quizgecko on...
Browser
Browser