Research on NoSQL Database Technology.pdf
Document Details
Uploaded by Deleted User
Full Transcript
Advances in Social Science, Education and Humanities Research, volume 176 2nd International Conference on Management, Education and Social Science (ICMESS 2018) Research on NoSQL Database Technology LI Jun-...
Advances in Social Science, Education and Humanities Research, volume 176 2nd International Conference on Management, Education and Social Science (ICMESS 2018) Research on NoSQL Database Technology LI Jun-shan, LI Jian-jun The institute of information science & technology, South China Business College Guangdong University of Foreign Studies, Guangzhou 510545, China Abstract—Due to the demand for ultra-large-scale and high- concurrency purely dynamic social networking websites and big II. THE PRODUCTION OF NOSQL DATABASE data management, traditional relational databases can no longer In the early days of the Internet application, the traffic of a meet the storage and access requirements of massive data, so website is generally not large. The pages of a website are more NoSQL database systems for specific applications have emerged. static webpages. That is, websites with dynamic interaction First, this paper introduces the background of the emergence of types have limited dynamic functions. Therefore, a single NoSQL non-relational databases. Second, it also introduces the relational database can be easily handled. With the rapid concept and overall architecture of the NoSQL database. Then, the notable features of NoSQL databases relative to relational development of the website, popular forums, blogs, sns, and databases are given. Next, the classification of NoSQL databases microblogs are gradually leading the trend in the web domain. by storage type is described. Finally, the development status and Especially after the emergence of many dynamic websites with prospect of NoSQL database technology are given. It is expected strong functions, although RDBMS can tolerate a certain that the research in this paper can further promote the research degree of irregularities and structural lack of data, however, in and application of non-relational database technology. the face of mass sparse data of loose structure, RDBMS seems to be awkward. At the same time, starting from Inktomi, which Keywords—NoSQL; Non-relational databases; Massive data can be regarded as the first search engine, to Google later, the management; Data storage; Data logical model widely used relational database management system leaked a series of problems of its own when applied to massive data. I. INTRODUCTION To this end, Google has built a massively scalable infrastructure to support Google's search engines and other Since the commercialization of a large number of applications (including Google Maps, Google Earth, Gmail, commercially available relational database management Google Finance, and Google Apps), establishing a scalable systems (RDBMS) in the 1980s and has been widely used, the infrastructure for parallelism. Handle massive data. With the relational database management system has not only become a release of Google related technologies and solutions, the mainstream database product, but also has applications in inventor of open source search engine Lucene developed the various fields of the national economy and data management. first open source software that mimics some of the The field has an absolutely dominant position. However, with characteristics of Google's infrastructure, followed by Lucene's the rapid development and wide application of Internet core developers who joined Yahoo, relying on many open technology, traditional relational databases have exposed many source contributors The support created an open source insurmountable problems when dealing with hyperscale and Hadoop product and its subprojects and related projects that high concurrent and pure dynamic social networking website can replace all parts of Google's infrastructure. (SNS) type web2.0. In order to solve the challenges brought by large-scale data sets and multiple data types, especially the In fact, the term NoSQL and its ideas came before the first big data application problems, non-relational data storage release of Hadoop. NoSQL first came from the name of a small system NoSQL emerged. open source relational database developed by Carlo Strozzi in 1998 because the database stores all data as an ASCII file and The full name of NoSQL is Not Only SQL, which means uses shell scripts instead of SQL to access data, hence the name "not just SQL" or "structured queries," but narrowly refers to Nosql. At a conference on distributed open source databases, non-relational databases, and broadly refers to "non-relational data storage and processing topics held in San Francisco in data stores." There are also literatures interpreting their June 2009, Eric Evans from Rackspace once again raised the meaning as "When using a relational database, relational concept of NoSQL. In fact, at the conference, the term databases are used. When not applicable, it is not necessary to NoSQL originally did not have a deeper meaning, but the result use non-relational databases. Instead, consider using more was that it quickly spread across the Internet and became a new appropriate data storage." Obviously, NoSQL is a data trend in the IT field. The reason is that Google's success has storage system that is different from a relational database helped the public accept the concept of distributed computing (generally, NoSQL is also called a non-relational database in the new era, spurred people's interest in parallel large-scale system corresponding to a relational database). This is the case processing and distributed non-relational data storage, and the for relational database applications that currently cover almost emergence of Hadoop laid the foundation for the rapid all fields. For example, NoSQL's non-relational database development of NoSQL solid foundation. Further, NoSQL was concept is undoubtedly a new kind of thinking injection and an supported by two leading Web giants Google and Amazon, and entirely new database technology revolution. Copyright © 2018, the Authors. Published by Atlantis Press. This is an open access article under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/). 1137 Advances in Social Science, Education and Humanities Research, volume 176 accordingly the two most important new systems emerged in (1) NoSQL uses a loose type, extensible data model; the the field: The first is Google's distributed, column-oriented, data model does not have a strict definition, does not require multidimensional, Sparse, multi-version table system – the data model to be determined before the data is stored, and BigTable ; The system will be a large table of data the data model can be dynamically changed during system according to the value of the row key segmentation, and operation, so it is very beneficial to Store most of the semi- distributed to multiple servers, is a strong consistency system. structured and unstructured data in web applications. The second is Dynamo, Amazon's distributed storage system based on the P2P architecture; this system distributes (2) NoSQL uses multi-node data distribution model to data with consistent hashing algorithms, has better availability distribute records on multiple nodes through data partitioning. and failure recovery capabilities, and relatively poor It can achieve horizontal scaling, supports horizontal expansion, consistency, is a The ultimate consistency of the system. Since and can adapt to the rapid growth of Web applications, and it then, a large number of developers have started to use, clone, has a large number of data in a distributed architecture. Better or mix the two products in their own applications, and have performance. emerged many different implementations. In less than five (3) The NoSQL database no longer supports the ACID years, NoSQL and the concept of managing big data have been features (for example, Atomicity, Consistency, Isolation, widely disseminated. Numerous well-known companies have Durability) of the firms in the traditional relational database begun to use a variety of use cases including Facebook, Netflix, management system that have been formed for a long time, and Yahoo, EBay, and Hulu, many of which are the company also does not require transaction management. contributed their own expansion components and new products to the world through open source. With the further (4) Have persistent storage of data on disk or in memory, or development and extensive application of web2.0, the both. development and application of NoSql database technology (5) Does not support JOIN operations, supports large-scale have been promoted. data processing, most of the technologies are open source. For example, in a NoSQL database that stores data in the III. NOSQL DATABASE CONCEPT form of key-value pairs, the data structure is not fixed, and NoSQL non-relational database currently does not have a each tuple can have different fields. Each tuple can add some recognized authoritative definition. Sourav Mazumder, chief of its own Key-value pair. For another example, in a NoSQL technology architect at InfoSys Technologies, gave a more database stored as a document, an application is allowed to comprehensive description : store data of any structure in one data element; since it is not limited to a fixed storage structure, some unnecessary time and (1) Logically model data using an extensible, loosely- space overhead can be reduced; Data storage does not require a coupled data pattern. fixed table structure, there is no table connection operation. For (2) Designed to follow the consistency-availability-partition another example, Sina’s Weibo system supports distributed tolerance (CAP) theorem across multi-node data distribution data access by a large number of users through distributed models, supporting horizontal scaling. computing over 400 servers. All of the above shows that NoSQL has the performance advantages unmatched by (3) Have data persistence capabilities in disk and/or relational databases in big data access. memory. In addition, the concept of NoSQL can be further (4) Support multiple "Non-SQL" interfaces for data access. understood through the NoSQL overall architecture (see Fig. 1) Based on the above description and other related literature, given by Sourav Mazumder. Sourav Mazumder divides the NoSQL has unique connotations at least in the following NoSQL database into four layers: aspects relative to traditional relational databases. 1138 Advances in Social Science, Education and Humanities Research, volume 176 Interface layer Language-specific API REST Thrift MapReduce GET/PUT SQL subset Data logic model layer Key-Value Column Family Document Graph Data distribution layer CAP support Support multiple data centers Dynamic deployment Data persistence layer Based on memory Hard disk based Based on memory and hard disk Customizable pluggable Fig. 1. The overall architecture of NoSQL (1) Interface layer. Used to provide reasonable and of web 2.0 pure Dynamic social networking sites require high convenient programming languages and data call interfaces for concurrent reads and writes in the database. upper applications, mainly including REST (Representational State Transfer), RPC protocol Thrift from Facebook, Map B. No sharing operation Reduce for large-scale data processing, and Get/similar to One of the reasons for the NoSQL database is to "make it Memcached. Put mode, language-specific APIs, etc. easier to write large amounts of data." Starting from the (2) The data logic model layer. Used to describe the logical implementation technology that enables the server to easily representation of data in a database, including key-value handle larger amounts of data, it is clear that only the storage, column-cluster storage, document storage, and graph- performance-enhancing or scale-up options are available. A structure storage. direct solution to increase processing power by improving the performance of the current server itself is to purchase a server (3) Data distribution layer. It is used to define the that doubles in performance without changing the program, but distribution of data, including the CAP mechanism for generally requires up to 5 to 10 times more investment. The horizontal expansion, the multi-data center support mechanism scale-up plan means that more inexpensive servers can be used for ensuring the smooth running of NoSQL databases across to increase the processing capacity. Although it needs to make multiple data centers, and the dynamic deployment support changes to the program, the use of inexpensive servers can mechanism. control the cost, and can further increase the number of (4) Data persistence layer. Used to define the form of data inexpensive servers to increase processing capacity as needed. storage, including memory-based, hard disk-based, both Avoid the complexity and high cost of traditional commercial memory and hard disk-based persistent storage, as well as database sharing operations. custom pluggable persistence. C. Flexible expansion IV. NOSQL DATABASE FEATURES In the past, when the load of relational databases needed to increase, the relational database management system could not Compared to relational databases, NoSQL databases have easily scale-out on commercial cluster machines (that is, by the following salient features: connecting multiple low-cost servers together. With increased load, data administrators always maximize the use of resources A. Easy data dispersion by scaling-up (that is, purchasing larger and more powerful Relational databases are premised on JOIN operations. The servers to carry the increased load). However, as large data existence of associations between data is implemented using analysis requires the use of a large amount of computing power JOIN. For JOIN processing, relational databases have to store to handle the target data set, the high scalability and availability data in the same server, which is obviously not conducive to requirements of the database, and the need to migrate the the dispersion of data. Unlike relational databases, NoSQL database to the cloud or virtual environment, the design of the databases originally do not support JOIN processing, and each new NoSQL database can be used low. The cost of commercial data is independently designed so that it is easy to spread the hardware transparently scales out with new nodes. In other data across multiple servers. Since the data is distributed words, the NoSQL database system is composed of databases across multiple servers, the amount of data on each server is that are distributed on different nodes to form a storage system. reduced, making it easy to write and read large amounts of data, satisfying the very large-scale and high-concurrency SNS type 1139 Advances in Social Science, Education and Humanities Research, volume 176 You can dynamically add (or delete) nodes without downtime V. CLASSIFICATION OF NOSQL DATABASES maintenance, and data can be automatically migrated. According to different storage types of databases, NoSQL databases are divided into key-value pair storage databases, D. Flexible data model column-type storage databases, document storage databases, Change management is a difficult task for large relational and graphical storage databases. DBMS products. Even minor changes to the relational DBMS data model may require system downtime or service A. Key-value Store Database degradation. The NoSQL database is loose in data model The key-value store database is usually implemented as a constraints. Its key-to-store and document database allows hash table with a specific key and a pointer to specific value. applications to store any structure of data in a data element. Therefore, a key-value pair storage database is a NoSQL Even a relatively strict BigTable-based NoSQL database (eg, database that organizes and stores data in the form of key-value Cassandra, HBase) is usually not too restrictive when creating pairs, and queries the data by a completely consistent query of new columns. Therefore, in a NoSQL database, changes to the keys [11-12]. Key-value storage does not need to consider the application or database schema do not need to be managed as a storage format of data, and directly uses the key value to complex change unit; in theory, applications can be allowed to quickly query the required data. It is very suitable for data that iterate faster. does not involve too many data relationships and business relationships. It can effectively reduce the number of read-write E. Asynchronous replication disks and has extremely high reading and writing performance. Early relational DBMSs run on a single CPU, and read and When key-value pairs store database key-value pairs to save write operations are performed by a single database instance. and read data values, system efficiency is very high because it NoSQL replication technology allows database read and write does not have many limitations such as SQL processor, operations can be dispersed on separate servers running on indexing system, and analysis system. The key-value storage different CPUs. That is, NoSQL database replication refers to scheme not only provides efficient access performance, but the one-way information propagation behavior that occurs also has low implementation cost and scalability. The ability to between different database instances. The copying party and satisfy extremely high read/write performance is the most the copying party form a network connection between the significant feature of key-value pairs for storing NoSQL copying party and the copying party. The copying method databases. is usually the copying party actively sends the data to the copying party, and the copying party stores the received data in The key-value pair storage is divided into three storage the current instance, so that the data is essentially backed up on modes: temporary key-value pair storage, permanent key-value pair storage, and both having a key-value pair storage different instances, the main library focuses on the write according to different data storage modes. request, and reads from the library. Requests to improve the system's query service capabilities with high availability, high (1) Temporary key-value pairs are stored. The so-called read performance, and horizontal expansion. temporary nature is that "data may be lost." Memcached is a temporary key-value pair stored NoSQL database. Memcached Replication in NoSQL is often log-based asynchronous keeps all the data in memory, which saves and reads very replication so that data can be written to a node as quickly as quickly, but when Memcached stops, the data does not exist. possible without network delays. The disadvantage of Since the data is stored in memory, data beyond the memory asynchronous replication is that the write data sent by the capacity cannot be operated. A typical feature of temporary master server is not necessarily received from the server. This key-value storage is that the data in memory is stored and read may not always ensure the consistency of master-slave server very quickly, and data may be lost. data because there may be data loss. (2) Permanent key-value pair storage. In contrast to the temporary, the so-called permanent is "data will not be lost." Tokyo Tyrant belongs to this type of NoSQL database. Unlike the temporary, a permanent key does not store data in memory like Memcached, but instead stores the data on a hard disk. Since the IO operation of the hard disk must occur when data is saved to the hard disk, there is a gap between the performance and Memcached, but the data is not lost is its greatest advantage. The typical characteristic of permanent key-value storage is to save data on the hard disk, save and read processing speed is very fast, data will not be lost. (3) Both have a key-value store. That is, both temporary and permanent key-value pairs, Redis belongs to this type of NoSQL database. Redis combines the advantages of temporary key-value pairs with storage and permanent key-value pairs. Redis first saves the data into memory and writes the data to the hard disk when certain conditions are met. This not only 1140 Advances in Social Science, Education and Humanities Research, volume 176 ensures the speed of data processing in the memory, but also D. Graphical Storage Database guarantees the data's permanence by writing to the hard disk. A graph storage database is also referred to as a graph This type of database is particularly suitable for processing database for short, and is a graph-based structure that array-type data. Both temporary and permanent key-value pairs represents and stores graph data NoSQL databases through store typical data at the same time in the memory and hard disk, nodes, edges, and attributes. In a graph store database, each save and read very fast, data stored on the hard disk will not element contains a pointer to an adjoining element directly, and disappear, suitable for processing array type data. is a storage system that adjoins each other without an index. Graphical storage databases are the best storage for graphical B. Columnar Storage Database relationships and can be naturally extended to larger datasets A columnar storage database is a NoSQL database that without the need for concatenation operators, and have faster stores data in the same column and then stores the next column speeds for querying associated datasets. Graphical databases of data, storing, retrieving, and controlling permissions in units can be used to model things and their relationships, such as of column clusters (each column belongs to a cluster of relational graphs, social networks, and recommendation columns). It facilitates the storage of structured and semi- systems. AllegroGraph belongs to graph storage NoSQL structured data for data compression. Physically speaking, database. a table is a collection of columns. Each column is essentially a Finally, it should be noted that although there are more than table with only one field. Therefore, there is a very large I/O 100 kinds of NoSQL databases, they do not have a unified advantage for a column or columns of queries. Columnar architecture. Different NoSQL have their own strengths. At storage database is highly scalable, even if the data does not present, successful NoSQL must be particularly suitable for reduce the corresponding processing speed (especially the certain applications or occasions, and its performance must be writing speed), columnar storage database is usually used for far better than relational databases and other NoSQL databases. batch data processing, ad hoc query and business intelligence and analysis type the storage of data. Cassandra is a column- stored NoSQL database. Data storage does not require a fixed VI. NOSQL DATABASE DEVELOPMENT STATUS AND table structure and there is no restriction on the columns PROSPECTS between each record is the most typical characteristic of the Under current technology conditions, computer architecture columnar storage NoSQL database. requires a large level of scalability in data storage, and NoSQL is working to change this. At present, Google, Yahoo, C. Document Storage Database Facebook, Twitter, and Amazon all apply a large number of A document storage database is also referred to simply as a NoSQL databases. In many areas, NoSQL has achieved document database, and is a type of NoSQL database stored in success not only in the industry but also in academic fields. a key-value pair, and is a document data (a semi-structured data The university began to realize that the standard relational stored in a specific form) that is not mandatory. The document database alone is no longer enough, and it is necessary to add database is mainly oriented to use the storage engine's ability to NoSQL to the curriculum; from a technical point of view, divide different documents into different collections of storage. NoSQL is a very important supplement to relational databases. A document is equivalent to a record in a relational database. In the short years since the NoSQL concept was introduced Multiple documents form a collection, and multiple collections in 2009, NoSQL-type databases have exploded to produce are logically organized together as a document database. more than 100 new databases. With the development of Unlike key-value storage, document storage is concerned with technology and the popularity of applications, some interesting the internal structure of the document, which allows the storage mergers are taking place, such as CouchBase generated by engine to directly support secondary indexes, allowing efficient CouchDB and Membase transactions. These are synchronized querying of any field. The document storage model supports with the explosive growth of the Internet, big data, sensors, and nested storage capabilities, which means that the "values" of many technologies in the future. This has also led to more data the fields can be nested to store other documents. The most and different needs for their processing. significant feature of document-stored NoSQL databases is their ability to meet massive storage requirements and high For most of the NoSQL database systems that have been query performance. MongoDB is a NoSQL database for deployed today, there are many challenging issues that need to document storage. be addressed. In terms of generality, existing NoSQL database products are mostly application-specific solutions, resulting in their application has certain limitations, lack of global system considerations and versatility, and limited functionality. In terms of technology and theoretical maturity, no series of technical achievements have been formed, and there are no strong theories (such as relational computing theory, function dependency theory, Armstrong axiom systems, relational pattern normalization methods, etc.) and technologies that are similar to relational databases (such as query optimization strategies, two-stage blocking protocols, etc.), standard specifications (such as the SQL language), and built-in security mechanisms. In terms of system performance, the maturity, 1141 Advances in Social Science, Education and Humanities Research, volume 176 stability, and functionality of RDBMS can be reassuring; in REFERENCES comparison, most NoSQL databases still have many features to YAO L, ZHANG Y K. Solution of No SQL Distributed Storage and be implemented. In terms of technical support, all RDBMS Extension [J]. Computer Engineering, 2012,38 :40-42. (In Chinese) vendors have spared no effort to provide good corporate Basic knowledge of nosql database. http://www. support. In contrast, most NoSQL systems are open source ituring.com.cn/article/1069 projects. Although each database has several companies to Shashank T. Professional NoSQL[M]. Birmingham: Wrox, 2011. provide support, these companies are mostly small. The start- Lith, Adam, Jakob M. Investigating storage solutions for large data:A up companies have no global support resources, and there is no comparison of well performing and scalable data storage solutions for reassuring public trust like Oracle and Microsoft. In terms of real time extraction and batch insertion of data [D]. oteborg Sweden:Chalmers University of Technology,2010. management support, NoSQL's design goal is to provide zero- SHEN D R,YU G,WANG X T,et al. Survey on No SQL for management solutions, but today's reality is still far from this Management of Big Data[J]. Journal of Software, 2013, 24(8):1786- goal. In terms of professional support, many global business 1803. (In Chinese) units will have people who are familiar with RDBMS concepts Fay C, Jeffery D, Sanjay G, et al. Bigtable: A Distributed Storage and programming; in contrast, almost every NoSQL developer System for Structured Data [A]. //7th Symposium on Operating System is in a learning mode, although this situation will change with Design and Implementation[C]. Seattle, WA, USA: 2006. the passage of time. It is not easy to find a NoSQL programmer Giuseppe D, Deniz H, Madan J, et al. Dynamo: Amazon's Highly with rich experience now. In terms of its unique open source Available Key-value Store [A]. SOSP'07[C]. Stevenson, Washington, advantages, NoSQL now requires a lot of skills to use it, and it USA: 2007. requires a lot of human and material resources to maintain it. In Sourav M. NoSQL in the Enterprise [J]. Architect, 2010, (8):62-64. terms of applications, NoSQL is difficult to achieve data You have to know about NoSQL database 10 key characteristics. http://www.xue163.com/exploit/184/1842623.html. (In Chinese) integrity, and data integrity is essential in enterprise applications, so the current NoSQL project is difficult to Redis drill (6) redis replication (Active-backup). http://www.itnose.net/detail/6637390.html. (In Chinese) popularize in the enterprise. In addition, it takes a long process Emmanuel G. Implementation of key-value pairs store(1): Why is the from the emergence of the NoSQL database to acceptance by key value of storage, Why do you want to achieve it. http://. www. various users. In summary, NoSQL has great room for codecapsule.com improvement and development, both from the technical level Emmanuel G. Implementation of key-value pairs stored (2): To the and from the application level. existing key value is stored as a model. http://. www. codecapsule.com Data analysis tool—— Column type storage database. http://blog.csdn.net/physicsdandan/article/details/51988172. (In Chinese) VII. CONCLUSION Get the NoSQL database——Domestic application case inventory. The emergence of big data has promoted the development http://www.dataguru.cn/thread-42932-1-1.html. (In Chinese) of NoSQL database technology. NoSQL database has created an environment for the storage, transmission and processing of big data, which further promotes the application of NoSQL database. With the development of big data processing, cloud computing, Internet and other technologies, as well as the emergence of new applications in many cloud environments such as social networking, mobile services, and collaborative editing, new demands are placed on massive data management systems. With the expansion of scalability, flexibility, fault- tolerance, self-management, and "strong consistency", the design goals of the massive data management system in the era of cloud computing have provided a good opportunity for the NoSQL database. With the further increase of demand and the passage of time, NoSQL database system will gradually mature and gain wider application. ACKNOWLEDGMENT Supported by Teaching Quality Improvement Project and Teaching Reform Project for Guangdong Undergraduate Universities (No.296); Educational Innovation Project of Educational Department of Guangdong (Education Research) (No.2017GXJK243) 1142