BIG DATA part 1.pdf
Document Details
Uploaded by SteadyStatistics
Full Transcript
BIG DATA INTRODUCTION Data refers to the quantities, characters, or symbols on which operations are performed by a computer, which may be stored and transmitted in the form of electrical signals and recorded on magnetic, optical, or mechanical recording media. Data is used interchangeably with inf...
BIG DATA INTRODUCTION Data refers to the quantities, characters, or symbols on which operations are performed by a computer, which may be stored and transmitted in the form of electrical signals and recorded on magnetic, optical, or mechanical recording media. Data is used interchangeably with information Big Data is a term used to describe a collection of data that is huge in size and yet growing exponentially with time. In it is data that is so large and complex and cannot be stored or processed efficiently by the traditional data management systems. Examples of Big Data The New York Stock Exchange generates about one terabyte of new trade data per day. Social Media The statistic shows that 500+terabytes of new data is generated in terms of photos and videos uploads, text messages, comments etc A single Jet engine can generate 10+terabytes of data in 30 minutes of flight time. With many thousand flights per day, generation of data reaches up to many Petabytes. DATA TYPES 1. Structured 2. Unstructured 3. Semi-structured 1. Structured This is data stored, accessed and processed in the form of fixed format. It includes data stored in data management systems 2. Unstructured Data lacking form or structures. It is a heterogeneous data source containing a combination of simple text files, images, videos etc. Unstructured data needs to be organised in a structured way before it can be processed 3. Semi-structured Semi-structured data can contain both the forms of data Characteristics (Big Data) THE SIX Vs OF BIG DATA 1. Volume – Volume refers to the unimaginable amounts of information generated every second. The data is generated from social media, cell phones, cars, credit cards, M2M sensors, images and video. The name Big Data itself is related to a size, which is enormous. 2. Variety –It refers to heterogeneous (different) sources/format and the nature of data, both structured and unstructured. The data in the form of emails, photos, videos, monitoring devices, PDFs, audio. 3. Velocity – This refers to the speed of generation of data. How fast the data is generated and processed to meet the demands and/or determines real potential in the data 4. Variability – This refers to the inconsistency which can be shown by the data at times, thus hampering the process of being able to handle and manage the data effectively. 5. Veracity – It refers to the degree of certainty in the data sets (i.e. how reliable is the data. Costs of finding errors v costs of accepting errors, Legal consequences). 6. Value: It is not just the amount of data that we store or process that matters but the amount of valuable, reliable and trustworthy data that needs to be stored, processed, analysed to find insights. Benefits of Big Data Processing ◦Businesses can utilize outside intelligence while taking decisions ◦Improved customer service ◦Early identification of risk to the product/services, if any ◦Better operational efficiency INFRASTRUCTURE AND SERVICES: BIG DATA 1. Collection 2. Storage 3. Transmission 1. Collection Methods ◦Transactional Data ◦Online Marketing Analytics ◦Social Media ◦Loyalty Cards ◦Maps Transactional Data Transactional data includes multiple variables, such as what, how much, how and when customers purchased as well as what promotions or coupons they used. Point of Sale (POS) software can automatically store this information in a CRM (Customer Relationship Management) software. This data can also be sold by supermarkets and big stores to users who require it Online Marketing Analytics They collect data when a user browse a website ◦ Google Analytics has the ability to provide a lot of demographic insight on each visitor. This information is useful is building marketing campaigns, as well as website performance analysis. ◦ Heatmaps provide information on which sections of each website page generate the most ‘action’ (mouse clicks or interactions). ◦ Social media analytics allows for customer demographic as well as behavioural analysis. And powerful Facebook marketing tools can help you market to audiences that mirror your current following. Social Media Social media is used in many ways on a frequent basis: networking, procrastinating, gossiping, sharing, educating, games etc. 71 percent of internet users are social network users and by next year, it is estimated that there will be around 2.77 billion social media users around the globe. That equates to a LOT of social media data gathered! Loyalty Cards There are so many businesses willing to give customers a discount simply in exchange for their personal information. This is because this data is valuable to them and they can predict many shopping habits and in turn develop customised marketing efforts targeting those customers. Loyalty programs have the power to double overall sales by encouraging repeat shopping. Maps Maps information has the potential to provide businesses with customer location demographics and a detailed picture of the kind of people who live and work in certain areas. 2. Storage A. Distributed File Systems They distribute data across multiple machines, offering fault tolerance and scalability. Example: HDFS(Hadoop Distributed File System) it’s more efficient to break up & distribute data into many parts, allowing processing and analysing of different parts concurrently. This architecture allows for parallel processing and efficient data storage, making it suitable for storing large volumes of unstructured data. NoSQL Databases (Not SQL) NoSQL databases, designed to handle unstructured and semi-structured data, provide flexibility and scalability beyond traditional relational databases. Example: MongoDB, Cassandra Cloud Cloud computing refers to a broad set of products that are sold as a service and delivered over a network. Data Warehouse This is a technique for collecting and managing data from varied sources to provide meaningful business insights. It merges information coming from different sources into one comprehensive database The data structure and schema are defined in advance to optimise for fast SQL queries Data Lake This is a centralised repository that holds a vast amount of raw data in its native format until it is needed. Data is stored as-is, without having to first structure the data. It stores relational data from line of business applications, and non-relational data from mobile apps, IoT devices, and social media. 3. Big Data Transmission The new protocols that are being used include: ◦Reliable Multi-Destination Transport (RMDT), ◦UDP-based Data Transfer Protocol (UDT), ◦Quick UDP Internet Connections (QUIC)), ◦advances in fibre optics, ◦advances in data compression. UDP-UDT is a high performance data transfer protocol designed specifically for the high volume transfer of large datasets over high speed WAN IMPACT OF STORING BIG DATA Although big data has an impact on the way organisation do their activities, it also introduces some concerns. These include: ◦Access ◦Transmission time ◦Security ◦Processing time Access The data is stored off-site location away from the enterprise. 1. An enterprise must make a decision of: 1. who stores the data 2. who has access to their data in or outside of the organization 3. how to access their data (types of queries to be used) Transmission time An organisation must select the best media that will allow for faster transfer of data. Factors are: ◦ Bandwidth and cost ◦ requirements for building/upgrading networks and systems to cope with increasing data movement, ◦ international aspects, ◦ latency ◦ synchronisation problems Security Ensure security from: ◦ theft of data from online/third party storage, ◦ ransomware, ◦ DDoS attacks, ◦ insider attacks, ◦ data misuse by authorised users, ◦ insertion of fake data, ◦ changes to data access control ◦ security audits ◦ physical security. Processing time ◦balance between security - encrypting all data - and processing time ◦problems with having to decrypt-process-encrypt ◦process types