Full Transcript

INTRO duction Big data is a combination of structured, semi- structured and unstructured data that organizations collect, analyze and mine for information and insights. It's used in machine learning projects, predictive modeling and other advanced analytics application...

INTRO duction Big data is a combination of structured, semi- structured and unstructured data that organizations collect, analyze and mine for information and insights. It's used in machine learning projects, predictive modeling and other advanced analytics applications. 1.Structured Data - The data which is to the point, factual, and highly organized is referred to as structured data. It is quantitative in nature, i.e., it is related to quantities that means it contains measurable numerical values like numbers, dates, and times. It is easy to search and analyze structured data. Structured data exists in a predefined format. Relational database consisting of tables with rows and columns is one of the best examples of structured data. Structured data generally exist in tables like excel files and Google Docs spreadsheets. The programming language SQL (structured query language) is used for managing the structured data. SQL is developed by IBM in the 1970s and majorly used to handle relational databases and warehouses. 2. Unstructured Data - All the unstructured files, log files, audio files, and image files are included in the unstructured data. Some organizations have much data available, but they did not know how to derive data value since the data is raw. Unstructured data is the data that lacks any predefined model or format. It requires a lot of storage space, and it is hard to maintain security in it. It cannot be presented in a data model or schema. That's why managing, analyzing, or searching for unstructured data is hard. It resides in various different formats like text, images, audio and video files, etc. It is qualitative in nature and sometimes stored in a non-relational database or NO-SQL. 3. Semi Structured Data - Semi-structured data is a type of data that is not purely structured, but also not completely unstructured. It contains some level of organization or structure, but does not conform to a rigid schema or data model, and may contain elements that are not easily categorized or classified. Semi-structured data is becoming increasingly common as organizations collect and process more data from a variety of sources, including social media, IoT devices, and other unstructured sources. While semi- structured data can be more challenging to work with than strictly structured data, it offers greater flexibility and adaptability, making it a valuable tool for data analysis and management. SOURCES OF BIG DATA Social Media IoT devices E-commerce Global Positioning Transactional Data System (GPS) 1.Volume - The name Big Data itself is related to a size which is enormous. Size of data plays a very crucial role in determining value out of data. Also, whether a particular data can actually be considered as a Big Data or not, is dependent upon the volume of data. Hence, ‘Volume’ is one characteristic which needs to be considered while dealing with Big Data solutions. 2.Variety – The next aspect of Big Data is its variety. Variety refers to heterogeneous sources and the nature of data, both structured and unstructured. During earlier days, spreadsheets and databases were the only sources of data considered by most of the applications. Nowadays, data in the form of emails, photos, videos, monitoring devices, PDFs, audio, etc. are also being considered in the analysis applications. This variety of unstructured data poses certain issues for storage, mining and analyzing data. 3.Velocity – The term ‘velocity’ refers to the speed of generation of data. How fast the data is generated and processed to meet the demands, determines real potential in the data. Big Data Velocity deals with the speed at which data flows in from sources like business processes, application logs, networks, and social media sites, sensors, Mobile devices, etc. The flow of data is massive and continuous. 4.Variability – This refers to the inconsistency which can be shown by the data at times, thus hampering the process of being able to handle and manage the data effectively.

Use Quizgecko on...
Browser
Browser