Chapter 1: Introduction to Data Science PDF

Summary

This document provides an introduction to data science, covering fundamental concepts, the importance of data science, and the differences between data and datasets. It also discusses the four Vs of big data (volume, velocity, variety, and veracity) and the data mining process. The document also details the tasks involved in data science projects and presents a typical data architecture. This document is suited to students and professionals learning about data science.

Full Transcript

OBJECTIVES  Describe the fundamental concepts of data science.  Discuss the importance of data science and some of the factors that are driving its adoption. Determine the difference between data and data sets in the field of data science.  Discuss the benefits and uses of data scien...

OBJECTIVES  Describe the fundamental concepts of data science.  Discuss the importance of data science and some of the factors that are driving its adoption. Determine the difference between data and data sets in the field of data science.  Discuss the benefits and uses of data science in almost every business industry.  Understand clearly the activities involve in BIG DATA  A collection of huge data sets in which normal management techniques such as for example the RDMS (relational database management system) cannot process it.  Also, big data is a large chunk of raw data that is collected, stored and analyzed through various Big Data can be categorized into two (2) forms: Structured Data - more easily analyzed and organized into the database. Typically, data that can be stored in a table, and every instance in the table has the same structure (i.e., Unstructured Data - much harder to analyze and uses a variety of formats. Also, it is not easily interpreted by traditional data models and processes. The characteristics of Big Data are often referred to as the four (4) V’s: Volume – means how much data is there? Velocity – means at what speed is new data generated? Variety – means how diverse are different types of data? Data comes in all formats that may be structured, numeric in the traditional database or the unstructured text documents, video, audio, email, stock ticker data. Veracity – means how accurate is the data? DATA MINING Data mining is a process used by companies to turn raw data into useful information while big data on the other hand refers to a collection of large datasets (datasets in Excel sheets which are too large to be handled easily). Meaning, data mining refers to the activity of going through a large chunk of data to look for relevant or pertinent information. By using software/application to look for patterns in large batches of data, businesses can learn more about their customers to develop more effective marketing strategies, increase sales and decrease costs. Data mining depends on effective data collection, warehousing, and computer processing. Big Data and Data Mining are two different concepts, big data is a term that refers to a large amount of data whereas data mining refers to deep drive into the data to extract the key knowledge/Pattern/Information from a small or large amount of data. The main concept in Data Mining is to dig deep into analyzing the patterns and relationships of data Analyzing of Big data to give a that can be used further in Artificial business solution or to make a Intelligence, Predictive Analysis, etc. business definition plays a crucial But the main concept in Big Data is role to determine growth. Big data the source, variety, volume of data is the asset and data mining are and how to store and process this the manager of that is used to amount of data. provide beneficial results  This is where we use theories and techniques in the context of statistics, mathematics and machine learning.  The ultimate goal of data science is improving decision making, as this generally is of Data science is related to data paramount interest to mining, machine learning and business. big data in which a concept to  Figure 2 places data science combine statistics, data analysis, in the context of other closely domain knowledge and their related and data-related related methods in order to processes in the understand the phenomena via organization. the analysis of data. These principles and techniques are applied broadly across the functional areas in business. Using data science, we can extract different types of patterns such as for example, we might want to extract patterns that help us to identify groups of customers exhibiting similar behavior and tastes. This task is known as customer segmentation, and in data science terminology it is Figure 2. Data Science in the Context of Closely Related called clustering. Processes in the Organization On the other hand, we might want Probably the broadest business to extract a pattern that identifies applications are in marketing for products that are frequently bought tasks such as targeted marketing, together, a process called online advertising, and association-rule mining. recommendations for cross-selling. The patterns that we extract using The phrase actionable insight is data science are useful only if they sometimes used in this context to give us insight into the problem describe what we want the that enables us to do something to extracted patterns to give us. help solve the problem. The term insight highlights The term actionable highlights that the pattern should give that the insight we get should us relevant information also be something that we have about the problem that is not the capacity to use in some obvious. way. Data science is commonly defined as a methodology by which actionable insights can be inferred from data. Performing data science is a task with an ambitious objective: the production of beliefs informed by data and to be used as the basis of decision-making. The representation of complex environments by rich data opens up In the absence of data, are the possibility of applying all the uninformed and decisions, in scientific knowledge we have the best of cases, are based on regarding how to infer knowledge best practices or intuition. from data. Data and Data Set Data science is really dependent on data. In its most basic form, a piece of data is an abstraction of a real-world entity (person, object, or event). The terms variable, feature, and attribute are often used interchangeably to denote an individual abstraction. Each entity is typically described by a number of attributes. For example, a person might have the following attributes: name, address, age, contact number, status, and so on. Data - characteristics of information that represented as text, numbers, or multimedia that are collected through observation. In technical sense, data is a set of values of qualitative or quantitative variables about one or more persons or objects. Examples: Name, Address, Age, Contact Number, Status, etc. Dataset - describe values for The values in this set are each variable for unknown known as a datum. The data quantities such as height, set consists of data of one or weight, temperature, volume, more members corresponding etc. of an object or values of to each row. See the table 1 As we can see in table 1, data set is Table 2 illustrates an analytics record normally presented in a tabular pattern for a data set of classic books. Each where every column describes a row in the table describes one book. particular variable and each row The terms instance, example, entity, corresponds to a given member of the object, case, individual, and record data set, as per the given question. This are used in data science literature to is a part of data management. Usually today in big databases, datasets are refer to a row. So, a data set generally in the form of CSV, JSON, or contains a set of instances, and each XML data format. Now, data set consists instance is described by a set of of the data relating to a collection of attributes. entities, with each entity described in terms of a set of attributes. In its most basic form, a data set is organized in an N * M data matrix called the analytics record, where N is the number of entities (rows) and M is the number of attributes (columns). The construction of the analytics record is a prerequisite of doing data science. Then majority of the time and effort in data science projects is spent on creating, cleaning, and updating the analytics record. The analytics record is often constructed by merging information from many different sources, data may have to be extracted from multiple databases, data warehouses, or computer files in different formats (e.g., spreadsheets or csv files) or scraped from the web or social media streams. What Does Data Scientist Do?  Usually, data scientist works more closely with business stakeholders to understand their goals and determine how data can be used to achieve those goals.  They design data modeling processes, create algorithms and predictive models to extract the data the business needs, then help analyze the data and share insights with peers. While each project is different, 7. Apply data the process for gathering and science analyzing data generally follows methods and the following below: technique, such as machine 1. Ask the right questions to begin learning, the discovery process. statistical modeling, and artificial intelligence. 2. Acquire data. 8. Measure and improve results. 3. Process and clean the data. 9. Present final results to 4. Integrate and store data. stakeholders. 5. Initial data investigation and 10. Make adjustments based on exploratory data analysis. feedback. 6. Choose one or more potential 11. Repeat the process to solve a new models and algorithms. problem A Data Science Ecosystem Because data science is growing so rapidly, we now have a massive ecosystem of useful tools. The set of technologies used to do data science varies across organizations in which larger organization or the greater the amount of data being processed, the greater the complexity of the technology ecosystem supporting the data science activities. In most cases, this ecosystem contains tools and components from a number of different software suppliers, processing data in many different formats. Figure 3 gives a high-level overview of a typical data architecture. This architecture is not just for big-data environments, but for data environments of all sizes. In this diagram, the three main areas:  Data Sources - where all the data in an organization are generated.  Data Storage - where the Figure 3. Small-Data and Big-Data data are stored and processed. Architecture for Data Science (Inspired by a  Applications - where the figure from the Hortonworks newsletter, April 23, 2013, https://hortonworks.com/blog/hadoop-and-the-data-warehouse-when-to- data are shared with consumers usewhich) of these data. * All organizations have applications that generate and capture data about customers, transactions, and operational data on everything to do with how the organization operates. * Such data sources and applications include customer management, orders, manufacturing, delivery, invoicing, banking, finance, customer- relationship management (CRM), call center, enterprise resource planning (ERP) applications, and so on. * These types of applications are commonly referred to as online transaction processing (OLTP) systems. *For many data science projects, the data from these applications will be used to form the initial input data set for the ML algorithms. *Over time, the volume of data captured by the various applications in the organization grows ever larger and the organization will start to branch out to capture data that was ignored, wasn’t captured previously, or wasn’t available previously. *These newer data are commonly referred to as “bigdata sources” because the volume of data that is captured is significantly higher than the organization’s main operational applications. The most popular form of traditional data-integration and storage software is a relational database management system (RDBMS). These traditional systems are often as analysis functionality. Depending the backbone of the business on the maturity level of a BI intelligence (BI) solutions within an architecture, it can consist of anything organization. from a basic copy of an operational A BI solution is a user friendly application to an operational data store decision support system that (ODS) to massively parallel processing (MPP) BI database solutions and data provides data aggregating, warehouses. integration, and reporting as well Figure 3 shows the typical architecture of the data science ecosystem.  It is suitable for most organizations, both small and large.  As an organization scales in size, so too will the complexity of its data science ecosystem. For example, smaller-scale organizations may not require the Hadoop component, but for very large organizations the Hadoop component will become very important. Online Video/Tutorials: https://www.youtube.com/watch?v=bAyrObl7TYE https://www.youtube.com/watch?v=GLAK3dF7dfg https://www.youtube.com/watch?v=8pHzROP1D- w https://www.youtube.com/watch?v=dMRtcQ_gIe8 Thank you!

Use Quizgecko on...
Browser
Browser