Datascience_Intro.pdf

Data Science Introduction What is Data Science? ▲ Data scientists, "The Sexiest Job of the 21st Century" (Davenport and Patil, Harvard Business Review, 2012) ▲ Much of the data science explosion is coming from the tech-world ▲ What does Data Science mean? ▲ Is it the science of Big Data? ▲ What is Big Data anyway? ▲ Who does Data Science and where?..... ▲ Data science is the process of extracting and analysing useful information from data to solve problems that are difficult to solve analytically. ▲ For example, when you visit an e-commerce site and look at a few categories and products before making a purchase, you are creating data that Analysts can use to figure out how you make purchases. ▲ It involves different disciplines like mathematical and statistical modelling, extracting data from its source and applying data visualization techniques. ▲ It also involves handling big data technologies to gather both structured and unstructured data. ▲ It helps you find patterns that are hidden in the raw data. The term "Data Science" has evolved because mathematical statistics, data analysis, and "big data" have changed over time. ▲ Data science is an interdisciplinary field that lets you learn from both organised and unorganised data. With data science, you can turn a business problem into a research project and then apply into a real-world solution. History of datascience Why data science? ▲ According to IDC, worldwide data will reach 175 zettabytes by 2025. ▲ massive amounts of data about many aspects of our lives, both online and offline activities, real-time as well as past-time, Datafication=“taking all aspects of life and turning them into data” “Once we datafy things, we can transform their purpose and turn the information into new forms of value.” ▲ Data Science helps businesses to comprehend vast amounts of data from different sources, extract useful insights, and make better data-driven choices. ▲ Data Science is used extensively in several industrial fields, such as marketing, healthcare, finance, banking, and policy work. Significant advantages of using Data Analytics Technology: 1) Data is the oil of the modern age. With the proper tools, technologies, and algorithms, we can leverage data to create a unique competitive edge. 2) Data Science may assist in detecting fraud using sophisticated machine learning techniques. 3) It helps you avoid severe financial losses. 4) Enables the development of intelligent machines 5) You may use sentiment analysis to determine the brand loyalty of your customers. 6) This helps you to make better and quicker choices.It enables you to propose the appropriate product to the appropriate consumer in order to grow your company. Need for Data science ▲ The data we have and how much data we generate: ▲ How companies have benefited from data science? ▲ Demand and average salary of a data scientist: Impact of Data Science ▲ Data Science has had a significant influence on several aspects of modern civilization. ▲ The significance of Data Science to organisations keeps on increasing. According to one research, the worldwide market for data science would reach $115 billion by 2023. ▲ Healthcare industry has benefitted from the rise of data science. ▲ Google was able to build one of the first systems for monitoring the spread of diseases by using data science. Contd... ▲ The sports sector has similarly profited from data science. ▲ A data scientist in 2019 found ways to measure and calculate how goal attempts increase a soccer team's odds of winning. ▲ In reality, data science is utilised to easily compute statistics in several sports. ▲ Government agencies also use data science on a daily basis. ▲ Governments throughout the globe employ databases to monitor information regarding social security, taxes, and other data pertaining to their residents. Contd... ▲ The Internet has become the primary medium of human communication, the popularity of e-commerce has also grown. ▲ With data science, online firms may monitor the whole of the customer experience, including marketing efforts, purchases, and consumer trends. ▲ Ads must be one of the greatest instances of eCommerce firms using data science. ▲ Apps such as Tinder and Facebook use algorithms to assist users locate precisely what they are seeking. ▲ The Internet is a growing treasure trove of data, and the gathering and analysis of this data will also continue to expand. What is data in data science? ▲ Data is the foundation of data science. ▲ Data is the systematic record of a specified characters, quantity or symbols on which operations are performed by a computer, which may be stored and transmitted. ▲ It is a compilation of data to be utilised for a certain purpose, such as a survey or an analysis. ▲ When structured, data may be referred to as information. ▲ The data source (original data, secondary data) is also an essential consideration. ▲ Data from a random experiment are often stored in a table or spreadsheet. ▲ A statistical convention to denote variables is often called as features or columns and individual items (or units) as rows. Types of data: There are mainly two types of data, they are: Qualitative data: ▲ Qualitative data consists of information that cannot be counted, quantified, or expressed simply using numbers. ▲ It is gathered from text, audio, and pictures and distributed using data visualization tools, including word clouds, concept maps, graph databases, timelines, and infographics. ▲ The objective of qualitative data analysis is to answer questions about the activities and motivations of individuals. ▲ Collecting, and analyzing this kind of data may be time consuming. ▲ A researcher or analyst that works with qualitative data is referred to as a qualitative researcher or analyst. Types of Qualitative data: ▲ There are mainly two types of Qualitative data, they are: ▲ Nominal data: In statistics, nominal data (also known as nominal scale) is used to designate variables without giving a numerical value. It is the most basic type of measuring scale. In contrast to ordinal data, nominal data cannot be ordered or quantified. For example, The name of the person, the colour of the hair, nationality, etc. Nominal data may be both qualitative and quantitative. Contd... ▲ Analyze nominal data: Using the grouping approach, nominal data can be analyzed. The variables may be sorted into groups, and the frequency or percentage can be determined for each category. The data may also be shown graphically, for example using a pie chart. Although the nominal data cannot be processed using mathematical operators, they may still be studied using statistical techniques. Hypothesis testing is one approach to assess and analyse the data. With nominal data, nonparametric tests such as the chi-squared test may be used to test hypotheses. The purpose of the chi-squared test is to evaluate whether there is a statistically significant discrepancy between the predicted frequency and the actual frequency of the provided values. Contd.. ▲ Ordinal data: Ordinal data is a type of data in statistics where the values are in a natural order. One of the most important things about ordinal data is that you can't tell what the differences between the data values are. Most of the time, the width of the data categories doesn't match the increments of the underlying attribute. In some cases, the characteristics of interval or ratio data can be found by grouping the values of the data. For instance, the ranges of income are ordinal data, while the actual income is ratio data. Contd.. Ordinal data can't be changed with mathematical operators like interval or ratio data can. The median is the only way to figure out where the middle of a set of ordinal data is. This data type is widely found in the fields of finance and economics. Consider an economic study that examines the GDP levels of various nations. If the report rates the nations based on their GDP, the rankings are ordinal statistics. Analyzing ordinal data: ▲ Using visualisation tools to evaluate ordinal data is the easiest method. ▲ For example, the data may be displayed as a table where each row represents a separate category. ▲ In addition, may be represented graphically using different charts. ▲ The bar chart is the most popular style of graph used to display these types of data. ▲ Ordinal data may also be studied using sophisticated statistical analysis methods like hypothesis testing. Qualitative data collection methods: 1. Data records: Utilizing data that is already existing as the data source is a best technique to do qualitative research. Similar to visiting a library, you may examine books and other reference materials to obtain data that can be utilised for research. 2. Interviews: Personal interviews are one of the most common ways to get deductive data for qualitative research. The interview may be casual and not have a set plan. It is often like a conversation. The interviewer or researcher gets the information straight from the interviewee. 3. Focus groups: Focus groups are made up of 6 to 10 people who talk to each other. The moderator's job is to keep an eye on the conversation and direct it based on the focus questions. 4. Case Studies: Case studies are in-depth analyses of an individual or group, with an emphasis on the relationship between developmental characteristics and the environment. 5. Observation: It is a technique where the researcher observes the object and take down transcript notes to find out innate responses and reactions without prompting. Quantitative data: ▲ Quantitative data consists of numerical values, has numerical features, and mathematical operations can be performed on this type of data such as addition. ▲ Quantitative data is mathematically verifiable and evaluable due to its quantitative character. ▲ The simplicity of their mathematical derivations makes it possible to govern the measurement of different parameters. ▲ It is gathered for statistical analysis through surveys, polls, or questionnaires given to a subset of a population. ▲ Researchers are able to apply the collected findings to an entire population. Types of Quantitative data: ▲ There are mainly two types of quantitative data, they are ▲ Discrete Data: These are data that can only take on certain values, as opposed to a range. For instance, data about the blood type or gender of a population is considered discrete data. Example of discrete quantitative data may be the number of visitors in the website; could have 150 visits in one day, but not 150.6 visits. Usually, tally charts, bar charts, and pie charts are used to represent discrete data. Characteristics of discrete data: ▲ Since it is simple to summarise and calculate discrete data, it is often utilized in elementary statistical analysis. ▲ some other essential characteristics of discrete data: Discrete data is made up of discrete variables that are finite, measurable, countable, and can't be negative (5, 10, 15, and so on). Simple statistical methods, like bar charts, line charts, and pie charts, make it easy to show and explain discrete data. Data can also be categorical, which means it has a fixed number of data values, like a person's gender. Data that is both time- and space-bound is spread out in a random way. Discrete distributions make it easier to look at discrete values. Continuous Data: ▲ These are data that may take values between a certain range, including the greatest and lowest possible. ▲ The difference between the greatest and least value is known as the data range. ▲ For instance, the height and weight of school's children. This is considered continuous data. ▲ The tabular representation of continuous data is known as a frequency distribution. ▲ These may be depicted visually using histograms. Characteristics of continuous data: ▲ Continuous data, on the other hand, can be either numbers or spread out over time and date. ▲ This data type uses advanced statistical analysis methods because there are an infinite number of possible values. ▲ The important characteristics about continuous data are: 1. Continuous data changes over time, and at different points in time, it can have different values. 2. Random variables, which may or may not be whole numbers, make up continuous data. 3. Data analysis tools like line graphs, skews, and so on are used to measure continuous data. 4. One type of continuous data analysis that is often used is regression analysis. Quantitative data collection methods: 1. Surveys and questionnaires: These types of research are good for getting detailed feedback from users and customers, especially about how people feel about a product, service, or experience. 2. Open-source datasets: There are a lot of public datasets that can be found online and analysed for free. Researchers sometimes look at data that has already been collected and try to figure out what it means in a way that fits their own research project. 3. Experiments: A common method is an experiment, which usually has a control group and an experimental group. The experiment is set up so that it can be controlled and the conditions can be changed as needed. 4. Sampling: When there are a lot of data points, it may not be possible to survey each person or data point. In this case, quantitative research is done with the help of sampling. Sampling is the process of choosing a sample of data that is representative of the whole. The two types of sampling are Random sampling (also called probability sampling), and non-random sampling. Types of Data collection: ▲ Data collection can be classified into two types according to the source: ▲ Primary Data: These are the data that are acquired for the first time for a particular purpose by an investigator. Primary data are 'pure' in the sense that they have not been subjected to any statistical manipulations and are authentic. Examples of primary data include the Census of India. ▲ Secondary Data: These are the data that were initially gathered by a certain entity. This indicates that this kind of data has already been gathered by researchers or investigators and is accessible in either published or unpublished form. This data is impure because statistical computations may have previously been performed on it. For example, Information accessible on the website of the Government of India or the Department of Finance, or in other archives, books, journals, etc. Big data: ▲ Big data is defined as data with a larger volume and require overcoming logistical challenges to deal with them. ▲ Big data refers to bigger, more complicated data collections, particularly from novel data sources. ▲ Some data sets are so extensive that conventional data processing software is incapable of handling them. ▲ But, these vast quantities of data can be use to solve business challenges that were previously unsolvable. ▲ Data science is the study of how to analyse huge amount of data and get the information from them. ▲ can compare big data and data science to crude oil and an oil refinery. ▲ Data science and big data grew out of statistics and traditional ways of managing data, but they are now seen as separate fields. 1. Volume: How much information is there? 2. Variety: How different are the different kinds of data? 3. Velocity: How fast do new pieces of information get made? How do we use data in datascience? ▲ Every data must undergo pre-processing. This is an essential series of processes that converts raw data into a more comprehensible and valuable format for further processing. ▲ Common procedures are: ▲ 1) Collect and store the dataset: ▲ 2) Data cleaning a) Handling missing data b) Noisy data ▲ 3) Data integration ▲ 4) Data transformation a) Generalization b) Normalization c) Attribute selection d) Aggregation Datascience Life Cycle Data science- Lifecycle ▲ What is Data science lifecycle? ▲ A data science lifecycle is a systematic approach to find a solution for a data problem which shows the steps that are taken to develop, deliver/deploy , and maintain a data science project. ▲ A standard data science lifecycle approach comprises the use of machine learning algorithms and statistical procedures that result in more accurate prediction models. ▲ Data extraction, preparation, cleaning, modelling, assessment, etc., are some of the most important data science stages. This technique is known as "Cross Industry Standard Procedure for Data Mining" in the field of data science. Contd... Data science process flowchart (O’Neil and Schutt) CRISP-DM (Cross Industry Standard Process for Data Mining) How many phases are there in the data science lifecycle? Six phases Identifying problem and understanding the business: ▲ Data science lifecycle starts with "why?".. ▲ Figuring out what the problem is. ▲ Helps to find a clear goal around which all the other steps can be planned out. ▲ This phase should - 1. Specify the issue that why the problem must be resolved immediately and demands answer. 2. Specify the business project's potential value. 3. Identify dangers, including ethical concerns, associated with the project. 4. Create and convey a flexible, highly integrated project plan. Data collection: ▲ Getting raw data from the appropriate and reliable source. ▲ The data that is collected can be either organized or unorganized. ▲ The data could be collected from website logs, social media data, online data repositories, and even data that is streamed from online sources using APIs, web scraping, or data that could be in Excel or any other source. ▲ The information may be gathered by surveys or the more prevalent method of automated data gathering, such as internet cookies which is the primary source of data that is unanalysed. ▲ use secondary data which is an open-source dataset. ▲ There are many available websites from where we can collect data for example ▲ Kaggle(https://www.kaggle.com/datasets), ▲ UCI Machine Learning Repository( http://archive.ics.uci.edu/ml/index.php ), ▲ Google Public Datasets( https://cloud.google.com/bigquery/public-data/ ). Data processing ▲ The purpose of data processing is to ensure if there is any problem with the acquired data so that it can be resolved before proceeding to the next phase. ▲ Several Problems: The data may have several missing values in multiple rows or columns. May include several outliers, inaccurate numbers, timestamps with varying time zones, etc. The data may potentially have problems with date ranges. if data is gathered from many thermometers and any of them are defective, the data may need to be discarded or recollected. ▲ This Phase- Multiple solutions ▲ For example, if the data includes missing values, we can either replace them with zero or the column's mean value. ▲ However, if the column is missing a large number of values, it may be preferable to remove the column completely since it has so little data that it cannot be used in our data science life cycle method to solve the issue. ▲ Label encoding: All data must be in numeric representation for machine learning models. if a dataset includes categorical data, it must be converted to numeric values before the model can be executed. Data analysis ▲ Data analysis Exploratory Data Analysis (EDA) is a set of visual techniques for analysing data. With this method, we may get specific details on the statistical summary of the data. ▲ Also, we will be able to deal with duplicate numbers, outliers, and identify trends or patterns within the collection. ▲ Attempt to get a better understanding of the acquired and processed data. ▲ Apply statistical and analytical techniques to make conclusions about the data and determine the link between several columns in our dataset. ▲ Using pictures, graphs, charts, plots, etc., we may use visualisations to better comprehend and describe the data. ▲ Professionals use data statistical techniques such as the mean and median to better comprehend the data. ▲ Using histograms, spectrum analysis, and population distribution, they also visualise data and evaluate its distribution patterns. ▲ The data will be analysed based on the problems. Data visualization: ▲ Target column: Our target column will be the Species column since we will only want results based on species in the end. ▲ Matplotlib and seaborn library will be used for data visualization. ▲ There are many other visualization plots in Data science. ▲ To know more about them refer https://www.tutorialspoint.com/machine_learning_with_python/machine_le arning_with_python_understanding_data_with_visualization.htm Data modelling: ▲ Data Modelling is one of the most important aspects of data science and is sometimes referred to as the core of data analysis. ▲ The intended output of a model should be derived from prepared and analysed data. ▲ The environment required to execute the data model will be chosen and constructed, before achieving the specified criteria. ▲ At this phase, we develop datasets for training and testing the model for production related tasks. ▲ It also involves selecting the correct mode type and determining if the problem involves classification, regression, or clustering. ▲ After analysing the model type, we must choose the appropriate implementation algorithms. ▲ It must be performed with care, as it is crucial to extract the relevant insights from the provided data. ▲ Here machine learning comes in picture. Machine learning is basically divided into classification, regression, or clustering models and each model have some algorithms which is applied on the dataset to get the relevant information. ▲ Model deployment: ▲ The model is finally ready to be deployed in the desired format and chosen channel after a detailed review process. ▲ Note that the machine learning model has no utility unless it is deployed in the production. ▲ Generally speaking, these models are associated and integrated with products and applications.

Datascience_Intro.pdf

Document Details

Tags

Related

Full Transcript