Summary

This document provides an introduction to data science, outlining key concepts such as data, data science definition, and data science areas.

Full Transcript

INTRODUCTION TO DATA SCIENCE Outline What is Data? What is Data Science and Data Scientist? Why Data Science? Main area of Data Science. Data Science process Example What is Data? What is data? Wikipedia: Data (singular datum) are individual units of information. A datum describes a...

INTRODUCTION TO DATA SCIENCE Outline What is Data? What is Data Science and Data Scientist? Why Data Science? Main area of Data Science. Data Science process Example What is Data? What is data? Wikipedia: Data (singular datum) are individual units of information. A datum describes a single quality or quantity of some object Another definition: Data is a collection of facts, such as numbers, words, measurements , observations or even just descriptions of things. Data All Around Lots of data is being collected and warehoused o Web data o Telecom o Bank/credit transactions o Online trading and purchasing o Social Network Data is the New Oil It's valuable, but if unrefined it cannot really be used. oWhat to do with the collected data? oHow to utilize data? Digging for data: Datafication According to Datafication is: A technological trend turning many aspects of our life into data. Or A process of taking all aspects of life and turning them into data. Once we datafy things, we can transform their purpose and turn the information into new forms of value. K.Cukier and V.Mayer-Schoenberger, Viktor (2013). "The Rise of Big Data". Datafication Exam ple 1 Socialplatform s:Facebook What data can we collect? What benefits can we get? Socialplatform s:Facebook Collect and monitor data information of our actions and friendships to market products and services to us. Datafication Exam ple 2 Banking What data can we collect? What benefits can we get? Banking Data such as income, gender, age, etc. can be used to determine the likelihood of a person paying back a loan. Datafication Exam ple 3 Life insurance industry What data can we collect? What benefits can we get? 1 m inute break Outline What is Data? What is Data Science and Data Scientist? Why Data Science? Main area of Data Science. Data Science process Example What is Data science Data science is the art and science of acquiring knowledge through data. Data science is all about how we take data, use it to acquire knowledge, and then use that knowledge to: o Make decisions o Predict the future o Understand the past/present o Create new industries/products o etc.. Data scientist There are several definitions available on Data Scientists oA data scientist is*:  a specialist who applies their expertise in statistics and building machine learning models to make predictions and answer key questions.  and needs to be able to clean, analyze, and visualize data. The highly popular term of ‘Data Scientist’ was coined by DJ Patil and Jeff Hammerbacher: oData scientists are**:  those who crack complex data problems with their strong expertise in certain scientific disciplines. They work with several elements related to mathematics, statistics, computer science, etc (though they may not be an expert in all these fields). *https://www.dataquest.io/ **https://www.edureka.co/ Outline What is Data? What is Data Science and Data Scientist? Why Data Science? Main area of Data Science. Data Science process Example Why data science? In this data age, it's clear that we have a surplus of data. o But why should that necessitate an entire new set of vocabulary? o What was wrong with our previous form s ofanalysis? Why data science? The sheer volum e of data makes it literally impossible for a human to parse it in a reasonable time. data is collected in various form s, and from different sources, and often comes in very unorganized. data can be missing, incomplete, or just flat out wrong. data on very different scales Outline What is Data? What is Data Science and Data Scientist? Why Data Science? Main area of Data Science. Data Science process Example Main areas ofdata science Understanding data science begins with three basic areas: o Math/statistics: This is the use of equations and formulas to perform analysis o Com puter program m ing: This is the ability to use code to create outcomes on the computer o Dom ain knowledge: This refers to understanding the problem domain (medicine, finance, social science, and so on) Cont. Data science areas While having only two of these three qualities can make you intelligent, it will also leave a gap. In order to gain knowledge from data, we must be able to o utilize computer programming (to access and manipulate data, develop models, visualize the results, etc..) o understand the mathematics behind the models we derive o above all, understand our analyses' place in the domain we are in. (domain expertise allows you to apply concepts and results in a meaningful and effective way.) The m ath Math & Statistics Knowledge allows you to theorize and evaluate algorithm s and tweak the existing procedures to fit specific situations Math & Statistics can be used to form alize relationships between variables. We will study basic mathematic and statistic principles that are handy when dealing with data science  Advice from Hadley Wickham the ChiefScientist at Rstudio Com puter program m ing Computer help us to accomplish tedious, time-consuming tasks which would have otherwise taken us ages to manually fulfill. Computer languages help us communicate with machine processors. A computer speaks many languages and can be written in many languages; Similarly, data science can also be done in many languages. Python, Julia,and R are some of the many languages available to us. Python In this course we will learn Python for a variety of reasons: o Python is an extremely sim ple language to read and write, even if you've never coded before o It is one of the m ost com m on languages, both in production and in the academic setting (one of the fastest growing,as a matter of fact) o The language's online community is vast and friendly. o Python has prebuilt data science modules that data scientist can utilize. Dom ain knowledge This category focuses mainly on having knowledge about the particular topic you are working on. Examples of such domains includes medicine, marketing, banking, and industry. Outline What is Data? What is Data Science and Data Scientist? Why Data Science? Main area of Data Science. Data Science process Example Data Science Process If duplicates, missing values, outliers, then we may go back to collect more data, or spend more time cleaning the dataset. e.g., a spam classifier, search ranking algorithm, a recommendation system Schutt, R., & O'Neil, C. (2013). Doing data science: Straight talk from the frontline. This process is normally done by data engineers Data Engineer A data engineer is responsible for preparing data for analytical or operational uses. typical tasks include building data pipelines to pull together information from different source systems; integrating, cleansing data; and structuring it for use in individual analytics applications. “Data pipelines are sequences of processing and analysis steps applied to data for a specific purpose.” The data engineer often works as part of an analytics team, providing data in a ready-to-use form to data scientists. Overview ofthe m ain steps The five essential steps to perform data science are as follows: 1. Asking an interesting question 2. Obtaining the data 3. Exploring the data 4. Modeling the data 5. Com m unicating and visualizing the results 1.Asking an interesting question This step can be seen as a brainstorm ing session Understand the problem that needs to be addressed and solved Data scientists have to frame the problem into a data science problem Thus, they need to learn the dom ain knowledge and combine the technical knowledge with data to come up with a solution to drive business values. 2.Obtaining the data Once the question is determined, it is time to look out the world for the data that might be able to answer that question. There are several sources of data which can be private or public, for example: o Open Data is open for everyone (e.g. WHO, World Health Organization, database) o Data from companies o Data from surveys o Simulated data o Etc.. 3.Exploring the data/ Explorative data analysis(EDA) The basic tools of EDA are plots, graphs and summary statistics, i.e. Data profiling. Generally speaking, it’s a method of: o systematically going through the data, o plotting distributions of all variables (e.g. using box plots), o plotting time series of data, o transforming variables (e.g. one hot encoding) o looking at all pairwise relationships between variables using scatterplot matrices, o generating summary statistics for all of them (computing variables mean, minimum, maximum, the upper and lower quartiles, and identifying outliers). EDA With EDA, you want to understand the data, understand the shape of it, and try to connect your understanding of the process that generated the data to the data itself Although there’s lots of visualization involved in EDA, we distinguish between EDA and data visualization in that: EDA is done toward the beginning of analysis, and data visualization is done toward the end to com m unicate one’s findings. 4.Modeling the data This step involves the use of statisticaland m achine learning models. In this step, we are not only fitting and choosing models, we are implanting mathematical validation metrics in order to quantify the models and their effectiveness. 5.Com m unicate and visualize the results This could take the form of reporting the results up to manager or coworkers, or publishing a paper in a journal. The main goal of data visualization is to have the reader quickly digest the data, including possible trends, relationships, and more. We must ensure that we are making a visualas effective as possible Outline What is Data? What is Data Science and Data Scientist? Why Data Science? Main area of Data Science. Data Science process Example EXAMPLE: PREDICTING NEONATAL INFECTION Map the problem into the data science process: Ask Get Explore Model Visualize THE DATA SCIENCE WORKFLOW 50 DEFINE THE PROBLEM /QUESTION Can I predict infection before it occurs? THE DATA SCIENCE WORKFLOW 51 IDENTIFY AND COLLECT DATA VitalAreas:Heart Want to collect Rate,Blood alldata on the Pressure,etc… claim form (m ostly free text) THE DATA SCIENCE WORKFLOW 52 EXPLORE AND PREPARE DATA Aggregate data Cluster like at the m inute words level THE DATA SCIENCE WORKFLOW 53 BUILD AND EVALUATE MODELS Com pare Decision Tree Start with Naïve with Logistic Bayes Classifier Regression THE DATA SCIENCE WORKFLOW 54 Can you help JETT !

Use Quizgecko on...
Browser
Browser