Introduction To Data Science PDF
Document Details
Uploaded by Deleted User
Tags
Summary
This document details the introduction to data science, its history, and its evolution. The evolution of Data science and its rise in popularity is also included, covering aspects like the rise of programming languages like Python, techniques for collecting, analyzing, and interpreting data, and the role of various players in shaping its development.
Full Transcript
Introduction to Data Science Part 2 History of Data Science Data Science has been coined to describe a new profession tasked with making sense of massive amounts of data. Data science became the popular field it is today, due to the rise of programming languages like Python a...
Introduction to Data Science Part 2 History of Data Science Data Science has been coined to describe a new profession tasked with making sense of massive amounts of data. Data science became the popular field it is today, due to the rise of programming languages like Python and techniques for collecting, analyzing, and interpreting data. History of Data Science John Tukey: American Peter Naur : Danish mathematician computer engineer In 1962, John Tukey described a field he called "data analysis", which resembles modern data science. 2000s: Internet 2005: Big data In 1974, Peter Naur proposed "data 2015: Artificial science” as an alternative name to Intelligence (AI), computer science. Machine In 1997, C. F. Jeff Wu used the term "data Learning, and science" for the first time as an alternative Deep learning name for statistics. C. F. Jeff Wu History of Data Science In 2006, Jonathan Goldman arrived for work at LinkedIn, the business networking site, the place still felt like a startup. “People You May Know” ads achieved a click-through rate 30% higher than the rate obtained by other prompts to visit more pages on the site. History of Data Science In 2008, Dr DJ Patil and Jeff Hammerbacher, heads of analytics and data at LinkedIn and Facebook respectively, coined the term ‘data science’ to describe the emerging field of study that focused on teasing out the hidden value in the data that was being collected from touchpoints all over the retail and business sectors. How has the phrase Data Science evolved? Data science finds its foundation and beginning in Statistics. The advancement of Data science and its evolution has been majorly facilitated by: Artificial Intelligence Machine learning Internet of Things Data science began to grow in other industries, including medicine, engineering, and more, as a result of the influx of fresh data and corporations seeking new ways to improve profit and make better judgments. What is Data Science? Data can come from diverse sources such as sensors, surveys, social media, business transactions, or scientific experiments. Data Science is a field that gives insights from structured and unstructured data, using different scientific methods and algorithms, and consequently helps in generating insights, making predictions and devising data driver solutions. It uses a large amount of data to get meaningful insights using statistics and computation for decision making. What is Data Science (data-driven science)? Data Science is an interdisciplinary academic field that uses statistics, scientific computing, scientific methods, processes, algorithms and systems to extract or extrapolate knowledge and insights from noisy, structured, and unstructured data. Data science is a field of applied mathematics and statistics that provides useful information based on large amounts of complex data or big data. Data scientists are often responsible for collecting and cleaning data, selecting appropriate analytical techniques, and deploying models in real-world scenarios. What is Data Science? Data science is the field of study that combines domain expertise, programming skills, and knowledge of mathematics and statistics to extract meaningful insights from data. Data Science is about data gathering, analysis and decision-making. Data Science is about finding patterns in data, through analysis, and make future predictions. What is Data Science? Data science is all about how we take data, use it to acquire knowledge, and then use that knowledge to do the following: Make better decisions (should we choose A or B) Predictive analysis (what will happen next?/ Predict the future) Pattern discoveries (find pattern, or maybe hidden information in the data) Create new industries/products Where is Data Science Needed? Data Science is used in many industries in the world today, e.g. banking, consultancy, healthcare, and manufacturing. Examples of where Data Science is needed: To foresee delays for flight/ship/train etc. (through predictive analysis) To create promotional offers To find the best suited time to deliver goods To forecast the next years revenue for a company To analyze health benefit of training To predict who will win elections Where is Data Science Needed? Data Science can be applied in nearly every part of a business where data is available. Examples are: Consumer goods Stock markets Industry Politics Logistic companies E-commerce Why Data Science? The sheer volume of data makes it literally impossible for a human to parse it in a reasonable time. Data is collected in various forms and from different sources, and often comes in very unorganized/Unstructured. Data can be missing, incomplete, or just flat out wrong. Often, we have data on very different scales and that makes it tough to compare it. Example – Sigma Technologies What is Data Scientist? The title data scientist has skyrocketed in popularity over the past five years. Demand has been driven by the impact on an organization of using data effectively. There are chief data scientists now in startups, in large companies, in nonprofits, and in government. So, what exactly is a data scientist? What is Data Scientist? A data scientist is someone who extracts insights from messy data. o Example 1: Facebook asks you to list your hometown and your current location, ostensibly to make it easier for your friends to find and connect with you. But it also analyzes these locations to identify global migration patterns and where the fanbases of different football teams live. o Example 2: As a large retailer, Target tracks your purchases and interactions, both online and in-store. And it uses the data to predictively model which of its customers are pregnant, to better market baby-related purchases to them. How Does a Data Scientist Work? A Data Scientist requires expertise in several backgrounds: Machine Learning Statistics Programming (Python or R) Mathematics Databases How Does a Data Scientist Work? A data scientist doesn’t do anything fundamentally new. We’ve long had statisticians, analysts, and programmers. What’s new is the way data scientists combine several different skills in a single profession. How Does a Data Scientist Work? A Data Scientist must find patterns within the data. Before he/she can find the patterns, he/she must organize the data in a standard format. Here is how a Data Scientist works: Ask the right questions - To understand the business problem. Asking the right questions involves domain knowledge and expertise, coupled with a keen ability to see the problem, see the available data, and match up the two. Explore and collect data - From database, web logs, customer feedback, etc. Extract the data - Transform the data to a standardized format. Clean the data - Remove erroneous values from the data. How Does a Data Scientist Work? Here is how a Data Scientist works: Find and replace missing values - Check for missing values and replace them with a suitable value (e.g. an average value). Normalize data - Scale the values in a practical range (e.g. 140 cm is smaller than 1,8 m. However, the number 140 is larger than 1,8. - so scaling is important). Analyze data, find patterns and make future predictions. Represent the result - Present the result with useful insights in a way the "company" can understand. How Does a Data Scientist Work? Finally, a data scientist must be able to communicate. Data scientists are valued for their ability to create narratives around their work. They don’t live in an abstract, mathematical world; they understand how to integrate the results into a larger story and recognize that if their results don’t lead to action, those results are meaningless. How Does a Data Scientist Work? Case study – what's in a job description? Looking for a job in data science? Great, let us help. In this case study, 1,000 job descriptions for companies actively hiring data scientists (as of January 2016) were taken from the Web. The goal here is to look at some of the most common keywords people use in their job descriptions. Case study – what's in a job description? The results are as follows (represented as the phrase, and then the number of of times it occurred): What Is a Data-Driven Organization? Being a data-driven organization means culturally treating data as a strategic asset and then building capabilities to put that asset to use not just for big decisions but also for everyday action on the frontline. When we talk about Data driven organization, this means that strategic decisions are based on the analysis and interpretation of data. In other words, companies take full advantage of business intelligence to improve their customer and market knowledge. What Is a Data-Driven Organization? If the data scientists are isolated in a group that has no real contact with the decision makers, your organization’s leadership will suffer from a lack of context and expertise. Major corporations and governments have created roles such as the chief data scientist (CDS) and chief data officer (CDO) to ensure that their leadership teams have data expertise. Examples include Walmart, the New York Stock Exchange, the cities of Los Angeles and New York, and even the US Department of Commerce and National Institutes of Health. The CDS/CDO is responsible for ensuring that the organization is data driven. What Is a Data-Driven Organization? The most well-known data-driven organizations are consumer Internet companies: Google, Amazon, Facebook, and LinkedIn. However, being data driven isn’t limited to the Internet. Walmart has pioneered the use of data since the 1970s. It was one of the first organizations to build large data warehouses to manage inventory across its business. What Is a Data-Driven Organization? In the 1980s, Walmart realized that the quality of its data was insufficient, so to acquire better data it became the first company to use barcode scanners at the cash registers. The company wanted to know what products were selling and how the placement of those products in the store impacted sales. What Is a Data-Driven Organization? It also needed to understand seasonal trends and how regional differences impacted its customers. As the number of stores and the volume of goods increased, the complexity of its inventory management increased. Thanks to its historical data, combined with a fast predictive model, the company was able to manage its growth curve. To further decrease the time for its data to turn into a decision, it became the first large company to invest in RFID technologies. What Is a Data-Driven Organization? What Is a Data-Driven Organization? Similarly, General Electric uses data to improve the efficiency of its airline engines. approximately 20,000 air‐ planes operating with 43,000 GE engines. the next 15 years, 30,000 more engines are expected to be in use. A 1% improvement in efficiency would result in $30 billion in savings over the next 15 years. Part of its effort to attack these problems has been the new GEnx engine. Each engine weighs 13,740 pounds, has 4,000 parts with 18 fan blades spinning at 1,242 ft/sec, and has a discharge temperature of 1,325oF. But one of the most radical departures from traditional engines is the amount of data that is recorded in real time. According to GE, a typical flight will generate a terabyte of data. What Is a Data-Driven Organization? This data is used by the pilots to make better decisions about efficiencies, and by the airlines to find optimal flight paths as well as to anticipate potential issues and conduct preventative maintenance. What Is a Data-Driven Organization? In Building Data Science Teams, we said that a data-driven organization acquires, processes, and leverages data in a timely fashion to create efficiencies, iterate on and develop new products, and navigate the competitive landscape. What Is a Data-Driven Organization? The first steps in working with data are acquiring and processing. The best data-driven organizations focus on keeping their data clean. The data must be organized, well documented, consistently formatted, and error free. Cleaning the data is often the most taxing part of data science and is frequently 80% of the work. Setting up the process to clean data at scale adds further complexity. Successful organizations invest heavily in tooling, processes, and regular audits. They have developed a culture that understands the importance of data quality; otherwise, as the adage goes, garbage in, garbage out. What Is a Data-Driven Organization? Jonathan Goldman created one of the first data products at LinkedIn—People You May Know—which transformed the growth trajectory of the company. DJ Patil built and grew the data science team at LinkedIn into a powerhouse and co-coined the term “Data Scientist.” Riley Newman worked on developing product analytics that was instrumental in Airbnb’s growth. Jace Kohlmeier led the data team at Khan Academy that helped optimize learning for millions of students. What is Data Analysis? Data analysis typically involves working with smaller, structured datasets to answer specific questions or solve specific problems. This can involve tasks such as data cleaning, data visualization, and exploratory data analysis to gain insights into the data and develop hypotheses about relationships between variables. Data analysis focuses on extracting insights from existing data. Data analysts typically use statistical methods to test these hypotheses and draw conclusions from the data. For example, a data analyst might analyze sales data to identify trends in customer behavior and make recommendations for marketing strategies. Data Analysis vs Data Science Data science and data analysis are both important disciplines in the field of data management and analysis, but they differ in several keyways. While both fields involve working with data, data science is more of an interdisciplinary field that involves the application of statistical, computational, and machine learning methods to extract insights from data and make predictions, while data analysis is more focused on the examination and interpretation of data to identify patterns and trends.