Document Details

VersatilePlateau1277

Uploaded by VersatilePlateau1277

Al-Ahliyya Amman University

Dr. Ashraf ALDabbas

Tags

data science data collection data storage data pipelines

Summary

This document provides a lecture on data science fundamentals, covering data collection and management, storage and retrieval, and data pipelines. It details various data sources and types, including web data, survey data, and open data. The document also discusses the concept of data pipelines and how they are used to automate data collection and storage.

Full Transcript

DATA SCIENCE FUNDAMENTALS Data Collection & Management in Data Science-Part-1 Data Storage and Retrieval-Part-2 Data Pipelines-Part-3 Dr. Ashraf ALDabbas DATA S...

DATA SCIENCE FUNDAMENTALS Data Collection & Management in Data Science-Part-1 Data Storage and Retrieval-Part-2 Data Pipelines-Part-3 Dr. Ashraf ALDabbas DATA SCIENCE FUNDAMENTALS Data Collection & Management in Data Science DATA SCIENCE FUNDAMENTALS Now that we understand the data science workflow, we'll dive deeper into the first step: data collection and storage. We'll learn about the different data sources you can draw from, what that data looks like, how to store the data once it's collected, and how a data pipeline can automate the process. DATA SCIENCE FUNDAMENTALS Sources of data We are generating vast amounts of data on a daily basis simply by surfing the internet, tracking a run, or paying by card in a shop. The companies behind these services that we use, collect this data internally. They use this to help them make data-driven decisions. On the other hand, there are also many free, open data sources available. This means the data can be freely used, shared and built-on by anyone. Note that sometimes companies share parts of their data with a wider public as well. Let's first take a look at company data sources. DATA SCIENCE FUNDAMENTALS DATA SCIENCE FUNDAMENTALS Company data Some of the most common company sources of data are web events, survey data, customer data, logistics data, and financial transactions. Let's dive a bit deeper into web data. DATA SCIENCE FUNDAMENTALS Web data When you visit a web page or click on a link, usually this information is tracked by companies in order to calculate conversion rates or monitor the popularity of different pieces of content. The following information is captured: the name of the event, which could mean the URL of the page visited or an identifier for the element that was clicked, the timestamp of the event, and an identifier for the user that performed the action. DATA SCIENCE FUNDAMENTALS Survey data Data can also be collected by asking people for their opinions in surveys. This can be, for example, in the form of a face-to-face interview, online questionnaire, or a focus group. DATA SCIENCE FUNDAMENTALS Net Promoter Score You've likely answered a question as shown in the image before. This is a very common type of survey data used by companies: the Net Promoter Score, or NPS, which asks how likely a user is to recommend a product to a friend or colleague. DATA SCIENCE FUNDAMENTALS Open data There are multiple ways to access open data. Two of them are APIs and public records. Public data APIs Let's begin with APIs. API stands for Application Programming Interface. It's an easy way of requesting data from a third party over the internet. Many companies have public APIs to let anyone access their data. Some noteable APIs include Twitter, Wikipedia, Yahoo! Finance, and Google Maps, but there are many, many more. Tracking a hashtag Let's look at an example of the Twitter API. Suppose we want to track Tweets with the hashtag DataFramed, DataCamp's wonderful podcast on Data Science. We can use the Twitter API to request all Tweets with this hashtag. At this point, we have many options for analysis. We could perform a sentiment analysis on the text of each Tweet and get an idea of how people like our podcast. We could simply track how often hashtag DataFramed appears each week. We could also combine this data with our downloads data and see if positive Tweets are correlated with more downloads. DATA SCIENCE FUNDAMENTALS Public records Public records are another great way of gathering data. They can be collected and shared by international organizations like the World Bank, the UN, or the WTO, national statistical offices, who use census and survey data, or government agencies, who make information about for example the weather, environment or population publicly available. For example, in the US, data-dot-gov has health, education, and commerce data available for free download. In the EU, data-dot-europa-dot-eu has similar data. Exercise: Sorting data sources Data collection is the first step in the data science workflow. Without data, there wouldn't be any data science. Depending on the project you are working on, you will need different data sources. Consider the following data needs. Is this data publicly available, or something that is tracked by companies internally? DATA SCIENCE FUNDAMENTALS Exercise Classifying data types It's important to know what type of data you have collected. This will be important later on in the data science workflow, when you want to store the data and even later when you will be performing the analyses. There are two types of data: qualitative data and quantitative data. Instructions Classify the data as the correct data type. DATA SCIENCE FUNDAMENTALS Asthma frequencies You've realized by now that data can come from various sources, and not all of them are publicly available. A data science report contains the following visualization. With your new knowledge of data sources, you are able to identify where the data that's behind it originated. What source has the data scientist most likely used to collect this data? Instructions Data types You now know where to collect data. But what does that data look like? In this lecture we'll talk about the different types of data Why care about data types? You might wonder why it's important to know what type of data you have collected. This will be essential later on in the data science process. For instance, it's especially relevant when you want to store the data, which we'll talk about in the next video as not all types of data can be stored in the same place. Furthermore, when you're visualizing or analyzing the data it's important to know the type of data you are dealing with. Not all visualizations or analyses can be performed with all data types. So, let's dive in. DATA SCIENCE FUNDAMENTALS Quantitative vs qualitative data There are two general types of data: qualitative and quantitative data. It’s important to understand the key differences between both. Quantitative data can be counted, measured, and expressed using numbers. Qualitative data is descriptive and conceptual. Qualitative data can be observed but not measured. Now that we know the differences, let’s dive into each type of data with a real-world example. Quantitative data Quantitative data can be expressed in numbers. For example, the fridge is 60 inches tall, has two apples in it, and costs 1000 dollars. DATA SCIENCE FUNDAMENTALS Qualitative data Qualitative data, on the other hand, are things that can be observed but not measured like: the fridge is red, was built in Italy, and might need to be cleaned out because it smells like fish. DATA SCIENCE FUNDAMENTALS Other data types Other than the traditional quantitative and qualitative data, there are many other data types that are becoming more and more important. There is image data, text data, geospatial data, network data, and many more. Note that these other data types aren't mutually exclusive with quantitative and qualitative data. Meaning often these other data types are a special mix of quantitative and qualitative data. Let's look at some examples. DATA SCIENCE FUNDAMENTALS Other data types: Image data Digital images are everywhere. An image is made up of pixels. These pixels contain information about color and intensity. Typically, the pixels are stored in computer memory. In the example you can see that if we zoom in on the image we can distinguish the different pixels. Other data types: Text data Emails, documents, reviews, social media posts, and so on. As you can imagine, text data can be found in many places. This data can be stored and analyzed to find relevant insights. On the slide, you can see an example of a restaurant review. DATA SCIENCE FUNDAMENTALS Other data types: Geospatial data Geospatial data are data with location information. In the example you can see that many different types of information can be captured using geospatial data. For a specific region we can keep track of where the roads, the buildings, and vegetation are. This is especially useful for navigation apps like Waze and Google maps. DATA SCIENCE FUNDAMENTALS Other data types: Network data Network data consists of the people or things in a network, depicted by circles on the slide, and the relationships between them, depicted by lines on the slide. Here you can see an example of a social network. You can easily see who knows whom. DATA SCIENCE FUNDAMENTALS Recap In this lecture we looked at the most common data types: quantitative data, qualitative data, image data, text data, geospatial data, and network data. These can all serve as inputs for your data science analysis. But before doing that, the data needs to be stored. That's what we'll cover in the next class. Let's practice! But first, let's see if you know the difference between the different types of data. Let's practice! DATA SCIENCE FUNDAMENTALS Classifying data types It's important to know what type of data you have collected. This will be important later on in the data science workflow, when you want to store the data and even later when you will be performing the analyses. There are two types of data: qualitative data and quantitative data. Instructions: Classify the data as the correct data type. 1. The price of a cup of coffee in Parisian cafés 2. The eye color of people participating in a study 3. The daily average temperature in New York during 2019 4. Images of several cats 5. The reviews for a property on Airbnb 6. The individual weight of all the dogs in a shelter Net promoter score Net Promoter Score (or NPS) is a common metric companies use to track the success of a product or website. It’s measured by asking a simple question: How likely is it that you would recommend [insert brand/website/service/product] to a friend or colleague? Users respond on a scale of 0 - 10 with 0 being not at all likely to recommend and 10 being extremely likely to recommend. Which of the following best describes NPS data? Possible Answers Select one answer Qualitative data Quantitative data DATA SCIENCE FUNDAMENTALS Activity tracker Jane's New Year's resolution this year was to get into the best shape of her life. To help her achieve this goal she decided to invest in an activity tracker. After some months of tracking her activity, there is quite some data available. The company that manufactured the activity tracker has a public API that allows access to your personal data. Jane is specifically interested in the GPS data of her runs because she wants to make a heatmap showing her most common running routes. What type of data will she be extracting from the API? Possible Answers Select one answer Image data Text data Geospatial data Network data Data storage and retrieval Dr. Ashraf ALDabbas The data science workflow Data Storage and Retrieval DATA SCIENCE FUNDAMENTALS Data storage and retrieval Previously in this chapter, you learned about different data sources and data types. The data science workflow Now, let's discuss efficient ways of storing and retrieving the data that was collected. As you can see this is still part of the first step in the data science workflow we defined before. Things to consider when storing data When storing data there are multiple things to take into consideration. First, we need to determine where we want to store the data. Then, we need to know what kind of data we are storing. And lastly, we need to consider how we can retrieve our data from storage. Let's take a closer look. DATA SCIENCE FUNDAMENTALS Location: Parallel storage solutions Data science projects could require large amounts of data. At this point the data probably can't be stored on a single computer anymore. In order to make sure that all data is saved and easy to access, it is stored across many different computers. Large companies often have their own set of storage computers, called a “cluster” or a “server”, on premises. Location: The cloud Alternatively, you could pay another company to store data for you. This is referred to as “cloud storage”. Common cloud storage providers include Microsoft Azure, Amazon Web Services, or AWS, and Google Cloud. These services provide more than just data storage; they can also help your organization with data analytics, machine learning, and deep learning. For now, we’ll just focus on data storage. DATA SCIENCE FUNDAMENTALS Types of data storage Different types of data require different storage solutions. Some data is unstructured, like email, text, video and audio files, web pages, and social media messages. This type of data is often stored in a type of database called a Document Database. More commonly, data can be expressed as tables of information, like what you might find in a spreadsheet. A database that stores information in tables is called a Relational Database. Both of these types of databases can be found on the cloud storage providers that were mentioned earlier. Retrieval: Data querying Once data has been stored in a Document Database or a Relational Database, we’ll need to access it. At a basic level, we’ll want to be able to request a specific piece of data, such as “All of the images that were created on March 3rd” or “All of the customer addresses in Montana”. In addition, we might even want to do some analysis, such as summing, counting, or averaging data. Retrieval: Data querying Each type of database has its own query language; Document Databases mainly use NoSQL, while Relational Databases mainly use SQL. SQL stands for “Structured Query Language” and NoSQL stands for “Not only SQL”. DATA SCIENCE FUNDAMENTALS Putting it all together: Location Storing your data is like building a library. First, you need to decide where to build your library. That corresponds to choosing a location: either an on-premises cluster or one of the cloud providers we discussed before: Azure, AWS, or Google Cloud. Putting it all together: Data type Next, you need to decide what types of shelves to install to store your books. The types of shelves will depend on the types of books. This is analogous to choosing between a Document Database for unstructured data or a Relational Database for tabular data. Just like a library might have multiple types of shelves, you might need to have some data stored in a Document Database and other data stored in a Relational Database. DATA SCIENCE FUNDAMENTALS Putting it all together: Queries Finally, you’ll need a system for referencing and checking out books. The way you locate and retrieve each book depends on how that book is stored Similarly, you need a query language to speak to the database. For Document Databases, we generally use NoSQL, and for Relational Databases, we generally use SQL. Now that you understand different ways of storing data, let’s practice! DATA SCIENCE FUNDAMENTALS Cloud platforms Jerome has collected a lot of data for a data science project he's working on. His goal is to build a face recognition algorithm and to do that he has collected thousands of images. He needs to decide on which Cloud provider to choose for storing the data. Which of the following is NOT an example of a Cloud provider? Possible Answers :Select one answer Google Cloud Amazon Web Services Microsoft Azure SQL Server DATA SCIENCE FUNDAMENTALS Exercise Which type of database? It's important to understand what type of data you are dealing with because it will affect your storage decision. Some data is tabular and belongs in a Relational Database. Some is unstructured and belongs in a Document Database. Instructions Sort each dataset into the correct type of database. Data Pipelines Dr. Ashraf ALDabbas 1. Data Pipelines Let's learn about data pipelines. So far we've learned about data collection and storage, but how can we scale all this? This is where data pipelines come in. 2. Data collection and storage Data engineers work to collect and store data, so that others, like analysts and data scientists can access data for their work, whether it's for visualization or building machine learning models. DATA SCIENCE FUNDAMENTALS 3. How do we scale? But how do we scale this? Consider the different data sources you learned about - what if we're collecting data from more than one data source? And then, what if these data sources have different types of data? For example, consider real-time streaming data, which is data that is continuously being generated, like tweets from all around the world. This makes storing this incoming data complicated, because as a data engineer, you want to make sure data is organized and easy to access. 4. What is a data pipeline? Enter the data pipeline. A data pipeline moves data into defined stages, for example, from data ingestion through an API to loading data into a database. A key feature is that pipelines automate this movement. Data is constantly coming in and it would be tedious to ask a data engineer to manually run programs to collect and store data. Instead a data engineer schedules tasks whether it's hourly, daily, or tasks can be triggered by an event. Because of this automation, data pipelines need to be monitored. Luckily, alerts can be generated automatically, for example, when 95% of storage capacity has been reach or if an API is responding with an error. DATA SCIENCE FUNDAMENTALS Data pipelines aren't necessary for all data science projects, but they are when working with a lot of data from different sources. There isn't a set way to make a pipeline - pipelines are highly customized depending on your data, storage options, and ultimate usage of the data. ETL, which stands for extract, transform, and load, is a popular framework for data pipelines. Let's explore it with a case study. 5. Case study: smart home After learning about IoT devices and APIs, you decide to try out both. Specifically, you want to use APIs and devices in your house to better understand the status of your house and neighborhood. You gather a list of data sources and associated information. The first two are provided by APIs. Every 30 minutes, you get the weather conditions, and you get Tweets geotagged in your neighborhood whenever they are published. The remaining rows are IoT devices that send their sensor data over the internet at the specified frequencies. 6. Extract How does it all come together? First, we begin with extracting all the data from the data sources we listed, whether it's an API or setting up an IoT device. However, a quick look at the frequencies and structures makes us realize that storing the raw data as is won't work. DATA SCIENCE FUNDAMENTALS 7. Transform This is where the transform phase comes in. In this stage, we transform data to make sure data stays organized so that others can easily find relevant data and use it. A common transformation is joining data from different sources into one dataset. Another is converting the incoming data's structure to fit a database's schema. A database's schema informs us how data must be structured before loading it in. A transformation can also be removing irrelevant data. For example, the Twitter API, not only gives you a tweet but others details, like the number of followers the author has, which is not useful in our scenario, so we shouldn't store it. Data gets altered throughout the data science workflow, it's important to note that analytical tasks like data preparation and exploration don't occur at this stage - we'll see this in the next chapter. Transform With all the data coming in, how do we keep it organized and easy to use? Example transformations: Joining data sources into one data set Converting data structures to fit database schemas Removing irrelevant data Data preparation and exploration does not occur at this stage UNDERSTANDING DATA SCIENCE 9. Load Finally, we load the data into storage so that it can be used for visualization and analysis. Load UNDERSTANDING DATA SCIENCE 10. Automation Automation Once we've set up all those steps, we automate. For example, we can say every time we get a tweet, we transform it in a certain way and store it in a specific table in our database. There are tools that specialized to do this, the most popular is called Airflow. UNDERSTANDING DATA SCIENCE Let's practice! Data pipeline characteristics Which of the following statements is true? Possible Answers: Select one answer A data pipeline is essential for every data science project. In the transform phase of ETL, data analysts perform exploratory data analysis. Data engineers design and build custom data pipelines for projects. Data pipelines do not require automation. Exercise Extract Transform Load Tech companies have complex large-scale data pipelines to deal with the huge amount of data coming from millions of users, whether it's incoming social media posts, viewership data of TV episodes, or recent online purchases. Imagine the data pipeline of a music streaming service. In this exercise, there are several pipeline tasks listed. Classify those tasks within the steps of ETL. Instructions Classify the tasks within the categories: extract, transform, and load.

Use Quizgecko on...
Browser
Browser