Data Preparation, Exploration, and Visualization PDF
Document Details
Uploaded by VersatilePlateau1277
Al-Ahliyya Amman University
Dr. Ashraf ALDabbas
Tags
Summary
This presentation provides a foundational understanding of data preparation and visualization steps within data science. It explains why data preparation is essential and introduces common issues like missing values, outliers, and duplicates. The presentation also highlights the importance of ensuring data quality for accurate analysis and visualization.
Full Transcript
Preparation, Exploration, and Visualization Data preparation Dr. Ashraf ALDabbas DATA SCIENCE Data preparation is fundamental: data scientists spend 80% of their time cleaning and manipulating data, and only 20% of their time actually an...
Preparation, Exploration, and Visualization Data preparation Dr. Ashraf ALDabbas DATA SCIENCE Data preparation is fundamental: data scientists spend 80% of their time cleaning and manipulating data, and only 20% of their time actually analyzing it. This chapter will show you how to diagnose problems in your data, deal with missing values and outliers. You will then learn about visualization, another essential tool to both explore your data and convey your findings. Data workflow Data preparation happens after collecting and storing the data. DATA SCIENCE Why prepare data? Real-life data is messy ( Skipping this step may lead to)Preparation is done to prevent: errors incorrect results biasing algorithms You would not use vegetables without cleaning, peeling and dicing them, as your soup would taste weird and no one would eat it. Well, if you don't clean, peel and dice your data, your results will look weird, and no DATA SCIENCE one will use them! Let's start cleaning Sara Lis Hadrien Lis Age "27" "30" "30" Size 1.77 5.58 1.80 5.58 Country "Belgium" "USA" "FR" "USA" Let's take a simple, but dirty, dataset, and clean it together. Maybe you can already notice a few things. DATA SCIENCE Tidy data Befor e Sara Lis Hadrien Lis Age "27" "30" "30" Size 1.77 5.58 1.80 5.58 Country "Belgium" "USA" "FR" "USA" One fundamental aspect of cleaning data is "tidiness". Tidy data is a way of presenting a matrix of data, with observations on rows and variables as columns. This is not the case here. Our observations (people) are in columns, and their features are on rows. Let's take care of that. It's easy to do that programmatically with Python or R. They also DATA help with the SCIENCE Tidy data output Befor Afte e r Sara Lis Hadrien Lis Name Age Size Country Age "27" "30" "30" Sara "26" 1.78 "Belgium" Size 1.77 5.58 1.80 5.58 Lis "30" 5.58 "USA" Country "Belgium" "USA" "FR" "USA" Hadrien 1.80 "FR" Lis "30" 5.58 "USA" The data looks much clearer this way. DATA SCIENCE Remove duplicates Befor e Name Age Size Country Sara "27" 1.77 "Belgium" Lis "30" 5.58 "USA" Hadrien 1.80 "FR" Lis "30" 5.58 "USA" In general, you want to remove duplicates. Python and R make them easy to identify. Here we can see that Lis appears twice. DATA SCIENCE Remove duplicates | output Befor Afte e r Name Age Size Country Name Age Size Country Sara "27" 1.77 "Belgium" Sara "27" 1.77 "Belgium" Lis "30" 5.58 "USA" Lis "30" 5.58 "USA" Hadrien 1.80 "FR" Hadrien 1.80 "FR" Lis "30" 5.58 "USA" Let's remove the duplicate. DATA SCIENCE Unique ID Befor e Name Age Size Country Sara "27" 1.77 "Belgium" Lis "30" 5.58 "USA" Hadrien 1.80 "FR" What if there's another person called Lis? Then, you want a way to uniquely identify each observation. It can be a combination of features (name plus last name plus year of birth, for example), DATA SCIENCE Unique ID | output Befor Afte e r Name Age Size Country ID Name Age Size Country Sara "27" 1.77 "Belgium" 0 Sara "27" 1.77 "Belgium" Lis "30" 5.58 "USA" 1 Lis "30" 5.58 "USA" Hadrien 1.80 "FR" 2 Hadrien 1.80 "FR" but the safest way is to assign a unique ID. Sara's ID is now 0, Lis' 1 and Hadrien's 2. DATA SCIENCE Homogenei ty Befor e ID Name Age Size Country 0 Sara "27" 1.77 "Belgium" 1 Lis "30" 5.58 "USA" 2 Hadrien 1.80 "FR" Something fishy is going on in the size column. Lis simply can't be that tall (or Hadrien and Sara that small, depending where you're from). Lis is in the US, she inputted her size in feet. Sara and Hadrien are based in Europe, they use the metric system. All variables should use the same standard. DATA SCIENCE Homogeneity | output Befor Afte e r ID Name Age Size Country ID Name Age Size Country 0 Sara "27" 1.77 "Belgium" 0 Sara "27" 1.77 "Belgium" 1 Lis "30" 5.58 "USA" 1 Lis "30" 1.70 "USA" 2 Hadrien 1.80 "FR" 2 Hadrien 1.80 "FR" Programmatically, you can filter values above 2.5 meters, and apply a division by 3.281 to get the metric value. Here we go. DATA SCIENCE Homogeneity, again Befor e ID Name Age Size Country 0 Sara "27" 1.77 "Belgium" 1 Lis "30" 1.70 "USA" 2 Hadrien 1.80 "FR" Similarly, countries should follow the same format. The United States and France are abbreviated, but Belgium is written in full. Let's fix that. DATA SCIENCE Homogeneity, again | output Before After ID Name Age Size Country ID Name Age Size Country 0 Sara "27" 1.77 "Belgium" 0 Sara "27" 1.77 "BE" 1 Lis "30" 1.70 "US" 1 Lis "30" 1.70 "US" 2 Hadrien 1.80 "FR" 2 Hadrien 1.80 "FR" Looking better already! DATA SCIENCE Data types Befor e ID Name Age Size Country 0 Sara "27" 1.77 "BE" 1 Lis "30" 1.70 "US" 2 Hadrien 1.80 "FR" Another common issue relates to data types. The tools you use might be able to infer data types for each column, but you'd better make sure they are correct. Here, the Age column is encoded as text. If you try to get the mean, you'll get an error, because the average of two words doesn't make sense. You should change the type of this feature to numbers. DATA SCIENCE Data types | output Befor Afte e r ID Name Age Size Country ID Name Age Size Country 0 Sara "27" 1.77 "BE" 0 Sara 27 1.77 "BE" 1 Lis "30" 1.70 "US" 1 Lis 30 1.70 "US" 2 Hadrien 1.80 "FR" 2 Hadrien 1.80 "FR" Ages are now numbers; you can see the quotes have disappeared. DATA SCIENCE Missing Reason s: values Befor ID Name Age Size Country data entry e error 0 Sara 27 1.77 "BE" 1 Lis 30 1.70 "US" valid missing value 2 Hadrien 1.80 "FR" Solutions: Last but not least, missing values. They are common and occur for imput various reasons: the agent doing the entry was distracted, the person surveyed did not understand the question, or it's on e purpose, for example an event that has not happened yet. drop There are several ways to deal with missing values. You can keep substitute the exact value if you have access to the source. For example, you can take an aggregate value, like the mean, median or max depending on the situation. You can drop the observation altogether, but each observation you remove means less training data for your model. Or, you can keepDATA it SCIENCE Missing values | output Befor Afte e r ID Name Age Size Country ID Name Age Size Country 0 Sara 27 1.77 "BE" 0 Sara 27 1.77 "BE" 1 Lis 30 1.70 "USA" 1 Lis 30 1.70 "US" 2 Hadrien 1.80 "FR" 2 Hadrien 28 1.80 "FR" Here, we take the mean, 27.5, and round it up to get 28, which happens to be the correct value. DATA SCIENCE Let's practice! Classify the following statements as True (T) or False (F) Exploratory Data Analysis Dr. Ashraf ALDabbas DATA SCIENCE What is EDA? Data Exploratory Analysis: Exploring the data Formulating hypotheses Assessing characteristics Visualizing Exploratory Data Analysis, or EDA for short, is a process that was promoted by John Tukey, a respected statistician. It consists in exploring the data and formulating hypotheses about it, and assessing its main characteristics, with a strong emphasis on visualization. DATA SCIENCE Data workflow EDA happens after data preparation, but they can get mixed. EDA can reveal new things that need cleaning. DATA SCIENCE Let's dive right in Dataset 1 Dataset 2 Dataset 3 Dataset 4 |x |y | |x |y | |x |y | |x |y | | | | | | | | | | | | | |10.0 |8.04 | |10.0 |9.14 | |10.0 |7.46 | |8.0 |6.58 | |8.0 |6.95 | |8.0 |8.14 | |8.0 |6.77 | |8.0 |5.76 | |13.0 |7.58 | |13.0 |8.74 | |13.0 |12.74| |8.0 |7.71 | |9.0 |8.81 | |9.0 |8.77 | |9.0 |7.11 | |8.0 |8.84 | |11.0 |8.33 | |11.0 |9.26 | |11.0 |7.81 | |8.0 |8.47 | |14.0 |9.96 | |14.0 |8.10 | |14.0 |8.84 | |8.0 |7.04 | |6.0 |7.24 | |6.0 |6.13 | |6.0 |6.08 | |8.0 |5.25 | |4.0 |4.26 | |4.0 |3.10 | |4.0 |5.39 | |19.0 |12.50| |12.0 |10.84| |12.0 |9.13 | |12.0 |8.15 | |8.0 |5.56 | |7.0 |4.82 | |7.0 |7.26 | |7.0 |6.42 | |8.0 |7.91 | |5.0 |5.68 | |5.0 |4.74 | |5.0 |5.73 | |8.0 |6.89 | Let's dive right in. What can you say about these four different datasets? Well, from these lines of data, probably very little. DATA SCIENCE Surpri se! All four datasets display: identical mean and variance for x identical mean and variance for y identical correlation coeffi cient identical linear regression equation In short: they look quite similar, But is that the case? DATA SCIENCE Anscombe's quartet No! Here are the four graphs. They all tell a different story, that pure metrics can't fully convey. DATA SCIENCE The first graph displays a linear relationship, DATA SCIENCE while the second one has a non-linear relationship. DATA SCIENCE In the third graph, we see the linear line is thrown off by one point that has an extreme y value. DATA SCIENCE A similar thing happens with the fourth dataset. We should have no correlation, but one extreme point is enough to display a strong one. In short, streaming through the data gives you little information. Descriptive statistics do better, but can be misleading; visualization teaches us the most. That's why EDA relies heavily on this last technique. This was an extreme DATA SCIENCE Let's look at SpaceX launches! DATA SCIENCE Knowing your data I mean, let's look at the data behind SpaceX launches. The first thing to do is to know what features we're looking at. We have different information, such as the flight number or what the rocket transported. All have the correct data type. Flight Number Payload Mass (kg) (number) (number) Date (datetime) Orbit (text) Time (UTC) Customer (datetime) Booster (text) Version (text) Mission Outcome Launch Site (text) (text) DATA SCIENCE Payload (text) Previewing your data Flight Date Time (UTC) Booster Version Launch Site Payload 1 2010-06-04 18:45:00 F9 v1.0 B0003 CCAFS LC-40 Dragon Spacecraft Qualification Unit 2 2010-12-08 15:43:00 F9 v1.0 B0004 CCAFS LC-40 Dragon demo flight C1, two CubeSats... 3 2012-05-22 7:44:00 F9 v1.0 B0005 CCAFS LC-40 Dragon demo flight C2+ 4 2012-10-08 0:35:00 F9 v1.0 B0006 CCAFS LC-40 SpaceX CRS-1 5 2013-03-01 15:10:00 F9 v1.0 B0007 CCAFS LC-40 SpaceX CRS-2 Payload Mass (kg) Orbit Customer Mission Outcome Landing Outcome NaN LEO SpaceX Success Failure (parachute) LEO (ISS) NASA (COTS) NRO Failure NaN LEO NASA Success (parachute) No (ISS) (COTS) attempt 525 Success 500 LEO (ISS) NASA (CRS) Success No attempt 677 LEO (ISS) NASA (CRS) Success No attempt Looking at your tables helps make sense of your observations. Can you notice the missing payload mass for the first two rows? DATA SCIENCE Descriptive statistics Flight Date Time (UTC) Booster Version Launch Site Payload count 55 55 55 55 55 55 unique 55 55 53 51 4 55 top 6 2018-03-30 4:45:00 F9 v1.1 CCAFS LC-40 SES-9 freq 1 1 2 5 26 1 Payload Mass (kg) Orbit Customer Mission Outcome Landing Outcome count 53 55 55 55 55 unique 47 8 28 2 12 top 9,600 GTO NASA (CRS) Success No attempt freq 5 22 14 54 18 It's always a good idea to calculate descriptive statistics. The SpaceX dataset is mainly qualitative, but we still get a lot of information. We have a count of 55 pretty much everywhere, because we have 55 launches. The Payload Mass column shows 53 because of the two missing values we saw before. Only 1 mission failed. Most of the time, there is no attempt at landing. You could also calculate the average payload mass, or the count of launches per year. But do you know what would be best for this last option? DATA SCIENCE Visualiz e! Visualization! In a glance we can see that there were no launches in 2011. The count of launches then gradually increased before doubling in 2017. 2018 is lower, but remember we only have 3 months of data for this year, so it actually looks like it's going to double again. DATA SCIENCE Ask more questions! Now this launch count is informative, but you probably have a couple more. How about count by launch site? Rockets originally launched from Cape Canaveral Air Force Station, but in 2017 most rockets launched from Kennedy Space Center Launch Complex 39. DATA SCIENCE Ask more questions! How about mission outcome? Just one failure in 2015! DATA SCIENCE Outlier s Another thing you do during EDA is look for outliers, that is, unusual values. Whether they are errors or valid, it's nice to know about them, as they can throw your results off. Here, we can see we have only 5 launches with a weight greater than 7,000 kg, when the average mass is closer 3800 kg. DATA SCIENCE Let's practice! Numerical EDA A few years ago, KIC 8462852, also known as Tabby's star, started decreasing strongly in luminosity. We're not sure why yet, but scientists are investigating by recording observations at different times and locations. Below are some descriptive statistics on the observations. Unlike the qualitative dataset we saw in the video, the Tabby's star dataset is made of quantitative data only, so we get statistics, like the mean and median. Luminosity Observations +-------+----------+ | count | 52 | | mean | 1.169356 | | std | 1.230953 | | min | 0.974110 | | 25% | 0.995637 | | 50% | 1.002465 | | 75% | 1.008685 | | max | 9.874800 | +-------+----------+ How many separate luminosity observations make up the dataset? Possible Answers: Select one answer 52 About 1.23 1 9.87 Visual EDA Your friend graphed the star's luminosity using data he copied from public sources. He shows you the graph below. He concludes that scientists have it completely backwards. Scientists say the star is dimming, as the graph shows the luminosity is actually pretty flat, except one moment when it increases a lot. Which of the following statements is true? Possible Answers Select one answer The data and graph has to be correct. As the graph shows, the star must have been stable, except for a brief luminous spike in mid-2012. Scientists should focus on what made that star spike in luminosity. Something weird is going on. That jump here is an outlier and the data should be reviewed. If scientists and journalists say the star is dimming, then the star is dimming. It does not matter that the data states otherwise. Interactive dashboards Dr. Ashraf ALDabbas Visualization: Interactive dashboards We finished the last lesson looking at some graphs, so let's talk a little more about visualization. One picture... One picture is worth a thousand words......if the picture makes sense. However, there are a few things to pay attention to, to ensure your chart is easily understandable and straight to the point. DATA SCIENCE Use color purposefully For example, you should use color purposefully. Remember this graph? Count of launches by year, pretty straightforward. DATA SCIENCE Use color purposefully What about this one? DATA SCIENCE Use color purposefully So colorful, so much better! What do you think?? DATA SCIENCE Use color purposefully Wrong! Granted, it's aesthetically pleasing, but it's also confusing. What do the colors correspond to? Nothing. We're just counting launches per year. One color is enough. DATA SCIENCE Colorblindne ss E xisting color palettes accessible to everyone You should also be mindful of colorblindness. You may distinguish red and green very well, but some people don't, and more than you think. Red and green is the most common (but not the only one). You can find a lot of information on colorblindness online, as well as palettes of colors that are accessible to colorblind people. DATA SCIENCE Readable fonts sans- serif You should also use readable fonts. Sans-serif ones are easier to read. There are nicer fonts available, sure, but your readers should focus on your viz message, not on the font. DATA SCIENCE Label, label, label title x axis label y axis label legend An image is worth a thousand words, but words do help. Your graphs should always have a title, so we know what we're looking at; the x and y axis should have labels, otherwise they could be anything; and you should provide a legend if you use colors and patterns, so that we know what they refer to. DATA SCIENCE Axe s There are some cases when you can start your axes higher than zero, if you want to zoom in on evolution, for example. But it can be misleading. Here, it looks like Obamacare enrollment is far from the goal. DATA SCIENCE Axe s when it's actually close. DATA SCIENCE And the award goes to... Follow these advises, and you should prevent your readers the confusion that comes with badly designed graphs like this one, DATA SCIENCE Honorable mention Or this one. I don't know where to start. DATA SCIENCE Questi on 1 picture = 1000 words 1000 pictures = ? If a picture is worth a thousand words, then what is worth a thousand pictures? DATA SCIENCE A dashboard! A dashboard! Well, technically, a dashboard of a few pictures is enough already. My point is, showing several pictures together can be more insightful than looking at them separately, or trying to pack all the insights in one graph. Your car dashboard indicates the car speed, the motor rotation speed and the proportion of gas left. Individually, these pieces of information are useful. But together, they paint a much bigger 1 Photo by M arek Szturc on Unsplash picture and make your trip more DATA SCIENCE safe and more comfortable. That's what dashboards do: group all the relevant information in one place to make it easier to gather insights and act on them. On this dashboard, any sales person can see not only how sales are progressing this quarter, but also how this progression compares to previous quarters. On top of that, they can keep track of transactions and opportunities, as well as of customer count. They can filter all of this data by software, service or maintenance sales. All of DATA SCIENCE that in one place, customized BI tools Business Intelligence tools let you clean, explore, visualize data, and build dashboards, without requiring any programming knowledge. Such tools are Tableau, Looker, or Power BI. Of course, you can also do that programmatically using Python, R, or even JavaScript. DATA SCIENCE Next level The next level is to make your visualization interactive, which BI tools make really easy. It lets you display more targeted information when hovering over an element, like you can see here. We also talked about filters: here the user can filter on the type of sales and the quarter. Giving power to the user is a great way to engage them. DATA SCIENCE Let's practice! Interactive dashboards Dashboards are great, interactive dashboards are even better. Find here a dashboard made for tracking a website's activity. It consists of (A) average page visit length segmented by age, (B) the number of visits today, (C) the monthly active users over time, and (D) the source of traffic segmented by month. If you hover over or tap certain elements of the bar and line graphs, you'll see some interactivity that shows more detail about the selected data. You can always reset any plot to its original state by double-clicking it. Using this interactivity, how much website traffic in March was from search engines? Instructions 0.471 0.389 0.898 0.6 Bank Dashboard Improving a dashboard Here's another dashboard, which you may remember from the first chapter. It's the dashboard of data collected by vibration sensors on the Brooklyn Bridge to measure the amount of traffic. Since the first chapter, you've learned more about visualization, so take a closer look at this dashboard and see if you can come up with some suggestions to improve it. Remember, some of the plots in the dashboard are interactive and may change as you interact with them. You can always reset any plot to its original state by double-clicking it. Which suggestion would you not give? Instructions The bars in the "Cars by Color" graph should match the car colors they represent. MPH (miles per hours) should be added to the title and axis labels referring to average speed. The "Total Toll Amount" and "Number of Cars" components need a date associated with them. The y-axis in the top right "Average Speed" graph needs to start at 0, as opposed to 52.