Big Data Part 8 PDF

Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...

Summary

This document presents various case studies in data analysis and visualization. Data analysis techniques and the implications of data interpretation using visual methods like graphs and charts are discussed in detail. It also covers the different types of sampling biases, and how to distinguish good from bad information.

Full Transcript

Case Study 1 (Colors) Colored background: “heavy” background White background: eyes go to the data, therefore easy to focus on data Colored background: eyes go to the background In General: avoid colored background as much as possible, use white background instead Clients version Feels co...

Case Study 1 (Colors) Colored background: “heavy” background White background: eyes go to the data, therefore easy to focus on data Colored background: eyes go to the background In General: avoid colored background as much as possible, use white background instead Clients version Feels completely different Colors seem to create much visual noise Case Study 2 (Presentation) Too much data noise for audience Today, I’m going to talk you through a success story: the increase in Moonville users over time. First, let me set up what we are looking at. On the vertical y‐axis of this graph, we’re going to plot active users. This is defined as the number of unique users in the past 30 days. We’ll look at how this has changed over time, from the launch in late 2013 to today, shown along the horizontal x‐axis. We launched Moonville in September 2013. By the end of that first month, we had just over 5,000 active users, denoted by the big blue dot at the bottom left of the graph. Early feedback on the game was mixed. In spite of this—and our practically complete lack of marketing—the number of active users nearly doubled in the first four months, to almost 11,000 active users by the end of December. In early 2014, the number of active users increased along a steeper trajectory. This was primarily the result of the friends and family promotions we ran during this time to increase awareness of the game. Growth was pretty flat over the rest of 2014 as we halted all marketing efforts and focused on quality improvements to the game. Uptake this year, on the other hand, has been incredible, surpassing our expectations. The revamped and improved game has gone viral. The partnerships we’ve forged with social media channels have proven successful for continuing to increase our active user base. At recent growth rates, we anticipate we’ll surpass 100,000 active users in June! Salient points can be added to the circulated version Case Study 3 Logic in order No “story” recognizable Too many numbers means too many “data” noise Telling the “positive” story Feature A and B show greatest customer satisfaction Audience immediately looks at Feature A and B because of descending order Telling the “negative” story Feature N and J show least customer satisfaction Audience looks immediately and Feature N and J because it is placed at top and goes in descending order Telling the story of the “unused” features Feature O is least used Audience looks immediately and Feature O because of the placing and coloring “base” visual without anything highlighted 1st slide of story Shades of blue “Best” Features in the eyes of users Text to illustrate the point 2ndslide of story Shades of orange “Worst” Features in the eyes of users Text for explanation Not as easy as 1st Story, but nevertheless good 3rd slide of story Shades of turquoise “Not used” features Text for explanation Putting complete storyline in one chart Strategic use of color allows “zigzag” view (similarity of colors) Too much “noise” for live presentation but good for report document Audience should not have “own” conclusions, with coloring and text you give them the conclusion Case Study 4 “Spaghetti” Graph Difficult to concentrate on one line because of crisscrossing Too many attention competing elements Emphasizing a single line Using visual cues (color and text) Emphasizing another trend line Vertical separation One color needed only Y-Axis is not needed because of the alignment of the charts Assumption: Trend for each category is more important than comparison between categories Horizontal separation E.g. comparing 2015 for each category Combination of separation and highlighting method Vertical separation Works better for report presentation Do I need all categories? All years? When appropriate, reducing the amount of data shown can make the challenge of graphing data like that shown in this Example easier as well. Combination of separation and highlighting method Horizontal separation Works better for report presentation Much work for audience (Noise) Aim: Not annoying audience Alternative 1 Show numbers directly Alternative 2 Bar chart Alternative 3 Horizontal stacked bars Easily compare before and after Easy to see and compare One glance: see the steep increase in “Excited” Which alternative you choose depends on the visual emphasis and how you want the audience to interact with your data Statistics is the science of making effective use of numerical data. It deals with all aspects of this, including the collection analysis and interpretation of data. Big data means more responsibility Data and information are so prevalent in our lives today, that it is known as the “Information Age” Being literate today means not just being able to read, but being able to understand the massive amount of information thrown at us every day – much of it on the computer. Field of manipulation: Collection of data In order to analyze and interpret data, we must first collect it. The data that is collected is known as a sample. The sample is collected from a population There are many different types of sampling bias. Some examples include: Area Bias Self-Selection Bias Leading Question Bias Social Desirability Bias Area Bias The World Wildlife Fund (WWF) has written on the threats posed to polar bears from global warming. However, also according to them, about 20 distinct polar bear populations exist, accounting for approximately 22,000 polar bears worldwide. Only 2 of the groups are decreasing. 10 populations are stable. 2 populations are increasing. The status of the remaining 6 populations is unknown. If you only looked at the 2 groups that are decreasing, it would be easy to say that “Polar Bear Population is Decreasing”. You need to look at the whole picture to get the whole story. Self-selection bias In Self-Selection Bias, a participants' decision to participate may be correlated with traits that affect the study, making the participants a non-representative sample. For example: If you were to set up a booth to ask people about their grooming habits… The people who respond are more likely to be those who take more time to primp in the morning than those who just throw on something and head out the door. Leading Question Bias If you have a survey that asks: Don’t you think that … are paid too little? A) Yes they should earn more B) No they should not earn more C) No opinion You are suggesting by the tone of the question what you believe the answer should be. That will bias your results (is it always bad?) Social Desirability If you ask people in a survey about how often they shower, or how often they recycle, your data is going to be biased by the fact that nobody wants to admit to doing something that is considered socially undesirable. Conclusion of Sampling Bias Adding in a Sampling bias into your data collection is an important tool if you want to lie, cheat, manipulate, or mislead with your study results! Field of manipulation: Data analysis Data analysis is a process of gathering, modeling, and transforming data with the goal of highlighting useful information, suggesting conclusions, and supporting decision making. Example: Analysis of data What if you were a real-estate agent and you were trying to convince people to move into a particular neighborhood. You could, with perfect honesty and “truthfulness” tell different people that the average income in the neighborhood is: a) $150,000 b) $35,000 c) $10,000 The $150,000 figure is the arithmetic mean of the incomes of all the families in the neighborhood. The $35,000 figure is the median. The $10,000 figure is the mode. This particular neighborhood is lucky enough to be near a cliff… and the ONE home with an ocean view is a giant mansion on 50 acres that is owned by a Hollywood Star. With gates. And spikes. And security to keep out the riff raff of the rest of the neighborhood of poor people and the few middle class that live nearby. Field of manipulation: Interpretation Interpreting data often involves displaying it in some useful way Charts are a type of graphics If your goal is to lie, cheat, manipulate, or mislead, graphics are your friend… This is real data. The top graph shows the cosmic radiation rate in neutrons per hour. The lower is the temperature change since 1975 when it started. All from the BBC’s website Here, the data is the same but by changing the axis labels, someone was able to really suggest that the difference in population was much greater than it was. Once again, both of these charts show the same information if you ONLY look at the HEIGHT of the frogs. The volume of an image is a great way to lie, cheat, manipulate, or mislead. How to distinguish between good and bad information Look at the sources. If none are given, do NOT trust the information. Check to see if there are any obvious sources of bias in the data. Look at how the data was collected and where it was collected from. Look very closely at the data axis and legend. And finally, do NOT believe everything you are shown just because it is “Science” and “Data”. Try to figure out if the source has some ulterior motive to manipulate your opinion. Tips for Graphical Excellence (GE) The principals of Graphical Excellence (GE) are: GE is the well-designed presentation of interesting data – a matter of substance, of statistics, and of design. GE consists of complex ideas communicated with clarity, precision, and efficiency. GE is that which gives to the viewer the greatest number of ideas in the shortest time with the least ink in the smallest space. Thank you for your attention!

Use Quizgecko on...
Browser
Browser